Re: UnifiedHighlighter and extraction of exact hit offset ranges

2017-01-12 Thread Timothy Rodriguez (BLOOMBERG/ 120 PARK)
It's arguably per request, imo.  Even per field you may want different possible 
formats depending on the request (and field).  That would potentially require a 
different parameter to specify the formatter to use for a given request (and a 
way to config multiple formatters, with perhaps a default).  

From: david.w.smi...@gmail.com At: 01/11/17 13:19:37
To: Timothy Rodriguez (BLOOMBERG/ 120 PARK), dawid.we...@gmail.com
Cc: dev@lucene.apache.org
Subject: Re: UnifiedHighlighter and extraction of exact hit offset ranges

If the generics could be contained _instead of_ spreading to the UH class 
itself (making UH typed), I think it could be nice.  But given the per-field 
possible settings for formatting... that in particular makes balancing these 
concerns hard.  I guess in the end Object isn't too bad since it's limited to 
the advanced method use-case (highlightFieldsAsObjects).

On Wed, Jan 11, 2017 at 8:45 AM Timothy Rodriguez (BLOOMBERG/ 120 PARK) 
<trodrigue...@bloomberg.net> wrote:

While we were open sourcing it. I had tried creating a patch to generify it, 
but the generics did wind up all over the place. Ultimately the 
UnifiedHighlighter would need to be generic itself so it can ensure the passage 
formatters etc are of the same type. (Or alternatively, generic passage 
formatters are passed in per request.) I wound up dumping the changes because 
they were quite substantial and they'd also push it further from the 
PostingsHighlighjer. 

I'm hoping to get back to trying that again in he future. It'd be nice to have 
a PassageFormatter. 

-Tim

Sent from Bloomberg Professional for iPhone


- Original Message -
From: Dawid Weiss <dawid.we...@gmail.com>
To: david.w.smi...@gmail.com
CC: TIMOTHY RODRIGUEZ, dev@lucene.apache.org
At: 11-Jan-2017 08:37:37


Thanks David!

That's almost exactly what I ended up doing. I don't mind casting
Object to my own type; you can always make it a covariant override in
your subclass (which you have to do to access those expert-level
methods anyway).

I still kind of think startOffset/endOffset and other related methods
could be made public to allow tinkering with them in
FieldHighlighter#highlightOffsetsEnums (otherwise this method is
protected for overriding, but useless in practice).

There is another API problem I found too. If you wish to override
FieldHighlighter.getSummaryPassagesNoHighlight you can't return
anything sensible because Passage is final, contains only
package-private fields and addMatch is package-private too. So you
can't create a "custom" passage.

I can file an issue and provide a patch if these changes are not
against the design of the unified highlighter?

Dawid

On Wed, Jan 11, 2017 at 2:24 PM, David Smiley <david.w.smi...@gmail.com> wrote:
> Hi Dawid,
>
> You could write a trivial PassageFormatter that simply returns the Passage
> list instead of doing formatting.  Passages contain offsets. And yes,
> WholeBreakIterator if you don't need passage fragmentation. Unless I'm
> missing some aspect of your requirements, this doesn't involve any internal
> highlighter customizing.  Perhaps Javadocs could be improved to make this
> more clear... and perhaps this Passage-returning PassageFormatter could be
> included to clarify how it's done.  I recall doing or seeing this recently
> months ago but I'm not sure.
>
> One ugly aspect of the API (shared with it's PostingsHighlighter lineage)
> related to this discussion is that the PassageFormatter is declared to
> return Object.  It's kinda hard to rectify it to be typed, perhaps with
> generics, while also not spilling lots of generics to other places (the UH
> itself) just because of this.  Perhaps UH.highlightFieldsAsObjects() could
> be modified to take a Class to thus provide a type for the output... and
> maybe the PassageFormatter could declare not only with generics but with a
> method what types of results it produces.  I'm curious what you think.
>
> ~ David
>
>
> On Wed, Jan 11, 2017 at 6:02 AM Dawid Weiss <dawid.we...@gmail.com> wrote:
>>
>> To follow-up: I hacked into the offsets by passing WholeBreakIterator
>> and a custom PassageFormatter that just returns the matches from the
>> singleton resulting passage. This is suboptimal though, as there's
>> still some complex logic going on in highlightOffsetsEnums that could
>> be avoided.
>>
>> Dawid
>>
>> On Wed, Jan 11, 2017 at 11:34 AM, Dawid Weiss <dawid.we...@gmail.com>
>> wrote:
>> > Can any of the folks who contributed to UnifiedHighlighter (David?)
>> > clarify my thinking here?
>> >
>> > I have a requirement to extract (for a set of search results) a list
>> > of exact "hit" ranges (field offsets, with support for multi-term
>> > queries and span queries). Obviously, I'm only talking 

Re: UnifiedHighlighter and extraction of exact hit offset ranges

2017-01-11 Thread Dawid Weiss
> I'm guessing what you're seeing is from browsing the 6.3 code. The
> extensibility has been improved and committed for 6.4; see CHANGES.txt and
> LUCENE-7559 which did it.  In particular, all Passage methods are now
> public.

Ah, surely I am!... Sorry about that -- I've been updating/ modifying
code that has to rely on a published Lucene version and didn't even
think to look at the master branch. I should have, sorry!

> I agree that OffsetsEnum methods should be public so that someone could
> override FieldHighlighter#highlightOffsetsEnums usefully. This is an
> oversight; good catch!  We should further enhance
> TestUnifiedHighlighterExtensibility to help us check for this.  I'll file an
> issue.  Come to think of it... one could argue LUCENE-7559 isn't really done
> as it's scope should have included OffsetsEnums methods.

It hasn't been published yet (LUCENE-7559) so sure, I'd reopen it and
fix it as part of that issue, thank you!

D.

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



Re: UnifiedHighlighter and extraction of exact hit offset ranges

2017-01-11 Thread David Smiley
If the generics could be contained _instead of_ spreading to the UH class
itself (making UH typed), I think it could be nice.  But given the
per-field possible settings for formatting... that in particular makes
balancing these concerns hard.  I guess in the end Object isn't too bad
since it's limited to the advanced method use-case
(highlightFieldsAsObjects).

On Wed, Jan 11, 2017 at 8:45 AM Timothy Rodriguez (BLOOMBERG/ 120 PARK) <
trodrigue...@bloomberg.net> wrote:

> While we were open sourcing it. I had tried creating a patch to generify
> it, but the generics did wind up all over the place. Ultimately the
> UnifiedHighlighter would need to be generic itself so it can ensure the
> passage formatters etc are of the same type. (Or alternatively, generic
> passage formatters are passed in per request.) I wound up dumping the
> changes because they were quite substantial and they'd also push it further
> from the PostingsHighlighjer.
>
> I'm hoping to get back to trying that again in he future. It'd be nice to
> have a PassageFormatter.
>
> -Tim
>
> Sent from Bloomberg Professional for iPhone
>
>
> - Original Message -
> From: Dawid Weiss 
> To: david.w.smi...@gmail.com
> CC: TIMOTHY RODRIGUEZ, dev@lucene.apache.org
> At: 11-Jan-2017 08:37:37
>
> Thanks David!
>
> That's almost exactly what I ended up doing. I don't mind casting
> Object to my own type; you can always make it a covariant override in
> your subclass (which you have to do to access those expert-level
> methods anyway).
>
> I still kind of think startOffset/endOffset and other related methods
> could be made public to allow tinkering with them in
> FieldHighlighter#highlightOffsetsEnums (otherwise this method is
> protected for overriding, but useless in practice).
>
> There is another API problem I found too. If you wish to override
> FieldHighlighter.getSummaryPassagesNoHighlight you can't return
> anything sensible because Passage is final, contains only
> package-private fields and addMatch is package-private too. So you
> can't create a "custom" passage.
>
> I can file an issue and provide a patch if these changes are not
> against the design of the unified highlighter?
>
> Dawid
>
> On Wed, Jan 11, 2017 at 2:24 PM, David Smiley 
> wrote:
> > Hi Dawid,
> >
> > You could write a trivial PassageFormatter that simply returns the
> Passage
> > list instead of doing formatting. Passages contain offsets. And yes,
> > WholeBreakIterator if you don't need passage fragmentation. Unless I'm
> > missing some aspect of your requirements, this doesn't involve any
> internal
> > highlighter customizing. Perhaps Javadocs could be improved to make this
> > more clear... and perhaps this Passage-returning PassageFormatter could
> be
> > included to clarify how it's done. I recall doing or seeing this recently
> > months ago but I'm not sure.
> >
> > One ugly aspect of the API (shared with it's PostingsHighlighter lineage)
> > related to this discussion is that the PassageFormatter is declared to
> > return Object. It's kinda hard to rectify it to be typed, perhaps with
> > generics, while also not spilling lots of generics to other places (the
> UH
> > itself) just because of this. Perhaps UH.highlightFieldsAsObjects()
> could
> > be modified to take a Class to thus provide a type for the output... and
> > maybe the PassageFormatter could declare not only with generics but with
> a
> > method what types of results it produces. I'm curious what you think.
> >
> > ~ David
> >
> >
> > On Wed, Jan 11, 2017 at 6:02 AM Dawid Weiss 
> wrote:
> >>
> >> To follow-up: I hacked into the offsets by passing WholeBreakIterator
> >> and a custom PassageFormatter that just returns the matches from the
> >> singleton resulting passage. This is suboptimal though, as there's
> >> still some complex logic going on in highlightOffsetsEnums that could
> >> be avoided.
> >>
> >> Dawid
> >>
> >> On Wed, Jan 11, 2017 at 11:34 AM, Dawid Weiss 
> >> wrote:
> >> > Can any of the folks who contributed to UnifiedHighlighter (David?)
> >> > clarify my thinking here?
> >> >
> >> > I have a requirement to extract (for a set of search results) a list
> >> > of exact "hit" ranges (field offsets, with support for multi-term
> >> > queries and span queries). Obviously, I'm only talking about queries
> >> > that relate to field content somehow, but this has always been quite
> >> > problematic and required the use of multiple helper classes
> >> > (WeightedSpanTermExtractor, MultiTermHighlighting, etc.) and pretty
> >> > hairy logic.
> >> >
> >> > So I turned to look at UnifiedHighlighter for help.
> >> >
> >> > Seems like the right way (?) to do it would be to override (and abuse)
> >> > UnifiedHighlighter's getFieldHighlighter method and return a field
> >> > highlighter with an override of:
> >> >
> >> > protected Passage[] highlightOffsetsEnums(List
> >> > offsetsEnums) throws 

Re: UnifiedHighlighter and extraction of exact hit offset ranges

2017-01-11 Thread David Smiley
Dawid,

I'm guessing what you're seeing is from browsing the 6.3 code. The
extensibility has been improved and committed for 6.4; see CHANGES.txt and
LUCENE-7559 which did it.  In particular, all Passage methods are now
public.

I agree that OffsetsEnum methods should be public so that someone could
override FieldHighlighter#highlightOffsetsEnums usefully. This is an
oversight; good catch!  We should further
enhance TestUnifiedHighlighterExtensibility to help us check for this.
I'll file an issue.  Come to think of it... one could argue LUCENE-7559
isn't really done as it's scope should have included OffsetsEnums methods.

*Jim:* can I change some visibility there for getting this into 6.4 as part
of the same issue?  Very low risk of course.  If not; no big deal.

~ David

On Wed, Jan 11, 2017 at 8:37 AM Dawid Weiss  wrote:

> Thanks David!
>
> That's almost exactly what I ended up doing. I don't mind casting
> Object to my own type; you can always make it a covariant override in
> your subclass (which you have to do to access those expert-level
> methods anyway).
>
> I still kind of think startOffset/endOffset and other related methods
> could be made public to allow tinkering with them in
> FieldHighlighter#highlightOffsetsEnums (otherwise this method is
> protected for overriding, but useless in practice).
>
> There is another API problem I found too. If you wish to override
> FieldHighlighter.getSummaryPassagesNoHighlight you can't return
> anything sensible because Passage is final, contains only
> package-private fields and addMatch is package-private too. So you
> can't create a "custom" passage.
>
> I can file an issue and provide a patch if these changes are not
> against the design of the unified highlighter?
>
> Dawid
>
> On Wed, Jan 11, 2017 at 2:24 PM, David Smiley 
> wrote:
> > Hi Dawid,
> >
> > You could write a trivial PassageFormatter that simply returns the
> Passage
> > list instead of doing formatting.  Passages contain offsets. And yes,
> > WholeBreakIterator if you don't need passage fragmentation. Unless I'm
> > missing some aspect of your requirements, this doesn't involve any
> internal
> > highlighter customizing.  Perhaps Javadocs could be improved to make this
> > more clear... and perhaps this Passage-returning PassageFormatter could
> be
> > included to clarify how it's done.  I recall doing or seeing this
> recently
> > months ago but I'm not sure.
> >
> > One ugly aspect of the API (shared with it's PostingsHighlighter lineage)
> > related to this discussion is that the PassageFormatter is declared to
> > return Object.  It's kinda hard to rectify it to be typed, perhaps with
> > generics, while also not spilling lots of generics to other places (the
> UH
> > itself) just because of this.  Perhaps UH.highlightFieldsAsObjects()
> could
> > be modified to take a Class to thus provide a type for the output... and
> > maybe the PassageFormatter could declare not only with generics but with
> a
> > method what types of results it produces.  I'm curious what you think.
> >
> > ~ David
> >
> >
> > On Wed, Jan 11, 2017 at 6:02 AM Dawid Weiss 
> wrote:
> >>
> >> To follow-up: I hacked into the offsets by passing WholeBreakIterator
> >> and a custom PassageFormatter that just returns the matches from the
> >> singleton resulting passage. This is suboptimal though, as there's
> >> still some complex logic going on in highlightOffsetsEnums that could
> >> be avoided.
> >>
> >> Dawid
> >>
> >> On Wed, Jan 11, 2017 at 11:34 AM, Dawid Weiss 
> >> wrote:
> >> > Can any of the folks who contributed to UnifiedHighlighter (David?)
> >> > clarify my thinking here?
> >> >
> >> > I have a requirement to extract (for a set of search results) a list
> >> > of exact "hit" ranges (field offsets, with support for multi-term
> >> > queries and span queries). Obviously, I'm only talking about queries
> >> > that relate to field content somehow, but this has always been quite
> >> > problematic and required the use of multiple helper classes
> >> > (WeightedSpanTermExtractor, MultiTermHighlighting, etc.) and pretty
> >> > hairy logic.
> >> >
> >> > So I turned to look at UnifiedHighlighter for help.
> >> >
> >> > Seems like the right way (?) to do it would be to override (and abuse)
> >> > UnifiedHighlighter's getFieldHighlighter method and return a field
> >> > highlighter with an override of:
> >> >
> >> > protected Passage[] highlightOffsetsEnums(List
> >> > offsetsEnums) throws IOException {
> >> >
> >> > so that I can capture and return a separate Passage for each
> >> > OffsetsEnum (I have my own code to deal with overlaps and merging, so
> >> > I can skip this entirely). Then, with a custom no-op PassageFormatter
> >> > I could simply get a list of those offsets.
> >> >
> >> > The problem with this approach is that there is currently no way to
> >> > access offsets in OffsetsEnum -- everything is 

Re: UnifiedHighlighter and extraction of exact hit offset ranges

2017-01-11 Thread Timothy Rodriguez (BLOOMBERG/ 120 PARK)
While we were open sourcing it. I had tried creating a patch to generify it, 
but the generics did wind up all over the place. Ultimately the 
UnifiedHighlighter would need to be generic itself so it can ensure the passage 
formatters etc are of the same type. (Or alternatively, generic passage 
formatters are passed in per request.) I wound up dumping the changes because 
they were quite substantial and they'd also push it further from the 
PostingsHighlighjer. 

I'm hoping to get back to trying that again in he future. It'd be nice to have 
a PassageFormatter. 

-Tim

Sent from Bloomberg Professional for iPhone

- Original Message -
From: Dawid Weiss 
To: david.w.smi...@gmail.com
CC: TIMOTHY RODRIGUEZ, dev@lucene.apache.org
At: 11-Jan-2017 08:37:37


Thanks David!

That's almost exactly what I ended up doing. I don't mind casting
Object to my own type; you can always make it a covariant override in
your subclass (which you have to do to access those expert-level
methods anyway).

I still kind of think startOffset/endOffset and other related methods
could be made public to allow tinkering with them in
FieldHighlighter#highlightOffsetsEnums (otherwise this method is
protected for overriding, but useless in practice).

There is another API problem I found too. If you wish to override
FieldHighlighter.getSummaryPassagesNoHighlight you can't return
anything sensible because Passage is final, contains only
package-private fields and addMatch is package-private too. So you
can't create a "custom" passage.

I can file an issue and provide a patch if these changes are not
against the design of the unified highlighter?

Dawid

On Wed, Jan 11, 2017 at 2:24 PM, David Smiley  wrote:
> Hi Dawid,
>
> You could write a trivial PassageFormatter that simply returns the Passage
> list instead of doing formatting.  Passages contain offsets. And yes,
> WholeBreakIterator if you don't need passage fragmentation. Unless I'm
> missing some aspect of your requirements, this doesn't involve any internal
> highlighter customizing.  Perhaps Javadocs could be improved to make this
> more clear... and perhaps this Passage-returning PassageFormatter could be
> included to clarify how it's done.  I recall doing or seeing this recently
> months ago but I'm not sure.
>
> One ugly aspect of the API (shared with it's PostingsHighlighter lineage)
> related to this discussion is that the PassageFormatter is declared to
> return Object.  It's kinda hard to rectify it to be typed, perhaps with
> generics, while also not spilling lots of generics to other places (the UH
> itself) just because of this.  Perhaps UH.highlightFieldsAsObjects() could
> be modified to take a Class to thus provide a type for the output... and
> maybe the PassageFormatter could declare not only with generics but with a
> method what types of results it produces.  I'm curious what you think.
>
> ~ David
>
>
> On Wed, Jan 11, 2017 at 6:02 AM Dawid Weiss  wrote:
>>
>> To follow-up: I hacked into the offsets by passing WholeBreakIterator
>> and a custom PassageFormatter that just returns the matches from the
>> singleton resulting passage. This is suboptimal though, as there's
>> still some complex logic going on in highlightOffsetsEnums that could
>> be avoided.
>>
>> Dawid
>>
>> On Wed, Jan 11, 2017 at 11:34 AM, Dawid Weiss 
>> wrote:
>> > Can any of the folks who contributed to UnifiedHighlighter (David?)
>> > clarify my thinking here?
>> >
>> > I have a requirement to extract (for a set of search results) a list
>> > of exact "hit" ranges (field offsets, with support for multi-term
>> > queries and span queries). Obviously, I'm only talking about queries
>> > that relate to field content somehow, but this has always been quite
>> > problematic and required the use of multiple helper classes
>> > (WeightedSpanTermExtractor, MultiTermHighlighting, etc.) and pretty
>> > hairy logic.
>> >
>> > So I turned to look at UnifiedHighlighter for help.
>> >
>> > Seems like the right way (?) to do it would be to override (and abuse)
>> > UnifiedHighlighter's getFieldHighlighter method and return a field
>> > highlighter with an override of:
>> >
>> > protected Passage[] highlightOffsetsEnums(List
>> > offsetsEnums) throws IOException {
>> >
>> > so that I can capture and return a separate Passage for each
>> > OffsetsEnum (I have my own code to deal with overlaps and merging, so
>> > I can skip this entirely). Then, with a custom no-op PassageFormatter
>> > I could simply get a list of those offsets.
>> >
>> > The problem with this approach is that there is currently no way to
>> > access offsets in OffsetsEnum -- everything is protected (so
>> > subclassable), but OffsetsEnum are closed to package-private scope.
>> > Namely these two:
>> >
>> >   int startOffset() throws IOException {
>> > return postingsEnum.startOffset();
>> >   }
>> >
>> >   int endOffset() throws 

Re: UnifiedHighlighter and extraction of exact hit offset ranges

2017-01-11 Thread Dawid Weiss
Thanks David!

That's almost exactly what I ended up doing. I don't mind casting
Object to my own type; you can always make it a covariant override in
your subclass (which you have to do to access those expert-level
methods anyway).

I still kind of think startOffset/endOffset and other related methods
could be made public to allow tinkering with them in
FieldHighlighter#highlightOffsetsEnums (otherwise this method is
protected for overriding, but useless in practice).

There is another API problem I found too. If you wish to override
FieldHighlighter.getSummaryPassagesNoHighlight you can't return
anything sensible because Passage is final, contains only
package-private fields and addMatch is package-private too. So you
can't create a "custom" passage.

I can file an issue and provide a patch if these changes are not
against the design of the unified highlighter?

Dawid

On Wed, Jan 11, 2017 at 2:24 PM, David Smiley  wrote:
> Hi Dawid,
>
> You could write a trivial PassageFormatter that simply returns the Passage
> list instead of doing formatting.  Passages contain offsets. And yes,
> WholeBreakIterator if you don't need passage fragmentation. Unless I'm
> missing some aspect of your requirements, this doesn't involve any internal
> highlighter customizing.  Perhaps Javadocs could be improved to make this
> more clear... and perhaps this Passage-returning PassageFormatter could be
> included to clarify how it's done.  I recall doing or seeing this recently
> months ago but I'm not sure.
>
> One ugly aspect of the API (shared with it's PostingsHighlighter lineage)
> related to this discussion is that the PassageFormatter is declared to
> return Object.  It's kinda hard to rectify it to be typed, perhaps with
> generics, while also not spilling lots of generics to other places (the UH
> itself) just because of this.  Perhaps UH.highlightFieldsAsObjects() could
> be modified to take a Class to thus provide a type for the output... and
> maybe the PassageFormatter could declare not only with generics but with a
> method what types of results it produces.  I'm curious what you think.
>
> ~ David
>
>
> On Wed, Jan 11, 2017 at 6:02 AM Dawid Weiss  wrote:
>>
>> To follow-up: I hacked into the offsets by passing WholeBreakIterator
>> and a custom PassageFormatter that just returns the matches from the
>> singleton resulting passage. This is suboptimal though, as there's
>> still some complex logic going on in highlightOffsetsEnums that could
>> be avoided.
>>
>> Dawid
>>
>> On Wed, Jan 11, 2017 at 11:34 AM, Dawid Weiss 
>> wrote:
>> > Can any of the folks who contributed to UnifiedHighlighter (David?)
>> > clarify my thinking here?
>> >
>> > I have a requirement to extract (for a set of search results) a list
>> > of exact "hit" ranges (field offsets, with support for multi-term
>> > queries and span queries). Obviously, I'm only talking about queries
>> > that relate to field content somehow, but this has always been quite
>> > problematic and required the use of multiple helper classes
>> > (WeightedSpanTermExtractor, MultiTermHighlighting, etc.) and pretty
>> > hairy logic.
>> >
>> > So I turned to look at UnifiedHighlighter for help.
>> >
>> > Seems like the right way (?) to do it would be to override (and abuse)
>> > UnifiedHighlighter's getFieldHighlighter method and return a field
>> > highlighter with an override of:
>> >
>> > protected Passage[] highlightOffsetsEnums(List
>> > offsetsEnums) throws IOException {
>> >
>> > so that I can capture and return a separate Passage for each
>> > OffsetsEnum (I have my own code to deal with overlaps and merging, so
>> > I can skip this entirely). Then, with a custom no-op PassageFormatter
>> > I could simply get a list of those offsets.
>> >
>> > The problem with this approach is that there is currently no way to
>> > access offsets in OffsetsEnum -- everything is protected (so
>> > subclassable), but OffsetsEnum are closed to package-private scope.
>> > Namely these two:
>> >
>> >   int startOffset() throws IOException {
>> > return postingsEnum.startOffset();
>> >   }
>> >
>> >   int endOffset() throws IOException {
>> > return postingsEnum.endOffset();
>> >   }
>> >
>> > Should these two be protected to allow such customizations (I agree
>> > it's *very* low-level, but I have a practical use case where this
>> > would be useful).
>> >
>> > Am I on the right track here?
>> >
>> > Separately from that, I think it'd be nice to have some sort of
>> > generic utility that, for a given document (or a set of documents)
>> > would return such hit ranges... UnifiedHighlighter seems
>> >
>> > Dawid
>
> --
> Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
> LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
> http://www.solrenterprisesearchserver.com

-
To unsubscribe, e-mail: 

Re: UnifiedHighlighter and extraction of exact hit offset ranges

2017-01-11 Thread David Smiley
Hi Dawid,

You could write a trivial PassageFormatter that simply returns the Passage
list instead of doing formatting.  Passages contain offsets. And yes,
WholeBreakIterator if you don't need passage fragmentation. Unless I'm
missing some aspect of your requirements, this doesn't involve any internal
highlighter customizing.  Perhaps Javadocs could be improved to make this
more clear... and perhaps this Passage-returning PassageFormatter could be
included to clarify how it's done.  I recall doing or seeing this recently
months ago but I'm not sure.

One ugly aspect of the API (shared with it's PostingsHighlighter lineage)
related to this discussion is that the PassageFormatter is declared to
return Object.  It's kinda hard to rectify it to be typed, perhaps with
generics, while also not spilling lots of generics to other places (the UH
itself) just because of this.  Perhaps UH.highlightFieldsAsObjects() could
be modified to take a Class to thus provide a type for the output... and
maybe the PassageFormatter could declare not only with generics but with a
method what types of results it produces.  I'm curious what you think.

~ David

On Wed, Jan 11, 2017 at 6:02 AM Dawid Weiss  wrote:

> To follow-up: I hacked into the offsets by passing WholeBreakIterator
> and a custom PassageFormatter that just returns the matches from the
> singleton resulting passage. This is suboptimal though, as there's
> still some complex logic going on in highlightOffsetsEnums that could
> be avoided.
>
> Dawid
>
> On Wed, Jan 11, 2017 at 11:34 AM, Dawid Weiss 
> wrote:
> > Can any of the folks who contributed to UnifiedHighlighter (David?)
> > clarify my thinking here?
> >
> > I have a requirement to extract (for a set of search results) a list
> > of exact "hit" ranges (field offsets, with support for multi-term
> > queries and span queries). Obviously, I'm only talking about queries
> > that relate to field content somehow, but this has always been quite
> > problematic and required the use of multiple helper classes
> > (WeightedSpanTermExtractor, MultiTermHighlighting, etc.) and pretty
> > hairy logic.
> >
> > So I turned to look at UnifiedHighlighter for help.
> >
> > Seems like the right way (?) to do it would be to override (and abuse)
> > UnifiedHighlighter's getFieldHighlighter method and return a field
> > highlighter with an override of:
> >
> > protected Passage[] highlightOffsetsEnums(List
> > offsetsEnums) throws IOException {
> >
> > so that I can capture and return a separate Passage for each
> > OffsetsEnum (I have my own code to deal with overlaps and merging, so
> > I can skip this entirely). Then, with a custom no-op PassageFormatter
> > I could simply get a list of those offsets.
> >
> > The problem with this approach is that there is currently no way to
> > access offsets in OffsetsEnum -- everything is protected (so
> > subclassable), but OffsetsEnum are closed to package-private scope.
> > Namely these two:
> >
> >   int startOffset() throws IOException {
> > return postingsEnum.startOffset();
> >   }
> >
> >   int endOffset() throws IOException {
> > return postingsEnum.endOffset();
> >   }
> >
> > Should these two be protected to allow such customizations (I agree
> > it's *very* low-level, but I have a practical use case where this
> > would be useful).
> >
> > Am I on the right track here?
> >
> > Separately from that, I think it'd be nice to have some sort of
> > generic utility that, for a given document (or a set of documents)
> > would return such hit ranges... UnifiedHighlighter seems
> >
> > Dawid
>
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com


Re: UnifiedHighlighter and extraction of exact hit offset ranges

2017-01-11 Thread Dawid Weiss
To follow-up: I hacked into the offsets by passing WholeBreakIterator
and a custom PassageFormatter that just returns the matches from the
singleton resulting passage. This is suboptimal though, as there's
still some complex logic going on in highlightOffsetsEnums that could
be avoided.

Dawid

On Wed, Jan 11, 2017 at 11:34 AM, Dawid Weiss  wrote:
> Can any of the folks who contributed to UnifiedHighlighter (David?)
> clarify my thinking here?
>
> I have a requirement to extract (for a set of search results) a list
> of exact "hit" ranges (field offsets, with support for multi-term
> queries and span queries). Obviously, I'm only talking about queries
> that relate to field content somehow, but this has always been quite
> problematic and required the use of multiple helper classes
> (WeightedSpanTermExtractor, MultiTermHighlighting, etc.) and pretty
> hairy logic.
>
> So I turned to look at UnifiedHighlighter for help.
>
> Seems like the right way (?) to do it would be to override (and abuse)
> UnifiedHighlighter's getFieldHighlighter method and return a field
> highlighter with an override of:
>
> protected Passage[] highlightOffsetsEnums(List
> offsetsEnums) throws IOException {
>
> so that I can capture and return a separate Passage for each
> OffsetsEnum (I have my own code to deal with overlaps and merging, so
> I can skip this entirely). Then, with a custom no-op PassageFormatter
> I could simply get a list of those offsets.
>
> The problem with this approach is that there is currently no way to
> access offsets in OffsetsEnum -- everything is protected (so
> subclassable), but OffsetsEnum are closed to package-private scope.
> Namely these two:
>
>   int startOffset() throws IOException {
> return postingsEnum.startOffset();
>   }
>
>   int endOffset() throws IOException {
> return postingsEnum.endOffset();
>   }
>
> Should these two be protected to allow such customizations (I agree
> it's *very* low-level, but I have a practical use case where this
> would be useful).
>
> Am I on the right track here?
>
> Separately from that, I think it'd be nice to have some sort of
> generic utility that, for a given document (or a set of documents)
> would return such hit ranges... UnifiedHighlighter seems
>
> Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



UnifiedHighlighter and extraction of exact hit offset ranges

2017-01-11 Thread Dawid Weiss
Can any of the folks who contributed to UnifiedHighlighter (David?)
clarify my thinking here?

I have a requirement to extract (for a set of search results) a list
of exact "hit" ranges (field offsets, with support for multi-term
queries and span queries). Obviously, I'm only talking about queries
that relate to field content somehow, but this has always been quite
problematic and required the use of multiple helper classes
(WeightedSpanTermExtractor, MultiTermHighlighting, etc.) and pretty
hairy logic.

So I turned to look at UnifiedHighlighter for help.

Seems like the right way (?) to do it would be to override (and abuse)
UnifiedHighlighter's getFieldHighlighter method and return a field
highlighter with an override of:

protected Passage[] highlightOffsetsEnums(List
offsetsEnums) throws IOException {

so that I can capture and return a separate Passage for each
OffsetsEnum (I have my own code to deal with overlaps and merging, so
I can skip this entirely). Then, with a custom no-op PassageFormatter
I could simply get a list of those offsets.

The problem with this approach is that there is currently no way to
access offsets in OffsetsEnum -- everything is protected (so
subclassable), but OffsetsEnum are closed to package-private scope.
Namely these two:

  int startOffset() throws IOException {
return postingsEnum.startOffset();
  }

  int endOffset() throws IOException {
return postingsEnum.endOffset();
  }

Should these two be protected to allow such customizations (I agree
it's *very* low-level, but I have a practical use case where this
would be useful).

Am I on the right track here?

Separately from that, I think it'd be nice to have some sort of
generic utility that, for a given document (or a set of documents)
would return such hit ranges... UnifiedHighlighter seems

Dawid

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org