[jira] [Comment Edited] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

Jira Sun, 15 Dec 2019 06:06:31 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996736#comment-16996736
 ]


Nándor Mátravölgyi edited comment on LUCENE-9093 at 12/15/19 2:05 PM:
----------------------------------------------------------------------

We have the same idea how the chained breakiterators could be used to align the 
match in a more pleasing way. I also agree that some changes to 
FieldHighlighter will be necessary to handle overlaps. Your suggestion of that 
is about also highlighting the matches that were included in a previous 
Passage. I'd think trying to completely avoid the overlaps is preferable. That 
would make the snippets not redundant and implicitly solve the issue of needing 
to highlight some matches more than one time.

These are examples of what the least favorable edge cases would look like when 
we strictly avoid overlaps, but want to have centered match alignment. The 
search query is "field" and the original text is:
{noformat}
If set to false, or if there is no match in the alternate field either, the 
alternate field will be shown without highlighting, but could be marked by 
other processors.{noformat}
If the search has fragsize around 50 the first "field" word will be aligned 
properly. The next one will be left-aligned because the preceding text has 
already been used for a passage.
{noformat}
[
  "in the alternate <b>field</b> either, the alternate",
  "<b>field</b> will be shown without highlighting, but"
]{noformat}
If the search has fragsize around 60 the first "field" word will be aligned 
properly. The next one will be right-aligned because it is at the very end of 
the passage made for the first match.
{noformat}
[
 "match in the alternate <b>field</b> either, the alternate <b>field</b>"
]{noformat}
Now the question is: which of these is closer to what we want to see? I'd say 
either "worst" edge case would be much better than the constantly left-aligned 
matches we have currently. Note: these are close to how the other highlighters 
behave when they have near-boundary matches.

Regarding the question of abstraction. I've not found a reason to think we need 
to replace the breakitartors with a new interface. I think the bulk of the 
fastVector's fragment builder abstraction is about tracking the matches and 
highlighting the terms with different styles. (note I've only looked through it 
briefly)

Just for the sake of completeness, I'll tell you that for what I would like to 
do, a different concept of fragment length and snippet limit would be better. 
In all honesty I want an excerpt of the document that shows valuable matches in 
the context of a few words around them, while the whole highlight is no longer 
than N characters. Right now I have the configuration of fragsize=90 and 
snippets=3 because I want something that's not longer than 300 chars. If the 
highlighter could determine what differently sized fragments would yield the 
best excerpt, that would be the "best". A dense cluster of matches could form a 
180 chars fragment while two singular matches would form two 50 chars fragment. 
This could be better than forcing the fragments to be uniform in size.


was (Author: myusername8):
We have the same idea how the chained breakiterators could be used to align the 
match in a more pleasing way. I also agree that some changes to 
FieldHighlighter will be necessary to handle overlaps. Your suggestion of that 
is about also highlighting the matches that were included in a previous 
Passage. I'd think trying to completely avoid the overlaps is preferable. That 
would make the snippets not redundant and implicitly solve the issue of needing 
to highlight some matches more than one time.

These are examples of what the least favorable edge cases would look like when 
we strictly avoid overlaps, but want to have centered match alignment. The 
search query is "field" and the original text is:

 
{noformat}
If set to false, or if there is no match in the alternate field either, the 
alternate field will be shown without highlighting, but could be marked by 
other processors.{noformat}
If the search has fragsize around 50 the first "field" word will be aligned 
properly. The next one will be left-aligned because the preceding text has 
already been used for a passage.

 

 
{noformat}
[
  "in the alternate <b>field</b> either, the alternate",
  "<b>field</b> will be shown without highlighting, but"
]{noformat}
 

If the search has fragsize around 60 the first "field" word will be aligned 
properly. The next one will be right-aligned because it is at the very end of 
the passage made for the first match.
{noformat}
[
 "match in the alternate <b>field</b> either, the alternate <b>field</b>"
]{noformat}
Now the question is: which of these is closer to what we want to see? I'd say 
either "worst" edge case would be much better than the constantly left-aligned 
matches we have currently. Note: these are close to how the other highlighters 
behave when they have near-boundary matches.

Regarding the question of abstraction. I've not found a reason to think we need 
to replace the breakitartors with a new interface. I think the bulk of the 
fastVector's fragment builder abstraction is about tracking the matches and 
highlighting the terms with different styles. (note I've only looked through it 
briefly)

Just for the sake of completeness, I'll tell you that for what I would like to 
do, a different concept of fragment length and snippet limit would be better. 
In all honesty I want an excerpt of the document that shows valuable matches in 
the context of a few words around them, while the whole highlight is no longer 
than N characters. Right now I have the configuration of fragsize=90 and 
snippets=3 because I want something that's not longer than 300 chars. If the 
highlighter could determine what differently sized fragments would yield the 
best excerpt, that would be the "best". A dense cluster of matches could form a 
180 chars fragment while two singular matches would form two 50 chars fragment. 
This could be better than forcing the fragments to be uniform in size.

> Unified highlighter with word separator never gives context to the left
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-9093
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9093
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Tim Retout
>            Priority: Major
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get 
> context to the left of the matches returned; only words to the right of each 
> match are shown.  I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much 
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for 
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.bs.type=WORD&hl.fragsize=30&hl.method=unified
> I see this snippet:
> "<em>Apple</em> Lossless, H.264 video"
> Note that "Apple" is anchored to the left.  Compare with the original 
> highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.fragsize=30
> And the match has context either side:
> ", Audible, <em>Apple</em> Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is 
> respecting the hl.fragsize parameter, although [SOLR-9935] suggests support 
> was added.  I included the hl.fragsize param in the unified URL too, but it's 
> making no difference unless set to 0.)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-9093) Unified highlighter with word separator never gives context to the left

Reply via email to