[
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16996736#comment-16996736
]
Nándor Mátravölgyi edited comment on LUCENE-9093 at 12/15/19 2:05 PM:
----------------------------------------------------------------------
We have the same idea how the chained breakiterators could be used to align the
match in a more pleasing way. I also agree that some changes to
FieldHighlighter will be necessary to handle overlaps. Your suggestion of that
is about also highlighting the matches that were included in a previous
Passage. I'd think trying to completely avoid the overlaps is preferable. That
would make the snippets not redundant and implicitly solve the issue of needing
to highlight some matches more than one time.
These are examples of what the least favorable edge cases would look like when
we strictly avoid overlaps, but want to have centered match alignment. The
search query is "field" and the original text is:
{noformat}
If set to false, or if there is no match in the alternate field either, the
alternate field will be shown without highlighting, but could be marked by
other processors.{noformat}
If the search has fragsize around 50 the first "field" word will be aligned
properly. The next one will be left-aligned because the preceding text has
already been used for a passage.
{noformat}
[
"in the alternate <b>field</b> either, the alternate",
"<b>field</b> will be shown without highlighting, but"
]{noformat}
If the search has fragsize around 60 the first "field" word will be aligned
properly. The next one will be right-aligned because it is at the very end of
the passage made for the first match.
{noformat}
[
"match in the alternate <b>field</b> either, the alternate <b>field</b>"
]{noformat}
Now the question is: which of these is closer to what we want to see? I'd say
either "worst" edge case would be much better than the constantly left-aligned
matches we have currently. Note: these are close to how the other highlighters
behave when they have near-boundary matches.
Regarding the question of abstraction. I've not found a reason to think we need
to replace the breakitartors with a new interface. I think the bulk of the
fastVector's fragment builder abstraction is about tracking the matches and
highlighting the terms with different styles. (note I've only looked through it
briefly)
Just for the sake of completeness, I'll tell you that for what I would like to
do, a different concept of fragment length and snippet limit would be better.
In all honesty I want an excerpt of the document that shows valuable matches in
the context of a few words around them, while the whole highlight is no longer
than N characters. Right now I have the configuration of fragsize=90 and
snippets=3 because I want something that's not longer than 300 chars. If the
highlighter could determine what differently sized fragments would yield the
best excerpt, that would be the "best". A dense cluster of matches could form a
180 chars fragment while two singular matches would form two 50 chars fragment.
This could be better than forcing the fragments to be uniform in size.
was (Author: myusername8):
We have the same idea how the chained breakiterators could be used to align the
match in a more pleasing way. I also agree that some changes to
FieldHighlighter will be necessary to handle overlaps. Your suggestion of that
is about also highlighting the matches that were included in a previous
Passage. I'd think trying to completely avoid the overlaps is preferable. That
would make the snippets not redundant and implicitly solve the issue of needing
to highlight some matches more than one time.
These are examples of what the least favorable edge cases would look like when
we strictly avoid overlaps, but want to have centered match alignment. The
search query is "field" and the original text is:
{noformat}
If set to false, or if there is no match in the alternate field either, the
alternate field will be shown without highlighting, but could be marked by
other processors.{noformat}
If the search has fragsize around 50 the first "field" word will be aligned
properly. The next one will be left-aligned because the preceding text has
already been used for a passage.
{noformat}
[
"in the alternate <b>field</b> either, the alternate",
"<b>field</b> will be shown without highlighting, but"
]{noformat}
If the search has fragsize around 60 the first "field" word will be aligned
properly. The next one will be right-aligned because it is at the very end of
the passage made for the first match.
{noformat}
[
"match in the alternate <b>field</b> either, the alternate <b>field</b>"
]{noformat}
Now the question is: which of these is closer to what we want to see? I'd say
either "worst" edge case would be much better than the constantly left-aligned
matches we have currently. Note: these are close to how the other highlighters
behave when they have near-boundary matches.
Regarding the question of abstraction. I've not found a reason to think we need
to replace the breakitartors with a new interface. I think the bulk of the
fastVector's fragment builder abstraction is about tracking the matches and
highlighting the terms with different styles. (note I've only looked through it
briefly)
Just for the sake of completeness, I'll tell you that for what I would like to
do, a different concept of fragment length and snippet limit would be better.
In all honesty I want an excerpt of the document that shows valuable matches in
the context of a few words around them, while the whole highlight is no longer
than N characters. Right now I have the configuration of fragsize=90 and
snippets=3 because I want something that's not longer than 300 chars. If the
highlighter could determine what differently sized fragments would yield the
best excerpt, that would be the "best". A dense cluster of matches could form a
180 chars fragment while two singular matches would form two 50 chars fragment.
This could be better than forcing the fragments to be uniform in size.
> Unified highlighter with word separator never gives context to the left
> -----------------------------------------------------------------------
>
> Key: LUCENE-9093
> URL: https://issues.apache.org/jira/browse/LUCENE-9093
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Reporter: Tim Retout
> Priority: Major
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get
> context to the left of the matches returned; only words to the right of each
> match are shown. I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.bs.type=WORD&hl.fragsize=30&hl.method=unified
> I see this snippet:
> "<em>Apple</em> Lossless, H.264 video"
> Note that "Apple" is anchored to the left. Compare with the original
> highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.fragsize=30
> And the match has context either side:
> ", Audible, <em>Apple</em> Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is
> respecting the hl.fragsize parameter, although [SOLR-9935] suggests support
> was added. I included the hl.fragsize param in the unified URL too, but it's
> making no difference unless set to 0.)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]