[
https://issues.apache.org/jira/browse/LUCENE-9093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17002548#comment-17002548
]
Nándor Mátravölgyi edited comment on LUCENE-9093 at 12/24/19 12:04 AM:
-----------------------------------------------------------------------
I could look into making this a github PR tomorrow... I'll change the default
fragalign to 0.5 as well.
It also works in SENTENCE mode, but the results won't be as accurate in some
cases. Let me elaborate.
In any mode the selected BreakIterator (WORD, SEPARATOR, SENTENCE, etc.) makes
the decision on where a slice can happen. The first slice always contains the
match. The LengthGoalBreakIterator will decide which side of the first slice
should the selected BI add more slices to. The logic is generic and will work
regardless of the underlying BI. Since the snippet will be grown until it
reaches fragsize, the size of the last slice to be added will determine how big
to overshoot is. Examples in SENTENCE mode:
Example text: _Hello Susan! I cannot believe the weather is unreal again! The
sky is green. I hope Mrs Smith will bring an umbrella for the picnic. Let's not
panic._
1. If the fragsize is smaller than the first slice (sentence in this case), no
expansion will happen in either direction. Note that fragalign is N/A in this
case.
{noformat}
q=sky&hl.fragalign=0.5&hl.fragsize=10 makes snippet length of 17
The <b>sky</b> is green.{noformat}
2. If the fragsize is bigger than the first slice and the fragalign is 0.5, the
slice will be expanded on the left first and then on the right if any space is
left.
{noformat}
q=sky&hl.fragalign=0.5&hl.fragsize=30 makes snippet length of 63
I cannot believe the weather is unreal again! The <b>sky</b> is green.
q=sky&hl.fragalign=0.5&hl.fragsize=80 makes snippet length of 119
I cannot believe the weather is unreal again! The <b>sky</b> is green. I hope
Mrs Smith will bring an umbrella for the picnic.
q=sky&hl.fragalign=0.5&hl.fragsize=120 makes snippet length of 132
Hello Susan! I cannot believe the weather is unreal again! The <b>sky</b> is
green. I hope Mrs Smith will bring an umbrella for the picnic.{noformat}
3. If the fragsize is bigger than the first slice and the fragalign is 0, the
slice will be expanded on the right only. (the match is anchored to
0/left/begin)
{noformat}
q=sky&hl.fragalign=0.0&hl.fragsize=30 makes snippet length of 73
The <b>sky</b> is green. I hope Mrs Smith will bring an umbrella for the picnic.
q=sky&hl.fragalign=0.0&hl.fragsize=80 makes snippet length of 90
The <b>sky</b> is green. I hope Mrs Smith will bring an umbrella for the
picnic. Let's not panic.{noformat}
4. If the fragsize is bigger than the first slice and the fragalign is 1, the
slice will be expanded on the left only. (the match is anchored to 1/right/end)
{noformat}
q=sky&hl.fragalign=1.0&hl.fragsize=30 makes snippet length of 63
I cannot believe the weather is unreal again! The <b>sky</b> is green.
q=sky&hl.fragalign=1.0&hl.fragsize=70 makes snippet length of 76
Hello Susan! I cannot believe the weather is unreal again! The <b>sky</b> is
green.{noformat}
In the above examples there are big overshoots of the fragsize. 63 instead of
30 (+110%) and 119 instead of 80 (+49%). These would also occur if the
fragalign would be 0.1, but the alignment would be even less accurate in cases
where the left expansion overshoots:
{noformat}
q=sky&hl.fragalign=0.1&hl.fragsize=30 makes snippet length of 63
I cannot believe the weather is unreal again! The <b>sky</b> is green.{noformat}
This is because the order of expansion is strictly left first. I guess this
could be improved if so desired.
In summary, to ensure the accuracy of fragsize & fragalign parameters, they
have to be proportional to the approximate size of the slices. Here's how the
worst expected overshoot can be calculated:
{noformat}
float WorstOvershootPercent(float fragsize, float avgSliceLength) {
return ((((fragsize-1)+avgSliceLength) / fragsize)-1)*100;
}
WORD: (words are usually 12-25 characters most)
WorstOvershootPercent(15, 12) => 73.34%
WorstOvershootPercent(100, 25) => 24.00%
WorstOvershootPercent(300, 25) => 8.00%
SENTENCE: (a sentence can be very long)
WorstOvershootPercent(300, 300) => 99.66%
WorstOvershootPercent(300, 500) => 166.34%
WorstOvershootPercent(2000, 300) => 14.95%
WorstOvershootPercent(2000, 500) => 24.95%{noformat}
The other highlighters have similar rules for this. The only thing that can
improve this easily in some cases, is to search the closest length to the
fragsize instead of the minimum. The LengthGoalBreakIterator has a
closestTo-mode, but it's not usable because it would require yet another
parameter. ([view on
github|https://github.com/apache/lucene-solr/blob/1be5b689640fe4d1bf0ae3fd19c5fe93b20a77ef/solr/core/src/java/org/apache/solr/highlight/UnifiedSolrHighlighter.java#L330])
Using that mode could make an undershoot that is closer to the desired size
than the overshoot.
was (Author: myusername8):
I could look into making this a github PR tomorrow... I'll change the default
fragalign to 0.5 as well.
It also works in SENTENCE mode, but the results won't be as accurate in some
cases. Let me elaborate.
In any mode the selected BreakIterator (WORD, SEPARATOR, SENTENCE, etc.) makes
the decision on where a slice can happen. The first slice always contains the
match. The LengthGoalBreakIterator will decide which side of the first slice
should the selected BI add more slices to. The logic is generic and will work
regardless of the underlying BI. Since the snippet will be grown until it
reaches fragsize, the size of the last slice to be added will determine how big
to overshoot is. Examples in SENTENCE mode:
Example text: _Hello Susan! I cannot believe the weather is unreal again! The
sky is green. I hope Mrs Smith will bring an umbrella for the picnic. Let's not
panic._
# If the fragsize is smaller than the first slice (sentence in this case), no
expansion will happen in either direction. Note that fragalign is N/A in this
case.
{noformat}
q=sky&hl.fragalign=0.5&hl.fragsize=10 makes snippet length of 17
The <b>sky</b> is green.{noformat}
# If the fragsize is bigger than the first slice and the fragalign is 0.5, the
slice will be expanded on the left first and then on the right if any space is
left.
{noformat}
q=sky&hl.fragalign=0.5&hl.fragsize=30 makes snippet length of 63
I cannot believe the weather is unreal again! The <b>sky</b> is green.
q=sky&hl.fragalign=0.5&hl.fragsize=80 makes snippet length of 119
I cannot believe the weather is unreal again! The <b>sky</b> is green. I hope
Mrs Smith will bring an umbrella for the picnic.
q=sky&hl.fragalign=0.5&hl.fragsize=120 makes snippet length of 132
Hello Susan! I cannot believe the weather is unreal again! The <b>sky</b> is
green. I hope Mrs Smith will bring an umbrella for the picnic.{noformat}
# If the fragsize is bigger than the first slice and the fragalign is 0, the
slice will be expanded on the right only. (the match is anchored to
0/left/begin)
{noformat}
q=sky&hl.fragalign=0.0&hl.fragsize=30 makes snippet length of 73
The <b>sky</b> is green. I hope Mrs Smith will bring an umbrella for the picnic.
q=sky&hl.fragalign=0.0&hl.fragsize=80 makes snippet length of 90
The <b>sky</b> is green. I hope Mrs Smith will bring an umbrella for the
picnic. Let's not panic.{noformat}
# If the fragsize is bigger than the first slice and the fragalign is 1, the
slice will be expanded on the left only. (the match is anchored to 1/right/end)
{noformat}
q=sky&hl.fragalign=1.0&hl.fragsize=30 makes snippet length of 63
I cannot believe the weather is unreal again! The <b>sky</b> is green.
q=sky&hl.fragalign=1.0&hl.fragsize=70 makes snippet length of 76
Hello Susan! I cannot believe the weather is unreal again! The <b>sky</b> is
green.{noformat}
In the above examples there are big overshoots of the fragsize. 63 instead of
30 (+110%) and 119 instead of 80 (+49%). These would also occur if the
fragalign would be 0.1, but the alignment would be even less accurate in cases
where the left expansion overshoots:
{noformat}
q=sky&hl.fragalign=0.1&hl.fragsize=30 makes snippet length of 63
I cannot believe the weather is unreal again! The <b>sky</b> is green.{noformat}
This is because the order of expansion is strictly left first. I guess this
could be improved if so desired.
In summary, to ensure the accuracy of fragsize & fragalign parameters, they
have to be proportional to the approximate size of the slices. Here's how the
worst expected overshoot can be calculated:
{noformat}
float WorstOvershootPercent(float fragsize, float avgSliceLength) {
return ((((fragsize-1)+avgSliceLength) / fragsize)-1)*100;
}
WORD: (words are usually 12-25 characters most)
WorstOvershootPercent(15, 12) => 73.34%
WorstOvershootPercent(100, 25) => 24.00%
WorstOvershootPercent(300, 25) => 8.00%
SENTENCE: (a sentence can be very long)
WorstOvershootPercent(300, 300) => 99.66%
WorstOvershootPercent(300, 500) => 166.34%
WorstOvershootPercent(2000, 300) => 14.95%
WorstOvershootPercent(2000, 500) => 24.95%{noformat}
The other highlighters have similar rules for this. The only thing that can
improve this easily in some cases, is to search the closest length to the
fragsize instead of the minimum. The LengthGoalBreakIterator has a
closestTo-mode, but it's not usable because it would require yet another
parameter. ([view on
github|https://github.com/apache/lucene-solr/blob/1be5b689640fe4d1bf0ae3fd19c5fe93b20a77ef/solr/core/src/java/org/apache/solr/highlight/UnifiedSolrHighlighter.java#L330])
Using that mode could make an undershoot that is closer to the desired size
than the overshoot.
> Unified highlighter with word separator never gives context to the left
> -----------------------------------------------------------------------
>
> Key: LUCENE-9093
> URL: https://issues.apache.org/jira/browse/LUCENE-9093
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Reporter: Tim Retout
> Priority: Major
> Attachments: LUCENE-9093.patch
>
>
> When using the unified highlighter with hl.bs.type=WORD, I am not able to get
> context to the left of the matches returned; only words to the right of each
> match are shown. I see this behaviour on both Solr 6.4 and Solr 7.1.
> Without context to the left of a match, the highlighted snippets are much
> less useful for understanding where the match appears in a document.
> As an example, using the techproducts data with Solr 7.1, given a search for
> "apple", highlighting the "features" field:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.bs.type=WORD&hl.fragsize=30&hl.method=unified
> I see this snippet:
> "<em>Apple</em> Lossless, H.264 video"
> Note that "Apple" is anchored to the left. Compare with the original
> highlighter:
> http://localhost:8983/solr/techproducts/select?hl.fl=features&hl=on&q=apple&hl.fragsize=30
> And the match has context either side:
> ", Audible, <em>Apple</em> Lossless, H.264 video"
> (To complicate this, in general I am not sure that the unified highlighter is
> respecting the hl.fragsize parameter, although [SOLR-9935] suggests support
> was added. I included the hl.fragsize param in the unified URL too, but it's
> making no difference unless set to 0.)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]