Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361734439
 
 

 ##########
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##########
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx 
is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-    return baseIter.preceding(offset); // no change needed
+    final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // 
convert DONE to 0
+    fragmentEndFromPreceding = baseIter.following(fragmentStart);
+    if (fragmentEndFromPreceding == DONE) {
+      fragmentEndFromPreceding = baseIter.last();
+    }
+    final int centerLength = fragmentEndFromPreceding - fragmentStart;
+    final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * 
fragmentAlignment);
 
 Review comment:
   > Also; couldn't we subtract out the match length itself if it's unduly 
influencing the final snippet size?
   
   My original patch was so convoluted because that cannot be done easily in 
the current implementation with the BreakIterator base class. The call to 
`preceding()` on the LGBI receives the `matchStart` and returns the final start 
of the snippet. Since this method has no way to know the other end of the match 
it cannot adjust to the length of the match. Using private state the 
`following()` could know it, but that  would only be able to adjust on the 
right hand side and not the left.
   Something like `public int[] getPassageRange(int matchStart, int matchEnd) { 
... }` would allow a better implementation to be written. But that requires 
changing `protected final BreakIterator breakIterator;` inside 
`FieldHighlighter` to a custom BI base that has such method.
   
   > Realistically, do you imagine a user would want to use fragalign of 
anything other than 0.5?
   
   I'd say the margin values of `fragalign` are of little use, but having the 
ability to set it in the range of`0.35-0.65` could be worthwhile.
   
   I'm not sure if a `slop` parameter similar to RegexFragmenter's would be too 
beneficial.
   - The `RegexFragmenter` processes the whole stored text (until 
`hl.regex.maxAnalyzedChars`) and creates a list of breaks called `hotSpots`. 
Interestingly it also uses the tokenizer's offsets from the index (I think), 
because even if a break would be in the middle of a word it will not break it 
in half. Unless I'm completely missing something the regex seems to only give a 
strong recommendation to where a break should go.
   - Implementing an acceptable range (`slop`) would require to build some list 
of `hotSpots` or at least to look for them when extracting the fragment. For 
this we could only use `preceding()` and `following()` on the wrapped BI, which 
we wanted to avoid if possible for performance reasons.
   - For WORD BI the current solution is pretty accurate and is as optimal as 
it gets. (single call of `preceding()` and `following()` on the underlying BI)
   - For SENTENCE BI the slop would not help enough to be worth the trouble I 
think. Let's say I have to extend the snippet on the right side with 100 chars 
of a slop 0.4 (valid range: 60-140):
     - If the next sentence is 300 chars long there's nothing better to do what 
we already do.
     - If the next sentences have the lengths of [30, 50, 200, 40] the better 
choice would be to have the first two sentences appended with a sum length of 
80. (that's only 20 off from the goal) For this the LGBI can be created with 
`closestTo` mode instead of  `minLength` mode. It would cost an extra call of 
`preceding()` and `following()` each but would produce the better result 
instead of appending 3 of the following sentences with 280 total length.
   
   So we already have a `slop` of sorts... it's just a boolean that cannot be 
turned on right now as it is missing a parameter. I've mentioned [this 
todo](https://github.com/apache/lucene-solr/blob/e06ad4cfb587e1bb69a80c0b531df0e52c2da89b/solr/core/src/java/org/apache/solr/highlight/UnifiedSolrHighlighter.java#L330)
 about it before.
   
   After looking though all of this, my suggestion is to:
   - keep `hl.fragalign` and have it be `0.5` by default
   - add `hl.closestTo` with default `false`
     - a better parameter name would be welcome
     - we could consider making it automatically turned on if the BI is SENTENCE
   - do not worry about keeping a backward-compatible behavior with this change

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to