Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361734439
########## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java ########## @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break // called at start of new Passage given first word start offset @Override public int preceding(int offset) { - return baseIter.preceding(offset); // no change needed + final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0 + fragmentEndFromPreceding = baseIter.following(fragmentStart); + if (fragmentEndFromPreceding == DONE) { + fragmentEndFromPreceding = baseIter.last(); + } + final int centerLength = fragmentEndFromPreceding - fragmentStart; + final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment); Review comment: > Also; couldn't we subtract out the match length itself if it's unduly influencing the final snippet size? My original patch was so convoluted because that cannot be done easily in the current implementation with the BreakIterator base class. The call to `preceding()` on the LGBI receives the `matchStart` and returns the final start of the snippet. Since this method has no way to know the other end of the match it cannot adjust to the length of the match. Using private state the `following()` could know it, but that would only be able to adjust on the right hand side and not the left. Something like `public int[] getPassageRange(int matchStart, int matchEnd) { ... }` would allow a better implementation to be written. But that requires changing `protected final BreakIterator breakIterator;` inside `FieldHighlighter` to a custom BI base that has such method. > Realistically, do you imagine a user would want to use fragalign of anything other than 0.5? I'd say the margin values of `fragalign` are of little use, but having the ability to set it in the range of`0.35-0.65` could be worthwhile. I'm not sure if a `slop` parameter similar to RegexFragmenter's would be too beneficial. - The `RegexFragmenter` processes the whole stored text (until `hl.regex.maxAnalyzedChars`) and creates a list of breaks called `hotSpots`. Interestingly it also uses the tokenizer's offsets from the index (I think), because even if a break would be in the middle of a word it will not break it in half. Unless I'm completely missing something the regex seems to only give a strong recommendation to where a break should go. - Implementing an acceptable range (`slop`) would require to build some list of `hotSpots` or at least to look for them when extracting the fragment. For this we could only use `preceding()` and `following()` on the wrapped BI, which we wanted to avoid if possible for performance reasons. - For WORD BI the current solution is pretty accurate and is as optimal as it gets. (single call of `preceding()` and `following()` on the underlying BI) - For SENTENCE BI the slop would not help enough to be worth the trouble I think. Let's say I have to extend the snippet on the right side with 100 chars of a slop 0.4 (valid range: 60-140): - If the next sentence is 300 chars long there's nothing better to do what we already do. - If the next sentences have the lengths of [30, 50, 200, 40] the better choice would be to have the first two sentences appended with a sum length of 80. (that's only 20 off from the goal) For this the LGBI can be created with `closestTo` mode instead of `minLength` mode. It would cost an extra call of `preceding()` and `following()` each but would produce the better result instead of appending 3 of the following sentences with 280 total length. So we already have a `slop` of sorts... it's just a boolean that cannot be turned on right now as it is missing a parameter. I've mentioned [this todo](https://github.com/apache/lucene-solr/blob/e06ad4cfb587e1bb69a80c0b531df0e52c2da89b/solr/core/src/java/org/apache/solr/highlight/UnifiedSolrHighlighter.java#L330) about it before. After looking though all of this, my suggestion is to: - keep `hl.fragalign` and have it be `0.5` by default - add `hl.closestTo` with default `false` - a better parameter name would be welcome - we could consider making it automatically turned on if the BI is SENTENCE - do not worry about keeping a backward-compatible behavior with this change ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org