[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361847732 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java ## @@ -173,8 +205,30 @@ private int moveToBreak(int idx) { // precondition: idx is a known break // called at start of new Passage given first word start offset @Override - public int preceding(int offset) { -return baseIter.preceding(offset); // no change needed + public int preceding(int matchStartIndex) { +final int targetIdx = (matchStartIndex - 1) - (int)(lengthGoal * fragmentAlignment); +if (targetIdx <= 0) { + return 0; +} +final int beforeIdx = baseIter.preceding(targetIdx + 1); +if (beforeIdx == DONE) { + return 0; Review comment: This again depends on what we want to accomplish. The original and correct behavior of a BreakIterator would be to return the DONE. Since I'm working on making the LGBI more correct, should I change these? Although `FieldHighlighter` will have to check for DONE if I remove these checks. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361840892 ## File path: lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/LengthGoalBreakIteratorTest.java ## @@ -39,65 +41,142 @@ // 0 1 // 01234567890123456789 static final String CONTENT = "Aa bb. Cc dd. Ee ff"; + static final String CONTENT2 = "Aa bb Cc dd X Ee ff Gg hh."; + + public void testFragmentAlignmentConstructor() throws IOException { +BreakIterator baseBI = new CustomSeparatorBreakIterator('.'); +// test fragmentAlignment validation +float[] valid_aligns = {0.f, 0.f, 0.5f, 0.99f, 1.f}; +for (float alignment : valid_aligns) { + LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment); +} +float[] invalid_aligns = {-0.01f, -1.f, 1.5f, Float.NaN, Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY}; +for (float alignment : invalid_aligns) { + expectThrows(IllegalArgumentException.class, () -> { +LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment); + }); +} +// test backwards compatibility constructors +String backwardCompString = LengthGoalBreakIterator.createClosestToLength(baseBI, 50).toString(); +assertTrue(backwardCompString, backwardCompString.contains("fragAlign=0.0")); +backwardCompString = LengthGoalBreakIterator.createMinLength(baseBI, 50).toString(); +assertTrue(backwardCompString, backwardCompString.contains("fragAlign=0.0")); + } public void testTargetLen() throws IOException { // "goal" means target length goal to find closest break // at first word: Query query = query("aa"); -assertEquals("almost two sent", -"Aa bb.", highlightClosestToLen(CONTENT, query, 9)); Review comment: **EDIT**: NVM this, I've checked this and it was never a problem, but could be in the proposed version. I've not checked this but thinking about it makes me believe that previous to my changes this could have happened regardless, since the `following()` was called with `start` instead of `end - 1`. The reason it was not a problem could be that it would only happen at a ridiculously low `fragsize`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361840892 ## File path: lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/LengthGoalBreakIteratorTest.java ## @@ -39,65 +41,142 @@ // 0 1 // 01234567890123456789 static final String CONTENT = "Aa bb. Cc dd. Ee ff"; + static final String CONTENT2 = "Aa bb Cc dd X Ee ff Gg hh."; + + public void testFragmentAlignmentConstructor() throws IOException { +BreakIterator baseBI = new CustomSeparatorBreakIterator('.'); +// test fragmentAlignment validation +float[] valid_aligns = {0.f, 0.f, 0.5f, 0.99f, 1.f}; +for (float alignment : valid_aligns) { + LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment); +} +float[] invalid_aligns = {-0.01f, -1.f, 1.5f, Float.NaN, Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY}; +for (float alignment : invalid_aligns) { + expectThrows(IllegalArgumentException.class, () -> { +LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment); + }); +} +// test backwards compatibility constructors +String backwardCompString = LengthGoalBreakIterator.createClosestToLength(baseBI, 50).toString(); +assertTrue(backwardCompString, backwardCompString.contains("fragAlign=0.0")); +backwardCompString = LengthGoalBreakIterator.createMinLength(baseBI, 50).toString(); +assertTrue(backwardCompString, backwardCompString.contains("fragAlign=0.0")); + } public void testTargetLen() throws IOException { // "goal" means target length goal to find closest break // at first word: Query query = query("aa"); -assertEquals("almost two sent", -"Aa bb.", highlightClosestToLen(CONTENT, query, 9)); Review comment: I've not checked this but thinking about it makes me believe that previous to my changes this could have happened regardless, since the `following()` was called with `start` instead of `end - 1`. The reason it was not a problem could be that it would only happen at a ridiculously low `fragsize`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361840489 ## File path: lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/LengthGoalBreakIteratorTest.java ## @@ -39,65 +41,142 @@ // 0 1 // 01234567890123456789 static final String CONTENT = "Aa bb. Cc dd. Ee ff"; + static final String CONTENT2 = "Aa bb Cc dd X Ee ff Gg hh."; + + public void testFragmentAlignmentConstructor() throws IOException { +BreakIterator baseBI = new CustomSeparatorBreakIterator('.'); +// test fragmentAlignment validation +float[] valid_aligns = {0.f, 0.f, 0.5f, 0.99f, 1.f}; +for (float alignment : valid_aligns) { + LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment); +} +float[] invalid_aligns = {-0.01f, -1.f, 1.5f, Float.NaN, Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY}; +for (float alignment : invalid_aligns) { + expectThrows(IllegalArgumentException.class, () -> { +LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment); + }); +} +// test backwards compatibility constructors +String backwardCompString = LengthGoalBreakIterator.createClosestToLength(baseBI, 50).toString(); +assertTrue(backwardCompString, backwardCompString.contains("fragAlign=0.0")); +backwardCompString = LengthGoalBreakIterator.createMinLength(baseBI, 50).toString(); +assertTrue(backwardCompString, backwardCompString.contains("fragAlign=0.0")); + } public void testTargetLen() throws IOException { // "goal" means target length goal to find closest break // at first word: Query query = query("aa"); -assertEquals("almost two sent", -"Aa bb.", highlightClosestToLen(CONTENT, query, 9)); Review comment: I've already thought about trying that, but there's one case I think could break and I didn't see it worthwhile to investigate further. Since the index tokenizer and the highlight separator may work on completely different characters, the separator could break inside the match if I allow it to process the chars of the match itself. In that case with very low `fragsize` or very uneven `fragAlignRatio` a break inside the match could be accepted and the [first assertion](https://github.com/apache/lucene-solr/blob/3ae1a0b3bad84cdfaa3941b87a1a7fcad63a66d4/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/Passage.java#L45) in `Passage.addMatch` would fail. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361839121 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java ## @@ -159,8 +160,9 @@ public Object highlightFieldForDoc(LeafReader reader, int docId, String content) break; } // advance breakIterator -passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), 0)); -passage.setEndOffset(Math.min(this.breakIterator.following(start), contentLength)); +passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), lastPassageEnd)); Review comment: > Without this small proposal, the length of this passage will be undersized That's incorrect. Such a fragment will be undersized either way. The current approach has the `fragsize` split up by `fragAlignRatio` statically. Even if there is not fulfilled expansion on the left, that won't be used on the right. We would only be moving the point where the `fragsize` on the left is truncated. BTW in the results these would-be-overlapping fragments get merged into a single snippet. So they'll become a bigger one instead of one normal- and one undersized. The only sure place we will receive an undersized snippet is when a match is at the very beginning or the end of the text. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361750286 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java ## @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break // called at start of new Passage given first word start offset @Override public int preceding(int offset) { -return baseIter.preceding(offset); // no change needed +final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0 +fragmentEndFromPreceding = baseIter.following(fragmentStart); +if (fragmentEndFromPreceding == DONE) { + fragmentEndFromPreceding = baseIter.last(); +} +final int centerLength = fragmentEndFromPreceding - fragmentStart; +final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment); Review comment: > hl.closestTo ... I wish it simply always worked this way... I'm guessing whoever wrote the original UH decided against it for performance reasons. There is a [comment](https://github.com/apache/lucene-solr/blob/e06ad4cfb587e1bb69a80c0b531df0e52c2da89b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java#L149) to suggest that. Also note that depending on which index is closer the BI may had to invoke `following()` one extra time inside `moveToBreak()`. I think `moveToBreak()` is only necessary when a default-summary is being made and I can remove it from the code when a match is being highlighted. The performance hit would be depending on the fragment's length that contains the goal index. In WORD mode I'd hope this is negligible. It will necessarily be worse for SENTENCE. You've brought up a new point to keep in mind: simplifying the options. I'll be back with updated code that we can discuss further. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361734439 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java ## @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break // called at start of new Passage given first word start offset @Override public int preceding(int offset) { -return baseIter.preceding(offset); // no change needed +final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0 +fragmentEndFromPreceding = baseIter.following(fragmentStart); +if (fragmentEndFromPreceding == DONE) { + fragmentEndFromPreceding = baseIter.last(); +} +final int centerLength = fragmentEndFromPreceding - fragmentStart; +final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment); Review comment: > Also; couldn't we subtract out the match length itself if it's unduly influencing the final snippet size? My original patch was so convoluted because that cannot be done easily in the current implementation with the BreakIterator base class. The call to `preceding()` on the LGBI receives the `matchStart` and returns the final start of the snippet. Since this method has no way to know the other end of the match it cannot adjust to the length of the match. Using private state the `following()` could know it, but that would only be able to adjust on the right hand side and not the left. Something like `public int[] getPassageRange(int matchStart, int matchEnd) { ... }` would allow a better implementation to be written. But that requires changing `protected final BreakIterator breakIterator;` inside `FieldHighlighter` to a custom BI base that has such method. > Realistically, do you imagine a user would want to use fragalign of anything other than 0.5? I'd say the margin values of `fragalign` are of little use, but having the ability to set it in the range of`0.35-0.65` could be worthwhile. I'm not sure if a `slop` parameter similar to RegexFragmenter's would be too beneficial. - The `RegexFragmenter` processes the whole stored text (until `hl.regex.maxAnalyzedChars`) and creates a list of breaks called `hotSpots`. Interestingly it also uses the tokenizer's offsets from the index (I think), because even if a break would be in the middle of a word it will not break it in half. Unless I'm completely missing something the regex seems to only give a strong recommendation to where a break should go. - Implementing an acceptable range (`slop`) would require to build some list of `hotSpots` or at least to look for them when extracting the fragment. For this we could only use `preceding()` and `following()` on the wrapped BI, which we wanted to avoid if possible for performance reasons. - For WORD BI the current solution is pretty accurate and is as optimal as it gets. (single call of `preceding()` and `following()` on the underlying BI) - For SENTENCE BI the slop would not help enough to be worth the trouble I think. Let's say I have to extend the snippet on the right side with 100 chars of a slop 0.4 (valid range: 60-140): - If the next sentence is 300 chars long there's nothing better to do what we already do. - If the next sentences have the lengths of [30, 50, 200, 40] the better choice would be to have the first two sentences appended with a sum length of 80. (that's only 20 off from the goal) For this the LGBI can be created with `closestTo` mode instead of `minLength` mode. It would cost an extra call of `preceding()` and `following()` each but would produce the better result instead of appending 3 of the following sentences with 280 total length. So we already have a `slop` of sorts... it's just a boolean that cannot be turned on right now as it is missing a parameter. I've mentioned [this todo](https://github.com/apache/lucene-solr/blob/e06ad4cfb587e1bb69a80c0b531df0e52c2da89b/solr/core/src/java/org/apache/solr/highlight/UnifiedSolrHighlighter.java#L330) about it before. After looking though all of this, my suggestion is to: - keep `hl.fragalign` and have it be `0.5` by default - add `hl.closestTo` with default `false` - a better parameter name would be welcome - we could consider making it automatically turned on if the BI is SENTENCE - do not worry about keeping a backward-compatible behavior with this change This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at:
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361663103 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java ## @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break // called at start of new Passage given first word start offset @Override public int preceding(int offset) { -return baseIter.preceding(offset); // no change needed +final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0 +fragmentEndFromPreceding = baseIter.following(fragmentStart); +if (fragmentEndFromPreceding == DONE) { + fragmentEndFromPreceding = baseIter.last(); +} +final int centerLength = fragmentEndFromPreceding - fragmentStart; +final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment); Review comment: The change is the same in WORD mode too, but the match and the fragment of the match is mostly the same. It's just not that apparent compared to the SENTENCE mode. I'll try and clarify all the expressions: - Source text: `I'm looking for a beach ball for my vacation. I've lost the old one. I liked that.` - Query text: `vacation` - Match term is `Match{start=36, end=44, text=vacation}` thus `matchsize` is `8` - Match-fragment is the fragment that contains the match according to the underlying BI: - WORD: `vacation` because it breaks on space and period. - SENTENCE: `I'm looking for a beach ball for my vacation.` because it breaks on period. - `hl.bs.type=SENTENCE=10=0.0`: - Match fragment `I'm looking for a beach ball for my vacation.` has 1 extra char on the right, but 10 is requested. Will try to append next fragments to reach `match.end+((1-fragalign)*fragsize)` (which is `54`). - Next break after `54` is the end of the second sentence so the result snippet will be: `I'm looking for a beach ball for my vacation. I've lost the old one.` - `hl.bs.type=SENTENCE=10=0.5`: - Match fragment `I'm looking for a beach ball for my vacation.` has 36 extra chars on the left, while 5 is requested. Also 5 are requested on the right so the extension will happen just like in the previous example. The result will be `I'm looking for a beach ball for my vacation. I've lost the old one.`. - `hl.bs.type=SENTENCE=10=1.0`: - Match fragment `I'm looking for a beach ball for my vacation.` has 36 extra chars on the left, while 10 is requested. No extra chars are requested on the right so the match fragment is the result. As you can see the `fragsize` is **not** about the final size of the snippet, but about the text around the match. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361663103 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java ## @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break // called at start of new Passage given first word start offset @Override public int preceding(int offset) { -return baseIter.preceding(offset); // no change needed +final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0 +fragmentEndFromPreceding = baseIter.following(fragmentStart); +if (fragmentEndFromPreceding == DONE) { + fragmentEndFromPreceding = baseIter.last(); +} +final int centerLength = fragmentEndFromPreceding - fragmentStart; +final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment); Review comment: The change is the same in WORD mode too, but the match and the fragment of the match is mostly the same. It's just not that apparent compared to the SENTENCE mode. I'll try and clarify all the expressions: - Source text: `I'm looking for a beach ball for my vacation. I've lost the old one. I liked that.` - Query text: `vacation` - Match term is `Match{start=36, end=44, text=vacation}` thus `matchsize` is `8` - Match-fragment is the fragment that contains the match according to the underlying BI: - WORD: `vacation` because it breaks on space and period. - SENTENCE: `I'm looking for a beach ball for my vacation.` because it breaks on period. - `hl.bs.type=SENTENCE=10=0.0`: - Match fragment `I'm looking for a beach ball for my vacation.` has 1 extra char on the right, but 10 is requested. Will try to append next fragments to reach `match.end+((1-fragalign)-fragsize)` (which is `54`). - Next break after `54` is the end of the second sentence so the result snippet will be: `I'm looking for a beach ball for my vacation. I've lost the old one.` - `hl.bs.type=SENTENCE=10=0.5`: - Match fragment `I'm looking for a beach ball for my vacation.` has 36 extra chars on the left, while 5 is requested. Also 5 are requested on the right so the extension will happen just like in the previous example. The result will be `I'm looking for a beach ball for my vacation. I've lost the old one.`. - `hl.bs.type=SENTENCE=10=1.0`: - Match fragment `I'm looking for a beach ball for my vacation.` has 36 extra chars on the left, while 10 is requested. No extra chars are requested on the right so the match fragment is the result. As you can see the `fragsize` is **not** about the final size of the snippet, but about the text around the match. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361663103 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java ## @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break // called at start of new Passage given first word start offset @Override public int preceding(int offset) { -return baseIter.preceding(offset); // no change needed +final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0 +fragmentEndFromPreceding = baseIter.following(fragmentStart); +if (fragmentEndFromPreceding == DONE) { + fragmentEndFromPreceding = baseIter.last(); +} +final int centerLength = fragmentEndFromPreceding - fragmentStart; +final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment); Review comment: The change is the same in WORD mode too, but the match and the fragment of the match is mostly the same. It's just not that apparent compared to the SENTENCE mode. I'll try and clarify all the expressions: - Source text: `I'm looking for a beach ball for my vacation. I've lost the old one. I liked that.` - Query text: `vacation` - Match term is `Match{start=36, end=44, text=vacation}` thus `matchsize` is `8` - Match-fragment is the fragment that contains the match according to the underlying BI: - WORD: `vacation` because it breaks on space and period. - SENTENCE: `I'm looking for a beach ball for my vacation.` because it breaks on period. - `hl.bs.type=SENTENCE=10=0.0`: - Match fragment `I'm looking for a beach ball for my vacation.` has 1 extra char on the right, but 10 is requested. Will try to append next fragments to reach `match.end+((1-fragalign)-fragsize)` (which is `54`). - Next break after `54` is the end of the second sentence so the result snippet will be: `I'm looking for a beach ball for my vacation. I've lost the old one.` - `hl.bs.type=SENTENCE=10=0.5`: - Match fragment `I'm looking for a beach ball for my vacation.` has 36 extra chars on the left, while 5 is requested. Also 5 are requested on the right so the extension will happen just like in the previous example. The result will be `I'm looking for a beach ball for my vacation. I've lost the old one.`. - `hl.bs.type=SENTENCE=10=1.0`: - Match fragment `I'm looking for a beach ball for my vacation.` has 36 extra chars on the left, while 10 is requested. No extra chars are requested on the right so the match fragment is the result. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361655197 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java ## @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break // called at start of new Passage given first word start offset @Override public int preceding(int offset) { -return baseIter.preceding(offset); // no change needed +final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0 +fragmentEndFromPreceding = baseIter.following(fragmentStart); +if (fragmentEndFromPreceding == DONE) { + fragmentEndFromPreceding = baseIter.last(); +} +final int centerLength = fragmentEndFromPreceding - fragmentStart; +final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment); Review comment: Correction: the snippet size is not `fragsize+matchsize`, but `min(sentence_left_of_match, fragsize*fragalign) + matchsize + min(sentence_right_of_match, fragsize*(1-fragalign))`. So if I want a left aligned highlight in SENTENCE mode, but the match is on the right end of the sentence, the whole `fragsize` would be added on the right. This is less accurate if you're concerned about the result snippet's size, but more accurate if you're concerned about the alignment of the match. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361533868 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java ## @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break // called at start of new Passage given first word start offset @Override public int preceding(int offset) { -return baseIter.preceding(offset); // no change needed +final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0 +fragmentEndFromPreceding = baseIter.following(fragmentStart); +if (fragmentEndFromPreceding == DONE) { + fragmentEndFromPreceding = baseIter.last(); +} +final int centerLength = fragmentEndFromPreceding - fragmentStart; +final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * fragmentAlignment); Review comment: On the contrary! This suggestion is a little different than what I've thought of, so I'll have to rework the whole thing, but it has a great quality: This would make the length of the match irrelevant for the GLBI logic. I'm not going to go through all the details, but the code can be a little more optimal and less... hacky. I should have read this before writing [my previous comment](https://github.com/apache/lucene-solr/pull/1123#discussion_r361529545), now it is deprecated. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361530659 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java ## @@ -159,8 +160,9 @@ public Object highlightFieldForDoc(LeafReader reader, int docId, String content) break; } // advance breakIterator -passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), 0)); Review comment: Even if this was not necessary before, the current version with `lastPassageEnd` does need `max()`. Without it the passages could overlap. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361529545 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java ## @@ -159,8 +160,9 @@ public Object highlightFieldForDoc(LeafReader reader, int docId, String content) break; } // advance breakIterator -passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), 0)); -passage.setEndOffset(Math.min(this.breakIterator.following(start), contentLength)); +passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 1), lastPassageEnd)); +lastPassageEnd = Math.min(this.breakIterator.following(start), contentLength); Review comment: Changing the `following(start)` to `following(end - 1)` would have been a great idea before this changeset. However these two calls on the `LengthGoalBreakIterator` are very specifically crafted. The `preceding()` prepares state data that `following()` will use. Currently the workflow is: 1. The call to `preceding(start + 1)` will identify the `startIdx` and `endIdx` around the match by the underlying BI. This `startIdx` can be earlier than the start of the match. Similarly this `endIdx` can be later than the end of the match. (this is the usual case for SENTENCE BI) Both are needed at this point to know how long the match is. Then `fragsize-(endIdx-startIdx)` is the remaining size to give as context around the first fragment. Then `context_size*fragalign` is the length `preceding()` will try to prepend to the fragment. Finally it will return the final start position. 2. The call to `following(start)` ignores the argument because the end of the match's fragment (`endIdx`) was already found in `preceding()`. `following()` will try to append the missing length also prepared by `preceding()`. Then return the final end position. This is what I was referring to in my comment when I submitted the first version of the patch. Ideally there was a single method on the `LengthGoalBreakIterator` that would receive the start **and** end positions of the match to work with. That way the processing would be more optimal. The underlying BI would not need to iterate over the content from start to end when searching for the `endIdx`. It could start from `end - 1` of the match. I have tried to change up this part of the code more, but a whole bunch of tests broke. I think there is some other BreakIterator subclass that gets passed around here. So this part can't be rewritten to cater to this specific behavior without greater changes. Finally, `max(preceding(start + 1) ,lastPassageEnd)` is needed because the `preceding()` call can go backwards and create overlaps with the previous passage. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361492729 ## File path: lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/LengthGoalBreakIteratorTest.java ## @@ -41,6 +43,31 @@ // 01234567890123456789 static final String CONTENT = "Aa bb. Cc dd. Ee ff"; + public void testFragmentAlignmentConstructor() throws IOException { +BreakIterator baseBI = new CustomSeparatorBreakIterator('.'); +// test fragmentAlignment validation +float[] valid_aligns = {0.f, 0.f, 0.5f, 0.99f, 1.f}; +for (float alignment : valid_aligns) { + LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment); +} +float[] invalid_aligns = {-0.01f, -1.f, 1.5f, Float.NaN, Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY}; +for (float alignment : invalid_aligns) { + try { +LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment); +Assert.fail("Expected IllegalArgumentException for "+alignment); + } catch (IllegalArgumentException e) { +if (!e.getMessage().contains("fragmentAlignment")) { + throw e; +} + } +} +// test backwards compatibility constructors +String backwardCompString = LengthGoalBreakIterator.createClosestToLength(baseBI, 50).toString(); Review comment: I wasn't sure to write those or not. Are they an overkill? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361491499 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java ## @@ -33,23 +33,45 @@ private final BreakIterator baseIter; private final int lengthGoal; + private final float fragmentAlignment; // how much text to align before match-fragment, valid in range [0, 1] private final boolean isMinimumLength; // if false then is "closest to" length + private int fragmentEndFromPreceding; // store the match-break end for reuse in following() + private int fragmentEndFollowingLengthGoalFromPreceding; // store the remaining length to collect in following() /** Breaks will be at least {@code minLength} apart (to the extent possible). */ + public static LengthGoalBreakIterator createMinLength(BreakIterator baseIter, int minLength, +float fragmentAlignment) { +return new LengthGoalBreakIterator(baseIter, minLength, fragmentAlignment,true); + } + + /** For backwards compatibility you can initialise the break iterator without fragmentAlignment. */ public static LengthGoalBreakIterator createMinLength(BreakIterator baseIter, int minLength) { -return new LengthGoalBreakIterator(baseIter, minLength, true); +return createMinLength(baseIter, minLength, 0.f); } /** Breaks will be on average {@code targetLength} apart; the closest break to this target (before or after) * is chosen. */ + public static LengthGoalBreakIterator createClosestToLength(BreakIterator baseIter, int targetLength, + float fragmentAlignment) { +return new LengthGoalBreakIterator(baseIter, targetLength, fragmentAlignment, false); + } + + /** For backwards compatibility you can initialise the break iterator without fragmentAlignment. */ public static LengthGoalBreakIterator createClosestToLength(BreakIterator baseIter, int targetLength) { -return new LengthGoalBreakIterator(baseIter, targetLength, false); +return createClosestToLength(baseIter, targetLength, 0.f); } - private LengthGoalBreakIterator(BreakIterator baseIter, int lengthGoal, boolean isMinimumLength) { + private LengthGoalBreakIterator(BreakIterator baseIter, int lengthGoal, float fragmentAlignment, + boolean isMinimumLength) { this.baseIter = baseIter; this.lengthGoal = lengthGoal; +if (fragmentAlignment < 0.f || fragmentAlignment > 1.f || !Float.isFinite(fragmentAlignment)) { + throw new IllegalArgumentException("fragmentAlignment must be >= zero and <= one"); +} +this.fragmentAlignment = Math.max(Math.min(fragmentAlignment, 1.f), 0.f); Review comment: I was too preoccupied with the testing, so I forgot about that. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361480642 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java ## @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break // called at start of new Passage given first word start offset @Override public int preceding(int offset) { -return baseIter.preceding(offset); // no change needed +final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0 +fragmentEndFromPreceding = baseIter.following(fragmentStart); Review comment: AH I see! I somehow misinterpreted your question. You're mostly correct. The `baseIter.following(fragmentStart)` should be `baseIter.following(offset - 1)`. The minus one is needed for the very unlikely edge case where the match is a single character and the user requests the absolute minimal length for highlight. I added a little test that checks for that, you can see for yourself. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361432710 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java ## @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break // called at start of new Passage given first word start offset @Override public int preceding(int offset) { -return baseIter.preceding(offset); // no change needed +final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0 +fragmentEndFromPreceding = baseIter.following(fragmentStart); Review comment: A little correction: the **offset** parameter is the start of the match and the fragmentStart could be different because of the issues described above. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361426082 ## File path: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java ## @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx is a known break // called at start of new Passage given first word start offset @Override public int preceding(int offset) { -return baseIter.preceding(offset); // no change needed +final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // convert DONE to 0 +fragmentEndFromPreceding = baseIter.following(fragmentStart); Review comment: Unfortunately no. The fragmentStart argument is the start of the match that could be anything depending on the tokenizer in the index analyzer chain. Even if we assume it's the start of a word or a phrase, the underlying BI can break on different places. In case of SENTENCE the preceding() call here will find the beginning of the sentence. In case of SEPARATOR, which is customizable by query, the breaks can be anywhere else. We could only assume fragmentStart is a break point if the underlying BI would be the same as the tokenizer in the index analyzer chain. (I'm not sure, but the query analyzer chain could be different I think.) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org