[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-29 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361847732
 
 

 ##
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##
 @@ -173,8 +205,30 @@ private int moveToBreak(int idx) { // precondition: idx 
is a known break
 
   // called at start of new Passage given first word start offset
   @Override
-  public int preceding(int offset) {
-return baseIter.preceding(offset); // no change needed
+  public int preceding(int matchStartIndex) {
+final int targetIdx = (matchStartIndex - 1) - (int)(lengthGoal * 
fragmentAlignment);
+if (targetIdx <= 0) {
+  return 0;
+}
+final int beforeIdx = baseIter.preceding(targetIdx + 1);
+if (beforeIdx == DONE) {
+  return 0;
 
 Review comment:
   This again depends on what we want to accomplish. The original and correct 
behavior of a BreakIterator would be to return the DONE. Since I'm working on 
making the LGBI more correct, should I change these? Although 
`FieldHighlighter` will have to check for DONE if I remove these checks.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-29 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361840892
 
 

 ##
 File path: 
lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/LengthGoalBreakIteratorTest.java
 ##
 @@ -39,65 +41,142 @@
   //  0 1
   //  01234567890123456789
   static final String CONTENT = "Aa bb. Cc dd. Ee ff";
+  static final String CONTENT2 = "Aa bb Cc dd X Ee ff Gg hh.";
+
+  public void testFragmentAlignmentConstructor() throws IOException {
+BreakIterator baseBI = new CustomSeparatorBreakIterator('.');
+// test fragmentAlignment validation
+float[] valid_aligns = {0.f, 0.f, 0.5f, 0.99f, 1.f};
+for (float alignment : valid_aligns) {
+  LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+}
+float[] invalid_aligns = {-0.01f, -1.f, 1.5f, Float.NaN, 
Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY};
+for (float alignment : invalid_aligns) {
+  expectThrows(IllegalArgumentException.class, () -> {
+LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+  });
+}
+// test backwards compatibility constructors
+String backwardCompString = 
LengthGoalBreakIterator.createClosestToLength(baseBI, 50).toString();
+assertTrue(backwardCompString, 
backwardCompString.contains("fragAlign=0.0"));
+backwardCompString = LengthGoalBreakIterator.createMinLength(baseBI, 
50).toString();
+assertTrue(backwardCompString, 
backwardCompString.contains("fragAlign=0.0"));
+  }
 
   public void testTargetLen() throws IOException {
 // "goal" means target length goal to find closest break
 
 // at first word:
 Query query = query("aa");
-assertEquals("almost two sent",
-"Aa bb.", highlightClosestToLen(CONTENT, query, 9));
 
 Review comment:
   **EDIT**: NVM this, I've checked this and it was never a problem, but could 
be in the proposed version.
   I've not checked this but thinking about it makes me believe that previous 
to my changes this could have happened regardless, since the `following()` was 
called with `start` instead of `end - 1`. The reason it was not a problem could 
be that it would only happen at a ridiculously low `fragsize`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-29 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361840892
 
 

 ##
 File path: 
lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/LengthGoalBreakIteratorTest.java
 ##
 @@ -39,65 +41,142 @@
   //  0 1
   //  01234567890123456789
   static final String CONTENT = "Aa bb. Cc dd. Ee ff";
+  static final String CONTENT2 = "Aa bb Cc dd X Ee ff Gg hh.";
+
+  public void testFragmentAlignmentConstructor() throws IOException {
+BreakIterator baseBI = new CustomSeparatorBreakIterator('.');
+// test fragmentAlignment validation
+float[] valid_aligns = {0.f, 0.f, 0.5f, 0.99f, 1.f};
+for (float alignment : valid_aligns) {
+  LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+}
+float[] invalid_aligns = {-0.01f, -1.f, 1.5f, Float.NaN, 
Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY};
+for (float alignment : invalid_aligns) {
+  expectThrows(IllegalArgumentException.class, () -> {
+LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+  });
+}
+// test backwards compatibility constructors
+String backwardCompString = 
LengthGoalBreakIterator.createClosestToLength(baseBI, 50).toString();
+assertTrue(backwardCompString, 
backwardCompString.contains("fragAlign=0.0"));
+backwardCompString = LengthGoalBreakIterator.createMinLength(baseBI, 
50).toString();
+assertTrue(backwardCompString, 
backwardCompString.contains("fragAlign=0.0"));
+  }
 
   public void testTargetLen() throws IOException {
 // "goal" means target length goal to find closest break
 
 // at first word:
 Query query = query("aa");
-assertEquals("almost two sent",
-"Aa bb.", highlightClosestToLen(CONTENT, query, 9));
 
 Review comment:
   I've not checked this but thinking about it makes me believe that previous 
to my changes this could have happened regardless, since the `following()` was 
called with `start` instead of `end - 1`. The reason it was not a problem could 
be that it would only happen at a ridiculously low `fragsize`.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-29 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361840489
 
 

 ##
 File path: 
lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/LengthGoalBreakIteratorTest.java
 ##
 @@ -39,65 +41,142 @@
   //  0 1
   //  01234567890123456789
   static final String CONTENT = "Aa bb. Cc dd. Ee ff";
+  static final String CONTENT2 = "Aa bb Cc dd X Ee ff Gg hh.";
+
+  public void testFragmentAlignmentConstructor() throws IOException {
+BreakIterator baseBI = new CustomSeparatorBreakIterator('.');
+// test fragmentAlignment validation
+float[] valid_aligns = {0.f, 0.f, 0.5f, 0.99f, 1.f};
+for (float alignment : valid_aligns) {
+  LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+}
+float[] invalid_aligns = {-0.01f, -1.f, 1.5f, Float.NaN, 
Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY};
+for (float alignment : invalid_aligns) {
+  expectThrows(IllegalArgumentException.class, () -> {
+LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+  });
+}
+// test backwards compatibility constructors
+String backwardCompString = 
LengthGoalBreakIterator.createClosestToLength(baseBI, 50).toString();
+assertTrue(backwardCompString, 
backwardCompString.contains("fragAlign=0.0"));
+backwardCompString = LengthGoalBreakIterator.createMinLength(baseBI, 
50).toString();
+assertTrue(backwardCompString, 
backwardCompString.contains("fragAlign=0.0"));
+  }
 
   public void testTargetLen() throws IOException {
 // "goal" means target length goal to find closest break
 
 // at first word:
 Query query = query("aa");
-assertEquals("almost two sent",
-"Aa bb.", highlightClosestToLen(CONTENT, query, 9));
 
 Review comment:
   I've already thought about trying that, but there's one case I think could 
break and I didn't see it worthwhile to investigate further.
   
   Since the index tokenizer and the highlight separator may work on completely 
different characters, the separator could break inside the match if I allow it 
to process the chars of the match itself. In that case with very low `fragsize` 
or very uneven `fragAlignRatio` a break inside the match could be accepted and 
the [first 
assertion](https://github.com/apache/lucene-solr/blob/3ae1a0b3bad84cdfaa3941b87a1a7fcad63a66d4/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/Passage.java#L45)
 in `Passage.addMatch` would fail.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-29 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361839121
 
 

 ##
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
 ##
 @@ -159,8 +160,9 @@ public Object highlightFieldForDoc(LeafReader reader, int 
docId, String content)
   break;
 }
 // advance breakIterator
-passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 
1), 0));
-passage.setEndOffset(Math.min(this.breakIterator.following(start), 
contentLength));
+passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 
1), lastPassageEnd));
 
 Review comment:
   > Without this small proposal, the length of this passage will be undersized
   
   That's incorrect. Such a fragment will be undersized either way. The current 
approach has the `fragsize` split up by `fragAlignRatio` statically. Even if 
there is not fulfilled expansion on the left, that won't be used on the right. 
We would only be moving the point where the `fragsize` on the left is truncated.
   
   BTW in the results these would-be-overlapping fragments get merged into a 
single snippet. So they'll become a bigger one instead of one normal- and one 
undersized. The only sure place we will receive an undersized snippet is when a 
match is at the very beginning or the end of the text.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-27 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361750286
 
 

 ##
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx 
is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-return baseIter.preceding(offset); // no change needed
+final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // 
convert DONE to 0
+fragmentEndFromPreceding = baseIter.following(fragmentStart);
+if (fragmentEndFromPreceding == DONE) {
+  fragmentEndFromPreceding = baseIter.last();
+}
+final int centerLength = fragmentEndFromPreceding - fragmentStart;
+final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * 
fragmentAlignment);
 
 Review comment:
   > hl.closestTo ... I wish it simply always worked this way...
   
   I'm guessing whoever wrote the original UH decided against it for 
performance reasons. There is a  
[comment](https://github.com/apache/lucene-solr/blob/e06ad4cfb587e1bb69a80c0b531df0e52c2da89b/lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java#L149)
 to suggest that. Also note that depending on which index is closer the BI may 
had to invoke `following()` one extra time inside `moveToBreak()`. I think 
`moveToBreak()` is only necessary when a default-summary is being made and I 
can remove it from the code when a match is being highlighted.
   
   The performance hit would be depending on the fragment's length that 
contains the goal index. In WORD mode I'd hope this is negligible. It will 
necessarily be worse for SENTENCE.
   
   You've brought up a new point to keep in mind: simplifying the options. I'll 
be back with updated code that we can discuss further.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-27 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361734439
 
 

 ##
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx 
is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-return baseIter.preceding(offset); // no change needed
+final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // 
convert DONE to 0
+fragmentEndFromPreceding = baseIter.following(fragmentStart);
+if (fragmentEndFromPreceding == DONE) {
+  fragmentEndFromPreceding = baseIter.last();
+}
+final int centerLength = fragmentEndFromPreceding - fragmentStart;
+final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * 
fragmentAlignment);
 
 Review comment:
   > Also; couldn't we subtract out the match length itself if it's unduly 
influencing the final snippet size?
   
   My original patch was so convoluted because that cannot be done easily in 
the current implementation with the BreakIterator base class. The call to 
`preceding()` on the LGBI receives the `matchStart` and returns the final start 
of the snippet. Since this method has no way to know the other end of the match 
it cannot adjust to the length of the match. Using private state the 
`following()` could know it, but that  would only be able to adjust on the 
right hand side and not the left.
   Something like `public int[] getPassageRange(int matchStart, int matchEnd) { 
... }` would allow a better implementation to be written. But that requires 
changing `protected final BreakIterator breakIterator;` inside 
`FieldHighlighter` to a custom BI base that has such method.
   
   > Realistically, do you imagine a user would want to use fragalign of 
anything other than 0.5?
   
   I'd say the margin values of `fragalign` are of little use, but having the 
ability to set it in the range of`0.35-0.65` could be worthwhile.
   
   I'm not sure if a `slop` parameter similar to RegexFragmenter's would be too 
beneficial.
   - The `RegexFragmenter` processes the whole stored text (until 
`hl.regex.maxAnalyzedChars`) and creates a list of breaks called `hotSpots`. 
Interestingly it also uses the tokenizer's offsets from the index (I think), 
because even if a break would be in the middle of a word it will not break it 
in half. Unless I'm completely missing something the regex seems to only give a 
strong recommendation to where a break should go.
   - Implementing an acceptable range (`slop`) would require to build some list 
of `hotSpots` or at least to look for them when extracting the fragment. For 
this we could only use `preceding()` and `following()` on the wrapped BI, which 
we wanted to avoid if possible for performance reasons.
   - For WORD BI the current solution is pretty accurate and is as optimal as 
it gets. (single call of `preceding()` and `following()` on the underlying BI)
   - For SENTENCE BI the slop would not help enough to be worth the trouble I 
think. Let's say I have to extend the snippet on the right side with 100 chars 
of a slop 0.4 (valid range: 60-140):
 - If the next sentence is 300 chars long there's nothing better to do what 
we already do.
 - If the next sentences have the lengths of [30, 50, 200, 40] the better 
choice would be to have the first two sentences appended with a sum length of 
80. (that's only 20 off from the goal) For this the LGBI can be created with 
`closestTo` mode instead of  `minLength` mode. It would cost an extra call of 
`preceding()` and `following()` each but would produce the better result 
instead of appending 3 of the following sentences with 280 total length.
   
   So we already have a `slop` of sorts... it's just a boolean that cannot be 
turned on right now as it is missing a parameter. I've mentioned [this 
todo](https://github.com/apache/lucene-solr/blob/e06ad4cfb587e1bb69a80c0b531df0e52c2da89b/solr/core/src/java/org/apache/solr/highlight/UnifiedSolrHighlighter.java#L330)
 about it before.
   
   After looking though all of this, my suggestion is to:
   - keep `hl.fragalign` and have it be `0.5` by default
   - add `hl.closestTo` with default `false`
 - a better parameter name would be welcome
 - we could consider making it automatically turned on if the BI is SENTENCE
   - do not worry about keeping a backward-compatible behavior with this change


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:

[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-27 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361663103
 
 

 ##
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx 
is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-return baseIter.preceding(offset); // no change needed
+final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // 
convert DONE to 0
+fragmentEndFromPreceding = baseIter.following(fragmentStart);
+if (fragmentEndFromPreceding == DONE) {
+  fragmentEndFromPreceding = baseIter.last();
+}
+final int centerLength = fragmentEndFromPreceding - fragmentStart;
+final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * 
fragmentAlignment);
 
 Review comment:
   The change is the same in WORD mode too, but the match and the fragment of 
the match is mostly the same. It's just not that apparent compared to the 
SENTENCE mode.
   
   I'll try and clarify all the expressions:
   - Source text: `I'm looking for a beach ball for my vacation. I've lost the 
old one. I liked that.`
   - Query text: `vacation`
   - Match term is `Match{start=36, end=44, text=vacation}` thus `matchsize` is 
`8`
   - Match-fragment is the fragment that contains the match according to the 
underlying BI:
 - WORD: `vacation` because it breaks on space and period.
 - SENTENCE: `I'm looking for a beach ball for my vacation.` because it 
breaks on period.
   - `hl.bs.type=SENTENCE=10=0.0`:
 - Match fragment `I'm looking for a beach ball for my vacation.` has 1 
extra char on the right, but 10 is requested. Will try to append next fragments 
to reach `match.end+((1-fragalign)*fragsize)` (which is `54`).
 - Next break after `54` is the end of the second sentence so the result 
snippet will be: `I'm looking for a beach ball for my vacation. I've lost the 
old one.`
   - `hl.bs.type=SENTENCE=10=0.5`:
 - Match fragment `I'm looking for a beach ball for my vacation.` has 36 
extra chars on the left, while 5 is requested. Also 5 are requested on the 
right so the extension will happen just like in the previous example. The 
result will be `I'm looking for a beach ball for my vacation. I've lost the old 
one.`.
   - `hl.bs.type=SENTENCE=10=1.0`:
 - Match fragment `I'm looking for a beach ball for my vacation.` has 36 
extra chars on the left, while 10 is requested. No extra chars are requested on 
the right so the match fragment is the result.
   
   As you can see the `fragsize` is **not** about the final size of the 
snippet, but about the text around the match.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-27 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361663103
 
 

 ##
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx 
is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-return baseIter.preceding(offset); // no change needed
+final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // 
convert DONE to 0
+fragmentEndFromPreceding = baseIter.following(fragmentStart);
+if (fragmentEndFromPreceding == DONE) {
+  fragmentEndFromPreceding = baseIter.last();
+}
+final int centerLength = fragmentEndFromPreceding - fragmentStart;
+final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * 
fragmentAlignment);
 
 Review comment:
   The change is the same in WORD mode too, but the match and the fragment of 
the match is mostly the same. It's just not that apparent compared to the 
SENTENCE mode.
   
   I'll try and clarify all the expressions:
   - Source text: `I'm looking for a beach ball for my vacation. I've lost the 
old one. I liked that.`
   - Query text: `vacation`
   - Match term is `Match{start=36, end=44, text=vacation}` thus `matchsize` is 
`8`
   - Match-fragment is the fragment that contains the match according to the 
underlying BI:
 - WORD: `vacation` because it breaks on space and period.
 - SENTENCE: `I'm looking for a beach ball for my vacation.` because it 
breaks on period.
   - `hl.bs.type=SENTENCE=10=0.0`:
 - Match fragment `I'm looking for a beach ball for my vacation.` has 1 
extra char on the right, but 10 is requested. Will try to append next fragments 
to reach `match.end+((1-fragalign)-fragsize)` (which is `54`).
 - Next break after `54` is the end of the second sentence so the result 
snippet will be: `I'm looking for a beach ball for my vacation. I've lost the 
old one.`
   - `hl.bs.type=SENTENCE=10=0.5`:
 - Match fragment `I'm looking for a beach ball for my vacation.` has 36 
extra chars on the left, while 5 is requested. Also 5 are requested on the 
right so the extension will happen just like in the previous example. The 
result will be `I'm looking for a beach ball for my vacation. I've lost the old 
one.`.
   - `hl.bs.type=SENTENCE=10=1.0`:
 - Match fragment `I'm looking for a beach ball for my vacation.` has 36 
extra chars on the left, while 10 is requested. No extra chars are requested on 
the right so the match fragment is the result.
   
   As you can see the `fragsize` is **not** about the final size of the 
snippet, but about the text around the match.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-27 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361663103
 
 

 ##
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx 
is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-return baseIter.preceding(offset); // no change needed
+final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // 
convert DONE to 0
+fragmentEndFromPreceding = baseIter.following(fragmentStart);
+if (fragmentEndFromPreceding == DONE) {
+  fragmentEndFromPreceding = baseIter.last();
+}
+final int centerLength = fragmentEndFromPreceding - fragmentStart;
+final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * 
fragmentAlignment);
 
 Review comment:
   The change is the same in WORD mode too, but the match and the fragment of 
the match is mostly the same. It's just not that apparent compared to the 
SENTENCE mode.
   
   I'll try and clarify all the expressions:
   - Source text: `I'm looking for a beach ball for my vacation. I've lost the 
old one. I liked that.`
   - Query text: `vacation`
   - Match term is `Match{start=36, end=44, text=vacation}` thus `matchsize` is 
`8`
   - Match-fragment is the fragment that contains the match according to the 
underlying BI:
 - WORD: `vacation` because it breaks on space and period.
 - SENTENCE: `I'm looking for a beach ball for my vacation.` because it 
breaks on period.
   - `hl.bs.type=SENTENCE=10=0.0`:
 - Match fragment `I'm looking for a beach ball for my vacation.` has 1 
extra char on the right, but 10 is requested. Will try to append next fragments 
to reach `match.end+((1-fragalign)-fragsize)` (which is `54`).
 - Next break after `54` is the end of the second sentence so the result 
snippet will be: `I'm looking for a beach ball for my vacation. I've lost the 
old one.`
   - `hl.bs.type=SENTENCE=10=0.5`:
 - Match fragment `I'm looking for a beach ball for my vacation.` has 36 
extra chars on the left, while 5 is requested. Also 5 are requested on the 
right so the extension will happen just like in the previous example. The 
result will be `I'm looking for a beach ball for my vacation. I've lost the old 
one.`.
   - `hl.bs.type=SENTENCE=10=1.0`:
 - Match fragment `I'm looking for a beach ball for my vacation.` has 36 
extra chars on the left, while 10 is requested. No extra chars are requested on 
the right so the match fragment is the result.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-27 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361655197
 
 

 ##
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx 
is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-return baseIter.preceding(offset); // no change needed
+final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // 
convert DONE to 0
+fragmentEndFromPreceding = baseIter.following(fragmentStart);
+if (fragmentEndFromPreceding == DONE) {
+  fragmentEndFromPreceding = baseIter.last();
+}
+final int centerLength = fragmentEndFromPreceding - fragmentStart;
+final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * 
fragmentAlignment);
 
 Review comment:
   Correction: the snippet size is not `fragsize+matchsize`, but 
`min(sentence_left_of_match, fragsize*fragalign) + matchsize + 
min(sentence_right_of_match, fragsize*(1-fragalign))`.
   So if I want a left aligned highlight in SENTENCE mode, but the match is on 
the right end of the sentence, the whole `fragsize` would be added on the right.
   This is less accurate if you're concerned about the result snippet's size, 
but more accurate if you're concerned about the alignment of the match.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-26 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361533868
 
 

 ##
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx 
is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-return baseIter.preceding(offset); // no change needed
+final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // 
convert DONE to 0
+fragmentEndFromPreceding = baseIter.following(fragmentStart);
+if (fragmentEndFromPreceding == DONE) {
+  fragmentEndFromPreceding = baseIter.last();
+}
+final int centerLength = fragmentEndFromPreceding - fragmentStart;
+final int extraPrecedingLengthGoal = (int)((lengthGoal - centerLength) * 
fragmentAlignment);
 
 Review comment:
   On the contrary!
   
   This suggestion is a little different than what I've thought of, so I'll 
have to rework the whole thing, but it has a great quality: This would make the 
length of the match irrelevant for the GLBI logic. I'm not going to go through 
all the details, but the code can be a little more optimal and less... hacky.
   
   I should have read this before writing [my previous 
comment](https://github.com/apache/lucene-solr/pull/1123#discussion_r361529545),
 now it is deprecated.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-26 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361530659
 
 

 ##
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
 ##
 @@ -159,8 +160,9 @@ public Object highlightFieldForDoc(LeafReader reader, int 
docId, String content)
   break;
 }
 // advance breakIterator
-passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 
1), 0));
 
 Review comment:
   Even if this was not necessary before, the current version with 
`lastPassageEnd` does need `max()`. Without it the passages could overlap.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-26 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361529545
 
 

 ##
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldHighlighter.java
 ##
 @@ -159,8 +160,9 @@ public Object highlightFieldForDoc(LeafReader reader, int 
docId, String content)
   break;
 }
 // advance breakIterator
-passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 
1), 0));
-passage.setEndOffset(Math.min(this.breakIterator.following(start), 
contentLength));
+passage.setStartOffset(Math.max(this.breakIterator.preceding(start + 
1), lastPassageEnd));
+lastPassageEnd = Math.min(this.breakIterator.following(start), 
contentLength);
 
 Review comment:
   Changing the `following(start)` to `following(end - 1)` would have been a 
great idea before this changeset. However these two calls on the 
`LengthGoalBreakIterator` are very specifically crafted. The `preceding()` 
prepares state data that `following()` will use. Currently the workflow is:
   1. The call to `preceding(start + 1)` will identify the `startIdx` and 
`endIdx` around the match by the underlying BI. This `startIdx` can be earlier 
than the start of the match. Similarly this `endIdx` can be later than the end 
of the match. (this is the usual case for SENTENCE BI) Both are needed at this 
point to know how long the match is. Then `fragsize-(endIdx-startIdx)` is the 
remaining size to give as context around the first fragment. Then 
`context_size*fragalign` is the length `preceding()` will try to prepend to the 
fragment. Finally it will return the final start position.
   2. The call to `following(start)` ignores the argument because the end of 
the match's fragment (`endIdx`) was already found in `preceding()`. 
`following()` will try to append the missing length also prepared by 
`preceding()`. Then return the final end position.
   
   This is what I was referring to in my comment when I submitted the first 
version of the patch. Ideally there was a single method on the 
`LengthGoalBreakIterator` that would receive the start **and** end positions of 
the match to work with. That way the processing would be more optimal. The 
underlying BI would not need to iterate over the content from start to end when 
searching for the `endIdx`. It could start from `end - 1` of the match.
   
   I have tried to change up this part of the code more, but a whole bunch of 
tests broke. I think there is some other BreakIterator subclass that gets 
passed around here. So this part can't be rewritten to cater to this specific 
behavior without greater changes.
   
   Finally, `max(preceding(start + 1) ,lastPassageEnd)` is needed because the 
`preceding()` call can go backwards and create overlaps with the previous 
passage.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-26 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361492729
 
 

 ##
 File path: 
lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/LengthGoalBreakIteratorTest.java
 ##
 @@ -41,6 +43,31 @@
   //  01234567890123456789
   static final String CONTENT = "Aa bb. Cc dd. Ee ff";
 
+  public void testFragmentAlignmentConstructor() throws IOException {
+BreakIterator baseBI = new CustomSeparatorBreakIterator('.');
+// test fragmentAlignment validation
+float[] valid_aligns = {0.f, 0.f, 0.5f, 0.99f, 1.f};
+for (float alignment : valid_aligns) {
+  LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+}
+float[] invalid_aligns = {-0.01f, -1.f, 1.5f, Float.NaN, 
Float.NEGATIVE_INFINITY, Float.POSITIVE_INFINITY};
+for (float alignment : invalid_aligns) {
+  try {
+LengthGoalBreakIterator.createClosestToLength(baseBI, 50, alignment);
+Assert.fail("Expected IllegalArgumentException for "+alignment);
+  } catch (IllegalArgumentException e) {
+if (!e.getMessage().contains("fragmentAlignment")) {
+  throw e;
+}
+  }
+}
+// test backwards compatibility constructors
+String backwardCompString = 
LengthGoalBreakIterator.createClosestToLength(baseBI, 50).toString();
 
 Review comment:
   I wasn't sure to write those or not. Are they an overkill?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-26 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361491499
 
 

 ##
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##
 @@ -33,23 +33,45 @@
 
   private final BreakIterator baseIter;
   private final int lengthGoal;
+  private final float fragmentAlignment; // how much text to align before 
match-fragment, valid in range [0, 1]
   private final boolean isMinimumLength; // if false then is "closest to" 
length
+  private int fragmentEndFromPreceding; // store the match-break end for reuse 
in following()
+  private int fragmentEndFollowingLengthGoalFromPreceding; // store the 
remaining length to collect in following()
 
   /** Breaks will be at least {@code minLength} apart (to the extent 
possible). */
+  public static LengthGoalBreakIterator createMinLength(BreakIterator 
baseIter, int minLength,
+float 
fragmentAlignment) {
+return new LengthGoalBreakIterator(baseIter, minLength, 
fragmentAlignment,true);
+  }
+
+  /** For backwards compatibility you can initialise the break iterator 
without fragmentAlignment. */
   public static LengthGoalBreakIterator createMinLength(BreakIterator 
baseIter, int minLength) {
-return new LengthGoalBreakIterator(baseIter, minLength, true);
+return createMinLength(baseIter, minLength, 0.f);
   }
 
   /** Breaks will be on average {@code targetLength} apart; the closest break 
to this target (before or after)
* is chosen. */
+  public static LengthGoalBreakIterator createClosestToLength(BreakIterator 
baseIter, int targetLength,
+  float 
fragmentAlignment) {
+return new LengthGoalBreakIterator(baseIter, targetLength, 
fragmentAlignment, false);
+  }
+
+  /** For backwards compatibility you can initialise the break iterator 
without fragmentAlignment. */
   public static LengthGoalBreakIterator createClosestToLength(BreakIterator 
baseIter, int targetLength) {
-return new LengthGoalBreakIterator(baseIter, targetLength, false);
+return createClosestToLength(baseIter, targetLength, 0.f);
   }
 
-  private LengthGoalBreakIterator(BreakIterator baseIter, int lengthGoal, 
boolean isMinimumLength) {
+  private LengthGoalBreakIterator(BreakIterator baseIter, int lengthGoal, 
float fragmentAlignment,
+  boolean isMinimumLength) {
 this.baseIter = baseIter;
 this.lengthGoal = lengthGoal;
+if (fragmentAlignment < 0.f || fragmentAlignment > 1.f || 
!Float.isFinite(fragmentAlignment)) {
+  throw new IllegalArgumentException("fragmentAlignment must be >= zero 
and <= one");
+}
+this.fragmentAlignment = Math.max(Math.min(fragmentAlignment, 1.f), 0.f);
 
 Review comment:
   I was too preoccupied with the testing, so I forgot about that.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-26 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361480642
 
 

 ##
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx 
is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-return baseIter.preceding(offset); // no change needed
+final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // 
convert DONE to 0
+fragmentEndFromPreceding = baseIter.following(fragmentStart);
 
 Review comment:
   AH I see! I somehow misinterpreted your question. You're mostly correct. The 
`baseIter.following(fragmentStart)` should be `baseIter.following(offset - 1)`. 
The minus one is needed for the very unlikely edge case where the match is a 
single character and the user requests the absolute minimal length for 
highlight. I added a little test that checks for that, you can see for yourself.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-26 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361432710
 
 

 ##
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx 
is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-return baseIter.preceding(offset); // no change needed
+final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // 
convert DONE to 0
+fragmentEndFromPreceding = baseIter.following(fragmentStart);
 
 Review comment:
   A little correction: the **offset** parameter is the start of the match and 
the fragmentStart could be different because of the issues described above.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified highlighter with word separator never gives context to the left

2019-12-26 Thread GitBox
Traktormaster commented on a change in pull request #1123: LUCENE-9093: Unified 
highlighter with word separator never gives context to the left
URL: https://github.com/apache/lucene-solr/pull/1123#discussion_r361426082
 
 

 ##
 File path: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/LengthGoalBreakIterator.java
 ##
 @@ -174,7 +175,48 @@ private int moveToBreak(int idx) { // precondition: idx 
is a known break
   // called at start of new Passage given first word start offset
   @Override
   public int preceding(int offset) {
-return baseIter.preceding(offset); // no change needed
+final int fragmentStart = Math.max(baseIter.preceding(offset), 0); // 
convert DONE to 0
+fragmentEndFromPreceding = baseIter.following(fragmentStart);
 
 Review comment:
   Unfortunately no. The fragmentStart argument is the start of the match that 
could be anything depending on the tokenizer in the index analyzer chain. Even 
if we assume it's the start of a word or a phrase, the underlying BI can break 
on different places. In case of SENTENCE the preceding() call here will find 
the beginning of the sentence. In case of SEPARATOR, which is customizable by 
query, the breaks can be anywhere else.
   We could only assume fragmentStart is a break point if the underlying BI 
would be the same as the tokenizer in the index analyzer chain. (I'm not sure, 
but the query analyzer chain could be different I think.)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org