elliotzlin opened a new pull request, #1069: URL: https://github.com/apache/lucene/pull/1069
### Description (or a Jira issue link if you have one) [LUCENE-2587](https://issues.apache.org/jira/browse/LUCENE-2587) The issue has a good write up of the bug. To summarize, we start new fragments at the end offset of the previous fragment instead of the start offset of the first token of the fragment, which potentially introduces spurious un-analyzed chars in the fragment. To take the test case as an example, we analyze out punctuation when tokenizing the string. However when highlighting the fragment containing the hit we get a fragment that starts with a period `.`. The fix here starts new fragments at the start offset of the token that leads the new fragment. We also store the end offset of the antecedent fragment so we can use that to determine whether we can merge contiguous fragments. <!-- If this is your first contribution to Lucene, please make sure you have reviewed the contribution guide. https://github.com/apache/lucene/blob/main/CONTRIBUTING.md --> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
