[
https://issues.apache.org/jira/browse/SOLR-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848271#action_12848271
]
Sanjoy Ghosh commented on SOLR-1826:
Hi,
I investigated this some more. The problem seems to be in:
org.apache.lucene.search.highlight.TokenSource.java
public static TokenStream getTokenStream(TermPositionVector tpv, boolean
tokenPositionsGuaranteedContiguous) {
has at the end the following code to sort the tokens into original document
order.
Arrays.sort(tokensInOriginalOrder, new Comparator(){
public int compare(Object o1, Object o2)
{
Token t1=(Token) o1;
Token t2=(Token) o2;
if(t1.startOffset()t2.endOffset())
return 1;
if(t1.startOffset()t2.startOffset())
return -1;
return 0;
}});
This is not sorting the tokens into the right original order. The order should
be
lorem, power, powershotcom, shot, com, ipsum for this to work correctly.
Instead we are getting lorem, power, com, powershotcom, shot, ipsum which
confuses TokenGroup.isDistinct().
I would be happy to fix this bug.
Should we fix this as a Lucene bug or fix it in Solr by creating a new
TokenStream that handles overlapping tokens correctly.
highlighting breaks when using WordDelimiterFilter and setting
termOffsets=true
---
Key: SOLR-1826
URL: https://issues.apache.org/jira/browse/SOLR-1826
Project: Solr
Issue Type: Bug
Components: highlighter
Affects Versions: 1.4
Reporter: Stefan Oestreicher
Attachments: SOLR-1826.txt, SOLR-1826.txt, SOLR-1826.txt
When using the WordDelimiterFilter and setting termOffsets to true the
highlighting breaks in certain cases. This did not happen in the 1.3 release.
For example, if I index the term PowerShot.com and search for {{pow*}} the
highlighting snippet contains {{emPower/ememPowerShot.com/em}}.
I will attach a patch which adds tests to the highlighter unittest to
demonstrate the issue.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.