[ 
https://issues.apache.org/jira/browse/SOLR-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12848271#action_12848271
 ] 

Sanjoy Ghosh commented on SOLR-1826:
------------------------------------

Hi,

I investigated this some more.  The problem seems to be in:

org.apache.lucene.search.highlight.TokenSource.java

public static TokenStream getTokenStream(TermPositionVector tpv, boolean 
tokenPositionsGuaranteedContiguous) {

   has at the end the following code to sort the tokens into original document 
order.

            Arrays.sort(tokensInOriginalOrder, new Comparator(){
                public int compare(Object o1, Object o2)
                {
                    Token t1=(Token) o1;
                    Token t2=(Token) o2;
                    if(t1.startOffset()>t2.endOffset())
                        return 1;
                    if(t1.startOffset()<t2.startOffset())
                        return -1;
                    return 0;
                }});

This is not sorting the tokens into the right original order.  The order should 
be

lorem, power, powershotcom, shot, com, ipsum for this to work correctly.  
Instead we are getting lorem, power, com, powershotcom, shot, ipsum which 
confuses TokenGroup.isDistinct().

I would be happy to fix this bug.  

Should we fix this as a Lucene bug or fix it in Solr by creating a new 
TokenStream that handles overlapping tokens correctly.

> highlighting breaks when using WordDelimiterFilter and setting 
> termOffsets=true
> -------------------------------------------------------------------------------
>
>                 Key: SOLR-1826
>                 URL: https://issues.apache.org/jira/browse/SOLR-1826
>             Project: Solr
>          Issue Type: Bug
>          Components: highlighter
>    Affects Versions: 1.4
>            Reporter: Stefan Oestreicher
>         Attachments: SOLR-1826.txt, SOLR-1826.txt, SOLR-1826.txt
>
>
> When using the WordDelimiterFilter and setting termOffsets to true the 
> highlighting breaks in certain cases. This did not happen in the 1.3 release.
> For example, if I index the term "PowerShot.com" and search for {{pow*}} the 
> highlighting snippet contains {{<em>Power</em><em>PowerShot.com</em>}}.
> I will attach a patch which adds tests to the highlighter unittest to 
> demonstrate the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to