[jira] Commented: (SOLR-1826) highlighting breaks when using WordDelimiterFilter and setting termOffsets=true

2010-03-24 Thread Sanjoy Ghosh (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12849071#action_12849071
 ] 

Sanjoy Ghosh commented on SOLR-1826:


Just uploaded a patch that should fix this bug.  Please let me know if this is 
not the right fix.

 highlighting breaks when using WordDelimiterFilter and setting 
 termOffsets=true
 ---

 Key: SOLR-1826
 URL: https://issues.apache.org/jira/browse/SOLR-1826
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 1.4
Reporter: Stefan Oestreicher
 Attachments: SOLR-1826.patch, SOLR-1826.txt, SOLR-1826.txt, 
 SOLR-1826.txt


 When using the WordDelimiterFilter and setting termOffsets to true the 
 highlighting breaks in certain cases. This did not happen in the 1.3 release.
 For example, if I index the term PowerShot.com and search for {{pow*}} the 
 highlighting snippet contains {{emPower/ememPowerShot.com/em}}.
 I will attach a patch which adds tests to the highlighter unittest to 
 demonstrate the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (SOLR-1826) highlighting breaks when using WordDelimiterFilter and setting termOffsets=true

2010-03-22 Thread Sanjoy Ghosh (JIRA)

[ 
https://issues.apache.org/jira/browse/SOLR-1826?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12848271#action_12848271
 ] 

Sanjoy Ghosh commented on SOLR-1826:


Hi,

I investigated this some more.  The problem seems to be in:

org.apache.lucene.search.highlight.TokenSource.java

public static TokenStream getTokenStream(TermPositionVector tpv, boolean 
tokenPositionsGuaranteedContiguous) {

   has at the end the following code to sort the tokens into original document 
order.

Arrays.sort(tokensInOriginalOrder, new Comparator(){
public int compare(Object o1, Object o2)
{
Token t1=(Token) o1;
Token t2=(Token) o2;
if(t1.startOffset()t2.endOffset())
return 1;
if(t1.startOffset()t2.startOffset())
return -1;
return 0;
}});

This is not sorting the tokens into the right original order.  The order should 
be

lorem, power, powershotcom, shot, com, ipsum for this to work correctly.  
Instead we are getting lorem, power, com, powershotcom, shot, ipsum which 
confuses TokenGroup.isDistinct().

I would be happy to fix this bug.  

Should we fix this as a Lucene bug or fix it in Solr by creating a new 
TokenStream that handles overlapping tokens correctly.

 highlighting breaks when using WordDelimiterFilter and setting 
 termOffsets=true
 ---

 Key: SOLR-1826
 URL: https://issues.apache.org/jira/browse/SOLR-1826
 Project: Solr
  Issue Type: Bug
  Components: highlighter
Affects Versions: 1.4
Reporter: Stefan Oestreicher
 Attachments: SOLR-1826.txt, SOLR-1826.txt, SOLR-1826.txt


 When using the WordDelimiterFilter and setting termOffsets to true the 
 highlighting breaks in certain cases. This did not happen in the 1.3 release.
 For example, if I index the term PowerShot.com and search for {{pow*}} the 
 highlighting snippet contains {{emPower/ememPowerShot.com/em}}.
 I will attach a patch which adds tests to the highlighter unittest to 
 demonstrate the issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.