Highlighting overlapping tokens

Pierre GOSSE Mon, 17 Jan 2011 07:27:16 -0800

Hi all,

I'm having an issue when highlighting fields that have overlapping tokens. 
There was a bug opened in Jira some year ago 
https://issues.apache.org/jira/browse/LUCENE-627 but I'm a bit confused about 
this. In jira bug's status is "resolved", but still I got the exact same 
problem with a genuine lucene 2.9.3.


Looking for what was going on, I checked 
org.apache.lucene.search.highlight.TokenSources that rebuilds a tokenStream 
from TermVectors and I found that token where not sorted by offset, as one 
would expect.

When sorting tokens, the following comparer is used :

        public int compare(Object o1, Object o2)
        {
                Token t1=(Token) o1;
                Token t2=(Token) o2;
                if(t1.startOffset()>t2.endOffset())
                        return 1;
                if(t1.startOffset()<t2.startOffset())
                        return -1;
                return 0;
        }

I'm not sure why endOffset is used instead of startOffset in first test (looks 
like a typo), and with non-overlapping token this works just fine. 

But with overlapping tokens longest token get pushed to the end of their 
"overlapping zone" : (big,3,6), (fish,7,11), ({big fish},3,11) would end up 
sorted in this exact order, where I would have expected (big,3,6) ({big 
fish},3,11) (fish,7,11) or ({big fish},3,11) (big,3,6) (fish,7,11).
Highligthing with the term "{big fish}" builds a fragment by concatenating 
"big", "{big fish}", and "fish", giving this phrase : "big<em>big fish</em> 
fish".

I tested a quick fix by having preceding comparer changed like this :

        public int compare(Object o1, Object o2)
        {
                Token t1 = (Token)o1;
                Token t2 = (Token)o2;
                if (t1.startOffset() > t2.startOffset())
                        return 1;
                if (t1.startOffset() < t2.startOffset())
                        return -1;
                if (t1.endOffset() < t2.endOffset())
                        return -1;
                if (t1.endOffset() > t2.endOffset())
                        return 1;
                return 0;
        }

Highlight behavior is now correct as far as I tested it. 

Maybe the original sorting order has a purpose I don't understand, but to me 
this slight modification seams to fix everything. What should I do ? (I'm very 
new to this list and this community). 

If someone with better understanding of lucene highlight could give me some 
feedback, I would be grateful.

Thanks for your time.

Pierre


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Highlighting overlapping tokens

Reply via email to