Hi all, I'm having an issue when highlighting fields that have overlapping tokens. There was a bug opened in Jira some year ago https://issues.apache.org/jira/browse/LUCENE-627 but I'm a bit confused about this. In jira bug's status is "resolved", but still I got the exact same problem with a genuine lucene 2.9.3.
Looking for what was going on, I checked org.apache.lucene.search.highlight.TokenSources that rebuilds a tokenStream from TermVectors and I found that token where not sorted by offset, as one would expect. When sorting tokens, the following comparer is used : public int compare(Object o1, Object o2) { Token t1=(Token) o1; Token t2=(Token) o2; if(t1.startOffset()>t2.endOffset()) return 1; if(t1.startOffset()<t2.startOffset()) return -1; return 0; } I'm not sure why endOffset is used instead of startOffset in first test (looks like a typo), and with non-overlapping token this works just fine. But with overlapping tokens longest token get pushed to the end of their "overlapping zone" : (big,3,6), (fish,7,11), ({big fish},3,11) would end up sorted in this exact order, where I would have expected (big,3,6) ({big fish},3,11) (fish,7,11) or ({big fish},3,11) (big,3,6) (fish,7,11). Highligthing with the term "{big fish}" builds a fragment by concatenating "big", "{big fish}", and "fish", giving this phrase : "big<em>big fish</em> fish". I tested a quick fix by having preceding comparer changed like this : public int compare(Object o1, Object o2) { Token t1 = (Token)o1; Token t2 = (Token)o2; if (t1.startOffset() > t2.startOffset()) return 1; if (t1.startOffset() < t2.startOffset()) return -1; if (t1.endOffset() < t2.endOffset()) return -1; if (t1.endOffset() > t2.endOffset()) return 1; return 0; } Highlight behavior is now correct as far as I tested it. Maybe the original sorting order has a purpose I don't understand, but to me this slight modification seams to fix everything. What should I do ? (I'm very new to this list and this community). If someone with better understanding of lucene highlight could give me some feedback, I would be grateful. Thanks for your time. Pierre --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org