On Aug 24, 2005, at 7:47 PM, Fred Toth wrote:
However, after reviewing recent discussions about highlighting,
and struggling with our own highlighting issues, I'm wondering if
there's a better way.

Here's one way. This is the algo used by a developer's version of my Perl/C search engine library Kinosearch, which is on hold in Alpha while I have a go at optimizing Plucene.

Every ResultSet is an object which keeps track of document numbers, scores, and positions.

When two ResultSets are merged, the positions associated with document 1 in ResultSet A and the positions associated with the same document 1 in ResultSet B are merged into a single array.

Positions are used by the phrase matching engine while two ResultSets with a phrase relationship are being merged. In the resulting merged ResultSet, positions which didn't take part in a phrase match will have been filtered out.

[I can build an ascii illustration of this if it isn't clear.]

Conveniently, when we arrive at a final result set, each document is associated with an array of positions. If a search term wasn't part of a phrase query, then all the positions it occurred in are represented. If a term was part of a phrase query, only the positions that were part of a phrase match are represented.

Next, Kinosearch builds an array of actual token start offsets (measured in characters) in the target document. (The start offsets are stored using delta encoding alongside stored fields at index- time). The start offsets which represent matched terms are extracted out into an array. Each position is assigned a score based on its distance to other positions within a limited range (using an inverse log formula). The position with the highest score determines where the excerpt will be taken from.

Since we know the token start offsets, it's trivial to insert the highlight tags.

The downside of this approach is that it's quite expensive to load and keep track of all those positions during set merging. The upsides are that it is unnecessary to load the full analysis apparatus, it works fine with stemmed words, and there's no need to go back to retrieve positions from disk later. The larger your index, the more the downsides outweigh the upsides, because the work necessary to process the increasing number of positions grows linearly with index size, while the amount of work to perform post- search analysis on matched documents stays constant.

It occurs to me that a hybrid approach may be possible which addresses the phrase-matching conundrum. I'm not yet familiar with the guts of Lucene's searching, but from a high level it looks similar, so this might work...

Keep track of positions matched during phrase-matching. Use those for highlighting terms which are part of the phrase match. Use post- search analysis for highlighting anything that isn't part of a phrase.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to