Re: Thinking about better highlighting

Marvin Humphrey Fri, 26 Aug 2005 01:29:02 -0700

On Aug 24, 2005, at 7:47 PM, Fred Toth wrote:

However, after reviewing recent discussions about highlighting,
and struggling with our own highlighting issues, I'm wondering if
there's a better way.

Here's one way. This is the algo used by a developer's version of myPerl/C search engine library Kinosearch, which is on hold in Alphawhile I have a go at optimizing Plucene.

Every ResultSet is an object which keeps track of document numbers,scores, and positions.

When two ResultSets are merged, the positions associated withdocument 1 in ResultSet A and the positions associated with the samedocument 1 in ResultSet B are merged into a single array.

Positions are used by the phrase matching engine while two ResultSetswith a phrase relationship are being merged. In the resulting mergedResultSet, positions which didn't take part in a phrase match willhave been filtered out.


[I can build an ascii illustration of this if it isn't clear.]

Conveniently, when we arrive at a final result set, each document isassociated with an array of positions. If a search term wasn't partof a phrase query, then all the positions it occurred in arerepresented. If a term was part of a phrase query, only thepositions that were part of a phrase match are represented.

Next, Kinosearch builds an array of actual token start offsets(measured in characters) in the target document. (The start offsetsare stored using delta encoding alongside stored fields at index-time). The start offsets which represent matched terms are extractedout into an array. Each position is assigned a score based on itsdistance to other positions within a limited range (using an inverselog formula). The position with the highest score determines wherethe excerpt will be taken from.

Since we know the token start offsets, it's trivial to insert thehighlight tags.

The downside of this approach is that it's quite expensive to loadand keep track of all those positions during set merging. Theupsides are that it is unnecessary to load the full analysisapparatus, it works fine with stemmed words, and there's no need togo back to retrieve positions from disk later. The larger yourindex, the more the downsides outweigh the upsides, because the worknecessary to process the increasing number of positions growslinearly with index size, while the amount of work to perform post-search analysis on matched documents stays constant.

It occurs to me that a hybrid approach may be possible whichaddresses the phrase-matching conundrum. I'm not yet familiar withthe guts of Lucene's searching, but from a high level it lookssimilar, so this might work...

Keep track of positions matched during phrase-matching. Use thosefor highlighting terms which are part of the phrase match. Use post-search analysis for highlighting anything that isn't part of a phrase.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Thinking about better highlighting

Reply via email to