Re: Thinking about better highlighting

mark harwood Thu, 25 Aug 2005 02:11:54 -0700

Unfortunately I've not had the time to address the
phrase highlighting issues in the current highlighter
but I think I've an idea as to how best to fix it:


I would suggest rewriting the highlighter to use Spans
not Terms to find the relevant sections in a text.
Most of the code required for such a solution is
around in one form or another but needs bringing
together:

* The SpansExtractor class here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=35518
This can be used to get Spans for a given query and
IndexReader to show where all query hits for a
document lie.

* The contrib section includes a MemoryIndex that can
provide a fast IndexReader for a single document
(faster than using RAMDirectory).

* The LuceneInAction code example for SpanQueries
includes a rudimentary highlighter that uses Spans to
control where markup is introduced given a collection
of Spans (This does not attempt to summarise long docs
however).


The overall approach using this code would be to index
each doc to be highlighted in MemoryIndex, run
SpansExtractor using the (rewritten) user query and
the MemoryIndex's IndexReader, give the resulting
spans to an adapted LIA highlighter/summariser.

Some issues with this:
1) The contrib section would now have inter-project
dependencies (highlighter -> MemIndex) which would
need to be catered for in the Ant build process.
2) We may need to think about how we factor in IDF
weighting of individual terms to the summarising
process so that the more important terms influence the
selection of highlights.


Does this sound reasonable?





                
___________________________________________________________ 
To help you stay safe and secure online, we've developed the all new Yahoo! 
Security Centre. http://uk.security.yahoo.com

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Thinking about better highlighting

Reply via email to