Re: Thinking about better highlighting

Fred Toth Thu, 25 Aug 2005 08:21:23 -0700

Based on this discussion, I've gone back and re-read everything
in LIA on SpanQuery, etc.


Isn't this just another manifestation of the same problem? How
do I reliably, correctly convert an arbitrary lucene query into its
equivalent SpanQuery?

Here's one, for example:

+text:"jurassic barnea" +author:zofer +year:[1987 TO 1987]

As you can imagine, ideally I would like to display the document with
the phrase highlighted, the author name highlighted and the year
highlighted.

Am I correct that there is no simple mechanism to get from the
above "standard" lucene query to a SpanQuery that can give me
the offsets of the terms that actually matched? I'm forced to
pick apart the query, essentially reparse it with a different methodology
to get "close" to what lucene has already done?

Even if I could reliably convert a standard phrase query to a
SpanQuery, that's just the tip of the iceberg, right? What about prefix
queries, complex booleans, etc. Is this a slippery slope?

Isn't it true that lucene has already identified (somewhere) exactly which
occurrences of "jurassic" and "barnea" caused the phrase match?
I like the idea of reindexing and requerying the matched ducument at
highlight time, but I'm still lost on how to convert everything to SpanQuery
variants.

Or am I missing something here (always a distinct possibility)?

Thanks,

Fred

At 05:11 AM 8/25/2005, you wrote:

Unfortunately I've not had the time to address the
phrase highlighting issues in the current highlighter
but I think I've an idea as to how best to fix it:

I would suggest rewriting the highlighter to use Spans
not Terms to find the relevant sections in a text.
Most of the code required for such a solution is
around in one form or another but needs bringing
together:

* The SpansExtractor class here:
http://issues.apache.org/bugzilla/show_bug.cgi?id=35518
This can be used to get Spans for a given query and
IndexReader to show where all query hits for a
document lie.

* The contrib section includes a MemoryIndex that can
provide a fast IndexReader for a single document
(faster than using RAMDirectory).

* The LuceneInAction code example for SpanQueries
includes a rudimentary highlighter that uses Spans to
control where markup is introduced given a collection
of Spans (This does not attempt to summarise long docs
however).


The overall approach using this code would be to index
each doc to be highlighted in MemoryIndex, run
SpansExtractor using the (rewritten) user query and
the MemoryIndex's IndexReader, give the resulting
spans to an adapted LIA highlighter/summariser.

Some issues with this:
1) The contrib section would now have inter-project
dependencies (highlighter -> MemIndex) which would
need to be catered for in the Ant build process.
2) We may need to think about how we factor in IDF
weighting of individual terms to the summarising
process so that the more important terms influence the
selection of highlights.


Does this sound reasonable?






___________________________________________________________

To help you stay safe and secure online, we've developed the all newYahoo! Security Centre. http://uk.security.yahoo.com


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Thinking about better highlighting

Reply via email to