Re: Thinking about better highlighting

Giulio Cesare Solaroli Thu, 25 Aug 2005 01:04:36 -0700

Hi,

on our setup, we are using two completely separated processes, one to
search and another to highlight the found documents. Both this
processes are used by other applications through xml-rpc calls.


Our index is used to search the content of an Oracle DB; for this
reason, there are no stored field on the lucene index besides the ID
of the document on the DB schema (not the Lucene docID, but the
primary key of the record of the DB where the document content is
stored).

The ID returned by the lucene search result is used to fetch the
document content from the DB.

When we need to highlight some text, the main application invoke the
highlighter application passing the text of the original query, and
the content of the document.
The highlighter application creates a brand new in memory index, and
indexes the passed document straight away. After indexing the
document, it is able to find the matching text and return the original
text, enhanced with tags matching the text to highlight.

The only requirement for this arrangement to work, is that the
searcher and the highlighter share the same analyzer implementation
with the process that has populated the main index. Otherwise, the
highlighter could be a completely independent application.

Our arrangement is probably not convenient for every use but did work
fine for us.

Hope this helps.

Regards,

Giulio Cesare


On 8/25/05, Paul Elschot <[EMAIL PROTECTED]> wrote:
> On Thursday 25 August 2005 04:47, Fred Toth wrote:
> > All,
> >
> > First, my thanks to those who've contributed to the current
> > best practices for highlighting. We use your code!
> >
> > However, after reviewing recent discussions about highlighting,
> > and struggling with our own highlighting issues, I'm wondering if
> > there's a better way.
> >
> > Others have certainly thought more about this (but I've thought
> > about it a lot).
> >
> > Isn't it true that the fundamental problem is that all of the highlighting
> > approaches are struggling with trying to recreate what the lucene core
> > has already done at search time?
> 
> Because storing those results takes memory, and most of these results
> would not be needed lateron. In Lucene only the score is kept during
> search, and then only when it is high enough.
> One could extend the search core to keep only the highlighting info of
> these higher scoring docs, but that would slow down searching.
> 
> >
> > My simplest example is a phrase query, "brown fox". Why should we
> > have to attempt to simulate what lucene does in the highlighting code?
> > There are several attempts out there to solve this using various approaches,
> > span queries, custom hacks, etc., but all suffer from the same problem.
> > Namely, it's a lot of difficult work to correctly find the same terms
> > in the highlighting
> > code that lucene has already found moments before. So we end up
> > highlighting "brown" and "fox" wherever they occur, not just the phrase.
> >
> > I read with interest the recent discussion of using span queries to search
> > a single document to determine phrases, taking into account slop factor,
> etc.
> ...
> 
> Getting PhraseQuery to work as a SpanQuery and as efficiently as it works
> now will not be straightforward but it might be possible.
> The NearSpansOrdered posted here might be a good starting point:
> http://issues.apache.org/bugzilla/show_bug.cgi?id=35823
> 
> One approach could be to redo the search, but limited to
> the documents to be highlighted, and only gathering the highlight
> positions from the Spans during that redone search.
> This can be fast when the search was just done and most index info
> needed is still cached by the operating system, and has the
> advantage that the highlights will be the same as the ones used to
> compute the document scores.
> 
> Regards,
> Paul Elschot
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Thinking about better highlighting

Reply via email to