Hi, on our setup, we are using two completely separated processes, one to search and another to highlight the found documents. Both this processes are used by other applications through xml-rpc calls.
Our index is used to search the content of an Oracle DB; for this reason, there are no stored field on the lucene index besides the ID of the document on the DB schema (not the Lucene docID, but the primary key of the record of the DB where the document content is stored). The ID returned by the lucene search result is used to fetch the document content from the DB. When we need to highlight some text, the main application invoke the highlighter application passing the text of the original query, and the content of the document. The highlighter application creates a brand new in memory index, and indexes the passed document straight away. After indexing the document, it is able to find the matching text and return the original text, enhanced with tags matching the text to highlight. The only requirement for this arrangement to work, is that the searcher and the highlighter share the same analyzer implementation with the process that has populated the main index. Otherwise, the highlighter could be a completely independent application. Our arrangement is probably not convenient for every use but did work fine for us. Hope this helps. Regards, Giulio Cesare On 8/25/05, Paul Elschot <[EMAIL PROTECTED]> wrote: > On Thursday 25 August 2005 04:47, Fred Toth wrote: > > All, > > > > First, my thanks to those who've contributed to the current > > best practices for highlighting. We use your code! > > > > However, after reviewing recent discussions about highlighting, > > and struggling with our own highlighting issues, I'm wondering if > > there's a better way. > > > > Others have certainly thought more about this (but I've thought > > about it a lot). > > > > Isn't it true that the fundamental problem is that all of the highlighting > > approaches are struggling with trying to recreate what the lucene core > > has already done at search time? > > Because storing those results takes memory, and most of these results > would not be needed lateron. In Lucene only the score is kept during > search, and then only when it is high enough. > One could extend the search core to keep only the highlighting info of > these higher scoring docs, but that would slow down searching. > > > > > My simplest example is a phrase query, "brown fox". Why should we > > have to attempt to simulate what lucene does in the highlighting code? > > There are several attempts out there to solve this using various approaches, > > span queries, custom hacks, etc., but all suffer from the same problem. > > Namely, it's a lot of difficult work to correctly find the same terms > > in the highlighting > > code that lucene has already found moments before. So we end up > > highlighting "brown" and "fox" wherever they occur, not just the phrase. > > > > I read with interest the recent discussion of using span queries to search > > a single document to determine phrases, taking into account slop factor, > etc. > ... > > Getting PhraseQuery to work as a SpanQuery and as efficiently as it works > now will not be straightforward but it might be possible. > The NearSpansOrdered posted here might be a good starting point: > http://issues.apache.org/bugzilla/show_bug.cgi?id=35823 > > One approach could be to redo the search, but limited to > the documents to be highlighted, and only gathering the highlight > positions from the Spans during that redone search. > This can be fast when the search was just done and most index info > needed is still cached by the operating system, and has the > advantage that the highlights will be the same as the ones used to > compute the document scores. > > Regards, > Paul Elschot > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]