Unfortunately I've not had the time to address the phrase highlighting issues in the current highlighter but I think I've an idea as to how best to fix it:
I would suggest rewriting the highlighter to use Spans not Terms to find the relevant sections in a text. Most of the code required for such a solution is around in one form or another but needs bringing together: * The SpansExtractor class here: http://issues.apache.org/bugzilla/show_bug.cgi?id=35518 This can be used to get Spans for a given query and IndexReader to show where all query hits for a document lie. * The contrib section includes a MemoryIndex that can provide a fast IndexReader for a single document (faster than using RAMDirectory). * The LuceneInAction code example for SpanQueries includes a rudimentary highlighter that uses Spans to control where markup is introduced given a collection of Spans (This does not attempt to summarise long docs however). The overall approach using this code would be to index each doc to be highlighted in MemoryIndex, run SpansExtractor using the (rewritten) user query and the MemoryIndex's IndexReader, give the resulting spans to an adapted LIA highlighter/summariser. Some issues with this: 1) The contrib section would now have inter-project dependencies (highlighter -> MemIndex) which would need to be catered for in the Ant build process. 2) We may need to think about how we factor in IDF weighting of individual terms to the summarising process so that the more important terms influence the selection of highlights. Does this sound reasonable? ___________________________________________________________ To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre. http://uk.security.yahoo.com --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]