Re: Poor QPS with highlighting

markharw00d Tue, 03 Feb 2009 12:55:20 -0800

Can you describe this in a little more detail; I'm not exactly sure what you
mean.

Break your large text documents into multiple Lucene documents. Ratherthan dividing them up into entirely discreet chunks of text considerstoring/indexing *overlapping* sections of text with an overlap as bigas the largest "slop" factor you use on Phrase/Span queries so that youdon't cut any potential phrases in half and fail to match e.g.

This non-overlapping indexing scheme will not match a search for "GeorgeBush"


   Doc 1 = "....  outgoing president George "
   Doc 2=  "Bush stated that ..."

While this overlapping scheme will match...
   Doc 1 = "....  outgoing president George "
   Doc 2=  "president George Bush stated that ..."

This fragmenting approach helps avoid the performance cost ofhighlighting very large documents.

The remaining issue is to remove duplicates in your search results whenyou match multiple chunks e.g. Lucene Docs #1 and #2 both refer to InputDoc#1 and will match a search for "president". You will need to store afield for the "original document number" and remove any duplicates (ormerge them in the display if that is what is required).


Cheers,
Mark


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Poor QPS with highlighting

Reply via email to