highlighting performance

Mike Sokolov Mon, 20 Jun 2011 13:20:48 -0700

Our apps use highlighting, and I expect that highlighting is anexpensive operation since it requires processing the text of thedocuments, but I ran a test and was surprised just how expensive it is.I made a test index with three fields: path, modified, and contents. Imade the index using org.apache.lucene.demo.IndexFiles modified so thatthe contents field is stored and analyzed:


          doc.add(new Field("contents", false, buf.toString(),

Store.YES, Index.ANALYZED,TermVector.WITH_POSITIONS_OFFSETS));

There are about 8000 documents in the index, and the contents fieldaverages around 7500 bytes. The total index directory size is about 242M.

I ran a modified version of the demo.SearchFiles class that doesn'tprint anything out (printing results takes most of the time for fasterqueries), and runs random queries drawn from the text of the documents:these are a mix of (mostly) term queries, and about 20% phrase queries(that are phrases from the text).

I compared a few cases: no field access, un-highlighted retrieval,highlighting, Highlighter and FastVectorHighlighter, always asking for10 top scoring docs per query, and running at least 1000 queries foreach case.

No field access at all gets about 7000 qps; basically we just callsearcher.search(query, 10)


Then there is a big cost for retrieving the stored documents from the index:

Retrieving each document (calling search.doc(docID)) and the path fieldonly (a small field) gets about 250 qps

As a comparison, if I don't store the contents field in the index (anddon't retrieve it at all), I get similar performance to the no retrievalcase (around 7000 qps). OK - so there is a fair amount of I/O requiredto retrieve the stored doc; this may be unavoidable, although doconsider that for highlighting only a small portion of the doc mayultimately be required.


Then another big penalty is paid for highlighting:

Highlighter gets about 60 qps

And finally I am really mystified about this one:

FastVectorHighlighter gets about 20 qps. There is a lot of variance here(say 9-44 qps), although always worse than Highlighter.


If these results hold up I'll be astonished, since they imply:

(1) FVH is not fast

(2) Highlighting consumes most processing time (around 80%) in the bestcase, as compared to just retrieving un-highlighted documents.

and the follow on is that at least for users that need highlighting,there is hardly any point in optimizing anything else!

I thought maybe FVH required a lot of memory, so I changed the -Xmx512m(from the default: 64m I think), but this had no effect.

I also tried optimizing the index, and although this improved queryperformance somewhat across the board, it actually accentuated the costof highlighting since the most marked improvement was in the basicunhighlighted query.


Here is what the highlighting looks like:

For FVH we allocate a single SimpleFragsListBuilder,SimpleFragmentBuilder, preTags[1], postTags[1] and DefaultEncoder sothese don't have to be created for each query. We also cache theFastVectorHighlighter itself, and we call:

highlighter.getBestFragment(highlighter.getFieldQuery(query),searcher.getIndexReader(), hits[i].doc, "contents", 40, flb, fb,preTags, postTags, encoder);


once for each result.

In the Highlighter case, we also cache the Highlighter and call:

highlighter.getBestFragment(analyzer, "contents", doc.get("contents"));

does this performance profile match up with your expectations? Did I dosomething stupid? Please let me know if I can provide more info. I'mconsidering what can be done to speed up highlighting, but don't want togo off half-cocked..


--
Michael Sokolov
Engineering Director
www.ifactory.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

highlighting performance

Reply via email to