highlighting performance

2011-06-20 Thread Mike Sokolov
Our apps use highlighting, and I expect that highlighting is an expensive operation since it requires processing the text of the documents, but I ran a test and was surprised just how expensive it is. I made a test index with three fields: path, modified, and contents. I made the index using

Re: highlighting performance

2011-06-20 Thread Koji Sekiguchi
Mike, FVH used to be faster for large docs. I wrote FVH section for Lucene in Action and it said: In contrib/benchmark (covered in appendix C), there’s an algorithm file called highlight-vs-vector-highlight.alg that lets you see the difference between two highlighters in processing time. As of

Re: highlighting performance

2011-06-20 Thread Michael Sokolov
Koji- I'm not familiar with the benchmarking system, but maybe I'll see if I can run that benchmark on my test data as a point of comparison - thanks for the pointer! -Mike On 6/20/2011 8:21 PM, Koji Sekiguchi wrote: Mike, FVH used to be faster for large docs. I wrote FVH section for Lucene

Re: highlighting performance

2011-06-21 Thread Michael Sokolov
I did that, and the benchmark indicates FVH is 10x faster than Highlighter now. I ran with a subset of the wikipedia data since I didn't want to deal with the whole thing. I'm trying to reconcile these weirdly varying results. One difference is that the benchmark doesn't use PhraseQueries -

Re: highlighting performance

2011-06-21 Thread Michael Sokolov
OK - it seems as if there is a blow-up in FieldPhraseList if a document has a large number of occurrences of a term that is in the query. In one example, I searched for "1", and this occurs just under 2000 times in one of my test documents (as the value of HTML attributes). Admittedly a weird

Re: highlighting performance

2011-06-22 Thread Itamar Syn-Hershko
I'm not intimately familiar with FVH myself, but that sounds reasonable. Tests usually don't lie. I'd definitely like to see a patched version that avoids that! Itamar. On 22/06/2011 05:29, Michael Sokolov wrote: OK - it seems as if there is a blow-up in FieldPhraseList if a document has a la