On 08/12/2011 01:41, Marvin Humphrey wrote:
These numbers are great, and in line with some benchmarks I was also running
today (raw data below).  StandardTokenizer and Normalizer are considerably
faster than RegexTokenizer and the current implementation of CaseFolder, and
thus the proposed EasyAnalyzer (StandardTokenizer, Normalizer,
SnowballStemmer) outperforms PolyAnalyzer (CaseFolder, RegexTokenizer,
SnowballStemmer) by a wide margin:

     Time to index 1000 docs (10 reps, truncated mean)
     =================================================
     PolyAnalyzer   .576 secs
     EasyAnalyzer   .436 secs

Here is more data from a real world indexing run:

RT+CF: 139 secs
ST+N:  112 secs

Can't wait for StandardTokenizer to land in trunk!

I don't have any further work planned, so the branch is ready to be merged.

It's also interesting that moving the tokenizer in front of the case
folder or normalizer always gave me faster results.

Yes, I get the same results.  When I first saw the effect, I thought it might
be stack-memory-vs-malloc'd-buffer in Normalizer, but I was taken by surprise
that CaseFolder behaved that way.  I have no explanation, but the results
certainly argue for starting off analysis with tokenization.

In Normalizer it's probably because we have to scan the whole document twice to find the buffer size which happens rarely if ever when working with tokenized words.

Also the benefit from running the normalizer or case folder before the tokenizer isn't that great because tokens and most of the text buffers are reused. So we don't really save on allocations.

Nick

Reply via email to