On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote:

(: Ironically, the numbers for Lucene on that page are a little
better than they should be because of a sneaky bug.  I would have
made updating the results a priority if they'd gone the other way. :)

Hrm.  It would be nice to have hard comparison of the Lucene, KS (and
Ferret and others?).

Doing honest, rigorous benchmarking is exacting and labor-intensive. Publishing results tends to ignite flame wars I don't have time for.

The main point that I wanted to make with that page was that KS was a lot faster than Plucene, and that it was in Lucene's ballpark. Having made that point, I've moved on. The benchmarking code is still very useful for internal development and I use it frequently.

At some point I would like to port the benchmarking work that has been contributed to Lucene of late, but I'm waiting for that code base to settle down first. After that happens, I'll probably make a pass and publish some results. Better to spend the time preparing one definitive presentation than to have to rebut every idiot's latest wildly inaccurate shootout.

... However, Lucene has been tuned by an army of developers over the
years, while KS is young yet and still had many opportunities for
optimization.  Current svn trunk for KS is about twice as fast for
indexing as when I did those benchmarking tests.

Wow, that's an awesome speedup!

The big bottleneck for KS has been its Tokenizer class. There's only one such class in KS, and it's regex-based. A few weeks ago, I finally figured out how to hook it into Perl's regex engine at the C level. The regex engine is not an official part of Perl's C API, so I wouldn't do this if I didn't have to, but the tokenizing loop is only about 100 lines of code and the speedup is dramatic.

I've also squeezed out another 30-40% by changing the implementation in ways which have gradually winnowed down the number of malloc() calls. Some of the techniques may be applicable to Lucene; I'll get around to firing up JIRA issues describing them someday.

So KS is faster than Lucene today?

I haven't tested recent versions of Lucene. I believe that the current svn trunk for KS is faster for indexing than Lucene 1.9.1. But... A) I don't have an official release out with the current Tokenizer code, B) I have no immediate plans to prepare further published benchmarks, and C) it's not really important, because so long as the numbers are close you'd be nuts to choose one engine or the other based on that criteria rather than, say, what language your development team speaks. KinoSearch scales to multiple machines, too.

Looking to the future, I wouldn't be surprised if Lucene edged ahead and stayed slightly ahead speed-wise, because I'm prepared to make some sacrifices for the sake of keeping KinoSearch's core API simple and the code base as small as possible. I'd rather maintain a single, elegant, useful, flexible, plenty fast regex-based Tokenizer than the slew of Tokenizers Lucene offers, for instance. It might be at a slight disadvantage going mano a mano against Lucene's WhiteSpaceTokenizer, but that's fine.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to