Re: improve how IndexWriter uses RAM to buffer added documents

Marvin Humphrey Wed, 04 Apr 2007 21:15:08 -0700


On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote:

(: Ironically, the numbers for Lucene on that page are a little
better than they should be because of a sneaky bug.  I would have
made updating the results a priority if they'd gone the otherway. :)


Hrm.  It would be nice to have hard comparison of the Lucene, KS (and
Ferret and others?).

Doing honest, rigorous benchmarking is exacting and labor-intensive.Publishing results tends to ignite flame wars I don't have time for.

The main point that I wanted to make with that page was that KS was alot faster than Plucene, and that it was in Lucene's ballpark.Having made that point, I've moved on. The benchmarking code isstill very useful for internal development and I use it frequently.

At some point I would like to port the benchmarking work that hasbeen contributed to Lucene of late, but I'm waiting for that codebase to settle down first. After that happens, I'll probably make apass and publish some results. Better to spend the time preparingone definitive presentation than to have to rebut every idiot'slatest wildly inaccurate shootout.

... However, Lucene has been tuned by an army of developers over the
years, while KS is young yet and still had many opportunities for
optimization.  Current svn trunk for KS is about twice as fast for
indexing as when I did those benchmarking tests.


Wow, that's an awesome speedup!

The big bottleneck for KS has been its Tokenizer class. There's onlyone such class in KS, and it's regex-based. A few weeks ago, Ifinally figured out how to hook it into Perl's regex engine at the Clevel. The regex engine is not an official part of Perl's C API, soI wouldn't do this if I didn't have to, but the tokenizing loop isonly about 100 lines of code and the speedup is dramatic.

I've also squeezed out another 30-40% by changing the implementationin ways which have gradually winnowed down the number of malloc()calls. Some of the techniques may be applicable to Lucene; I'll getaround to firing up JIRA issues describing them someday.

So KS is faster than Lucene today?

I haven't tested recent versions of Lucene. I believe that thecurrent svn trunk for KS is faster for indexing than Lucene 1.9.1.But... A) I don't have an official release out with the currentTokenizer code, B) I have no immediate plans to prepare furtherpublished benchmarks, and C) it's not really important, because solong as the numbers are close you'd be nuts to choose one engine orthe other based on that criteria rather than, say, what language yourdevelopment team speaks. KinoSearch scales to multiple machines, too.

Looking to the future, I wouldn't be surprised if Lucene edged aheadand stayed slightly ahead speed-wise, because I'm prepared to makesome sacrifices for the sake of keeping KinoSearch's core API simpleand the code base as small as possible. I'd rather maintain asingle, elegant, useful, flexible, plenty fast regex-based Tokenizerthan the slew of Tokenizers Lucene offers, for instance. It might beat a slight disadvantage going mano a mano against Lucene'sWhiteSpaceTokenizer, but that's fine.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: improve how IndexWriter uses RAM to buffer added documents

Reply via email to