Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-30 Thread Yonik Seeley

On 4/30/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:

After discussion on java-dev last time, I decided to retry the
"persistent hash" approach, where the Postings hash lasts across many
docs and then a single flush produces a partial segment containing all
of those docs.  This is in contrast to the previous approach where
each doc makes its own segment and then they are merged.

It turns out this is even faster than my previous approach,


Go, Mike, go!


With this new approach, as I process each term in the document I
immediately write the prox/freq in their compact (vints) format into
shared byte[] buffers, rather than accumulating int[] arrays that then
need to be re-processed into the vint encoding.  This speeds things up
because we don't double-process the postings.


Good idea!


 It also uses less
per-document RAM overhead because intermediate postings are stored as
vints not as ints.


I'm just trying to follow along at a high level...how do you handle
intermediate termdocs?

-Yonik

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-06 Thread Michael McCandless
"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:
>
> On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote:
> 
> >> What we need to do is cut down on decompression and conflict
> >> resolution costs when reading from one segment to another.  KS has
> >> solved this problem for stored fields.  Field defs are global and
> >> field values are keyed by name rather than field number in the field
> >> data file.  Benefits:
> >>
> >>* Whole documents can be read from one segment to
> >>  another as blobs.
> >>* No flags byte.
> >>* No remapping of field numbers.
> >>* No conflict resolution at all.
> >>* Compressed, uncompressed... doesn't matter.
> >>* Less code.
> >>* The possibility of allowing the user to provide their
> >>  own subclass for reading and writing fields. (For
> >>  Lucy, in the language of your choice.)
> >
> > I hear you, and I really really love those benefits, but, we just
> > don't have this freedom with Lucene.
> 
> Yeah, too bad.  This is one area where Lucene and Lucy are going to  
> differ.  Balmain and I are of one mind about global field defs.
> 
> > I think the ability to suddenly birth a new field,
> 
> You can do that in KS as of version 0.20_02.  :)

Excellent!

> > or change a field's attributes like "has vectors", "stores norms",
> > etc., with a new document,
> 
> Can't do that, though, and I make no apologies.  I think it's a  
> misfeature.

Alas, I don't think we (Lucene) can change this now.

> > I suppose if we had a
> > single mapping of field names -> numbers in the index, that would gain
> > us many of the above benefits?  Hmmm.
> 
> You'll still have to be able to remap field numbers when adding  
> entire indexes.

True, but it'd still be good progress for the frequent case of
adding/deleting docs to an existing index.  Progress not perfection...

> > Here's one idea I just had: assuming there are no deletions, you can
> > almost do a raw bytes copy from input segment to output (merged)
> > segment of the postings for a given term X.  I think for prox postings
> > you can.
> 
> You can probably squeeze out some nice gains using a skipVint()  
> function, even with deletions.

Good point.  I think likewise with copyVInt(int numToCopy).

> > But for freq postings, you can't, because they are delta coded.
> 
> I'm working on this task right now for KS.
> 
> KS implements the "Flexible Indexing" paradigm, so all posting data  
> goes in a single file.
> 
> I've applied an additional constraint to KS:  Every binary file must  
> consist of one type of record repeated over and over.  Every indexed  
> field gets its own dedicated posting file with the suffix .pNNN to  
> allow per-field posting formats.
> 
> The I/O code is isolated in subclasses of a new class called  
> "Stepper":  You can turn any Stepper loose on its file and read it  
> from top to tail.  When the file format changes, Steppers will get  
> archived, like old plugins.
> 
> My present task is to write the code for the Stepper subclasses  
> MatchPosting, ScorePosting, and RichPosting. (PayloadPosting can  
> wait.)  As I write them, I will see if I can figure out format that  
> can be merged as speedily as possible.  Perhaps the precise variant  
> of delta encoding used in Lucene's .frq file should be avoided.

Neat!  Yes, designing the file format to accommodate "merging"
efficiently (plus searching of course) is a good idea since we lose so
much indexing time to this.

> > Except: it's only the first entry of the incoming segments's freq
> > postings that needs to be re-interpreted?  So you could read that one,
> > encode the delta based on "last docID" for previous segment (I think
> > we'd have to store this in index, probably only if termFreq >
> > threshold), and then copyBytes the rest of the posting?  I will try
> > this out on the merges I'm doing in LUCENE-843; I think it should
> > work and make merging faster (assuming no deletes)?
> 
> Ugh, more special case code.
> 
> I have to say, I started trying to go over your patch, and the  
> overwhelming impression I got coming back to this part of the Lucene  
> code base in earnest for the first time since using 1.4.3 as a  
> porting reference was: simplicity seems to be nobody's priority these  
> days.

Unfortunately this is just a tough tradeoff... higher performance code
is often not "simple".  I also still need to clean up the code, add
comments, etc, but even after that, it's not going to look "simple".
I think this is just the reality of performance optimization.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Marvin Humphrey


On Apr 5, 2007, at 5:26 PM, Michael McCandless wrote:


What we need to do is cut down on decompression and conflict
resolution costs when reading from one segment to another.  KS has
solved this problem for stored fields.  Field defs are global and
field values are keyed by name rather than field number in the field
data file.  Benefits:

   * Whole documents can be read from one segment to
 another as blobs.
   * No flags byte.
   * No remapping of field numbers.
   * No conflict resolution at all.
   * Compressed, uncompressed... doesn't matter.
   * Less code.
   * The possibility of allowing the user to provide their
 own subclass for reading and writing fields. (For
 Lucy, in the language of your choice.)


I hear you, and I really really love those benefits, but, we just
don't have this freedom with Lucene.


Yeah, too bad.  This is one area where Lucene and Lucy are going to  
differ.  Balmain and I are of one mind about global field defs.



I think the ability to suddenly birth a new field,


You can do that in KS as of version 0.20_02.  :)


or change a field's attributes like "has vectors", "stores norms",
etc., with a new document,


Can't do that, though, and I make no apologies.  I think it's a  
misfeature.



I suppose if we had a
single mapping of field names -> numbers in the index, that would gain
us many of the above benefits?  Hmmm.


You'll still have to be able to remap field numbers when adding  
entire indexes.



Here's one idea I just had: assuming there are no deletions, you can
almost do a raw bytes copy from input segment to output (merged)
segment of the postings for a given term X.  I think for prox postings
you can.


You can probably squeeze out some nice gains using a skipVint()  
function, even with deletions.



But for freq postings, you can't, because they are delta coded.


I'm working on this task right now for KS.

KS implements the "Flexible Indexing" paradigm, so all posting data  
goes in a single file.


I've applied an additional constraint to KS:  Every binary file must  
consist of one type of record repeated over and over.  Every indexed  
field gets its own dedicated posting file with the suffix .pNNN to  
allow per-field posting formats.


The I/O code is isolated in subclasses of a new class called  
"Stepper":  You can turn any Stepper loose on its file and read it  
from top to tail.  When the file format changes, Steppers will get  
archived, like old plugins.


My present task is to write the code for the Stepper subclasses  
MatchPosting, ScorePosting, and RichPosting. (PayloadPosting can  
wait.)  As I write them, I will see if I can figure out format that  
can be merged as speedily as possible.  Perhaps the precise variant  
of delta encoding used in Lucene's .frq file should be avoided.



Except: it's only the first entry of the incoming segments's freq
postings that needs to be re-interpreted?  So you could read that one,
encode the delta based on "last docID" for previous segment (I think
we'd have to store this in index, probably only if termFreq >
threshold), and then copyBytes the rest of the posting?  I will try
this out on the merges I'm doing in LUCENE-843; I think it should
work and make merging faster (assuming no deletes)?


Ugh, more special case code.

I have to say, I started trying to go over your patch, and the  
overwhelming impression I got coming back to this part of the Lucene  
code base in earnest for the first time since using 1.4.3 as a  
porting reference was: simplicity seems to be nobody's priority these  
days.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless

"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:
> On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote:
> 
> >>> (I think for KS you "add" a previous segment not that
> >>> differently from how you "add" a document)?
> >>
> >> Yeah.  KS has to decompress and serialize posting content, which sux.
> >>
> >> The one saving grace is that with the Fibonacci merge schedule and
> >> the seg-at-a-time indexing strategy, segments don't get merged nearly
> >> as often as they do in Lucene.
> >
> > Yeah we need to work on this one.
> 
> What we need to do is cut down on decompression and conflict  
> resolution costs when reading from one segment to another.  KS has  
> solved this problem for stored fields.  Field defs are global and  
> field values are keyed by name rather than field number in the field  
> data file.  Benefits:
> 
>* Whole documents can be read from one segment to
>  another as blobs.
>* No flags byte.
>* No remapping of field numbers.
>* No conflict resolution at all.
>* Compressed, uncompressed... doesn't matter.
>* Less code.
>* The possibility of allowing the user to provide their
>  own subclass for reading and writing fields. (For
>  Lucy, in the language of your choice.)

I hear you, and I really really love those benefits, but, we just
don't have this freedom with Lucene.

I think the ability to suddenly birth a new field, or change a field's
attributes like "has vectors", "stores norms", etc., with a new
document, is something we just can't break at this point with Lucene?

If we could get those benefits without breaking backwards
compatibility then that would be awesome.  I suppose if we had a
single mapping of field names -> numbers in the index, that would gain
us many of the above benefits?  Hmmm.

> What I haven't got yet is a way to move terms and postings  
> economically from one segment to another.  But I'm working on it.  :)

Here's one idea I just had: assuming there are no deletions, you can
almost do a raw bytes copy from input segment to output (merged)
segment of the postings for a given term X.  I think for prox postings
you can.  But for freq postings, you can't, because they are delta
coded.

Except: it's only the first entry of the incoming segments's freq
postings that needs to be re-interpreted?  So you could read that one,
encode the delta based on "last docID" for previous segment (I think
we'd have to store this in index, probably only if termFreq >
threshold), and then copyBytes the rest of the posting?  I will try
this out on the merges I'm doing in LUCENE-843; I think it should
work and make merging faster (assuming no deletes)?

> > One thing that irks me about the
> > current Lucene merge policy (besides that it gets confused when you
> > flush-by-RAM-usage) is that it's a "pay it forward" design so you're
> > alwa>ys over-paying when you build a given index size.  With KS's
> > Fibonacci merge policy, you don't.  LUCENE-854 has some more details.
> 
> However, even under Fibo, when you get socked with a big merge, you  
> really get socked.  It bothers me that the time for adding to your  
> index can vary so unpredictably.

Yeah, I think that's best solved by concurrency (either with threads
or with our own "scheduling" eg on adding a doc you go and merge
another N terms in the running merge)?  There have been several
proposals recently for making Lucene's merging concurrent
(backgrounded), as part of LUCENE-847.

> > Segment merging really is costly.  In building a large (86 GB, 10 MM
> > docs) index, 65.6% of the time was spent merging!  Details are in
> > LUCENE-856...
> 
> > This is a great model.  Are there Python bindings to Lucy yet/coming?
> 
> I'm sure that they will appear once the C core is ready.  The  
> approach I am taking is to make some high-level design decisions  
> collaboratively on lucy-dev, then implement them in KS.  There's a  
> large amount of code that has been written according to our specs  
> that is working in KS and ready to commit to Lucy after trivial  
> changes.  There's more that's ready for review.  However, release of  
> KS 0.20 is taking priority, so code flow into the Lucy repository has  
> slowed.

OK, good to hear.

> I'll also be looking for a job in about a month.  That may slow us  
> down some more, though it won't stop things --  I've basically  
> decided that I'll do what it takes to Lucy off the ground.  I'll go  
> with something stopgap if nothing materializes which is compatible  
> with that commitment.

Whoa, I'm sorry to hear that :(  I hope you land, quickly, somewhere
that takes Lucy/KS seriously.  It's clearly excellent work.

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Marvin Humphrey


On Apr 5, 2007, at 12:06 PM, Michael McCandless wrote:


(I think for KS you "add" a previous segment not that
differently from how you "add" a document)?


Yeah.  KS has to decompress and serialize posting content, which sux.

The one saving grace is that with the Fibonacci merge schedule and
the seg-at-a-time indexing strategy, segments don't get merged nearly
as often as they do in Lucene.


Yeah we need to work on this one.


What we need to do is cut down on decompression and conflict  
resolution costs when reading from one segment to another.  KS has  
solved this problem for stored fields.  Field defs are global and  
field values are keyed by name rather than field number in the field  
data file.  Benefits:


  * Whole documents can be read from one segment to
another as blobs.
  * No flags byte.
  * No remapping of field numbers.
  * No conflict resolution at all.
  * Compressed, uncompressed... doesn't matter.
  * Less code.
  * The possibility of allowing the user to provide their
own subclass for reading and writing fields. (For
Lucy, in the language of your choice.)

What I haven't got yet is a way to move terms and postings  
economically from one segment to another.  But I'm working on it.  :)



One thing that irks me about the
current Lucene merge policy (besides that it gets confused when you
flush-by-RAM-usage) is that it's a "pay it forward" design so you're
alwa>ys over-paying when you build a given index size.  With KS's
Fibonacci merge policy, you don't.  LUCENE-854 has some more details.


However, even under Fibo, when you get socked with a big merge, you  
really get socked.  It bothers me that the time for adding to your  
index can vary so unpredictably.



Segment merging really is costly.  In building a large (86 GB, 10 MM
docs) index, 65.6% of the time was spent merging!  Details are in
LUCENE-856...



This is a great model.  Are there Python bindings to Lucy yet/coming?


I'm sure that they will appear once the C core is ready.  The  
approach I am taking is to make some high-level design decisions  
collaboratively on lucy-dev, then implement them in KS.  There's a  
large amount of code that has been written according to our specs  
that is working in KS and ready to commit to Lucy after trivial  
changes.  There's more that's ready for review.  However, release of  
KS 0.20 is taking priority, so code flow into the Lucy repository has  
slowed.


I'll also be looking for a job in about a month.  That may slow us  
down some more, though it won't stop things --  I've basically  
decided that I'll do what it takes to Lucy off the ground.  I'll go  
with something stopgap if nothing materializes which is compatible  
with that commitment.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless

"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:

> > (I think for KS you "add" a previous segment not that
> > differently from how you "add" a document)?
> 
> Yeah.  KS has to decompress and serialize posting content, which sux.
> 
> The one saving grace is that with the Fibonacci merge schedule and  
> the seg-at-a-time indexing strategy, segments don't get merged nearly  
> as often as they do in Lucene.

Yeah we need to work on this one.  One thing that irks me about the
current Lucene merge policy (besides that it gets confused when you
flush-by-RAM-usage) is that it's a "pay it forward" design so you're
alwa>ys over-paying when you build a given index size.  With KS's
Fibonacci merge policy, you don't.  LUCENE-854 has some more details.

Segment merging really is costly.  In building a large (86 GB, 10 MM
docs) index, 65.6% of the time was spent merging!  Details are in
LUCENE-856...

> > On C) I think it is important so the many ports of Lucene can "compare
> > notes" and "cross fertilize".
> 
> Well, if you port Lucene's benchmarking stuff to Perl/C, I'll apply  
> the patch. ;)

I hear you!

> Cross-fertilization is a powerful tool for stimulating algorithmic  
> innovation.  Exhibit A: our unfolding collaborative successes.

Couldn't agree more.

> That's why it was built into the Lucy proposal:
> 
>  [Lucy's C engine] will provide core, performance-critical
>  functionality, but leave as much up to the higher-level
>  language as possible.
> 
> Users from diverse communities approach problems from different  
> angles and come up with different solutions.  The best ones will  
> propagate across Lucy bindings.
> 
> The only problem is that since Dave Balmain has been much less  
> available than we expected, it's been largely up to me to get Lucy to  
> critical mass where other people can start writing bindings.

This is a great model.  Are there Python bindings to Lucy yet/coming?

> > But does KS give its users a choice in Tokenizer?
> 
> You supply a regular expression which matches one token.
> 
># Presto! A WhiteSpaceTokenizer:
>my $tokenizer = KinoSearch::Analysis::Tokenizer->new(
>token_re => qr/\S+/
>);
> 
> > Or, can users pre-tokenize their fields themselves?
> 
> TokenBatch provides an API for bulk addition of tokens; you can  
> subclass Analyzer to exploit that.

Ahh, I get it.  Nice!

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Marvin Humphrey


On Apr 5, 2007, at 3:58 AM, Michael McCandless wrote:


Marvin do you have any sense of what the equivalent cost is
in KS


It's big.  I don't have any good optimizations to suggest in this area.


(I think for KS you "add" a previous segment not that
differently from how you "add" a document)?


Yeah.  KS has to decompress and serialize posting content, which sux.

The one saving grace is that with the Fibonacci merge schedule and  
the seg-at-a-time indexing strategy, segments don't get merged nearly  
as often as they do in Lucene.



I share large int[] blocks and char[] blocks
across Postings and re-use them.  Etc.


Interesting.  I will have to try something like that!


On C) I think it is important so the many ports of Lucene can "compare
notes" and "cross fertilize".


Well, if you port Lucene's benchmarking stuff to Perl/C, I'll apply  
the patch. ;)


Cross-fertilization is a powerful tool for stimulating algorithmic  
innovation.  Exhibit A: our unfolding collaborative successes.


That's why it was built into the Lucy proposal:

[Lucy's C engine] will provide core, performance-critical
functionality, but leave as much up to the higher-level
language as possible.

Users from diverse communities approach problems from different  
angles and come up with different solutions.  The best ones will  
propagate across Lucy bindings.


The only problem is that since Dave Balmain has been much less  
available than we expected, it's been largely up to me to get Lucy to  
critical mass where other people can start writing bindings.



Performance certainly isn't everything.


That's a given in scripting language culture.  Most users are  
concerned with minimizing developer time above all else.  Ergo, my  
emphasis on API design and simplicity.



But does KS give its users a choice in Tokenizer?


You supply a regular expression which matches one token.

  # Presto! A WhiteSpaceTokenizer:
  my $tokenizer = KinoSearch::Analysis::Tokenizer->new(
  token_re => qr/\S+/
  );


Or, can users pre-tokenize their fields themselves?


TokenBatch provides an API for bulk addition of tokens; you can  
subclass Analyzer to exploit that.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-05 Thread Michael McCandless
"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:
> On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote:
> 
> >> (: Ironically, the numbers for Lucene on that page are a little
> >> better than they should be because of a sneaky bug.  I would have
> >> made updating the results a priority if they'd gone the other  
> >> way.  :)
> >
> > Hrm.  It would be nice to have hard comparison of the Lucene, KS (and
> > Ferret and others?).
> 
> Doing honest, rigorous benchmarking is exacting and labor-intensive.   
> Publishing results tends to ignite flame wars I don't have time for.
> 
> The main point that I wanted to make with that page was that KS was a  
> lot faster than Plucene, and that it was in Lucene's ballpark.   
> Having made that point, I've moved on.  The benchmarking code is  
> still very useful for internal development and I use it frequently.

Agreed.  Though, if the benchmarking is done in a way that anyone
could download & re-run it (eg as part of Lucene's new & developing
benchmark framework), it should help to keep flaming in check.

Accurate & well communicated benchmark results both within each
variant/port of Lucene and across them is crucial for all of us making
iterative progress on performance.

> At some point I would like to port the benchmarking work that has  
> been contributed to Lucene of late, but I'm waiting for that code  
> base to settle down first.  After that happens, I'll probably make a  
> pass and publish some results.  Better to spend the time preparing  
> one definitive presentation than to have to rebut every idiot's  
> latest wildly inaccurate shootout.

Excellent!

> >> ... However, Lucene has been tuned by an army of developers over the
> >> years, while KS is young yet and still had many opportunities for
> >> optimization.  Current svn trunk for KS is about twice as fast for
> >> indexing as when I did those benchmarking tests.
> >
> > Wow, that's an awesome speedup!
> 
> The big bottleneck for KS has been its Tokenizer class.  There's only  
> one such class in KS, and it's regex-based.  A few weeks ago, I  
> finally figured out how to hook it into Perl's regex engine at the C  
> level.  The regex engine is not an official part of Perl's C API, so  
> I wouldn't do this if I didn't have to, but the tokenizing loop is  
> only about 100 lines of code and the speedup is dramatic.

Tokenization is a very big part of Lucene's indexing time as well.

StandardAnalyzer is very time consuming.  When I switched to testing
with WhitespaceAnalyzer, it was quite a bit faster (I don't have exact
numbers).  Then when I created and switched to SimpleSpaceAnalyzer
(just splits on the space character, and, doesn't do new String(...)
for every token, instead makes offset+lenth slices into a char[]
array), it was even faster.

This is why "your mileage will vary" caveat is extremely important.
For most users of Lucene, I'd expect that 1) retrieving the doc from
whatever its source is, and 2) tokenizing, take a substantial amount
of time.  So the gains I'm seeing in my benchmarks won't usually be
seen by normal applications unless these applications have already
optimized their doc retrieval/tokenization.

And now that indexing each document is so fast, segment merging has
become a BIG part (66% in my "large index" test in LUCENE-856) of
indexing.  Marvin do you have any sense of what the equivalent cost is
in KS (I think for KS you "add" a previous segment not that
differently from how you "add" a document)?
 
> I've also squeezed out another 30-40% by changing the implementation  
> in ways which have gradually winnowed down the number of malloc()  
> calls.  Some of the techniques may be applicable to Lucene; I'll get  
> around to firing up JIRA issues describing them someday.

This generally was my approach in LUCENE-843 (minimize "new
Object()").  I re-use Posting objects, the hash for Posting objects,
byte buffers, etc.  I share large int[] blocks and char[] blocks
across Postings and re-use them.  Etc.

The one thing that still baffles me is: I can't get a persistent
Posting hash to be any faster.  I still reset the Posting hash with
every document, but I had variants in my iterations that kept the
Postings hash between documents (just flushing the int[]'s
periodically).  I had expected that leaving Posting instances in the
hash, esp. for frequent terms, would be a win, but so far I haven't
seen that empirically.

> > So KS is faster than Lucene today?
> 
> I haven't tested recent versions of Lucene.  I believe that the  
> current svn trunk for KS is faster for indexing than Lucene 1.9.1.   
> But... A) I don't have an official release out with the current  
> Tokenizer code, B) I have no immediate plans to prepare further  
> published benchmarks, and C) it's not really important, because so  
> long as the numbers are close you'd be nuts to choose one engine or  
> the other based on that criteria rather than, say, what language your  
> development team speaks.  Kin

Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-04 Thread Marvin Humphrey


On Apr 4, 2007, at 10:05 AM, Michael McCandless wrote:


(: Ironically, the numbers for Lucene on that page are a little
better than they should be because of a sneaky bug.  I would have
made updating the results a priority if they'd gone the other  
way.  :)


Hrm.  It would be nice to have hard comparison of the Lucene, KS (and
Ferret and others?).


Doing honest, rigorous benchmarking is exacting and labor-intensive.   
Publishing results tends to ignite flame wars I don't have time for.


The main point that I wanted to make with that page was that KS was a  
lot faster than Plucene, and that it was in Lucene's ballpark.   
Having made that point, I've moved on.  The benchmarking code is  
still very useful for internal development and I use it frequently.


At some point I would like to port the benchmarking work that has  
been contributed to Lucene of late, but I'm waiting for that code  
base to settle down first.  After that happens, I'll probably make a  
pass and publish some results.  Better to spend the time preparing  
one definitive presentation than to have to rebut every idiot's  
latest wildly inaccurate shootout.



... However, Lucene has been tuned by an army of developers over the
years, while KS is young yet and still had many opportunities for
optimization.  Current svn trunk for KS is about twice as fast for
indexing as when I did those benchmarking tests.


Wow, that's an awesome speedup!


The big bottleneck for KS has been its Tokenizer class.  There's only  
one such class in KS, and it's regex-based.  A few weeks ago, I  
finally figured out how to hook it into Perl's regex engine at the C  
level.  The regex engine is not an official part of Perl's C API, so  
I wouldn't do this if I didn't have to, but the tokenizing loop is  
only about 100 lines of code and the speedup is dramatic.


I've also squeezed out another 30-40% by changing the implementation  
in ways which have gradually winnowed down the number of malloc()  
calls.  Some of the techniques may be applicable to Lucene; I'll get  
around to firing up JIRA issues describing them someday.



So KS is faster than Lucene today?


I haven't tested recent versions of Lucene.  I believe that the  
current svn trunk for KS is faster for indexing than Lucene 1.9.1.   
But... A) I don't have an official release out with the current  
Tokenizer code, B) I have no immediate plans to prepare further  
published benchmarks, and C) it's not really important, because so  
long as the numbers are close you'd be nuts to choose one engine or  
the other based on that criteria rather than, say, what language your  
development team speaks.  KinoSearch scales to multiple machines, too.


Looking to the future, I wouldn't be surprised if Lucene edged ahead  
and stayed slightly ahead speed-wise, because I'm prepared to make  
some sacrifices for the sake of keeping KinoSearch's core API simple  
and the code base as small as possible.  I'd rather maintain a  
single, elegant, useful, flexible, plenty fast regex-based Tokenizer  
than the slew of Tokenizers Lucene offers, for instance.  It might be  
at a slight disadvantage going mano a mano against Lucene's  
WhiteSpaceTokenizer, but that's fine.


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-04 Thread Michael McCandless
"Marvin Humphrey" <[EMAIL PROTECTED]> wrote:
> 
> On Apr 3, 2007, at 9:52 AM, Michael McCandless wrote:
> 
> > "Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> >> Wow, very nice results Mike!
> >
> > Thanks :)  I'm just praying I don't have some sneaky bug making
> > the results far better than they really are!!
> 
> That's possible, but I'm confident that the model you're using is  
> capable of the gains you're seeing.  When I benched KinoSearch a year  
> ago against Lucene, KS was getting close, but was still a little  
> behind... 

OK glad to hear that :)  I *think* I don't have such bugs.

> (: Ironically, the numbers for Lucene on that page are a little  
> better than they should be because of a sneaky bug.  I would have  
> made updating the results a priority if they'd gone the other way.  :)

Hrm.  It would be nice to have hard comparison of the Lucene, KS (and
Ferret and others?).
 
> ... However, Lucene has been tuned by an army of developers over the  
> years, while KS is young yet and still had many opportunities for  
> optimization.  Current svn trunk for KS is about twice as fast for  
> indexing as when I did those benchmarking tests.

Wow, that's an awesome speedup!  So KS is faster than Lucene today?

> I look forward to studying your patch in detail at some point to see  
> what you've done differently.  It sounds like you only familiarized  
> yourself with the high-level details of how KS has been working,  
> yes?  Hopefully, you misunderstood and came up with something better. ;)

Exactly!  I very carefully didn't look closely at how KS does
indexing.  I did read your posts on this list and did read the Wiki
page and I think a few other pages describing KS's merge model but
stopped there.  We can compare our approaches in detail at some point
and then cross-fertilize :)

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-04 Thread Marvin Humphrey


On Apr 3, 2007, at 9:52 AM, Michael McCandless wrote:


"Yonik Seeley" <[EMAIL PROTECTED]> wrote:

Wow, very nice results Mike!


Thanks :)  I'm just praying I don't have some sneaky bug making
the results far better than they really are!!


That's possible, but I'm confident that the model you're using is  
capable of the gains you're seeing.  When I benched KinoSearch a year  
ago against Lucene, KS was getting close, but was still a little  
behind... 


(: Ironically, the numbers for Lucene on that page are a little  
better than they should be because of a sneaky bug.  I would have  
made updating the results a priority if they'd gone the other way.  :)


... However, Lucene has been tuned by an army of developers over the  
years, while KS is young yet and still had many opportunities for  
optimization.  Current svn trunk for KS is about twice as fast for  
indexing as when I did those benchmarking tests.


I look forward to studying your patch in detail at some point to see  
what you've done differently.  It sounds like you only familiarized  
yourself with the high-level details of how KS has been working,  
yes?  Hopefully, you misunderstood and came up with something better. ;)


Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Michael McCandless

"Yonik Seeley" <[EMAIL PROTECTED]> wrote:
> Wow, very nice results Mike!

Thanks :)  I'm just praying I don't have some sneaky bug making
the results far better than they really are!!  And still plenty
to do...

Mike

-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: improve how IndexWriter uses RAM to buffer added documents

2007-04-03 Thread Yonik Seeley

Wow, very nice results Mike!

-Yonik

On 4/3/07, Michael McCandless (JIRA) <[EMAIL PROTECTED]> wrote:


[ 
https://issues.apache.org/jira/browse/LUCENE-843?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12486335
 ]

Michael McCandless commented on LUCENE-843:
---


Last is the results for small docs (100 tokens = ~550 bytes plain text each):

  200 DOCS @ ~550 bytes plain text
  RAM = 32 MB
  NUM THREADS = 1
  MERGE FACTOR = 10


No term vectors nor stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  200 docs in 886.7 secs
  index size = 438M

new
  200 docs in 230.5 secs
  index size = 435M

Total Docs/sec: old  2255.6; new  8676.4 [  284.7% faster]
Docs/MB @ flush:old   128.0; new  4194.6 [ 3176.2% more]
Avg RAM used (MB) @ flush:  old   107.3; new37.7 [   64.9% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  200 docs in 888.7 secs
  index size = 438M

new
  200 docs in 239.6 secs
  index size = 432M

Total Docs/sec: old  2250.5; new  8348.7 [  271.0% faster]
Docs/MB @ flush:old   128.0; new  4146.8 [ 3138.9% more]
Avg RAM used (MB) @ flush:  old   108.1; new38.9 [   64.0% less]



With term vectors (positions + offsets) and 2 small stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  200 docs in 1480.1 secs
  index size = 2.1G

new
  200 docs in 462.0 secs
  index size = 2.1G

Total Docs/sec: old  1351.2; new  4329.3 [  220.4% faster]
Docs/MB @ flush:old93.1; new  4194.6 [ 4405.7% more]
Avg RAM used (MB) @ flush:  old   296.4; new38.3 [   87.1% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  200 docs in 1489.4 secs
  index size = 2.1G

new
  200 docs in 347.9 secs
  index size = 2.1G

Total Docs/sec: old  1342.8; new  5749.4 [  328.2% faster]
Docs/MB @ flush:old93.1; new  4146.8 [ 4354.5% more]
Avg RAM used (MB) @ flush:  old   297.1; new38.6 [   87.0% less]



  20 DOCS @ ~5,500 bytes plain text


No term vectors nor stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  20 docs in 397.6 secs
  index size = 415M

new
  20 docs in 167.5 secs
  index size = 411M

Total Docs/sec: old   503.1; new  1194.1 [  137.3% faster]
Docs/MB @ flush:old81.6; new   406.2 [  397.6% more]
Avg RAM used (MB) @ flush:  old87.3; new35.2 [   59.7% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  20 docs in 394.6 secs
  index size = 415M

new
  20 docs in 168.4 secs
  index size = 408M

Total Docs/sec: old   506.9; new  1187.7 [  134.3% faster]
Docs/MB @ flush:old81.6; new   432.2 [  429.4% more]
Avg RAM used (MB) @ flush:  old   126.6; new36.9 [   70.8% less]



With term vectors (positions + offsets) and 2 small stored fields

  AUTOCOMMIT = true (commit whenever RAM is full)

old
  20 docs in 754.2 secs
  index size = 1.7G

new
  20 docs in 304.9 secs
  index size = 1.7G

Total Docs/sec: old   265.2; new   656.0 [  147.4% faster]
Docs/MB @ flush:old46.7; new   406.2 [  769.6% more]
Avg RAM used (MB) @ flush:  old92.9; new35.2 [   62.1% less]


  AUTOCOMMIT = false (commit only once at the end)

old
  20 docs in 743.9 secs
  index size = 1.7G

new
  20 docs in 244.3 secs
  index size = 1.7G

Total Docs/sec: old   268.9; new   818.7 [  204.5% faster]
Docs/MB @ flush:old46.7; new   432.2 [  825.2% more]
Avg RAM used (MB) @ flush:  old93.0; new36.6 [   60.6% less]


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]