On Apr 5, 2007, at 3:58 AM, Michael McCandless wrote:

Marvin do you have any sense of what the equivalent cost is
in KS

It's big.  I don't have any good optimizations to suggest in this area.

(I think for KS you "add" a previous segment not that
differently from how you "add" a document)?

Yeah.  KS has to decompress and serialize posting content, which sux.

The one saving grace is that with the Fibonacci merge schedule and the seg-at-a-time indexing strategy, segments don't get merged nearly as often as they do in Lucene.

I share large int[] blocks and char[] blocks
across Postings and re-use them.  Etc.

Interesting.  I will have to try something like that!

On C) I think it is important so the many ports of Lucene can "compare
notes" and "cross fertilize".

Well, if you port Lucene's benchmarking stuff to Perl/C, I'll apply the patch. ;)

Cross-fertilization is a powerful tool for stimulating algorithmic innovation. Exhibit A: our unfolding collaborative successes.

That's why it was built into the Lucy proposal:

    [Lucy's C engine] will provide core, performance-critical
    functionality, but leave as much up to the higher-level
    language as possible.

Users from diverse communities approach problems from different angles and come up with different solutions. The best ones will propagate across Lucy bindings.

The only problem is that since Dave Balmain has been much less available than we expected, it's been largely up to me to get Lucy to critical mass where other people can start writing bindings.

Performance certainly isn't everything.

That's a given in scripting language culture. Most users are concerned with minimizing developer time above all else. Ergo, my emphasis on API design and simplicity.

But does KS give its users a choice in Tokenizer?

You supply a regular expression which matches one token.

  # Presto! A WhiteSpaceTokenizer:
  my $tokenizer = KinoSearch::Analysis::Tokenizer->new(
      token_re => qr/\S+/
  );

Or, can users pre-tokenize their fields themselves?

TokenBatch provides an API for bulk addition of tokens; you can subclass Analyzer to exploit that.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to