>>> For many reasons, we'd prefer option #1. In particular, it involves fewer
>>> keys, making it faster to query.
>> 
>> Do you have numbers proving this?  Btree fan-out is very aggressive, and
>> it takes a lot of keys before you should see any difference in the query
>> speed.
>
> No, I don't have numbers. However, if you think about the numbers, it's
> often at least a factor of 100-1,000 in the number of keys. On one side,
> you have a key for every unique word. On the other, you have a key for
> every word. I don't know enough about how aggressive it is, but usually
> 2-3 orders of magnitude throw some sort of performance problem.

The question is if it's creating a new level in the Btree.  If it
is, that's a problem.  If it isn't creating a new level, then it
is unlikely to make a difference.  Btree fan-out is key density to
the level, e.g., if you can get 100 keys on a page, then you get
100 keys for a one-level tree, 100^2 for a two-level tree, 100^3
for a three-level tree and so on.  (Assuming internal pages are all
cached, it means an in-memory page binary search, which isn't a
big problem, either.)

>> And, you're going to pay a price to break up the document lists yourself,
>> which isn't inexpensive.
>
> No, but it allows significant compression of the lists, unless we do
> something like a minimum-difference b-tree, which will probably manage
> similar compression ratios.

Yes, for long-enough lists you could do adaptive algorithms,
which should win over pretty much anything else (including
minimum-difference Btrees).

>>From your earlier e-mail:
>>Why is it too much?  How many sites index 100,000 documents?  My
>>bet is that it's not many, and last time I checked, 500MB of
>>disk cost under $7 US.  How many sites are going to benefit from
>>this work?
>
> There are actually more than you'd imagine. Even my "lowly" student
> organization here at Williams College has somewhere around 80,000
> documents right now. The school itself (around 2,000 undergraduates) has
> upwards of 150,000 documents.

Well, yeah, but we're still talking a problem that can be solved
for $10.

> As you might expect, people indexing lots of documents are also willing to
> invest in development. Currently the word index is on the order of the
> size of the documents indexed--it's quite clear that compression would
> help. This helps large-scale users get around OS limits on file sizes
> (e.g. users bumping into the 2GB limit), as well as smaller-scale users
> since they'd use less disk and/or get better transfer time. Benchmarking
> indicates that disk latency is still a problem for us.

I don't mean to be a jerk, honest, but the obvious answer here is
to switch to a different operating system.  There are lots of free
OS releases that support large filesystems.  Memory is cheap, disk
is cheap, development is very, very expensive.  I expect to support
100+GB caches in Berkeley DB this fall and we already support large
databases (I'm seeing 1TB databases in the field, now).  If you
wait 12 months, this problem will almost certainly go away, and
it's unclear how much of a development effort you can deploy in
under 12 months.

> Suffice to say
> that we're willing to invest some decent time fixing it using one
> solution or the other. Since we don't know enough about the Berkeley
> internals, we're asking you. It seems like either solution requires work
> on our part--either we worry about the off-page issues of #1, or we worry
> about index compression for #2.

We'd take a lot of convincing to work on #1 -- we don't really
think that's the right way to go.

I'm not convinced that the data size numbers you've described
warrant a lot of work on #2, but that's your decision :-), and
I certainly am interested in #2 for other reasons -- I've got
customers with Really Big Data, and I'd like to offer them
compression.

> It sounds to me like you're recommending #2, possibly through a
> minimum-difference B-Tree.

I'd do Huffman encoding first -- it's a much simpler solution
to implement and I think you'll get good results.  (It's
certainly worth the couple hours of work to figure out what kind
of compression is likely to result.)

> I think we'd also be willing to test the
> changes that you mentioned for converting off-page duplicate sets. Do you
> have a time frame for that?

My guess is that we'll do it sometime in September/October.
I've got customers that I've promised this to, so it's very
high on our list.

Regards,
--keith

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Keith Bostic
Sleepycat Software Inc.         [EMAIL PROTECTED]
394 E. Riding Dr.               +1-978-287-4781
Carlisle, MA 01741-1601         http://www.sleepycat.com


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to