On Sat, 24 Jul 1999, Sleepycat Software wrote:

> > For many reasons, we'd prefer option #1. In particular, it involves fewer
> > keys, making it faster to query.
> 
> Do you have numbers proving this?  Btree fan-out is very aggressive, and
> it takes a lot of keys before you should see any difference in the query
> speed.

No, I don't have numbers. However, if you think about the numbers, it's
often at least a factor of 100-1,000 in the number of keys. On one side,
you have a key for every unique word. On the other, you have a key for
every word. I don't know enough about how aggressive it is, but usually
2-3 orders of magnitude throw some sort of performance problem.

If you're telling me that's not a problem, it's very good news.

> And, you're going to pay a price to break up the document lists yourself,
> which isn't inexpensive.

No, but it allows significant compression of the lists, unless we do
something like a minimum-difference b-tree, which will probably manage
similar compression ratios.

>From your earlier e-mail:
>Why is it too much?  How many sites index 100,000 documents?  My
>bet is that it's not many, and last time I checked, 500MB of
>disk cost under $7 US.  How many sites are going to benefit from
>this work?

There are actually more than you'd imagine. Even my "lowly" student
organization here at Williams College has somewhere around 80,000
documents right now. The school itself (around 2,000 undergraduates) has
upwards of 150,000 documents.

As you might expect, people indexing lots of documents are also willing to
invest in development. Currently the word index is on the order of the
size of the documents indexed--it's quite clear that compression would
help. This helps large-scale users get around OS limits on file sizes
(e.g. users bumping into the 2GB limit), as well as smaller-scale users
since they'd use less disk and/or get better transfer time. Benchmarking
indicates that disk latency is still a problem for us.

> And, it's quite reasonable to want to do things like logical joins: join
> the document list for one keyword against the document list for another.
> If you have your own document list, you'll have to do that yourself, if
> you store it as a set of duplicates, Berkeley DB will do it for you.

True. We don't use this right now, but it would be nice.

> Well, I'm still not completely convinced -- if you can solve the problem
> with $10 of hardware, I'm not sure how much time it's worth.  :-)

I don't know how successfully I'm going to argue my point. Suffice to say
that we're willing to invest some decent time fixing it using one
solution or the other. Since we don't know enough about the Berkeley
internals, we're asking you. It seems like either solution requires work
on our part--either we worry about the off-page issues of #1, or we worry
about index compression for #2.

It sounds to me like you're recommending #2, possibly through a
minimum-difference B-Tree. I think we'd also be willing to test the
changes that you mentioned for converting off-page duplicate sets. Do you
have a time frame for that?

-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/






------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to