> From: [EMAIL PROTECTED]
>> The question is if it's creating a new level in the Btree.  If it
>
> Tests shows that 11 million keys create a level 4 btree. In the
> tests I'm inserting a sorted list of keys, in ascending order. I understand
> that 11 million keys should fit in a level 3 btree.

That depends on the page size and the sizes of the key and data
items, of course.

> Since there is
> no tree balancing I don't know how to get the optimal case.

Understood, and tree-balancing isn't an option for concurrency
reasons.

> In addition
> to balancing we will also need a re-packer to get rid of the 40% free
> space in leaf pages. I understand that this free space helps when inserting
> new entries. But if I want to publish a read-only database, I want to
> get rid of them.

You can get this by doing a "db_dump | db_load" of your
database, as they are dumped in sorted order.

>> Well, yeah, but we're still talking a problem that can be solved
>> for $10.
>
>  A friend of mine is exactly defending the same point :-) After many
> discussions with him, my final argument is the following:
> We want htdig indexing to be near optimal for
>              . space usage
>              . dynamic updating
>              . scalability
>              . performances
>              . functionalities
> because we don't want to write it from scratch next year. If we
> neglect one point or another, that's what is going to happen.

I don't disagree, but at the same time every project has limited
resources, and some of these goals don't play well together.

For example, space usage and dynamic updating are largely
incompatible in disk storage algorithms.  Making space usage
better makes it updates slower.  Scalability and space usage,
too: Berkeley DB stores 16-bits of page offset, so we are
limited to 64K of page size.  It would be nice to have bigger
pages on some machines, but that means more than 16-bits of
offset, and so space usage takes a hit.

Anyway, I don't think we disagree on the rules of the game,
although we're perhaps making different choices in some places.

> People will *not* by a new hard drive to index their stuff.
> They will index their stuff as long as it fits on their disk
> or not index it at all. 

I don't think that's true.  Yesterday's example of a University
that wants to index some 250K of documents -- if they aren't
willing to put $200 into disk to do that, then the project isn't
serious enough to care about.

It doesn't sound to me as if htdig is using that much disk space
for small projects, and the big projects are going to be willing
to buy the $250 of disk it takes to solve the problem.

>> My guess is that we'll do it sometime in September/October.
>> I've got customers that I've promised this to, so it's very
>> high on our list.
>
>  I'll try to implement this starting from now. The fact that I've
> been able to build a 60 millions entries tree in 1.2gb file gives me
> hope. I would really need an hint, though : why do we have 40% free
> space when using variable length keys/data (varying from 6 bytes to 35
> bytes, average 8 bytes, page size 4k) ? Is there a way to reduce this ?

I haven't seen your test program, but I'm pretty sure that this is
all based on the fact that one group was inserted in sorted order
and one wasn't.  If you build read-only indexes, then build it and
do a dump/re-load to store it in the smallest possible space.

Regards,
--keith

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Keith Bostic
Sleepycat Software Inc.         [EMAIL PROTECTED]
394 E. Riding Dr.               +1-978-287-4781
Carlisle, MA 01741-1601         http://www.sleepycat.com


------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.

Reply via email to