Hi Paul, not really an answer to your questions, I just thought you may find it useful as a confirmation that this packing of integers into (B or some other) Tree is good one.
I have seen Integer set distributions that can profit hugely from the tree organization on top. have look at: http://www.iis.uni-stuttgart.de/intset/ not meant for on disk storage, but the idea is quite similar. cheers, eks ________________________________ From: Paul Elschot <paul.elsc...@xs4all.nl> To: java-dev@lucene.apache.org Sent: Sunday, 18 January, 2009 23:51:36 Subject: Re: Filesystem based bitset On Friday 09 January 2009 22:30:14 Marvin Humphrey wrote: > On Fri, Jan 09, 2009 at 08:11:31PM +0100, Karl Wettin wrote: > > > SSD is pretty close to RAM when it comes to seeking. Wouldn't that > > mean that a bitset stored on an SSD would be more or less as fast as a > > bitset in RAM? > > Provided that your index can fit in the system i/o cache and stay there, you > get the speed of RAM regardless of the underlying permanent storage type. > There's no reason to wait on SSDs before implementing such a feature. Since this started by thinking out loud, I'd like to continue doing that. I've been thinking about how to add a decent skipTo() to something that compresses better than an (Open)BitSet, and this turns out to be an integer set implemented as a B plus tree (all leafs on the same level) of only integers with key/data compression by a frame of reference for every node (see LUCENE-1410). I found a java implementation for a B plus tree on sourceforge: BpLusDotNet in the BplusJ package, see http://bplusdotnet.sourceforge.net/ . This has nice transaction semantics on a file system and it has a BSD licence, so it could be used as a starting point, but: - it only has strings as index values, so it will need quite some simplification to work on integers as keys and data, and - it has no built in compression as far as I could see on first inspection. The questions: Would someone know of a better starting point for a B plus tree of integers with node compression? For example, how close is the current lucene code base to implementing a b plus tree for the doc ids of a single term? How valuable are transaction semantics for such an integer set? It is tempting to try and implement such an integer set by starting from the ground up, but I don't have any practical programming experience with transaction semantics, so it may be better to start from something that has transactions right from the start. Regards, Paul Elschot