Re: Filesystem based bitset

eks dev Mon, 19 Jan 2009 00:14:35 -0800

Hi Paul, 
not really an answer to your questions, I just thought you may find it useful 
as a confirmation that this packing of integers into (B or some other) Tree is 
good one.


I have seen Integer set distributions that can profit hugely from the tree 
organization on top. 
 
have look at: http://www.iis.uni-stuttgart.de/intset/
not meant for on disk storage, but the idea is quite similar.
 
cheers, 
eks






________________________________
From: Paul Elschot <paul.elsc...@xs4all.nl>
To: java-dev@lucene.apache.org
Sent: Sunday, 18 January, 2009 23:51:36
Subject: Re: Filesystem based bitset

 
On Friday 09 January 2009 22:30:14 Marvin Humphrey wrote:
> On Fri, Jan 09, 2009 at 08:11:31PM +0100, Karl Wettin wrote:
> 
> > SSD is pretty close to RAM when it comes to seeking. Wouldn't that 
> > mean that a bitset stored on an SSD would be more or less as fast as a 
> > bitset in RAM? 
> 
> Provided that your index can fit in the system i/o cache and stay there, you
> get the speed of RAM regardless of the underlying permanent storage type.
> There's no reason to wait on SSDs before implementing such a feature.
Since this started by thinking out loud, I'd like to continue doing that.
I've been thinking about how to add a decent skipTo() to something that
compresses better than an (Open)BitSet, and this turns out to be an
integer set implemented as a B plus tree (all leafs on the same level) of
only integers with key/data compression by a frame of reference for
every node (see LUCENE-1410).
I found a java implementation for a B plus tree  on sourceforge: BpLusDotNet
in the BplusJ package, see http://bplusdotnet.sourceforge.net/ .
This has nice transaction semantics on a file system and it has a BSD licence,
so it could be used as a starting point, but:
- it only has strings as index values, so it will need quite some simplification
to work on integers as keys and data, and
- it has no built in compression as far as I could see on first inspection.
The questions:
Would someone know of a better starting point for a B plus tree of integers
with node compression?
For example, how close is the current lucene code base to implementing
a b plus tree for the doc ids of a single term?
How valuable are transaction semantics for such an integer set? It is
tempting to try and implement such an integer set by starting from the
ground up, but I don't have any practical programming experience with
transaction semantics, so it may be better to start from something that
has transactions right from the start.
Regards,
Paul Elschot

Re: Filesystem based bitset

Reply via email to