Re: Hash indexing of HFiles

2011-07-19 Thread Casey Stella
I didn't get a chance to investigate thoroughly or get any benchmarks. We were looking for an alternate indexing strategy to B+ trees (with JDBM2) since we knew the keys a priori, but looking at the source I was a bit daunted at porting it and the license wasn't something that we could use. I sta

Re: Hash indexing of HFiles

2011-07-19 Thread Claudio Martella
This looks great. Actually, more than BDZ, the intriguing part is CHM as it's order preserving. I guess how it behaves for unseen keys. Do you know about it? What did you find more intriguing on this topic? :) On 7/19/11 3:02 AM, Casey Stella wrote: > I looked into MPH a while ago and came acros

Re: Hash indexing of HFiles

2011-07-18 Thread Casey Stella
I looked into MPH a while ago and came across Sebastiano's work, but was even more intrigued by CMPH (http://cmph.sourceforge.net/), which claims to work on the order of a billion keys. I attempted a java port of BDZ (acyclic random 3-graphs FTW :) at one point, but gave up as I found something el

Re: Hash indexing of HFiles

2011-07-18 Thread Stack
On Mon, Jul 18, 2011 at 9:22 AM, Claudio Martella wrote: > Yes, I had a look at it a while ago. For what I know perfect hashing > doesn't work that good for many elements. With millions of items it > should be computationally expensive and the probability of finding such > a perfect hashing. Did y

Re: Hash indexing of HFiles

2011-07-18 Thread Claudio Martella
On 7/18/11 6:05 PM, Stack wrote: > On Mon, Jul 18, 2011 at 4:04 AM, Claudio Martella > wrote: >> No, you can have collisions, so the index is not perfect (which means >> you can have buckets for colliding keys and empty unused entries in the >> hashtable directory). > Well, if a perfect index is w

Re: Hash indexing of HFiles

2011-07-18 Thread Stack
On Mon, Jul 18, 2011 at 4:04 AM, Claudio Martella wrote: > No, you can have collisions, so the index is not perfect (which means > you can have buckets for colliding keys and empty unused entries in the > hashtable directory). Well, if a perfect index is what you are after, you can generate hashi

Re: Hash indexing of HFiles

2011-07-18 Thread Claudio Martella
On 7/16/11 10:08 PM, Stack wrote: > On Fri, Jul 15, 2011 at 10:06 AM, Claudio Martella > wrote: >> On 7/15/11 6:24 PM, Stack wrote: >>> How do you figure the N in the below Claudio? >> N is the total amount of pairs in the sequence file. You know that when >> you finish flushing a memstore or comp

Re: Hash indexing of HFiles

2011-07-16 Thread Eric Charles
out there on Git Hub so you may want to check them out. HTH -Mike Date: Fri, 15 Jul 2011 14:32:50 +0200 From: claudio.marte...@tis.bz.it To: user@hbase.apache.org Subject: Hash indexing of HFiles Hello list, at SIGMOD this year i've seen a spreading of different storage files for HBase, wit

Re: Hash indexing of HFiles

2011-07-16 Thread Stack
On Fri, Jul 15, 2011 at 10:06 AM, Claudio Martella wrote: > On 7/15/11 6:24 PM, Stack wrote: >> How do you figure the N in the below Claudio? > N is the total amount of pairs in the sequence file. You know that when > you finish flushing a memstore or compacting files. So a perfect index? If thi

Re: Hash indexing of HFiles

2011-07-16 Thread Michel Segel
gt;> There are a couple of other projects out there on Git Hub so you may want to >> check them out. >> >> HTH >> >> -Mike >> >> >>> Date: Fri, 15 Jul 2011 14:32:50 +0200 >>> From: claudio.marte...@tis.bz.it >>> To: user

Re: Hash indexing of HFiles

2011-07-16 Thread Eric Charles
a couple of other projects out there on Git Hub so you may want to check them out. HTH -Mike Date: Fri, 15 Jul 2011 14:32:50 +0200 From: claudio.marte...@tis.bz.it To: user@hbase.apache.org Subject: Hash indexing of HFiles Hello list, at SIGMOD this year i've seen a spreading of diff

Re: Hash indexing of HFiles

2011-07-15 Thread Claudio Martella
mness without having to try to build a separate index. >>> But we're still using the base key for the row. Its not like we're creating >>> a secondary index on a column value. >>> >>> There are a couple of other projects out there on Git Hub so you may want

Re: Hash indexing of HFiles

2011-07-15 Thread Stack
on a column value. >> >> There are a couple of other projects out there on Git Hub so you may want to >> check them out. >> >> HTH >> >> -Mike >> >> >>> Date: Fri, 15 Jul 2011 14:32:50 +0200 >>> From: claudio.marte.

Re: Hash indexing of HFiles

2011-07-15 Thread Claudio Martella
> > There are a couple of other projects out there on Git Hub so you may want to > check them out. > > HTH > > -Mike > > >> Date: Fri, 15 Jul 2011 14:32:50 +0200 >> From: claudio.marte...@tis.bz.it >> To: user@hbase.apache.org >> Subject: Hash indexing

RE: Hash indexing of HFiles

2011-07-15 Thread Michael Segel
l 2011 14:32:50 +0200 > From: claudio.marte...@tis.bz.it > To: user@hbase.apache.org > Subject: Hash indexing of HFiles > > Hello list, > > at SIGMOD this year i've seen a spreading of different storage files for > HBase, with different techniques. My scenario and usage does

Hash indexing of HFiles

2011-07-15 Thread Claudio Martella
Hello list, at SIGMOD this year i've seen a spreading of different storage files for HBase, with different techniques. My scenario and usage doesn't really require range queries, so I thought I'd take advantage of even faster random i/o from hash indexing of data in each sequence file. Does anybo