Excellent summary Matt. Some notes in the below. On Sun, Sep 18, 2011 at 6:43 PM, Matt Corgan <mcor...@hotpads.com> wrote: > ... All of this is relatively easy for the data > and index blocks because they're immutable. Doing it for the memstore is > another story... >
We'd need another data structure completely, wouldn't we? Have you given it any thought? > Here's some contrived data to illustrate. Row key is a (reversed > domain) url. Column family is called "familyName". ColQualifier is a > browser type. Value is the number of views from that browser type. Here's > the current representation <http://pastebin.com/7ks8kzJ2>, the prefix trie's > toString output <http://pastebin.com/cL4AkCPC> for the row keys, and removing > the whitespace <http://pastebin.com/4qiSXNh9> from the toString output. Thats a big diff. > The second problem is the lack of random access to cell offsets within the > data block. (I'm not 100% sure on this one, so please correct me if i'm > wrong). I noticed how bad this problem is when i was storing historical > event logs with 8 byte keys and small values (so ~30b per cell). I had to > increase block size to 256KB because the block indexes were too big to fit > in memory. Then I needed fast random access to these events. The problem > is that there are ~10,000 cells per block, so without random lookups inside > the block, it's seeking through ~5,000 keys for the average lookup. That's > a crazy amount of overhead to retrieve a single cell. Probably the quickest > solution to this problem is to store the offset of each key in a list of > integers at the end of the block so that it's possible to do a binary search > inside the block. That would reduce it to ~14 avg memory accesses to find > the cell. > I'd imagine that we'd not want the index always, just when keys per block went over some size. hfilev2 should help here because we don't load all indices all the time. > But, the prefix trie should actually be way faster than a binary search > because you don't have to keep comparing the beginning of the key to the one > you're looking for. I'll save the details for another email or for code > comments. In general, with a smarter block/index format we could be much > more efficient than the current method of comparing full key byte[] over and > over again. > > Same random access problem also applies to the block index i think (correct > me if i'm wrong here too). > You should stick the above in the issue Matt, or at least refer to this mail in the issue; its great stuff. Thanks, St.Ack