I'm here! Still parsing through the stuff in this e-mail but I agree that many approaches will require rejiggering of KeyValue in a significant way. I have made some attempts at this but nothing is very far. It's definitely something that we are hoping to put some resources into over the summer.
JG > -----Original Message----- > From: Todd Lipcon [mailto:t...@cloudera.com] > Sent: Thursday, June 02, 2011 3:37 PM > To: dev@hbase.apache.org > Subject: Re: prefix compression > > Hey Matt, > > Interesting email, and also something I've been thinking about recently. > > Unfortunately, I think one of the big prerequisites before we can start > thinking about actual compression algorithms is some refactoring around > how KeyValue is used. Currently, KeyValue exposes byte arrays in a lot of > places > - these byte arrays are expected to have discrete sub-ranges that represent > row key, column qualifier, etc. Various parts of the code directly grab these > byte arrays and offsets and do stuff with them (eg comparison, serialization, > etc). > > In order to split the different components up, or do true prefix compression, > we'd need to be able to have KeyValue do some smarter things for > comparison, and never call functions like "getBuffer), getRowKeyOffset(), > getRowKeyLength()" expecting that to show us a byte array range with a row > key in it. > > I think Jonathan Gray has some code hacked together for this, but last I > heard it wasn't in shippable state... Jonathan, you out there? > > As for the particular compression algorithms, a nice first step would be a > very > simplistic one: the first row key of any HFile block would be given in full. > Following that, each key uses a single byte token which identifies the > number of key components shared with the previous row (ie same row, > same col, different ts). This isn't quite as good as true prefix compression, > but for common cases where each cell has three versions, and each row has > many columns, it's probably a huge savings. > > -Todd > > On Thu, Jun 2, 2011 at 3:28 PM, Matt Corgan <mcor...@hotpads.com> > wrote: > > > I've been reading up on prefix compression for a few weeks trying to > > figure out the best approach. Maybe some of you have noticed that 90% > > of my emails to the user list have something to do with it... Now > > that people are working on HFile v2 and the topic of Lucene's FST has > > come up, I thought it would be a good time to share some thoughts. My > > goal here is definitely not asking anyone to implement this, but just > > to start a discussion that gets the proper jiras in place. I have > > spent a little time coding an initial rowKey trie builder that could > > be the start of a pluggable implementation. > > > > This is a big topic to address in one email, but I don't think the > > pieces make sense on their own. Here's an outline with some thoughts: > > > > * refer to prefix compression as "compaction" to avoid interfering > > with traditional "compression" > > > > * main goal is to reduce size of data in memory to increase effective > > capacity of block index, block cache, and memstores > > * disk access takes thousands of times longer than memory access, > > especially over hdfs/network > > > > * secondary goal is to speed up the processing of in memory data > > * memory access takes 200+ cpu cycles, so must use cache friendly > > algorithms > > * structure trie node sizes around typical 64B cache line size > > * investigate java's memory prefetching capabilities, possibly > > increasing to ~256B nodes > > * research > > paper< > > > http://www.google.com/url?sa=t&source=web&cd=5&sqi=2&ved=0CDcQFj > AE&url > > > =http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3 > D10.1. > > > 1.68.4872%26rep%3Drep1%26type%3Dpdf&rct=j&q=databases%20prefetchi > ng&ei > > =7tnnTcSWLuXTiALSpNCUDA&usg=AFQjCNFpVC9oDsG1- > UG0x6ZsqsIHLqvGDA&sig2=o0 > > 96l8cNrnX69oeBGChUKg > > >about > > modern memory databases and prefetching > > * almost always use variable length integer/long encoding > > > > * need separate compacted structures/algorithms for rowKey, > > colQualifier, and timestamp > > * structures will need to interact with varInt pointers to link > > rowKey<->colQualifier<->timestamp > > > > * i think in the medium to long term (and maybe even short term) it > > would be best to code this from scratch > > > > > > Static row key compaction for block cache and data blocks: > > * avoid schemes where each row is a diff of the previous (expensive > > cpu) > > * utilize the fact that row keys are always sorted (unlike column > > qualifiers) > > * this makes for fast tree construction during flushing and > > compaction > > * low hanging fruit: (rowKeyCompaction=simple) > > * pull off the common prefix of all rows > > * have a flag indicating if row key is repeat of the previous > > * more aggressive: use modified Cache Sensitive Search Trees< > > http://www.cs.columbia.edu/~library/TR-repository/reports/reports-1998 > > /cucs-019-98.pdf > > > > > (rowKeyCompaction=cssTree64, > > cssTree256, etc) > > * modified to accept full byte range 0-255, store full byte strings > > in the tree, remove all pointers, optimize for cache line size, etc, > > but similar idea > > * CSS trees are considered sub-par because they are not easily > > modified, but that is ok in hbase's block cache and data blocks > > * input is a sorted list of byte[], output is a byte[] containing the > > entire prefix trie > > * possibly put all keys in the front of the block, and all values at > > the end, with value offsets/lengths in the keys > > > > Backwards compatibility > > * block cache could quickly compute the trie when loaded, therefore > > working with HFile v1 > > * eventually want to store the computed trie on disk > > * HFile v2 will need to store information about the pluggable data > > block compaction schemes > > > > > > Memstore dynamic row key compaction: > > * this will be much more difficult than the static compaction because > > it needs to accept unsorted byte[]'s and support row locking > > * use modified C-Burstsort< > > http://ww2.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf> > > for > > dynamic unsorted data > > * fast and compact (best dynamic string sorting algorithm i could > > find) > > * fragmentation friendly (not original goal, but ends up working like > > MSLAB) > > * allocates large (customizable) byte[] for holding multiple leaf values > > * "bursts" the byte[] when they fill up, creating multiple child > > arrays > > * will need to add some sort of lockable object to the leaf of the > > rowKey trie > > > > > > Column qualifier compaction: > > * config param colQualifierCompaction=(none, lookupTable, cssTree64, > > etc) > > * use lookupTable when column qualifiers are a small bounded set > > * store the lookup table in memory (similar to blockIndex or > > fixedFileTrailer) > > * on second thought, this could be difficult without 2-pass > > compaction, but maybe still possible > > * use cssTree when a large or unbounded set, like when appending a > > numerical suffix in wide rows > > > > > > Timestamp compaction: > > * timestampCompaction=(none, hfileDiff, cssTree, flatten) > > * each hfile needs a lowestTimestamp stored in the trailer > > * hflieDiff would store a varLong relative to the lowestTimestamp > > * cssTree would do the same as it does for rowKeys, but the trie > > overhead might wipe out the space savings with such small values > > * flatten would not store a timestamp in each record, rather use the > > hfile's lowestTimestamp for every record > > * many use cases don't care about the timestamp > > * i think this works for maxVersions=1. may need more elaborate > > scheme for multiple versions > > > > > > I'll stop there... sorry for the huge email! I haven't seen much > > discussion on email or jira about how prefix compression should be > > done, so I hope this provides somewhat of a starting point. > > > > Matt > > > > > > -- > Todd Lipcon > Software Engineer, Cloudera