Hey Matt,

Interesting email, and also something I've been thinking about recently.

Unfortunately, I think one of the big prerequisites before we can start
thinking about actual compression algorithms is some refactoring around how
KeyValue is used. Currently, KeyValue exposes byte arrays in a lot of places
- these byte arrays are expected to have discrete sub-ranges that represent
row key, column qualifier, etc. Various parts of the code directly grab
these byte arrays and offsets and do stuff with them (eg comparison,
serialization, etc).

In order to split the different components up, or do true prefix
compression, we'd need to be able to have KeyValue do some smarter things
for comparison, and never call functions like "getBuffer),
getRowKeyOffset(), getRowKeyLength()" expecting that to show us a byte array
range with a row key in it.

I think Jonathan Gray has some code hacked together for this, but last I
heard it wasn't in shippable state... Jonathan, you out there?

As for the particular compression algorithms, a nice first step would be a
very simplistic one: the first row key of any HFile block would be given in
full. Following that, each key uses a single byte token which identifies the
number of key components shared with the previous row (ie same row, same
col, different ts). This isn't quite as good as true prefix compression, but
for common cases where each cell has three versions, and each row has many
columns, it's probably a huge savings.

-Todd

On Thu, Jun 2, 2011 at 3:28 PM, Matt Corgan <mcor...@hotpads.com> wrote:

> I've been reading up on prefix compression for a few weeks trying to figure
> out the best approach.  Maybe some of you have noticed that 90% of my
> emails
> to the user list have something to do with it...  Now that people are
> working on HFile v2 and the topic of Lucene's FST has come up, I thought it
> would be a good time to share some thoughts.  My goal here is definitely
> not
> asking anyone to implement this, but just to start a discussion that gets
> the proper jiras in place.  I have spent a little time coding an initial
> rowKey trie builder that could be the start of a pluggable implementation.
>
> This is a big topic to address in one email, but I don't think the pieces
> make sense on their own.  Here's an outline with some thoughts:
>
> * refer to prefix compression as "compaction" to avoid interfering with
> traditional "compression"
>
> * main goal is to reduce size of data in memory to increase effective
> capacity of block index, block cache, and memstores
>    * disk access takes thousands of times longer than memory access,
> especially over hdfs/network
>
> * secondary goal is to speed up the processing of in memory data
>    * memory access takes 200+ cpu cycles, so must use cache friendly
> algorithms
>    * structure trie node sizes around typical 64B cache line size
>    * investigate java's memory prefetching capabilities, possibly
> increasing to ~256B nodes
>        * research
> paper<
> http://www.google.com/url?sa=t&source=web&cd=5&sqi=2&ved=0CDcQFjAE&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.68.4872%26rep%3Drep1%26type%3Dpdf&rct=j&q=databases%20prefetching&ei=7tnnTcSWLuXTiALSpNCUDA&usg=AFQjCNFpVC9oDsG1-UG0x6ZsqsIHLqvGDA&sig2=o096l8cNrnX69oeBGChUKg
> >about
> modern memory databases and prefetching
>    * almost always use variable length integer/long encoding
>
> * need separate compacted structures/algorithms for rowKey, colQualifier,
> and timestamp
>    * structures will need to interact with varInt pointers to link
> rowKey<->colQualifier<->timestamp
>
> * i think in the medium to long term (and maybe even short term) it would
> be
> best to code this from scratch
>
>
> Static row key compaction for block cache and data blocks:
> * avoid schemes where each row is a diff of the previous (expensive cpu)
> * utilize the fact that row keys are always sorted (unlike column
> qualifiers)
>    * this makes for fast tree construction during flushing and compaction
> * low hanging fruit: (rowKeyCompaction=simple)
>    * pull off the common prefix of all rows
>    * have a flag indicating if row key is repeat of the previous
> * more aggressive: use modified Cache Sensitive Search
> Trees<
> http://www.cs.columbia.edu/~library/TR-repository/reports/reports-1998/cucs-019-98.pdf
> >
> (rowKeyCompaction=cssTree64,
> cssTree256, etc)
>    * modified to accept full byte range 0-255, store full byte strings in
> the tree, remove all pointers, optimize for cache line size, etc, but
> similar idea
>    * CSS trees are considered sub-par because they are not easily modified,
> but that is ok in hbase's block cache and data blocks
> * input is a sorted list of byte[], output is a byte[] containing the
> entire
> prefix trie
> * possibly put all keys in the front of the block, and all values at the
> end, with value offsets/lengths in the keys
>
> Backwards compatibility
> * block cache could quickly compute the trie when loaded, therefore working
> with HFile v1
>    * eventually want to store the computed trie on disk
> * HFile v2 will need to store information about the pluggable data block
> compaction schemes
>
>
> Memstore dynamic row key compaction:
> * this will be much more difficult than the static compaction because it
> needs to accept unsorted byte[]'s and support row locking
> * use modified C-Burstsort<
> http://ww2.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf>
> for
> dynamic unsorted data
> * fast and compact (best dynamic string sorting algorithm i could find)
> * fragmentation friendly (not original goal, but ends up working like
> MSLAB)
>    * allocates large (customizable) byte[] for holding multiple leaf values
>    * "bursts" the byte[] when they fill up, creating multiple child arrays
> * will need to add some sort of lockable object to the leaf of the rowKey
> trie
>
>
> Column qualifier compaction:
> * config param colQualifierCompaction=(none, lookupTable, cssTree64, etc)
> * use lookupTable when column qualifiers are a small bounded set
>    * store the lookup table in memory (similar to blockIndex or
> fixedFileTrailer)
>    * on second thought, this could be difficult without 2-pass compaction,
> but maybe still possible
> * use cssTree when a large or unbounded set, like when appending a
> numerical
> suffix in wide rows
>
>
> Timestamp compaction:
> * timestampCompaction=(none, hfileDiff, cssTree, flatten)
> * each hfile needs a lowestTimestamp stored in the trailer
> * hflieDiff would store a varLong relative to the lowestTimestamp
> * cssTree would do the same as it does for rowKeys, but the trie overhead
> might wipe out the space savings with such small values
> * flatten would not store a timestamp in each record, rather use the
> hfile's
> lowestTimestamp for every record
>    * many use cases don't care about the timestamp
>    * i think this works for maxVersions=1.  may need more elaborate scheme
> for multiple versions
>
>
> I'll stop there...  sorry for the huge email!  I haven't seen much
> discussion on email or jira about how prefix compression should be done, so
> I hope this provides somewhat of a starting point.
>
> Matt
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to