> Memstore dynamic row key compaction

This is interesting though I think extremely difficult.  We need this
for Lucene realtime search with the RAM terms dictionary, which is
slated to use a ConcurrentSkipListMap, eg it's lock free and it works.

As Todd mentions, the KeyValue's need for access to the underlying
byte[] does create some difficulties in regards to how rowids'd work.

I think we could start with making the block index more efficient.
And prefix compression would be a big win.  Is there a Jira for that?
Sounds like we need to make the HFile pluggable first regardless?

On Thu, Jun 2, 2011 at 3:28 PM, Matt Corgan <mcor...@hotpads.com> wrote:
> I've been reading up on prefix compression for a few weeks trying to figure
> out the best approach.  Maybe some of you have noticed that 90% of my emails
> to the user list have something to do with it...  Now that people are
> working on HFile v2 and the topic of Lucene's FST has come up, I thought it
> would be a good time to share some thoughts.  My goal here is definitely not
> asking anyone to implement this, but just to start a discussion that gets
> the proper jiras in place.  I have spent a little time coding an initial
> rowKey trie builder that could be the start of a pluggable implementation.
>
> This is a big topic to address in one email, but I don't think the pieces
> make sense on their own.  Here's an outline with some thoughts:
>
> * refer to prefix compression as "compaction" to avoid interfering with
> traditional "compression"
>
> * main goal is to reduce size of data in memory to increase effective
> capacity of block index, block cache, and memstores
>    * disk access takes thousands of times longer than memory access,
> especially over hdfs/network
>
> * secondary goal is to speed up the processing of in memory data
>    * memory access takes 200+ cpu cycles, so must use cache friendly
> algorithms
>    * structure trie node sizes around typical 64B cache line size
>    * investigate java's memory prefetching capabilities, possibly
> increasing to ~256B nodes
>        * research
> paper<http://www.google.com/url?sa=t&source=web&cd=5&sqi=2&ved=0CDcQFjAE&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.68.4872%26rep%3Drep1%26type%3Dpdf&rct=j&q=databases%20prefetching&ei=7tnnTcSWLuXTiALSpNCUDA&usg=AFQjCNFpVC9oDsG1-UG0x6ZsqsIHLqvGDA&sig2=o096l8cNrnX69oeBGChUKg>about
> modern memory databases and prefetching
>    * almost always use variable length integer/long encoding
>
> * need separate compacted structures/algorithms for rowKey, colQualifier,
> and timestamp
>    * structures will need to interact with varInt pointers to link
> rowKey<->colQualifier<->timestamp
>
> * i think in the medium to long term (and maybe even short term) it would be
> best to code this from scratch
>
>
> Static row key compaction for block cache and data blocks:
> * avoid schemes where each row is a diff of the previous (expensive cpu)
> * utilize the fact that row keys are always sorted (unlike column
> qualifiers)
>    * this makes for fast tree construction during flushing and compaction
> * low hanging fruit: (rowKeyCompaction=simple)
>    * pull off the common prefix of all rows
>    * have a flag indicating if row key is repeat of the previous
> * more aggressive: use modified Cache Sensitive Search
> Trees<http://www.cs.columbia.edu/~library/TR-repository/reports/reports-1998/cucs-019-98.pdf>
> (rowKeyCompaction=cssTree64,
> cssTree256, etc)
>    * modified to accept full byte range 0-255, store full byte strings in
> the tree, remove all pointers, optimize for cache line size, etc, but
> similar idea
>    * CSS trees are considered sub-par because they are not easily modified,
> but that is ok in hbase's block cache and data blocks
> * input is a sorted list of byte[], output is a byte[] containing the entire
> prefix trie
> * possibly put all keys in the front of the block, and all values at the
> end, with value offsets/lengths in the keys
>
> Backwards compatibility
> * block cache could quickly compute the trie when loaded, therefore working
> with HFile v1
>    * eventually want to store the computed trie on disk
> * HFile v2 will need to store information about the pluggable data block
> compaction schemes
>
>
> Memstore dynamic row key compaction:
> * this will be much more difficult than the static compaction because it
> needs to accept unsorted byte[]'s and support row locking
> * use modified 
> C-Burstsort<http://ww2.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf>
> for
> dynamic unsorted data
> * fast and compact (best dynamic string sorting algorithm i could find)
> * fragmentation friendly (not original goal, but ends up working like MSLAB)
>    * allocates large (customizable) byte[] for holding multiple leaf values
>    * "bursts" the byte[] when they fill up, creating multiple child arrays
> * will need to add some sort of lockable object to the leaf of the rowKey
> trie
>
>
> Column qualifier compaction:
> * config param colQualifierCompaction=(none, lookupTable, cssTree64, etc)
> * use lookupTable when column qualifiers are a small bounded set
>    * store the lookup table in memory (similar to blockIndex or
> fixedFileTrailer)
>    * on second thought, this could be difficult without 2-pass compaction,
> but maybe still possible
> * use cssTree when a large or unbounded set, like when appending a numerical
> suffix in wide rows
>
>
> Timestamp compaction:
> * timestampCompaction=(none, hfileDiff, cssTree, flatten)
> * each hfile needs a lowestTimestamp stored in the trailer
> * hflieDiff would store a varLong relative to the lowestTimestamp
> * cssTree would do the same as it does for rowKeys, but the trie overhead
> might wipe out the space savings with such small values
> * flatten would not store a timestamp in each record, rather use the hfile's
> lowestTimestamp for every record
>    * many use cases don't care about the timestamp
>    * i think this works for maxVersions=1.  may need more elaborate scheme
> for multiple versions
>
>
> I'll stop there...  sorry for the huge email!  I haven't seen much
> discussion on email or jira about how prefix compression should be done, so
> I hope this provides somewhat of a starting point.
>
> Matt
>

Reply via email to