> Memstore dynamic row key compaction This is interesting though I think extremely difficult. We need this for Lucene realtime search with the RAM terms dictionary, which is slated to use a ConcurrentSkipListMap, eg it's lock free and it works.
As Todd mentions, the KeyValue's need for access to the underlying byte[] does create some difficulties in regards to how rowids'd work. I think we could start with making the block index more efficient. And prefix compression would be a big win. Is there a Jira for that? Sounds like we need to make the HFile pluggable first regardless? On Thu, Jun 2, 2011 at 3:28 PM, Matt Corgan <mcor...@hotpads.com> wrote: > I've been reading up on prefix compression for a few weeks trying to figure > out the best approach. Maybe some of you have noticed that 90% of my emails > to the user list have something to do with it... Now that people are > working on HFile v2 and the topic of Lucene's FST has come up, I thought it > would be a good time to share some thoughts. My goal here is definitely not > asking anyone to implement this, but just to start a discussion that gets > the proper jiras in place. I have spent a little time coding an initial > rowKey trie builder that could be the start of a pluggable implementation. > > This is a big topic to address in one email, but I don't think the pieces > make sense on their own. Here's an outline with some thoughts: > > * refer to prefix compression as "compaction" to avoid interfering with > traditional "compression" > > * main goal is to reduce size of data in memory to increase effective > capacity of block index, block cache, and memstores > * disk access takes thousands of times longer than memory access, > especially over hdfs/network > > * secondary goal is to speed up the processing of in memory data > * memory access takes 200+ cpu cycles, so must use cache friendly > algorithms > * structure trie node sizes around typical 64B cache line size > * investigate java's memory prefetching capabilities, possibly > increasing to ~256B nodes > * research > paper<http://www.google.com/url?sa=t&source=web&cd=5&sqi=2&ved=0CDcQFjAE&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.68.4872%26rep%3Drep1%26type%3Dpdf&rct=j&q=databases%20prefetching&ei=7tnnTcSWLuXTiALSpNCUDA&usg=AFQjCNFpVC9oDsG1-UG0x6ZsqsIHLqvGDA&sig2=o096l8cNrnX69oeBGChUKg>about > modern memory databases and prefetching > * almost always use variable length integer/long encoding > > * need separate compacted structures/algorithms for rowKey, colQualifier, > and timestamp > * structures will need to interact with varInt pointers to link > rowKey<->colQualifier<->timestamp > > * i think in the medium to long term (and maybe even short term) it would be > best to code this from scratch > > > Static row key compaction for block cache and data blocks: > * avoid schemes where each row is a diff of the previous (expensive cpu) > * utilize the fact that row keys are always sorted (unlike column > qualifiers) > * this makes for fast tree construction during flushing and compaction > * low hanging fruit: (rowKeyCompaction=simple) > * pull off the common prefix of all rows > * have a flag indicating if row key is repeat of the previous > * more aggressive: use modified Cache Sensitive Search > Trees<http://www.cs.columbia.edu/~library/TR-repository/reports/reports-1998/cucs-019-98.pdf> > (rowKeyCompaction=cssTree64, > cssTree256, etc) > * modified to accept full byte range 0-255, store full byte strings in > the tree, remove all pointers, optimize for cache line size, etc, but > similar idea > * CSS trees are considered sub-par because they are not easily modified, > but that is ok in hbase's block cache and data blocks > * input is a sorted list of byte[], output is a byte[] containing the entire > prefix trie > * possibly put all keys in the front of the block, and all values at the > end, with value offsets/lengths in the keys > > Backwards compatibility > * block cache could quickly compute the trie when loaded, therefore working > with HFile v1 > * eventually want to store the computed trie on disk > * HFile v2 will need to store information about the pluggable data block > compaction schemes > > > Memstore dynamic row key compaction: > * this will be much more difficult than the static compaction because it > needs to accept unsorted byte[]'s and support row locking > * use modified > C-Burstsort<http://ww2.cs.mu.oz.au/~rsinha/papers/SinhaRingZobel-2006.pdf> > for > dynamic unsorted data > * fast and compact (best dynamic string sorting algorithm i could find) > * fragmentation friendly (not original goal, but ends up working like MSLAB) > * allocates large (customizable) byte[] for holding multiple leaf values > * "bursts" the byte[] when they fill up, creating multiple child arrays > * will need to add some sort of lockable object to the leaf of the rowKey > trie > > > Column qualifier compaction: > * config param colQualifierCompaction=(none, lookupTable, cssTree64, etc) > * use lookupTable when column qualifiers are a small bounded set > * store the lookup table in memory (similar to blockIndex or > fixedFileTrailer) > * on second thought, this could be difficult without 2-pass compaction, > but maybe still possible > * use cssTree when a large or unbounded set, like when appending a numerical > suffix in wide rows > > > Timestamp compaction: > * timestampCompaction=(none, hfileDiff, cssTree, flatten) > * each hfile needs a lowestTimestamp stored in the trailer > * hflieDiff would store a varLong relative to the lowestTimestamp > * cssTree would do the same as it does for rowKeys, but the trie overhead > might wipe out the space savings with such small values > * flatten would not store a timestamp in each record, rather use the hfile's > lowestTimestamp for every record > * many use cases don't care about the timestamp > * i think this works for maxVersions=1. may need more elaborate scheme > for multiple versions > > > I'll stop there... sorry for the huge email! I haven't seen much > discussion on email or jira about how prefix compression should be done, so > I hope this provides somewhat of a starting point. > > Matt >