[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Matt Corgan updated HBASE-4676: ------------------------------- Status: Patch Available (was: Open) > Prefix Compression - Trie data block encoding > --------------------------------------------- > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver > Affects Versions: 0.96.0 > Reporter: Matt Corgan > Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currently uses it is far from optimal. The KeyValueScanner and related > classes that iterate through the trie will eventually need to be smarter and > have methods to do things like skipping to the next row of results without > scanning every cell in between. When that is accomplished it will also allow > much faster compactions because the full row key will not have to be compared > as often as it is now. > Current code is on github. The trie code is in a separate project than the > slightly modified hbase. There is an hbase project there as well with the > DeltaEncoding patch applied, and it builds on top of that. > https://github.com/hotpads/hbase/tree/prefix-trie-1 > https://github.com/hotpads/hbase-prefix-trie/tree/hcell-scanners > I'll follow up later with more implementation ideas. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira