[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573548#comment-13573548 ] Matteo Bertozzi commented on HBASE-4676: Is there a reason why the prefix-tree and cell code is inside org.apache.hbase instead of the traditional org.apache.hadoop.hbase? > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: 4676-prefix-tree-trunk-v16.patch, > 4676-prefix-tree-trunk-v17.patch, HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v15.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the dat
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573247#comment-13573247 ] Hudson commented on HBASE-4676: --- Integrated in HBase-TRUNK #3859 (See [https://builds.apache.org/job/HBase-TRUNK/3859/]) HBASE-4676 Prefix Compression - Trie data block encoding (Matt Corgan) (Revision 1443289) Result = ABORTED tedyu : Files : * /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoding.java * /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/ByteRangeTool.java * /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java * /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/test/RedundantKVGenerator.java * /hbase/trunk/hbase-common/src/main/java/org/apache/hbase/cell/CellComparator.java * /hbase/trunk/hbase-common/src/main/java/org/apache/hbase/cell/CellOutputStream.java * /hbase/trunk/hbase-prefix-tree * /hbase/trunk/hbase-prefix-tree/pom.xml * /hbase/trunk/hbase-prefix-tree/src * /hbase/trunk/hbase-prefix-tree/src/main * /hbase/trunk/hbase-prefix-tree/src/main/java * /hbase/trunk/hbase-prefix-tree/src/main/java/org * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PrefixTreeBlockMeta.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PrefixTreeCodec.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PrefixTreeSeeker.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/ArraySearcherPool.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/DecoderFactory.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeArrayReversibleScanner.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeArrayScanner.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeArraySearcher.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeCell.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/column * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/column/ColumnNodeReader.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/column/ColumnReader.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/row * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/row/RowNodeReader.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/timestamp * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/timestamp/MvccVersionDecoder.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/timestamp/TimestampDecoder.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/EncoderFactory.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/EncoderPool.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/PrefixTreeEncoder.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/ThreadLocalEncoderPool.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/column * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/column/ColumnNodeWriter.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/column/ColumnSectionWriter.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/other * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/other/CellTypeEncoder.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/other/LongEncoder.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/row * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/row/RowNodeWriter.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/row
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573246#comment-13573246 ] Ted Yu commented on HBASE-4676: --- https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/396/console seems to be close to normal. HBase-TRUNK-on-Hadoop-2.0.0 #395 and trunk build #3859 were both on ubuntu1. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: 4676-prefix-tree-trunk-v16.patch, > 4676-prefix-tree-trunk-v17.patch, HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v15.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need t
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573237#comment-13573237 ] Ted Yu commented on HBASE-4676: --- https://builds.apache.org/job/HBase-TRUNK/3859/console shows some test failures which didn't happen in previous trunk builds. Will investigate tomorrow morning. If trunk builds fail consecutively, will consider reverting. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: 4676-prefix-tree-trunk-v16.patch, > 4676-prefix-tree-trunk-v17.patch, HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v15.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of opti
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573179#comment-13573179 ] Ted Yu commented on HBASE-4676: --- I ran a few tests, marked failed by HBase-TRUNK-on-Hadoop-2.0.0 #395, locally: 1220 mvn clean test -Dhadoop.profile=2.0 -PlocalTests -Dtest=TestFullLogReconstruction 1222 mvn test -Dhadoop.profile=2.0 -PlocalTests -Dtest=TestDelayedRpc 1223 mvn test -Dhadoop.profile=2.0 -PlocalTests -Dtest=TestOfflineMetaRebuildHole,TestMergeTable They passed. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: 4676-prefix-tree-trunk-v16.patch, > 4676-prefix-tree-trunk-v17.patch, HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v15.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While prog
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573171#comment-13573171 ] Hudson commented on HBASE-4676: --- Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #395 (See [https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/395/]) HBASE-4676 Prefix Compression - Trie data block encoding (Matt Corgan) (Revision 1443289) Result = FAILURE tedyu : Files : * /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoding.java * /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/ByteRangeTool.java * /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java * /hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/test/RedundantKVGenerator.java * /hbase/trunk/hbase-common/src/main/java/org/apache/hbase/cell/CellComparator.java * /hbase/trunk/hbase-common/src/main/java/org/apache/hbase/cell/CellOutputStream.java * /hbase/trunk/hbase-prefix-tree * /hbase/trunk/hbase-prefix-tree/pom.xml * /hbase/trunk/hbase-prefix-tree/src * /hbase/trunk/hbase-prefix-tree/src/main * /hbase/trunk/hbase-prefix-tree/src/main/java * /hbase/trunk/hbase-prefix-tree/src/main/java/org * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PrefixTreeBlockMeta.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PrefixTreeCodec.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PrefixTreeSeeker.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/ArraySearcherPool.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/DecoderFactory.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeArrayReversibleScanner.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeArrayScanner.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeArraySearcher.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeCell.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/column * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/column/ColumnNodeReader.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/column/ColumnReader.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/row * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/row/RowNodeReader.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/timestamp * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/timestamp/MvccVersionDecoder.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/timestamp/TimestampDecoder.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/EncoderFactory.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/EncoderPool.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/PrefixTreeEncoder.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/ThreadLocalEncoderPool.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/column * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/column/ColumnNodeWriter.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/column/ColumnSectionWriter.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/other * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/other/CellTypeEncoder.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/other/LongEncoder.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/row * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/row/RowNodeWriter.java * /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hba
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573044#comment-13573044 ] Ted Yu commented on HBASE-4676: --- Integrated to trunk (patch v17). Thanks for the patch, Matt. Thanks for the review, Stack. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: 4676-prefix-tree-trunk-v16.patch, > 4676-prefix-tree-trunk-v17.patch, HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v15.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably sho
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573039#comment-13573039 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12568325/4676-prefix-tree-trunk-v17.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 105 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.coprocessor.example.TestBulkDeleteProtocol Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4360//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4360//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: 4676-prefix-tree-trunk-v16.patch, > 4676-prefix-tree-trunk-v17.patch, HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v15.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572987#comment-13572987 ] Matt Corgan commented on HBASE-4676: yep - will do. i'll create a few follow on issues as well for a benchmark, tool for testing on cluster data, pooling decoders, etc > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: 4676-prefix-tree-trunk-v16.patch, > HBASE-4676-0.94-v1.patch, HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v15.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could >
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572986#comment-13572986 ] stack commented on HBASE-4676: -- +1 on integrating. [~mcorgan] Any chance of adding a release note up here in JIRA? > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: 4676-prefix-tree-trunk-v16.patch, > HBASE-4676-0.94-v1.patch, HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v15.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower tha
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572973#comment-13572973 ] Ted Yu commented on HBASE-4676: --- Will integrate later in the evening if there is no further review comment. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: 4676-prefix-tree-trunk-v16.patch, > HBASE-4676-0.94-v1.patch, HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v15.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta e
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572967#comment-13572967 ] Ted Yu commented on HBASE-4676: --- A diff of the findbugs warnings shows the following: {code} tyu$ diff 4348-newPatchFindbugsWarningshbase-server.xml 4354-newPatchFindbugsWarningshbase-server.xml ... 1196a1197,1205 >category="STYLE"> > >start="72" end="290" sourcefile="ThriftServer.java" > sourcepath="org/apache/hadoop/hbase/thrift2/ThriftServer.java"/> > > name="main" signature="([Ljava/lang/String;)V" isStatic="true"> >start="202" end="290" startBytecode="0" endBytecode="1246" > sourcefile="ThriftServer.java" > sourcepath="org/apache/hadoop/hbase/thrift2/ThriftServer.java"/> > > start="284" end="284" startBytecode="479" endBytecode="479" > sourcefile="ThriftServer.java" > sourcepath="org/apache/hadoop/hbase/thrift2/ThriftServer.java"/> > {code} The class was not touched by this JIRA. Looks like HBASE-7757 touched that class. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: 4676-prefix-tree-trunk-v16.patch, > HBASE-4676-0.94-v1.patch, HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v15.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572700#comment-13572700 ] Ted Yu commented on HBASE-4676: --- https://builds.apache.org/job/PreCommit-HBASE-Build/4354/artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.xml shows there is no findbugs warning. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: 4676-prefix-tree-trunk-v16.patch, > HBASE-4676-0.94-v1.patch, HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v15.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572697#comment-13572697 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12568264/4676-prefix-tree-trunk-v16.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 105 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4354//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4354//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4354//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4354//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4354//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4354//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4354//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4354//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4354//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: 4676-prefix-tree-trunk-v16.patch, > HBASE-4676-0.94-v1.patch, HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v15.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > probl
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572589#comment-13572589 ] Ted Yu commented on HBASE-4676: --- For PrefixTreeArrayScanner, there is one findbug warning. See https://builds.apache.org/job/PreCommit-HBASE-Build/4351/artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.xml I guess the long line warning was due to presence of hbase-prefix-tree/pom.xml: +http://maven.apache.org/POM/4.0.0"; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd";> > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v15.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current dr
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572269#comment-13572269 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12568182/HBASE-4676-prefix-tree-trunk-v15.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 105 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 1 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4351//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4351//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4351//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4351//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4351//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4351//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4351//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4351//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4351//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v15.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * tri
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570602#comment-13570602 ] Ted Yu commented on HBASE-4676: --- The 12 findbugs warnings can be found here: https://builds.apache.org/job/PreCommit-HBASE-Build/4313/artifact/trunk/patchprocess/patchFindbugsWarningshbase-prefix-tree.xml > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570566#comment-13570566 ] Matt Corgan commented on HBASE-4676: Above test did not fail for me. I'll try to nix the findbugs. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a th
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570405#comment-13570405 ] stack commented on HBASE-4676: -- [~mcorgan] Findbugs is complaining. Given the list of complaints a once over, at least the ones in the new module. If you want to punt for now, and do later, that is fine. I'll just up findbugs tolerated count for now to get the patch in (we are working on getting findbugs overall to zero). Does above failed unit test fail for you? > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of o
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570067#comment-13570067 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12567811/HBASE-4676-prefix-tree-trunk-v14.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 105 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.util.TestHBaseFsck Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4313//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4313//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4313//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4313//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4313//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4313//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4313//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4313//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4313//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570040#comment-13570040 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12567806/HBASE-4676-prefix-tree-trunk-v13.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 105 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 12 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/4312//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4312//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4312//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4312//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4312//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4312//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4312//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/4312//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/4312//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v14.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely e
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570027#comment-13570027 ] Matt Corgan commented on HBASE-4676: Oh - that was easy. I have a static final boolean PrefixTreeEncoder.MULITPLE_FAMILIES_POSSIBLE that is set to false. The failing test is testing correctness on multiple families, so good sign that it fails =). I guess I will comment out the test for now. Will upload new patch based on latest trunk. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to opti
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570024#comment-13570024 ] Matt Corgan commented on HBASE-4676: I am getting the test failures. Not sure how that happened. Will look into it. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for wri
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570015#comment-13570015 ] stack commented on HBASE-4676: -- Thanks [~mcorgan] bq. See HFileDataBlockEncoderImpl.newOnDiskDataBlockDecodingContext. It will take some work to use it for client/rpc though As was said above by myself already (only I forgot about it but independently rediscovered my conclusion earlier today), EncodedDataBlock is inappropriate to rpc what with its KV fixation -- and worse, ByteBuffers that are supposed to have serialized KVs in them -- and its hfile pollution. I'm going to go basic and just pass a byte array accompanied by a class name to use to decode it; class must implement CellScanner (Thinking on it, and you confirm above), CellScanner is enough; looking at how we write from client to server it is fine... no searching and ditto on way out. One day we could make use of a CellSearcher but a bunch of at least server-internals code would need redoing first On getting this patch in, you have the test failures I report above? Or, let me just run this by jenkins now. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few smal
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13569930#comment-13569930 ] Matt Corgan commented on HBASE-4676: Some quick answers for your later questions: {quote}If I wanted to pass a prefixtree block of Cells on rpc, I see that I could write the block with a CellOutputStream/PrefixTreeDecoder but I'm not sure I get how I decode.{quote}I think eventually we want to use the same code that encodes/decodes to disk. See HFileDataBlockEncoderImpl.newOnDiskDataBlockDecodingContext. It will take some work to use it for client/rpc though. Code is somewhat confusing now because it forks for encoded/non-encoded blocks. {quote}I suppose I should aim for CellSearcher being the minimal Interface rpc block readers should implement? It is pretty heavyweight though compared to CellScanner at least. Would CellScanner be too low down?{quote}CellScanner is enough. Current client interface would scan through every cell, explode them to KeyValues, and stick the KeyValue[] into the org.apache.hadoop.hbase.client.Result object. Or are you trying to get rid of Result now? I was thinking that could wait till later. {quote}Can I select a prefixtrie block after this patch is committed and have hfiles read and written w/ prefixtree?{quote}Yes, should be fully functional. {quote}You think we should commit as is? It still needs a bunch of work but I'd be up for getting it in as is and having new issues to do fix up.{quote}Yeah, that's what i'm thinking. Some of the next steps are common to all codecs, like pooling the decoders. Future patches would be smaller and more understandable if the current code is committed. Regarding RPC stuff you're working on: In general, I don't think you should use Prefix-tree specific code for rpc. It should use interfaces like DataBlockEncoder and HFileDataBlockEncoder so that other encoder implementations can be used. For example, the existing encoders like PREFIX and FAST_DIFF are probably better for rpc since rpc does not require random access. Prefix-tree is more expensive to encode. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's q
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13569707#comment-13569707 ] Ted Yu commented on HBASE-4676: --- In TestRowEncoder: {code} +searcher = new PrefixTreeArraySearcher(blockMetaReader, blockMetaReader.getRowTreeDepth(), {code} Should we use the following to obtain searcher ? {code} + searcher = DecoderFactory.checkOut(block, true); {code} > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower t
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13569703#comment-13569703 ] stack commented on HBASE-4676: -- Why remove the TODO? {code} -//TODO compare byte[]'s in reverse since later bytes more likely to differ -return 0 == compareStatic(a, b); +return equalsRow(a, b) +&& equalsFamily(a, b) +&& equalsQualifier(a, b) +&& equalsTimestamp(a, b) +&& equalsType(a, b); } {code} Seems to make sense. How hard to use pb varints in stead of your UVInt? You go via the tool so maybe you could do it therein? You'd have to up your block version I suppose and be able to read the old form? Why CellScanner in +package org.apache.hbase.codec.prefixtree.scanner? Why not use what is over in hbase-common? (Oh, its not there yet...) And CS has this + void resetToBeforeFirstEntry(); Did we say strike this? When I ran tests, I got these failures: Results : Tests in error: org.apache.hadoop.hbase.mapreduce.TestMultithreadedTableMapper: Waiting for startup of standalone server testStopDuringStart(org.apache.hadoop.hbase.master.TestMasterNoCluster): test timed out after 3 milliseconds org.apache.hadoop.hbase.TestZooKeeper: Shutting down org.apache.hadoop.hbase.thrift.TestThriftServer: Shutting down testMultiClusters(org.apache.hadoop.hbase.TestHBaseTestingUtility): org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.access$6(Lorg/apache/hadoop/hbase/client/HConnectionManager$HConnectionImplementation;Lorg/apache/hadoop/hbase/client/HConnectionManager$HConnectionImplementation$MasterProtocolState;)V testLocalHBaseCluster(org.apache.hadoop.hbase.TestLocalHBaseCluster): Shutting down testHDFSRegioninfoMissing(org.apache.hadoop.hbase.util.TestHBaseFsck): org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/root-region-server testRebalanceOnRegionServerNumberChange[0](org.apache.hadoop.hbase.TestRegionRebalancing): Shutting down testRebalanceOnRegionServerNumberChange[1](org.apache.hadoop.hbase.TestRegionRebalancing): Shutting down Tests run: 1193, Failures: 0, Errors: 9, Skipped: 13 They pass for you? If I wanted to pass a prefixtree block of Cells on rpc, I see that I could write the block with a CellOutputStream/PrefixTreeDecoder but I'm not sure I get how I decode. I don't see a PrefixTreeDecoder (that implements CellSearcher). I suppose I should aim for CellSearcher being the minimal Interface rpc block readers should implement? It is pretty heavyweight though compared to CellScanner at least. Would CellScanner be too low down? If an encoding doesn't support CellSearcher, it should not implement or throw unsupportedoperation when you try to do a previous, etc? Ok to expect a null arg constructor on decoders or will I need to pass in context? (or all context they need is in the block?) You think we should commit as is? It still needs a bunch of work but I'd be up for getting it in as is and having new issues to do fix up. Can I select a prefixtrie block after this patch is committed and have hfiles read and written w/ prefixtree? > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan >Priority: Critical > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size,
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13544230#comment-13544230 ] Ted Yu commented on HBASE-4676: --- checkLineLengths () in dev-support/test-patch.sh doesn't exclude pom.xml +1 on patch v13. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point w
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13544223#comment-13544223 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12563341/HBASE-4676-prefix-tree-trunk-v13.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 105 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplication {color:red}-1 core zombie tests{color}. There are 6 zombie test(s): Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3861//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3861//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3861//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3861//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3861//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3861//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3861//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3861//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3861//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3861//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v13.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infe
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543932#comment-13543932 ] Ted Yu commented on HBASE-4676: --- @Matt: You may want to take care of the long lines in patch v12. Thanks > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543635#comment-13543635 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12563241/HBASE-4676-prefix-tree-trunk-v12.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 105 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 lineLengths{color}. The patch introduces lines longer than 100 {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3849//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3849//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v12.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely elimi
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13539102#comment-13539102 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562286/HBASE-4676-prefix-tree-trunk-v11.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 105 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 2 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 14 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.client.TestMultiParallel org.apache.hadoop.hbase.client.TestFromClientSide org.apache.hadoop.hbase.replication.TestReplicationWithCompression Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3680//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3680//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts t
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13539084#comment-13539084 ] Matt Corgan commented on HBASE-4676: {quote}Since CopyKeyDataBlockEncoder isn't included in the patch, I would suggest removing the above enum.{quote}CopyKeyDataBlockEncoder already exists in the code (not added by me). This comment just lets developers know about it. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v11.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write spee
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13539021#comment-13539021 ] Ted Yu commented on HBASE-4676: --- {code} + //COPY_KEY is for benchmarking only + // COPY_KEY(5, createEncoder("org.apache.hadoop.hbase.io.encoding.CopyKeyDataBlockEncoder")), {code} Since CopyKeyDataBlockEncoder isn't included in the patch, I would suggest removing the above enum. There were some javadoc warnings spotted by Hadoop QA. You can use the following command to find the warnings: mvn clean package javadoc:javadoc -DskipTests -DHBasePatchProcess > patchJavadocWarnings.txt {code} [WARNING] Javadoc Warnings [WARNING] /Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/scanner/ReversibleCellScanner.java:42: warning - Tag @link: reference not found: CellScannerPosition.BEFORE_FIRST [WARNING] /Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/scanner/ReversibleCellScanner.java:52: warning - Tag @link: reference not found: CellScannerPosition.BEFORE_FIRST [WARNING] /Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeArrayScanner.java:207: warning - @return tag has no arguments. [WARNING] /Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeArrayScanner.java:303: warning - @param argument "forwards:" is not a parameter name. [WARNING] /Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/scanner/CellSearcher.java:77: warning - Tag @link: can't find KeyValueScanner.reseek() in org.apache.hbase.codec.prefixtree.scanner.CellSearcher [WARNING] /Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/column/ColumnNodeWriter.java:85: warning - @return tag has no arguments. [WARNING] /Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PrefixTreeCodec.java:56: warning - @ is an unknown tag. [WARNING] /Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PrefixTreeCodec.java:56: warning - @ is an unknown tag. [WARNING] /Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/scanner/ReversibleCellScanner.java:42: warning - Tag @link: reference not found: CellScannerPosition.BEFORE_FIRST [WARNING] /Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/scanner/ReversibleCellScanner.java:52: warning - Tag @link: reference not found: CellScannerPosition.BEFORE_FIRST [WARNING] /Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/util/byterange/ByteRangeSet.java:38: warning - Tag @link: reference not found: ByteRangeHashSet [WARNING] /Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/util/byterange/ByteRangeSet.java:38: warning - Tag @link: reference not found: ByteRangeTreeSet [INFO] {code} The above are warnings generated as described. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538991#comment-13538991 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562262/HBASE-4676-prefix-tree-trunk-v10.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 105 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 14 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 14 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.replication.TestReplication {color:red}-1 core zombie tests{color}. There are zombie tests. See build logs for details. Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3677//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3677//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3677//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3677//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3677//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3677//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3677//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3677//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3677//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3677//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, > HBASE-4676-prefix-tree-trunk-v10.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, > HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding compl
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537713#comment-13537713 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12562043/HBASE-4676-prefix-tree-trunk-v9.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 105 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:green}+1 javadoc{color}. The javadoc tool did not generate any warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 62 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . {color:red}-1 core zombie tests{color}. There are zombie tests. See build logs for details. Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3648//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3648//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3648//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3648//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3648//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3648//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3648//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3648//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3648//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3648//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, HBASE-4676-prefix-tree-trunk-v1.patch, > HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, > HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, > HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, > HBASE-4676-prefix-tree-trunk-v8.patch, HBASE-4676-prefix-tree-trunk-v9.patch, > hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of s
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535541#comment-13535541 ] Ted Yu commented on HBASE-4676: --- {code} +++ b/hbase-prefix-tree/src/test/java/org/apache/hbase/codec/prefixtree/timestamp/data/TestTimestamps1.java +++ b/hbase-prefix-tree/src/test/java/org/apache/hbase/codec/prefixtree/timestamp/data/TestTimestamps2.java +++ b/hbase-prefix-tree/src/test/java/org/apache/hbase/codec/prefixtree/timestamp/data/TestTimestamps3.java +++ b/hbase-prefix-tree/src/test/java/org/apache/hbase/codec/prefixtree/builder/data/TestSortedByteArrays1.java {code} Can the class names be more informative about what they test ? For TimestampCompressorTests.java, KeyValueToolTests.java, PrefixTreeBlockMetaTests.java, etc, normally their class names should start with Test. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, HBASE-4676-prefix-tree-trunk-v1.patch, > HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, > HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, > HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, > HBASE-4676-prefix-tree-trunk-v8.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimizat
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13530796#comment-13530796 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12560147/HBASE-4676-prefix-tree-trunk-v8.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 129 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 116 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 62 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3525//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3525//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3525//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3525//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3525//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3525//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3525//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3525//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3525//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3525//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, HBASE-4676-prefix-tree-trunk-v1.patch, > HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, > HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, > HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, > HBASE-4676-prefix-tree-trunk-v8.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multi
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13528175#comment-13528175 ] stack commented on HBASE-4676: -- Took a quick look through v8. Need to come back. You probably don't mean this to be included: # The java implementation to use. Java 1.6 required. -# export JAVA_HOME=/usr/java/jdk1.6.0/ +export JAVA_HOME=/hpdev/jdk6 ... or the hbase-site.xml changes. Can leave out on commit. Hmmm... should move the below up into util: hbase-common/src/main/java/org/apache/hadoop/hbase/util/test/RedundantKVGenerator.java Puttting it in a subpackage called 'test' would seem to say it should be over under src/test. Nit. Do we have to add an IOE on write too in CellOutputStream? Prefix-tree depends on hadoop? Should content of ByteUtils just go into utils/Bytes? Ditto content of StringByteUtils. I need to help you get this in. Perhaps we should break out the stuff that belongs in hbase-common? Make this patch smaller? > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, HBASE-4676-prefix-tree-trunk-v1.patch, > HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, > HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, > HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, > HBASE-4676-prefix-tree-trunk-v8.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527742#comment-13527742 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12560147/HBASE-4676-prefix-tree-trunk-v8.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 129 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 113 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 62 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3470//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3470//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3470//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3470//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3470//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3470//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3470//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3470//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3470//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3470//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, HBASE-4676-prefix-tree-trunk-v1.patch, > HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, > HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, > HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, > HBASE-4676-prefix-tree-trunk-v8.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column famil
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497577#comment-13497577 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553580/HBASE-4676-common-and-server-v8.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 24 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 99 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 27 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.TestCompare org.apache.hadoop.hbase.io.hfile.TestHFileReaderV1 org.apache.hadoop.hbase.io.hfile.TestHFileBlockCompatibility org.apache.hadoop.hbase.io.hfile.TestLruBlockCache org.apache.hadoop.hbase.regionserver.TestStoreFile org.apache.hadoop.hbase.TestHTableDescriptor org.apache.hadoop.hbase.io.TestHeapSize org.apache.hadoop.hbase.TestHColumnDescriptor org.apache.hadoop.hbase.master.TestCatalogJanitor org.apache.hadoop.hbase.io.hfile.TestHFileDataBlockEncoder org.apache.hadoop.hbase.regionserver.TestColumnSeeking org.apache.hadoop.hbase.regionserver.TestRSStatusServlet org.apache.hadoop.hbase.regionserver.TestScanner org.apache.hadoop.hbase.regionserver.TestSplitTransaction org.apache.hadoop.hbase.regionserver.TestHBase7051 org.apache.hadoop.hbase.coprocessor.TestCoprocessorInterface org.apache.hadoop.hbase.TestFSTableDescriptorForceCreation org.apache.hadoop.hbase.io.hfile.TestHFileInlineToRootChunkConversion org.apache.hadoop.hbase.regionserver.TestBlocksScanned org.apache.hadoop.hbase.regionserver.TestKeepDeletes org.apache.hadoop.hbase.filter.TestDependentColumnFilter org.apache.hadoop.hbase.regionserver.TestResettingCounters org.apache.hadoop.hbase.TestSerialization org.apache.hadoop.hbase.coprocessor.TestRegionObserverStacking org.apache.hadoop.hbase.io.TestHalfStoreFileReader org.apache.hadoop.hbase.io.hfile.TestSeekTo org.apache.hadoop.hbase.regionserver.TestScanWithBloomError org.apache.hadoop.hbase.regionserver.wal.TestWALActionsListener org.apache.hadoop.hbase.regionserver.TestRegionSplitPolicy org.apache.hadoop.hbase.io.hfile.TestCachedBlockQueue org.apache.hadoop.hbase.regionserver.TestHRegionInfo org.apache.hadoop.hbase.regionserver.TestCompactSelection org.apache.hadoop.hbase.constraint.TestConstraints org.apache.hadoop.hbase.filter.TestColumnPrefixFilter org.apache.hadoop.hbase.io.hfile.TestHFile org.apache.hadoop.hbase.filter.TestMultipleColumnPrefixFilter org.apache.hadoop.hbase.regionserver.TestMinVersions org.apache.hadoop.hbase.rest.model.TestTableRegionModel org.apache.hadoop.hbase.regionserver.TestWideScanner org.apache.hadoop.hbase.client.TestIntraRowPagination org.apache.hadoop.hbase.io.hfile.TestReseekTo org.apache.hadoop.hbase.filter.TestFilter Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3341//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apac
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497561#comment-13497561 ] stack commented on HBASE-4676: -- I'm up for committing the hbase-common and hbase-server changes that are up in RB. Attach the patch here or rather to HBASE-7162, an issue just for these changes. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-common-and-server-v8.patch, HBASE-4676-prefix-tree-trunk-v1.patch, > HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, > HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, > HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, > hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currently uses it is far from optimal. The KeyValueScanner and
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496917#comment-13496917 ] stack commented on HBASE-4676: -- Looking in v7 patch attached here: Should this be in the patch Matt? b/bin/index-builder-setup.rb Did you mean to include the hbase prefix tree module in this hbase-common patch? i.e. b/hbase-prefix-tree/pom.xml? I've not reviewed the prefixtree stuff. Lets get the hbase-common changes in first (including the removes of classes moved from hbase-server to hbase-common?) Or, if you think that v7 has all addressed up on RB, then I could commit only the bits of the v7 patch that touch hbase-common and hbase-server? > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496865#comment-13496865 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12553437/HBASE-4676-prefix-tree-trunk-v7.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 142 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 109 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 59 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3332//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > HBASE-4676-prefix-tree-trunk-v7.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wid
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494579#comment-13494579 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12552931/HBASE-4676-prefix-tree-trunk-v6.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 142 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 103 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 60 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3299//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, > hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493840#comment-13493840 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12552792/HBASE-4676-prefix-tree-trunk-v5.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 142 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 103 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 58 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks org.apache.hadoop.hbase.TestDrainingServer Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3286//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > HBASE-4676-prefix-tree-trunk-v5.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between ro
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493779#comment-13493779 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12552778/HBASE-4676-prefix-tree-trunk-v4.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 142 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 103 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 58 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.io.hfile.TestHFileDataBlockEncoder Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3285//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, > hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-i
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485872#comment-13485872 ] Matt Corgan commented on HBASE-4676: for the record, this is now on branch prefix-tree2 on github: {code} git clone -b prefix-tree2 https://hotp...@github.com/hotpads/hbase.git hbase-prefix-tree {code} > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currently uses it is far from optimal. The KeyValueScanner and related > classes that iterate through the trie will eventually need to be smarter and > have methods to do things like skipping to the next row of results without > scanning every cell
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485834#comment-13485834 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12551154/HBASE-4676-prefix-tree-trunk-v3.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 143 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 102 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 49 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3170//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3170//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > HBASE-4676-prefix-tree-trunk-v3.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for >
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483864#comment-13483864 ] Ted Yu commented on HBASE-4676: --- I tried to run test suite on the following platform and got the exact same test failure: {code} Linux s0 2.6.38-11-generic #48-Ubuntu SMP Fri Jul 29 19:02:55 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux {code} > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currently uses it is far from optimal. The KeyValueScanner and related > classes that iterate through the trie will eventually need to be smarter and > have methods to do things like skipping to the next row of results without > scanning every cell in between
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483831#comment-13483831 ] Ted Yu commented on HBASE-4676: --- I ran TestHFileDataBlockEncoder on MacBook 29 times, on Linux a few times. The test didn't fail. Looking at the stack provided by Hadoop QA, looks like the NPE was caused by this code in TimestampEncoder.java: {code} // should always find an exact match return Arrays.binarySearch(sortedUniqueTimestamps, timestamp); {code} Meaning sortedUniqueTimestamps was null. Can you guard against this case ? Thanks > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currently uses it is far from optimal. The KeyValueSc
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483806#comment-13483806 ] Matt Corgan commented on HBASE-4676: Does anyone know why TestHFileDataBlockEncoder would fail on jenkins when it passes 100% of the time locally for me? I believe the part that's failing has something to do with timestamps. Does jenkins do something different with the clock? > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currently uses it is far from optimal. The KeyValueScanner and related > classes that iterate through the trie will eventually need to be smarter and > have methods to do things like skipping to the next row of r
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483041#comment-13483041 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12550594/HBASE-4676-prefix-tree-trunk-v2.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 143 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 99 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 49 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.io.hfile.TestHFileDataBlockEncoder Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3134//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3134//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, > hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadat
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476011#comment-13476011 ] Matt Corgan commented on HBASE-4676: hmm. Didn't think i changed anything since those tests passed. Will look into it. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currently uses it is far from optimal. The KeyValueScanner and related > classes that iterate through the trie will eventually need to be smarter and > have methods to do things like skipping to the next row of results without > scanning every cell in between. When that is accomplished it will also allow > much faster compactions because the full row key will not have to be compared > as often as it is n
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13475994#comment-13475994 ] Hadoop QA commented on HBASE-4676: -- {color:red}-1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12549106/HBASE-4676-prefix-tree-trunk-v1.patch against trunk revision . {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 139 new or modified tests. {color:green}+1 hadoop2.0{color}. The patch compiles against the hadoop 2.0 profile. {color:red}-1 javadoc{color}. The javadoc tool appears to have generated 99 warning messages. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:red}-1 findbugs{color}. The patch appears to introduce 53 new Findbugs (version 1.3.9) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:red}-1 core tests{color}. The patch failed these unit tests: org.apache.hadoop.hbase.io.encoding.TestChangingEncoding org.apache.hadoop.hbase.regionserver.TestStore org.apache.hadoop.hbase.io.encoding.TestDataBlockEncoders Test results: https://builds.apache.org/job/PreCommit-HBASE-Build/3052//testReport/ Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html Findbugs warnings: https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/3052//console This message is automatically generated. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.96.0 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, > HBASE-4676-prefix-tree-trunk-v1.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta dete
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13475934#comment-13475934 ] Matt Corgan commented on HBASE-4676: I have all known bugs worked out. All tests are passing including those internal to the hbase-prefix-tree module, and existing DataBlockEncoding tests in hbase-server, such as TestEncodedSeekers. Split into 3 patches for reviewboard: https://reviews.apache.org/r/7589/ https://reviews.apache.org/r/7591/ https://reviews.apache.org/r/7592/ The prefix-tree branch on my github repo contains the latest from trunk as of this comment. http://github.com/hotpads/hbase/tree/prefix-tree {code} git clone -b prefix-tree https://hotp...@github.com/hotpads/hbase.git hbase-prefix-tree {code} > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.90.6 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a qua
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466624#comment-13466624 ] Ted Yu commented on HBASE-4676: --- [~mikhail]: Can you provide your insight on this feature please ? > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, Performance, regionserver >Affects Versions: 0.90.6 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currently uses it is far from optimal. The KeyValueScanner and related > classes that iterate through the trie will eventually need to be smarter and > have methods to do things like skipping to the next row of results without > scanning every cell in between. When that is accomplished it will also allow > much faster compactions because the full row key will not have to be compared > as often as it is now. > Current code is on github. The trie code is in a separate projec
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457131#comment-13457131 ] Ted Yu commented on HBASE-4676: --- @Matt: Do you want to attach your patch(es) here for Hadoop QA to run test suite ? Thanks > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, performance, regionserver >Affects Versions: 0.90.6 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currently uses it is far from optimal. The KeyValueScanner and related > classes that iterate through the trie will eventually need to be smarter and > have methods to do things like skipping to the next row of results without > scanning every cell in between. When that is accomplished it will also allow > much faster compactions because the full row key will not have to be compared > as often as it is now. > Current code is on github. The trie cod
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447427#comment-13447427 ] Matt Corgan commented on HBASE-4676: ReviewBoard links: hbase-common: https://reviews.apache.org/r/6897/ hbase-prefix-tree: https://reviews.apache.org/r/6898/ excludes tests for now > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, performance, regionserver >Affects Versions: 0.90.6 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currently uses it is far from optimal. The KeyValueScanner and related > classes that iterate through the trie will eventually need to be smarter and > have methods to do things like skipping to the next row of results without > scanning every cell in between. When that is accomplished it will also allow > much faster compactions because the full row key will not have to be compared >
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439090#comment-13439090 ] Matt Corgan commented on HBASE-4676: Barely been able to touch it this whole summer, but I spent a little time in july rebasing to trunk and reorganizing for readability. It's on github on a branch called "prefix-tree" (i renamed from trie since people wince when pronouncing it). https://github.com/hotpads/hbase/tree/prefix-tree What's up there is pretty good and workable, but i still have some things to do and some questions to ask for things like comparators, sequenceIds, ROOT and META handling. Also needs more documentation. I could use some high level feedback if you can get anything out of it without a lot of docs/comments. Maybe look at the things in the hbase-common module to start since that will affect overall hbase? The prefix-trie module is more implementation details that only matter when you enable this specific encoding. btw - current version has an "HCell" interface which i will rename to Cell(?) after seeing the recent discussion about HStore naming. Happy to post on reviewboard as well, but might have to wait till the weekend. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, performance, regionserver >Affects Versions: 0.90.6 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the sam
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439037#comment-13439037 ] stack commented on HBASE-4676: -- You need review on this Matt? > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, performance, regionserver >Affects Versions: 0.90.6 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, > PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by > blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currently uses it is far from optimal. The KeyValueScanner and related > classes that iterate through the trie will eventually need to be smarter and > have methods to do things like skipping to the next row of results without > scanning every cell in between. When that is accomplished it will also allow > much faster compactions because the full row key will not have to be compared > as often as it is now. > Current code is on github. The trie code is in a separate project than the > slightly modified hbase.
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287829#comment-13287829 ] Matt Corgan commented on HBASE-4676: A little more detail... it would help where you have long qualifiers but only a few of them, like if you migrated a narrow relational db table over. If you have wide rows with long qualifiers, then you would want to take advantage of the qualifier trie. Not sure which case you have, but migrating relational style tables over will be pretty common so i wanted to handle that common case well. If you get a chance to do a random read benchmark on it I'd love to hear the results. I've only done a few small benchmarks at the Store level and haven't benchmarked a whole cluster. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, performance, regionserver >Affects Versions: 0.90.6 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, > hbase-prefix-trie-0.1.jar > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quart
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287766#comment-13287766 ] James Taylor commented on HBASE-4676: - Our qualifiers tend to be long, so trie-encoding them helps quite a bit. We could optimize this ourselves by managing our own mapping, but it's great that the trie-encoding does it for us. We haven't seen this impact our scan times. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, performance, regionserver >Affects Versions: 0.90.6 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, > hbase-prefix-trie-0.1.jar > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currently uses it is far from optimal. The KeyValueScanner and related > classes that iterate through the trie will eventually need to be smarter and > have methods to do things like skipping to the next row of results without > scanning every cell in between. When that is accomplished it will also
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287637#comment-13287637 ] Matt Corgan commented on HBASE-4676: Great to hear James. I'm working on migrating it to trunk, but wrestling with git and maven, my two worst friends. The last change I've been thinking about making to it before finalizing this version is to add an option controlling whether to trie-encode the qualifiers vs merely de-duping them. There's some read-time expense to decoding the trie-encoded qualifiers, and if you have a narrow table then you may not be saving much memory anyway. So it would be an option to trade a little memory for faster scans/decoding. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, performance, regionserver >Affects Versions: 0.90.6 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, > hbase-prefix-trie-0.1.jar > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currentl
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287587#comment-13287587 ] James Taylor commented on HBASE-4676: - This is fantastic, Matt. We're testing your patch on 0.94 in our dev cluster here at Salesforce and are seeing 5-15x compression with no degradation on scan performance. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, performance, regionserver >Affects Versions: 0.90.6 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, > hbase-prefix-trie-0.1.jar > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currently uses it is far from optimal. The KeyValueScanner and related > classes that iterate through the trie will eventually need to be smarter and > have methods to do things like skipping to the next row of results without > scanning every cell in between. When that is accomplished it will also allow > much faster compactions because the full row key will
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285830#comment-13285830 ] Matt Corgan commented on HBASE-4676: Jacques - I like the idea , and I think the long-term goal is to auto-tune some of this stuff. For example, if flushes or compactions are backlogged, that maybe a good signal for certain tables to temporarily switch from more compact Trie/GZip encoding/compression to faster-encoding Prefix/Snappy. There are other factors at play though on a table-by-table basis. Some tables might be random-read heavy, in which case you would want to keep the Trie. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, performance, regionserver >Affects Versions: 0.90.6 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, > hbase-prefix-trie-0.1.jar > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currently uses it is far from optimal. The KeyValueScanner and related > classe
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285793#comment-13285793 ] Jacques commented on HBASE-4676: Random thought on this. I was talking to Nicolas Spiegelberg and he was talking about how they are exploring changing encoding schemes when data becomes more permanent. Don't quote me on this but it sounded like they were considering using gzip compression on major compactions. I was wondering if something similar made sense here. Basically, use less compression the shorter the likely lifetime of files. Use the Trie compressions on only major compactions. This way the performance difference could be less of a serious issue since you're paying for it a fewer number of times. I glanced around and didn't see any Jiras about a two-tiered approach of data storage formats but it seems like that would be a prerequisite for a hybrid/tiered approach. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, performance, regionserver >Affects Versions: 0.90.6 >Reporter: Matt Corgan >Assignee: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, > hbase-prefix-trie-0.1.jar > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thor
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13251836#comment-13251836 ] Matt Corgan commented on HBASE-4676: the git url in above comment is wrong. reposting instructions with public https endpoint (looks like i can't edit jira comments) To use with Eclipse: * check out hbase-0.94 from apache subversion to "myworkspace" * apply the HBASE-4676-0.94-v1.patch from HBASE-4676 * cd myworkspace * git clone -b 0.1 https://hotp...@github.com/hotpads/hbase-prefix-trie.git hbase-prefix-trie-0.1 * in eclipse -> new java project * type "hbase-prefix-trie-0.1" and eclipse should find it. continue * on the "Projects" tab, add hbase-0.94. finish * ctrl-shift-R to open SeekBenchmarkMain.java * follow instructions in SeekBenchmarkMain To build a jar after modifying code: * edit package-hbase-prefix-trie.xml in the project root to point to your build directory * right click it -> Run As -> Ant Build > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, performance, regionserver >Affects Versions: 0.90.6 >Reporter: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, > hbase-prefix-trie-0.1.jar > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thoro
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13251375#comment-13251375 ] Matt Corgan commented on HBASE-4676: Sorry, the packaging is unusual for HBase but I think it's worth a shot at turning this into a module rather than dumping 200+ new .java files into hbase-core. As it stands, the dependencies are very limited. The hbase-prefix-trie project needs to see KeyValue.java and DataBlockEncoder.java (maybe a couple others, but not many). Then, hbase-core instantiates a PrefixTrieDataBlockEncoder through reflection at the bottom of DataBlockEncoding.java, so hbase-core has no compile-time dependency on hbase-prefix-trie. If HBASE-4336 makes headway, then we can extract some of the fundamental hbase classes (KeyValue, DataBlockEncoder) into hbase-common, and I can further limit the prefix-trie to only have a dependency on these fundamental interfaces in the lightweight hbase-common module. HBase-common will tie together the modules. Limiting the dependencies between modules is the key to keeping development agile, otherwise small changes will trigger massive refactorings that even the most seasoned devs won't be able to pull off safely. I added instructions to the git project homepage for checking out the prefix trie and modifying it or using it for profiling. Just a few extra steps beyond applying a patch: https://github.com/hotpads/hbase-prefix-trie. Pasting them here as well: To use with Eclipse: * check out hbase-0.94 from apache subversion to "myworkspace" * apply the HBASE-4676-0.94-v1.patch from HBASE-4676 * cd myworkspace * git clone -b 0.1 g...@github.com:hotpads/hbase-prefix-trie.git hbase-prefix-trie-0.1 * in eclipse -> new java project * type "hbase-prefix-trie-0.1" and eclipse should find it. continue * on the "Projects" tab, add hbase-0.94. finish * ctrl-shift-R to open SeekBenchmarkMain.java * follow instructions in SeekBenchmarkMain To build a jar after modifying code: * edit package-hbase-prefix-trie.xml in the project root to point to your build directory * right click it -> Run As -> Ant Build > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, performance, regionserver >Affects Versions: 0.90.6 >Reporter: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, > hbase-prefix-trie-0.1.jar > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is repres
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13251272#comment-13251272 ] Zhihong Yu commented on HBASE-4676: --- Thanks for the update, Matt. PrefixTrieDataBlockEncoder is currently in the jar. If PrefixTrie related java classes are in the patch, that would make understanding your implementation easier. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, performance, regionserver >Affects Versions: 0.90.6 >Reporter: Matt Corgan > Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, > PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, > hbase-prefix-trie-0.1.jar > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currently uses it is far from optimal. The KeyValueScanner and related > classes that iterate through the trie will eventually need to be smarter and > have methods to do things like skipping to the next row of results without > scanning every cell in between. When that is accomplished it will also allow > much faster compactions because the full row key will not have to be c
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238948#comment-13238948 ] Matt Corgan commented on HBASE-4676: {quote}Do we not have this now Matt?{quote} I don't believe so. Right now, HFileReaderV2.blockSeek(..) does a brute force loop through each KV in the block and does a full KV comparison at each KV. Say there are 1000 KVs in the block and 20/row, on average you will have to do 500 full KV comparisons just to get to your key, and even if you are selecting all 20 KVs in the row you still do 500 comparisons on average to get to the start of the row. It's also very important to remember that these are long keys and they are most likely to share a common prefix since they got sorted to the same block, so you're comparators are churning the same prefix bytes over and over. {quote}The slow write and scan speeds would tend to rule it out for anything but a random read workload but when random reading only{quote} Yeah, it's not really targeted at long scans on cold data, but there are some cases in between long cold scans and hot point queries. Some factors: If it's warm data your block cache is now 6x bigger. Then there are many scans that are really sub-block prefix scans to get all the keys in a row (the 20 of 1000 cells i mentioned above). If your scan will have a low hit ratio then the ability to jump between rows inside a block will lessen CPU usage. It performs best when doing lots of random Gets on hot data (in block cache). Possibly a persistent memcached alternative, especially with SSDs. I believe the current system is actually limited by CPU when doing high throughput Gets on data with long keys and short values. The trie's individual request latency may not be much different, but a single server could serve many more parallel requests before maxing out cpu. The bigger the value/key ratio of your data the smaller the difference between trie and anything else. Seems like many people now have big values so I doubt they would see a difference. I'm more trying to enable smarter schemas with compound primary keys. {quote}Regards your working set read testing, did it all fit in memory?{quote} yes, i'm trying to do the purest comparison of cpu usage possible. leaving it up to others to extrapolate the results to what happens with the effectively bigger block cache, more rows/block fetched from disk for equivalent size block, etc. i'm currently just standing up a StoreFile and using it directly. see https://github.com/hotpads/hbase-prefix-trie/blob/hcell-scanners/test/org/apache/hadoop/hbase/cell/pt/test/performance/seek/SeekBenchmarkMain.java. i'll try to factor in network and disk latencies in later tests (did some preliminary tests friday but am on vacation this week). {quote}How did you measure cycles per seek?{quote} simple assumption of 2 billion/second. was just trying to emphasize how many cycles a seek currently takes {quote}What is numBlocks? The total number of blocks the dataset fit in?{quote} number of data blocks in the HFile i fed in > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, performance, regionserver >Affects Versions: 0.90.6 >Reporter: Matt Corgan > Attachments: PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, > SeeksPerSec by blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block.
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238645#comment-13238645 ] stack commented on HBASE-4676: -- bq. The KeyValueScanner and related classes that iterate through the trie will eventually need to be smarter and have methods to do things like skipping to the next row of results without scanning every cell in between. Do we not have this now Matt? As said before, 'So 40-80x faster seeks at medium block size.', is pretty nice improvement. Ditto on 'Seek speed is ~15x NONE seek speed while expanding the effective block cache size by 4x.' What are you thinking these times Matt? The slow write and scan speeds would tend to rule it out for anything but a random read workload but when random reading only, you'll want to use prefixtrie (smile). Do you think it possible to approach prefix compression scan and write speeds? Regards your working set read testing, did it all fit in memory? How did you measure cycles per seek? What is numBlocks? The total number of blocks the dataset fit in? Throughput is better for sure.. what is latency like? Excellent stuff Matt. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io, performance, regionserver >Affects Versions: 0.90.6 >Reporter: Matt Corgan > Attachments: PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, > SeeksPerSec by blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141676#comment-13141676 ] Matt Corgan commented on HBASE-4676: More thorough benchmark shared here: https://docs.google.com/spreadsheet/ccc?key=0Ah5tKh7-sVXYdGkyWm9WenhSRzVfd2U5VC1XNF9tekE cell format: Key[int,int,string,string]Value[VInt] avg raw key bytes: 52 avg raw value bytes: 1 avg cells/row: ~10 The TRIE compressor gives ~6x compression when all the timestamps are the same. For this table we would set the raw block size at 256KB or 1MB which gives encoded sizes of 39KB or 178KB. PREFIX compressor is doing 2x compression on this test because I don't think it gets too involved with qualifiers or timestamps. Encoding MB/s are: NONE: 1000 PREFIX: 275 TRIE: 15 That is purely because of CPU because everything is in my linux page cache. TRIE encoding could be improved quite a bit, but will likely stay slower than the others. Although if the scanners/comparators could be improved down the road, then it may speed up compaction significantly. - Scanning MB/s are: NONE: 3700 PREFIX: 1800 TRIE: 800 Trie is slower for a scan through all cells because of more complex decoding. However, it could be much faster for things like row counting because it can quickly jump from one row to the next without iterating cells in between. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io >Affects Versions: 0.94.0 >Reporter: Matt Corgan > Attachments: SeeksPerSec by blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of op
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140704#comment-13140704 ] Matt Corgan commented on HBASE-4676: Write speed is slower because of cpu usage. There's still tons of room for optimization, but slower write speed will generally be the trade off for higher random read speed. The current encoding algorithms on github throw off a lot of garbage which tends to indicate that they're not using the cpu cache very well. Some data sets would be able to disable gzip/lzo compression and just go with prefix compression, so you save some cpu there. I'm working on a benchmark to compare a few more factors, including compression ratio and sequential read speed. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io >Affects Versions: 0.94.0 >Reporter: Matt Corgan > Attachments: SeeksPerSec by blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currently uses it is far from optimal. The KeyValueScanner and related > classes that iterate through the trie will eventually need to be smarter and >
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139472#comment-13139472 ] stack commented on HBASE-4676: -- bq. and the TRIE encoding can do ~1.6mm. So 40-80x faster seeks at medium block size. Thats a pretty crazy different Matt. The write speed is slower because its burning cpu doing the 'compression', right? So TRIE takes up 1/4 size. Did you say above how much less a PREFIX compressed block is than a NONE compressed block? > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io >Affects Versions: 0.94.0 >Reporter: Matt Corgan > Attachments: SeeksPerSec by blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > need to be done to optimize the data structures used for encoding and could > probably show a 10x increase. It will still be slower than delta encoding, > but with a much higher decode speed. I have not yet created a thorough > benchmark for write speed nor sequential read speed. > Though the trie is reaching a point where it is internally very efficient > (probably within half or a quarter of its max read speed) the way that hbase > currently uses it is far from optimal. The KeyValueScanner and related > classes that iterate through the trie will eventually need to be smarter and > have methods to do things like skipping to the next row of results without > scanning every cell in between. When that is accomplished it will also allow > much faster compactions because the full row key will not have to be compared >
[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding
[ https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13135600#comment-13135600 ] Matt Corgan commented on HBASE-4676: The attached chart "SeeksPerSec by blockSize.png" is for a table with a 4b key and small value. It is a pretty extreme case that shows the strength of the trie in reading and its weakness in writing. Here is a comparison from a table we use internally for counters: {code} table name: Count5s raw KeyValue blockSize: 256KB encoding savings: ~4x encoded blockSize: ~64KB numReaderThreads: 1 pread: false numCells: 1339210 avgKeyBytes: 85 avgValueBytes: 9 raw data size: 120MB encoded data size: ~40MB (~5x L3 cache size) encoding,sequentialMB/s,seeks/s NONE,886.67,23604 PREFIX,404.3,15543 TRIE,25.63,340613 {code} You can see the trie only encodes at 25MB/s, but i think this could be more like 200MB/s with some work. Seek speed is ~15x NONE seek speed while expanding the effective block cache size by 4x. Another advantage is that the trie should be able to operate on DirectByteBuffers without copying them to the heap first. It would need a separate implementation of the PtArrayBlockSearcher called PtBufferBlockSearcher which would have the same code structure but replace all the bytes[i] (and similar) calls with buffer.get(i) calls. > Prefix Compression - Trie data block encoding > - > > Key: HBASE-4676 > URL: https://issues.apache.org/jira/browse/HBASE-4676 > Project: HBase > Issue Type: New Feature > Components: io >Affects Versions: 0.94.0 >Reporter: Matt Corgan > Attachments: SeeksPerSec by blockSize.png > > > The HBase data block format has room for 2 significant improvements for > applications that have high block cache hit ratios. > First, there is no prefix compression, and the current KeyValue format is > somewhat metadata heavy, so there can be tremendous memory bloat for many > common data layouts, specifically those with long keys and short values. > Second, there is no random access to KeyValues inside data blocks. This > means that every time you double the datablock size, average seek time (or > average cpu consumption) goes up by a factor of 2. The standard 64KB block > size is ~10x slower for random seeks than a 4KB block size, but block sizes > as small as 4KB cause problems elsewhere. Using block sizes of 256KB or 1MB > or more may be more efficient from a disk access and block-cache perspective > in many big-data applications, but doing so is infeasible from a random seek > perspective. > The PrefixTrie block encoding format attempts to solve both of these > problems. Some features: > * trie format for row key encoding completely eliminates duplicate row keys > and encodes similar row keys into a standard trie structure which also saves > a lot of space > * the column family is currently stored once at the beginning of each block. > this could easily be modified to allow multiple family names per block > * all qualifiers in the block are stored in their own trie format which > caters nicely to wide rows. duplicate qualifers between rows are eliminated. > the size of this trie determines the width of the block's qualifier > fixed-width-int > * the minimum timestamp is stored at the beginning of the block, and deltas > are calculated from that. the maximum delta determines the width of the > block's timestamp fixed-width-int > The block is structured with metadata at the beginning, then a section for > the row trie, then the column trie, then the timestamp deltas, and then then > all the values. Most work is done in the row trie, where every leaf node > (corresponding to a row) contains a list of offsets/references corresponding > to the cells in that row. Each cell is fixed-width to enable binary > searching and is represented by [1 byte operationType, X bytes qualifier > offset, X bytes timestamp delta offset]. > If all operation types are the same for a block, there will be zero per-cell > overhead. Same for timestamps. Same for qualifiers when i get a chance. > So, the compression aspect is very strong, but makes a few small sacrifices > on VarInt size to enable faster binary searches in trie fan-out nodes. > A more compressed but slower version might build on this by also applying > further (suffix, etc) compression on the trie nodes at the cost of slower > write speed. Even further compression could be obtained by using all VInts > instead of FInts with a sacrifice on random seek speed (though not huge). > One current drawback is the current write speed. While programmed with good > constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not > programmed with the same level of optimization as the read path. Work will > nee