[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-07 Thread Matteo Bertozzi (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573548#comment-13573548
 ] 

Matteo Bertozzi commented on HBASE-4676:


Is there a reason why the prefix-tree and cell code is inside
org.apache.hbase instead of the traditional org.apache.hadoop.hbase?

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: 4676-prefix-tree-trunk-v16.patch, 
> 4676-prefix-tree-trunk-v17.patch, HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v15.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the dat

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573247#comment-13573247
 ] 

Hudson commented on HBASE-4676:
---

Integrated in HBase-TRUNK #3859 (See 
[https://builds.apache.org/job/HBase-TRUNK/3859/])
HBASE-4676 Prefix Compression - Trie data block encoding (Matt Corgan) 
(Revision 1443289)

 Result = ABORTED
tedyu : 
Files : 
* 
/hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoding.java
* 
/hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/ByteRangeTool.java
* 
/hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
* 
/hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/test/RedundantKVGenerator.java
* 
/hbase/trunk/hbase-common/src/main/java/org/apache/hbase/cell/CellComparator.java
* 
/hbase/trunk/hbase-common/src/main/java/org/apache/hbase/cell/CellOutputStream.java
* /hbase/trunk/hbase-prefix-tree
* /hbase/trunk/hbase-prefix-tree/pom.xml
* /hbase/trunk/hbase-prefix-tree/src
* /hbase/trunk/hbase-prefix-tree/src/main
* /hbase/trunk/hbase-prefix-tree/src/main/java
* /hbase/trunk/hbase-prefix-tree/src/main/java/org
* /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache
* /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase
* /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec
* /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PrefixTreeBlockMeta.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PrefixTreeCodec.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PrefixTreeSeeker.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/ArraySearcherPool.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/DecoderFactory.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeArrayReversibleScanner.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeArrayScanner.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeArraySearcher.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeCell.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/column
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/column/ColumnNodeReader.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/column/ColumnReader.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/row
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/row/RowNodeReader.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/timestamp
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/timestamp/MvccVersionDecoder.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/timestamp/TimestampDecoder.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/EncoderFactory.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/EncoderPool.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/PrefixTreeEncoder.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/ThreadLocalEncoderPool.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/column
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/column/ColumnNodeWriter.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/column/ColumnSectionWriter.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/other
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/other/CellTypeEncoder.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/other/LongEncoder.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/row
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/row/RowNodeWriter.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/row

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-06 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573246#comment-13573246
 ] 

Ted Yu commented on HBASE-4676:
---

https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/396/console seems to 
be close to normal.

HBase-TRUNK-on-Hadoop-2.0.0 #395 and trunk build #3859 were both on ubuntu1.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: 4676-prefix-tree-trunk-v16.patch, 
> 4676-prefix-tree-trunk-v17.patch, HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v15.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need t

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-06 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573237#comment-13573237
 ] 

Ted Yu commented on HBASE-4676:
---

https://builds.apache.org/job/HBase-TRUNK/3859/console shows some test failures 
which didn't happen in previous trunk builds.

Will investigate tomorrow morning.

If trunk builds fail consecutively, will consider reverting.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: 4676-prefix-tree-trunk-v16.patch, 
> 4676-prefix-tree-trunk-v17.patch, HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v15.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of opti

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-06 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573179#comment-13573179
 ] 

Ted Yu commented on HBASE-4676:
---

I ran a few tests, marked failed by HBase-TRUNK-on-Hadoop-2.0.0 #395, locally:

 1220  mvn clean test -Dhadoop.profile=2.0 -PlocalTests 
-Dtest=TestFullLogReconstruction
 1222  mvn test -Dhadoop.profile=2.0 -PlocalTests -Dtest=TestDelayedRpc
 1223  mvn test -Dhadoop.profile=2.0 -PlocalTests 
-Dtest=TestOfflineMetaRebuildHole,TestMergeTable

They passed.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: 4676-prefix-tree-trunk-v16.patch, 
> 4676-prefix-tree-trunk-v17.patch, HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v15.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While prog

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-06 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573171#comment-13573171
 ] 

Hudson commented on HBASE-4676:
---

Integrated in HBase-TRUNK-on-Hadoop-2.0.0 #395 (See 
[https://builds.apache.org/job/HBase-TRUNK-on-Hadoop-2.0.0/395/])
HBASE-4676 Prefix Compression - Trie data block encoding (Matt Corgan) 
(Revision 1443289)

 Result = FAILURE
tedyu : 
Files : 
* 
/hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoding.java
* 
/hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/ByteRangeTool.java
* 
/hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/Bytes.java
* 
/hbase/trunk/hbase-common/src/main/java/org/apache/hadoop/hbase/util/test/RedundantKVGenerator.java
* 
/hbase/trunk/hbase-common/src/main/java/org/apache/hbase/cell/CellComparator.java
* 
/hbase/trunk/hbase-common/src/main/java/org/apache/hbase/cell/CellOutputStream.java
* /hbase/trunk/hbase-prefix-tree
* /hbase/trunk/hbase-prefix-tree/pom.xml
* /hbase/trunk/hbase-prefix-tree/src
* /hbase/trunk/hbase-prefix-tree/src/main
* /hbase/trunk/hbase-prefix-tree/src/main/java
* /hbase/trunk/hbase-prefix-tree/src/main/java/org
* /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache
* /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase
* /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec
* /hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PrefixTreeBlockMeta.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PrefixTreeCodec.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PrefixTreeSeeker.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/ArraySearcherPool.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/DecoderFactory.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeArrayReversibleScanner.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeArrayScanner.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeArraySearcher.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeCell.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/column
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/column/ColumnNodeReader.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/column/ColumnReader.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/row
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/row/RowNodeReader.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/timestamp
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/timestamp/MvccVersionDecoder.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/timestamp/TimestampDecoder.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/EncoderFactory.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/EncoderPool.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/PrefixTreeEncoder.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/ThreadLocalEncoderPool.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/column
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/column/ColumnNodeWriter.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/column/ColumnSectionWriter.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/other
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/other/CellTypeEncoder.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/other/LongEncoder.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/row
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/row/RowNodeWriter.java
* 
/hbase/trunk/hbase-prefix-tree/src/main/java/org/apache/hba

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-06 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573044#comment-13573044
 ] 

Ted Yu commented on HBASE-4676:
---

Integrated to trunk (patch v17).

Thanks for the patch, Matt.

Thanks for the review, Stack.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: 4676-prefix-tree-trunk-v16.patch, 
> 4676-prefix-tree-trunk-v17.patch, HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v15.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably sho

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13573039#comment-13573039
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12568325/4676-prefix-tree-trunk-v17.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 105 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.coprocessor.example.TestBulkDeleteProtocol

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4360//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4360//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4360//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: 4676-prefix-tree-trunk-v16.patch, 
> 4676-prefix-tree-trunk-v17.patch, HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v15.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing 

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-06 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572987#comment-13572987
 ] 

Matt Corgan commented on HBASE-4676:


yep - will do.  i'll create a few follow on issues as well for a benchmark, 
tool for testing on cluster data, pooling decoders, etc

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: 4676-prefix-tree-trunk-v16.patch, 
> HBASE-4676-0.94-v1.patch, HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v15.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
>

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-06 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572986#comment-13572986
 ] 

stack commented on HBASE-4676:
--

+1 on integrating.

[~mcorgan] Any chance of adding a release note up here in JIRA?



> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: 4676-prefix-tree-trunk-v16.patch, 
> HBASE-4676-0.94-v1.patch, HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v15.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower tha

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-06 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572973#comment-13572973
 ] 

Ted Yu commented on HBASE-4676:
---

Will integrate later in the evening if there is no further review comment.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: 4676-prefix-tree-trunk-v16.patch, 
> HBASE-4676-0.94-v1.patch, HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v15.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta e

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-06 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572967#comment-13572967
 ] 

Ted Yu commented on HBASE-4676:
---

A diff of the findbugs warnings shows the following:
{code}
tyu$ diff 4348-newPatchFindbugsWarningshbase-server.xml 
4354-newPatchFindbugsWarningshbase-server.xml 
...
1196a1197,1205
>category="STYLE">
> 
>start="72" end="290" sourcefile="ThriftServer.java" 
> sourcepath="org/apache/hadoop/hbase/thrift2/ThriftServer.java"/>
> 
>  name="main" signature="([Ljava/lang/String;)V" isStatic="true">
>start="202" end="290" startBytecode="0" endBytecode="1246" 
> sourcefile="ThriftServer.java" 
> sourcepath="org/apache/hadoop/hbase/thrift2/ThriftServer.java"/>
> 
>  start="284" end="284" startBytecode="479" endBytecode="479" 
> sourcefile="ThriftServer.java" 
> sourcepath="org/apache/hadoop/hbase/thrift2/ThriftServer.java"/>
>   
{code}
The class was not touched by this JIRA. Looks like HBASE-7757 touched that 
class.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: 4676-prefix-tree-trunk-v16.patch, 
> HBASE-4676-0.94-v1.patch, HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v15.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes 

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-06 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572700#comment-13572700
 ] 

Ted Yu commented on HBASE-4676:
---

https://builds.apache.org/job/PreCommit-HBASE-Build/4354/artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.xml
 shows there is no findbugs warning.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: 4676-prefix-tree-trunk-v16.patch, 
> HBASE-4676-0.94-v1.patch, HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v15.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used 

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572697#comment-13572697
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12568264/4676-prefix-tree-trunk-v16.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 105 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4354//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4354//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4354//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4354//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4354//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4354//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4354//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4354//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4354//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: 4676-prefix-tree-trunk-v16.patch, 
> HBASE-4676-0.94-v1.patch, HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v15.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> probl

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-06 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572589#comment-13572589
 ] 

Ted Yu commented on HBASE-4676:
---

For PrefixTreeArrayScanner, there is one findbug warning. See 
https://builds.apache.org/job/PreCommit-HBASE-Build/4351/artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.xml

I guess the long line warning was due to presence of hbase-prefix-tree/pom.xml:

+http://maven.apache.org/POM/4.0.0"; 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"; 
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 
http://maven.apache.org/xsd/maven-4.0.0.xsd";>


> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v15.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current dr

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-06 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13572269#comment-13572269
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12568182/HBASE-4676-prefix-tree-trunk-v15.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 105 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 1 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4351//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4351//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4351//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4351//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4351//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4351//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4351//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4351//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4351//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v15.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * tri

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-04 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570602#comment-13570602
 ] 

Ted Yu commented on HBASE-4676:
---

The 12 findbugs warnings can be found here:
https://builds.apache.org/job/PreCommit-HBASE-Build/4313/artifact/trunk/patchprocess/patchFindbugsWarningshbase-prefix-tree.xml

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-04 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570566#comment-13570566
 ] 

Matt Corgan commented on HBASE-4676:


Above test did not fail for me.  

I'll try to nix the findbugs.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a th

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-04 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570405#comment-13570405
 ] 

stack commented on HBASE-4676:
--

[~mcorgan] Findbugs is complaining.  Given the list of complaints a once over, 
at least the ones in the new module.  If you want to punt for now, and do 
later, that is fine.  I'll just up findbugs tolerated count for now to get the 
patch in (we are working on getting findbugs overall to zero).  

Does above failed unit test fail for you?


> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of o

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570067#comment-13570067
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12567811/HBASE-4676-prefix-tree-trunk-v14.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 105 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 12 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.util.TestHBaseFsck

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4313//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4313//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4313//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4313//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4313//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4313//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4313//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4313//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4313//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570040#comment-13570040
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12567806/HBASE-4676-prefix-tree-trunk-v13.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 105 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 12 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4312//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4312//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4312//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4312//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4312//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4312//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4312//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4312//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/4312//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v14.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely e

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-03 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570027#comment-13570027
 ] 

Matt Corgan commented on HBASE-4676:


Oh - that was easy.  I have a static final boolean 
PrefixTreeEncoder.MULITPLE_FAMILIES_POSSIBLE that is set to false.  The failing 
test is testing correctness on multiple families, so good sign that it fails 
=).  I guess I will comment out the test for now.  Will upload new patch based 
on latest trunk.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to opti

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-03 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570024#comment-13570024
 ] 

Matt Corgan commented on HBASE-4676:


I am getting the test failures.  Not sure how that happened.  Will look into it.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for wri

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-03 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13570015#comment-13570015
 ] 

stack commented on HBASE-4676:
--

Thanks [~mcorgan]

bq. See HFileDataBlockEncoderImpl.newOnDiskDataBlockDecodingContext. It will 
take some work to use it for client/rpc though

As was said above by myself already (only I forgot about it but independently 
rediscovered my conclusion earlier today), EncodedDataBlock is inappropriate to 
rpc what with its KV fixation -- and worse, ByteBuffers that are supposed to 
have serialized KVs in them -- and its hfile pollution.   I'm going to go basic 
and just pass a byte array accompanied by a class name to use to decode it; 
class must implement CellScanner (Thinking on it, and you confirm above), 
CellScanner is enough; looking at how we write from client to server it is 
fine... no searching and ditto on way out.  One day we could make use of a 
CellSearcher but a bunch of at least server-internals code would need redoing 
first

On getting this patch in, you have the test failures I report above?

Or, let me just run this by jenkins now.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few smal

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-03 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13569930#comment-13569930
 ] 

Matt Corgan commented on HBASE-4676:


Some quick answers for your later questions:
{quote}If I wanted to pass a prefixtree block of Cells on rpc, I see that I 
could write the block with a CellOutputStream/PrefixTreeDecoder but I'm not 
sure I get how I decode.{quote}I think eventually we want to use the same code 
that encodes/decodes to disk.  See 
HFileDataBlockEncoderImpl.newOnDiskDataBlockDecodingContext.  It will take some 
work to use it for client/rpc though.  Code is somewhat confusing now because 
it forks for encoded/non-encoded blocks.

{quote}I suppose I should aim for CellSearcher being the minimal Interface rpc 
block readers should implement? It is pretty heavyweight though compared to 
CellScanner at least. Would CellScanner be too low down?{quote}CellScanner is 
enough.  Current client interface would scan through every cell, explode them 
to KeyValues, and stick the KeyValue[] into the 
org.apache.hadoop.hbase.client.Result object.  Or are you trying to get rid of 
Result now?  I was thinking that could wait till later.

{quote}Can I select a prefixtrie block after this patch is committed and have 
hfiles read and written w/ prefixtree?{quote}Yes, should be fully functional.

{quote}You think we should commit as is? It still needs a bunch of work but I'd 
be up for getting it in as is and having new issues to do fix up.{quote}Yeah, 
that's what i'm thinking.  Some of the next steps are common to all codecs, 
like pooling the decoders.  Future patches would be smaller and more 
understandable if the current code is committed.

Regarding RPC stuff you're working on: In general, I don't think you should use 
Prefix-tree specific code for rpc.  It should use interfaces like 
DataBlockEncoder and HFileDataBlockEncoder so that other encoder 
implementations can be used.  For example, the existing encoders like PREFIX 
and FAST_DIFF are probably better for rpc since rpc does not require random 
access.  Prefix-tree is more expensive to encode.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's q

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-02 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13569707#comment-13569707
 ] 

Ted Yu commented on HBASE-4676:
---

In TestRowEncoder:
{code}
+searcher = new PrefixTreeArraySearcher(blockMetaReader, 
blockMetaReader.getRowTreeDepth(),
{code}
Should we use the following to obtain searcher ?
{code}
+  searcher = DecoderFactory.checkOut(block, true);
{code}

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower t

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-02-02 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13569703#comment-13569703
 ] 

stack commented on HBASE-4676:
--

Why remove the TODO?

{code}
-//TODO compare byte[]'s in reverse since later bytes more likely to differ
-return 0 == compareStatic(a, b);
+return equalsRow(a, b)
+&& equalsFamily(a, b)
+&& equalsQualifier(a, b)
+&& equalsTimestamp(a, b)
+&& equalsType(a, b);
   }
{code}

Seems to make sense.

How hard to use pb varints in stead of your UVInt?  You go via the tool so 
maybe you could do it therein?  You'd have to up your block version I suppose 
and be able to read the old form?

Why CellScanner in +package org.apache.hbase.codec.prefixtree.scanner?  Why not 
use what is over in hbase-common? (Oh, its not there yet...)

And CS has this

+  void resetToBeforeFirstEntry();

Did we say strike this?

When I ran tests, I got these failures:


Results :

Tests in error: 
  org.apache.hadoop.hbase.mapreduce.TestMultithreadedTableMapper: Waiting for 
startup of standalone server
  testStopDuringStart(org.apache.hadoop.hbase.master.TestMasterNoCluster): test 
timed out after 3 milliseconds
  org.apache.hadoop.hbase.TestZooKeeper: Shutting down
  org.apache.hadoop.hbase.thrift.TestThriftServer: Shutting down
  testMultiClusters(org.apache.hadoop.hbase.TestHBaseTestingUtility): 
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.access$6(Lorg/apache/hadoop/hbase/client/HConnectionManager$HConnectionImplementation;Lorg/apache/hadoop/hbase/client/HConnectionManager$HConnectionImplementation$MasterProtocolState;)V
  testLocalHBaseCluster(org.apache.hadoop.hbase.TestLocalHBaseCluster): 
Shutting down
  testHDFSRegioninfoMissing(org.apache.hadoop.hbase.util.TestHBaseFsck): 
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = 
ConnectionLoss for /hbase/root-region-server
  
testRebalanceOnRegionServerNumberChange[0](org.apache.hadoop.hbase.TestRegionRebalancing):
 Shutting down
  
testRebalanceOnRegionServerNumberChange[1](org.apache.hadoop.hbase.TestRegionRebalancing):
 Shutting down

Tests run: 1193, Failures: 0, Errors: 9, Skipped: 13

They pass for you?

If I wanted to pass a prefixtree block of Cells on rpc, I see that I could 
write the block with a CellOutputStream/PrefixTreeDecoder but I'm not sure I 
get how I decode.  I don't see a PrefixTreeDecoder (that implements 
CellSearcher).  I suppose I should aim for CellSearcher being the minimal 
Interface rpc block readers should implement?  It is pretty heavyweight though 
compared to CellScanner at least.  Would CellScanner be too low down?  If an 
encoding doesn't support CellSearcher, it should not implement or throw 
unsupportedoperation when you try to do a previous, etc?  Ok to expect a null 
arg constructor on decoders or will I need to pass in context?  (or all context 
they need is in the block?)

You think we should commit as is?  It still needs a bunch of work but I'd be up 
for getting it in as is and having new issues to do fix up.  Can I select a 
prefixtrie block after this patch is committed and have hfiles read and written 
w/ prefixtree?

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
>Priority: Critical
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size,

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-01-04 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13544230#comment-13544230
 ] 

Ted Yu commented on HBASE-4676:
---

checkLineLengths () in dev-support/test-patch.sh doesn't exclude pom.xml

+1 on patch v13.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point w

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-01-04 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13544223#comment-13544223
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12563341/HBASE-4676-prefix-tree-trunk-v13.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 105 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.replication.TestReplication

 {color:red}-1 core zombie tests{color}.  There are 6 zombie test(s): 

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3861//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3861//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3861//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3861//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3861//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3861//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3861//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3861//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3861//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3861//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v13.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infe

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-01-04 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543932#comment-13543932
 ] 

Ted Yu commented on HBASE-4676:
---

@Matt:
You may want to take care of the long lines in patch v12.

Thanks

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2013-01-03 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543635#comment-13543635
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12563241/HBASE-4676-prefix-tree-trunk-v12.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 105 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:green}+1 findbugs{color}.  The patch does not introduce any new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:red}-1 lineLengths{color}.  The patch introduces lines longer than 
100

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3849//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3849//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3849//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v12.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely elimi

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-12-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13539102#comment-13539102
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12562286/HBASE-4676-prefix-tree-trunk-v11.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 105 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 2 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 14 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.client.TestMultiParallel
  org.apache.hadoop.hbase.client.TestFromClientSide
  
org.apache.hadoop.hbase.replication.TestReplicationWithCompression

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3680//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3680//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3680//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts t

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-12-23 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13539084#comment-13539084
 ] 

Matt Corgan commented on HBASE-4676:


{quote}Since CopyKeyDataBlockEncoder isn't included in the patch, I would 
suggest removing the above enum.{quote}CopyKeyDataBlockEncoder already exists 
in the code (not added by me).  This comment just lets developers know about it.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v11.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write spee

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-12-23 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13539021#comment-13539021
 ] 

Ted Yu commented on HBASE-4676:
---

{code}
+  //COPY_KEY is for benchmarking only
+  // COPY_KEY(5, 
createEncoder("org.apache.hadoop.hbase.io.encoding.CopyKeyDataBlockEncoder")),
{code}
Since CopyKeyDataBlockEncoder isn't included in the patch, I would suggest 
removing the above enum.

There were some javadoc warnings spotted by Hadoop QA. You can use the 
following command to find the warnings:
mvn clean package javadoc:javadoc -DskipTests -DHBasePatchProcess > 
patchJavadocWarnings.txt
{code}
[WARNING] Javadoc Warnings
[WARNING] 
/Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/scanner/ReversibleCellScanner.java:42:
 warning - Tag @link: reference not found: CellScannerPosition.BEFORE_FIRST
[WARNING] 
/Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/scanner/ReversibleCellScanner.java:52:
 warning - Tag @link: reference not found: CellScannerPosition.BEFORE_FIRST
[WARNING] 
/Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeArrayScanner.java:207:
 warning - @return tag has no arguments.
[WARNING] 
/Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/decode/PrefixTreeArrayScanner.java:303:
 warning - @param argument "forwards:" is not a parameter name.
[WARNING] 
/Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/scanner/CellSearcher.java:77:
 warning - Tag @link: can't find KeyValueScanner.reseek() in 
org.apache.hbase.codec.prefixtree.scanner.CellSearcher
[WARNING] 
/Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/encode/column/ColumnNodeWriter.java:85:
 warning - @return tag has no arguments.
[WARNING] 
/Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PrefixTreeCodec.java:56:
 warning - @ is an unknown tag.
[WARNING] 
/Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/PrefixTreeCodec.java:56:
 warning - @ is an unknown tag.
[WARNING] 
/Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/scanner/ReversibleCellScanner.java:42:
 warning - Tag @link: reference not found: CellScannerPosition.BEFORE_FIRST
[WARNING] 
/Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/codec/prefixtree/scanner/ReversibleCellScanner.java:52:
 warning - Tag @link: reference not found: CellScannerPosition.BEFORE_FIRST
[WARNING] 
/Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/util/byterange/ByteRangeSet.java:38:
 warning - Tag @link: reference not found: ByteRangeHashSet
[WARNING] 
/Users/tyu/trunk/hbase-prefix-tree/src/main/java/org/apache/hbase/util/byterange/ByteRangeSet.java:38:
 warning - Tag @link: reference not found: ByteRangeTreeSet
[INFO]
{code}
The above are warnings generated as described.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but 

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-12-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13538991#comment-13538991
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12562262/HBASE-4676-prefix-tree-trunk-v10.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 105 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
14 warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 14 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.replication.TestReplication

 {color:red}-1 core zombie tests{color}.  There are zombie tests. See build 
logs for details.

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3677//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3677//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3677//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3677//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3677//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3677//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3677//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3677//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3677//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3677//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v10.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, HBASE-4676-prefix-tree-trunk-v8.patch, 
> HBASE-4676-prefix-tree-trunk-v9.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding compl

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-12-20 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13537713#comment-13537713
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12562043/HBASE-4676-prefix-tree-trunk-v9.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 105 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:green}+1 javadoc{color}.  The javadoc tool did not generate any 
warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 62 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

 {color:red}-1 core zombie tests{color}.  There are zombie tests. See build 
logs for details.

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3648//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3648//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3648//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3648//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3648//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3648//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3648//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3648//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3648//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3648//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, HBASE-4676-prefix-tree-trunk-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, 
> HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, 
> HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, 
> HBASE-4676-prefix-tree-trunk-v8.patch, HBASE-4676-prefix-tree-trunk-v9.patch, 
> hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, 
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of s

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-12-18 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13535541#comment-13535541
 ] 

Ted Yu commented on HBASE-4676:
---

{code}
+++ 
b/hbase-prefix-tree/src/test/java/org/apache/hbase/codec/prefixtree/timestamp/data/TestTimestamps1.java
+++ 
b/hbase-prefix-tree/src/test/java/org/apache/hbase/codec/prefixtree/timestamp/data/TestTimestamps2.java
+++ 
b/hbase-prefix-tree/src/test/java/org/apache/hbase/codec/prefixtree/timestamp/data/TestTimestamps3.java
+++ 
b/hbase-prefix-tree/src/test/java/org/apache/hbase/codec/prefixtree/builder/data/TestSortedByteArrays1.java
{code}
Can the class names be more informative about what they test ?

For TimestampCompressorTests.java, KeyValueToolTests.java, 
PrefixTreeBlockMetaTests.java, etc, normally their class names should start 
with Test.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, HBASE-4676-prefix-tree-trunk-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, 
> HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, 
> HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, 
> HBASE-4676-prefix-tree-trunk-v8.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimizat

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-12-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13530796#comment-13530796
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12560147/HBASE-4676-prefix-tree-trunk-v8.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 129 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
116 warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 62 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3525//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3525//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3525//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3525//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3525//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3525//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3525//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3525//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3525//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3525//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, HBASE-4676-prefix-tree-trunk-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, 
> HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, 
> HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, 
> HBASE-4676-prefix-tree-trunk-v8.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multi

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-12-10 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13528175#comment-13528175
 ] 

stack commented on HBASE-4676:
--

Took a quick look through v8.  Need to come back.

You probably don't mean this to be included:

 
 # The java implementation to use.  Java 1.6 required.
-# export JAVA_HOME=/usr/java/jdk1.6.0/
+export JAVA_HOME=/hpdev/jdk6



... or the hbase-site.xml changes.


Can leave out on commit.

Hmmm... should move the below up into util:

hbase-common/src/main/java/org/apache/hadoop/hbase/util/test/RedundantKVGenerator.java

Puttting it in a subpackage called 'test' would seem to say it should be over 
under src/test.  Nit.


Do we have to add an IOE on write too in CellOutputStream?

Prefix-tree depends on hadoop?

Should content of ByteUtils just go into utils/Bytes?  Ditto content of 
StringByteUtils.

I need to help you get this in.  Perhaps we should break out the stuff that 
belongs in hbase-common?  Make this patch smaller?





> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, HBASE-4676-prefix-tree-trunk-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, 
> HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, 
> HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, 
> HBASE-4676-prefix-tree-trunk-v8.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-12-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13527742#comment-13527742
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12560147/HBASE-4676-prefix-tree-trunk-v8.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 129 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
113 warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 62 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3470//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3470//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3470//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3470//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-protocol.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3470//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3470//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3470//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3470//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3470//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3470//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, HBASE-4676-prefix-tree-trunk-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, 
> HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, 
> HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, 
> HBASE-4676-prefix-tree-trunk-v8.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column famil

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-11-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497577#comment-13497577
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12553580/HBASE-4676-common-and-server-v8.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 24 new 
or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
99 warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 27 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.TestCompare
  org.apache.hadoop.hbase.io.hfile.TestHFileReaderV1
  org.apache.hadoop.hbase.io.hfile.TestHFileBlockCompatibility
  org.apache.hadoop.hbase.io.hfile.TestLruBlockCache
  org.apache.hadoop.hbase.regionserver.TestStoreFile
  org.apache.hadoop.hbase.TestHTableDescriptor
  org.apache.hadoop.hbase.io.TestHeapSize
  org.apache.hadoop.hbase.TestHColumnDescriptor
  org.apache.hadoop.hbase.master.TestCatalogJanitor
  org.apache.hadoop.hbase.io.hfile.TestHFileDataBlockEncoder
  org.apache.hadoop.hbase.regionserver.TestColumnSeeking
  org.apache.hadoop.hbase.regionserver.TestRSStatusServlet
  org.apache.hadoop.hbase.regionserver.TestScanner
  org.apache.hadoop.hbase.regionserver.TestSplitTransaction
  org.apache.hadoop.hbase.regionserver.TestHBase7051
  org.apache.hadoop.hbase.coprocessor.TestCoprocessorInterface
  org.apache.hadoop.hbase.TestFSTableDescriptorForceCreation
  
org.apache.hadoop.hbase.io.hfile.TestHFileInlineToRootChunkConversion
  org.apache.hadoop.hbase.regionserver.TestBlocksScanned
  org.apache.hadoop.hbase.regionserver.TestKeepDeletes
  org.apache.hadoop.hbase.filter.TestDependentColumnFilter
  org.apache.hadoop.hbase.regionserver.TestResettingCounters
  org.apache.hadoop.hbase.TestSerialization
  org.apache.hadoop.hbase.coprocessor.TestRegionObserverStacking
  org.apache.hadoop.hbase.io.TestHalfStoreFileReader
  org.apache.hadoop.hbase.io.hfile.TestSeekTo
  org.apache.hadoop.hbase.regionserver.TestScanWithBloomError
  
org.apache.hadoop.hbase.regionserver.wal.TestWALActionsListener
  org.apache.hadoop.hbase.regionserver.TestRegionSplitPolicy
  org.apache.hadoop.hbase.io.hfile.TestCachedBlockQueue
  org.apache.hadoop.hbase.regionserver.TestHRegionInfo
  org.apache.hadoop.hbase.regionserver.TestCompactSelection
  org.apache.hadoop.hbase.constraint.TestConstraints
  org.apache.hadoop.hbase.filter.TestColumnPrefixFilter
  org.apache.hadoop.hbase.io.hfile.TestHFile
  org.apache.hadoop.hbase.filter.TestMultipleColumnPrefixFilter
  org.apache.hadoop.hbase.regionserver.TestMinVersions
  org.apache.hadoop.hbase.rest.model.TestTableRegionModel
  org.apache.hadoop.hbase.regionserver.TestWideScanner
  org.apache.hadoop.hbase.client.TestIntraRowPagination
  org.apache.hadoop.hbase.io.hfile.TestReseekTo
  org.apache.hadoop.hbase.filter.TestFilter

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3341//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3341//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apac

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-11-14 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13497561#comment-13497561
 ] 

stack commented on HBASE-4676:
--

I'm up for committing the hbase-common and hbase-server changes that are up in 
RB.  Attach the patch here or rather to HBASE-7162, an issue just for these 
changes.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-common-and-server-v8.patch, HBASE-4676-prefix-tree-trunk-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v2.patch, HBASE-4676-prefix-tree-trunk-v3.patch, 
> HBASE-4676-prefix-tree-trunk-v4.patch, HBASE-4676-prefix-tree-trunk-v5.patch, 
> HBASE-4676-prefix-tree-trunk-v6.patch, HBASE-4676-prefix-tree-trunk-v7.patch, 
> hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, 
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueScanner and

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-11-13 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496917#comment-13496917
 ] 

stack commented on HBASE-4676:
--

Looking in v7 patch attached here:

Should this be in the patch Matt?  b/bin/index-builder-setup.rb

Did you mean to include the hbase prefix tree module in this hbase-common 
patch?  i.e.  b/hbase-prefix-tree/pom.xml?  I've not reviewed the prefixtree 
stuff.  Lets get the hbase-common changes in first (including the removes of 
classes moved from hbase-server to hbase-common?)

Or, if you think that v7 has all addressed up on RB, then I could commit only 
the bits of the v7 patch that touch hbase-common and hbase-server?





> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-11-13 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13496865#comment-13496865
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12553437/HBASE-4676-prefix-tree-trunk-v7.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 142 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
109 warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 59 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.coprocessor.TestRegionServerCoprocessorExceptionWithAbort

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3332//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3332//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3332//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> HBASE-4676-prefix-tree-trunk-v7.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wid

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-11-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13494579#comment-13494579
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12552931/HBASE-4676-prefix-tree-trunk-v6.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 142 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
103 warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 60 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3299//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3299//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3299//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, HBASE-4676-prefix-tree-trunk-v6.patch, 
> hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, 
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-11-09 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493840#comment-13493840
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12552792/HBASE-4676-prefix-tree-trunk-v5.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 142 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
103 warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 58 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.io.hfile.TestForceCacheImportantBlocks
  org.apache.hadoop.hbase.TestDrainingServer

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3286//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3286//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3286//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> HBASE-4676-prefix-tree-trunk-v5.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between ro

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-11-08 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13493779#comment-13493779
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12552778/HBASE-4676-prefix-tree-trunk-v4.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 142 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
103 warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 58 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.io.hfile.TestHFileDataBlockEncoder

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3285//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-examples.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3285//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3285//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, HBASE-4676-prefix-tree-trunk-v4.patch, 
> hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, 
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-i

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-10-29 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485872#comment-13485872
 ] 

Matt Corgan commented on HBASE-4676:


for the record, this is now on branch prefix-tree2 on github:
{code}
git clone -b prefix-tree2 https://hotp...@github.com/hotpads/hbase.git 
hbase-prefix-tree
{code}

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueScanner and related 
> classes that iterate through the trie will eventually need to be smarter and 
> have methods to do things like skipping to the next row of results without 
> scanning every cell

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-10-28 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13485834#comment-13485834
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12551154/HBASE-4676-prefix-tree-trunk-v3.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 143 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
102 warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 49 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

{color:green}+1 core tests{color}.  The patch passed unit tests in .

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3170//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3170//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3170//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> HBASE-4676-prefix-tree-trunk-v3.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
>

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-10-24 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483864#comment-13483864
 ] 

Ted Yu commented on HBASE-4676:
---

I tried to run test suite on the following platform and got the exact same test 
failure:
{code}
Linux s0 2.6.38-11-generic #48-Ubuntu SMP Fri Jul 29 19:02:55 UTC 2011 x86_64 
x86_64 x86_64 GNU/Linux
{code}

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, 
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueScanner and related 
> classes that iterate through the trie will eventually need to be smarter and 
> have methods to do things like skipping to the next row of results without 
> scanning every cell in between

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-10-24 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483831#comment-13483831
 ] 

Ted Yu commented on HBASE-4676:
---

I ran TestHFileDataBlockEncoder on MacBook 29 times, on Linux a few times.
The test didn't fail.
Looking at the stack provided by Hadoop QA, looks like the NPE was caused by 
this code in TimestampEncoder.java:
{code}
// should always find an exact match
return Arrays.binarySearch(sortedUniqueTimestamps, timestamp);
{code}
Meaning sortedUniqueTimestamps was null.
Can you guard against this case ?

Thanks

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, 
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueSc

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-10-24 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483806#comment-13483806
 ] 

Matt Corgan commented on HBASE-4676:


Does anyone know why TestHFileDataBlockEncoder would fail on jenkins when it 
passes 100% of the time locally for me?  I believe the part that's failing has 
something to do with timestamps.  Does jenkins do something different with the 
clock?

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, 
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueScanner and related 
> classes that iterate through the trie will eventually need to be smarter and 
> have methods to do things like skipping to the next row of r

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-10-24 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13483041#comment-13483041
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12550594/HBASE-4676-prefix-tree-trunk-v2.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 143 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
99 warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 49 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   
org.apache.hadoop.hbase.io.hfile.TestHFileDataBlockEncoder

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3134//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3134//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3134//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, HBASE-4676-prefix-tree-trunk-v2.patch, 
> hbase-prefix-trie-0.1.jar, PrefixTrie_Format_v1.pdf, 
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadat

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-10-15 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13476011#comment-13476011
 ] 

Matt Corgan commented on HBASE-4676:


hmm.  Didn't think i changed anything since those tests passed.  Will look into 
it.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueScanner and related 
> classes that iterate through the trie will eventually need to be smarter and 
> have methods to do things like skipping to the next row of results without 
> scanning every cell in between.  When that is accomplished it will also allow 
> much faster compactions because the full row key will not have to be compared 
> as often as it is n

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-10-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13475994#comment-13475994
 ] 

Hadoop QA commented on HBASE-4676:
--

{color:red}-1 overall{color}.  Here are the results of testing the latest 
attachment 
  
http://issues.apache.org/jira/secure/attachment/12549106/HBASE-4676-prefix-tree-trunk-v1.patch
  against trunk revision .

{color:green}+1 @author{color}.  The patch does not contain any @author 
tags.

{color:green}+1 tests included{color}.  The patch appears to include 139 
new or modified tests.

{color:green}+1 hadoop2.0{color}.  The patch compiles against the hadoop 
2.0 profile.

{color:red}-1 javadoc{color}.  The javadoc tool appears to have generated 
99 warning messages.

{color:green}+1 javac{color}.  The applied patch does not increase the 
total number of javac compiler warnings.

{color:red}-1 findbugs{color}.  The patch appears to introduce 53 new 
Findbugs (version 1.3.9) warnings.

{color:green}+1 release audit{color}.  The applied patch does not increase 
the total number of release audit warnings.

 {color:red}-1 core tests{color}.  The patch failed these unit tests:
   org.apache.hadoop.hbase.io.encoding.TestChangingEncoding
  org.apache.hadoop.hbase.regionserver.TestStore
  org.apache.hadoop.hbase.io.encoding.TestDataBlockEncoders

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3052//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop2-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop1-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-prefix-tree.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-common.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-hadoop-compat.html
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3052//artifact/trunk/patchprocess/newPatchFindbugsWarningshbase-server.html
Console output: 
https://builds.apache.org/job/PreCommit-HBASE-Build/3052//console

This message is automatically generated.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.96.0
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, 
> HBASE-4676-prefix-tree-trunk-v1.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta dete

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-10-14 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13475934#comment-13475934
 ] 

Matt Corgan commented on HBASE-4676:


I have all known bugs worked out.  All tests are passing including those 
internal to the hbase-prefix-tree module, and existing DataBlockEncoding tests 
in hbase-server, such as TestEncodedSeekers.  

Split into 3 patches for reviewboard:
https://reviews.apache.org/r/7589/
https://reviews.apache.org/r/7591/
https://reviews.apache.org/r/7592/

The prefix-tree branch on my github repo contains the latest from trunk as of 
this comment.
http://github.com/hotpads/hbase/tree/prefix-tree
{code}
git clone -b prefix-tree https://hotp...@github.com/hotpads/hbase.git 
hbase-prefix-tree
{code}

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a qua

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-09-30 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13466624#comment-13466624
 ] 

Ted Yu commented on HBASE-4676:
---

[~mikhail]: Can you provide your insight on this feature please ?

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, Performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueScanner and related 
> classes that iterate through the trie will eventually need to be smarter and 
> have methods to do things like skipping to the next row of results without 
> scanning every cell in between.  When that is accomplished it will also allow 
> much faster compactions because the full row key will not have to be compared 
> as often as it is now.
> Current code is on github.  The trie code is in a separate projec

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-09-17 Thread Ted Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13457131#comment-13457131
 ] 

Ted Yu commented on HBASE-4676:
---

@Matt:
Do you want to attach your patch(es) here for Hadoop QA to run test suite ?

Thanks

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueScanner and related 
> classes that iterate through the trie will eventually need to be smarter and 
> have methods to do things like skipping to the next row of results without 
> scanning every cell in between.  When that is accomplished it will also allow 
> much faster compactions because the full row key will not have to be compared 
> as often as it is now.
> Current code is on github.  The trie cod

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-09-03 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13447427#comment-13447427
 ] 

Matt Corgan commented on HBASE-4676:


ReviewBoard links:

hbase-common: https://reviews.apache.org/r/6897/
hbase-prefix-tree: https://reviews.apache.org/r/6898/

excludes tests for now

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueScanner and related 
> classes that iterate through the trie will eventually need to be smarter and 
> have methods to do things like skipping to the next row of results without 
> scanning every cell in between.  When that is accomplished it will also allow 
> much faster compactions because the full row key will not have to be compared 
>

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-08-21 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439090#comment-13439090
 ] 

Matt Corgan commented on HBASE-4676:


Barely been able to touch it this whole summer, but I spent a little time in 
july rebasing to trunk and reorganizing for readability.  It's on github on a 
branch called "prefix-tree" (i renamed from trie since people wince when 
pronouncing it).  https://github.com/hotpads/hbase/tree/prefix-tree

What's up there is pretty good and workable, but i still have some things to do 
and some questions to ask for things like comparators, sequenceIds, ROOT and 
META handling.  Also needs more documentation.  I could use some high level 
feedback if you can get anything out of it without a lot of docs/comments.  
Maybe look at the things in the hbase-common module to start since that will 
affect overall hbase?  The prefix-trie module is more implementation details 
that only matter when you enable this specific encoding.

btw - current version has an "HCell" interface which i will rename to Cell(?) 
after seeing the recent discussion about HStore naming.

Happy to post on reviewboard as well, but might have to wait till the weekend.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the sam

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-08-21 Thread stack (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13439037#comment-13439037
 ] 

stack commented on HBASE-4676:
--

You need review on this Matt?

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, hbase-prefix-trie-0.1.jar, 
> PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, SeeksPerSec by 
> blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueScanner and related 
> classes that iterate through the trie will eventually need to be smarter and 
> have methods to do things like skipping to the next row of results without 
> scanning every cell in between.  When that is accomplished it will also allow 
> much faster compactions because the full row key will not have to be compared 
> as often as it is now.
> Current code is on github.  The trie code is in a separate project than the 
> slightly modified hbase.

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-06-01 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287829#comment-13287829
 ] 

Matt Corgan commented on HBASE-4676:


A little more detail... it would help where you have long qualifiers but only a 
few of them, like if you migrated a narrow relational db table over.  If you 
have wide rows with long qualifiers, then you would want to take advantage of 
the qualifier trie.  Not sure which case you have, but migrating relational 
style tables over will be pretty common so i wanted to handle that common case 
well.

If you get a chance to do a random read benchmark on it I'd love to hear the 
results.  I've only done a few small benchmarks at the Store level and haven't 
benchmarked a whole cluster.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, 
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, 
> hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quart

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-06-01 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287766#comment-13287766
 ] 

James Taylor commented on HBASE-4676:
-

Our qualifiers tend to be long, so trie-encoding them helps quite a bit. We 
could optimize this ourselves by managing our own mapping, but it's great that 
the trie-encoding does it for us. We haven't seen this impact our scan times.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, 
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, 
> hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueScanner and related 
> classes that iterate through the trie will eventually need to be smarter and 
> have methods to do things like skipping to the next row of results without 
> scanning every cell in between.  When that is accomplished it will also

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-06-01 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287637#comment-13287637
 ] 

Matt Corgan commented on HBASE-4676:


Great to hear James.  I'm working on migrating it to trunk, but wrestling with 
git and maven, my two worst friends.

The last change I've been thinking about making to it before finalizing this 
version is to add an option controlling whether to trie-encode the qualifiers 
vs merely de-duping them.  There's some read-time expense to decoding the 
trie-encoded qualifiers, and if you have a narrow table then you may not be 
saving much memory anyway.  So it would be an option to trade a little memory 
for faster scans/decoding.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, 
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, 
> hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currentl

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-06-01 Thread James Taylor (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13287587#comment-13287587
 ] 

James Taylor commented on HBASE-4676:
-

This is fantastic, Matt. We're testing your patch on 0.94 in our dev cluster 
here at Salesforce and are seeing 5-15x compression with no degradation on scan 
performance.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, 
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, 
> hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueScanner and related 
> classes that iterate through the trie will eventually need to be smarter and 
> have methods to do things like skipping to the next row of results without 
> scanning every cell in between.  When that is accomplished it will also allow 
> much faster compactions because the full row key will

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-05-30 Thread Matt Corgan (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285830#comment-13285830
 ] 

Matt Corgan commented on HBASE-4676:


Jacques - I like the idea , and I think the long-term goal is to auto-tune some 
of this stuff.  For example, if flushes or compactions are backlogged, that 
maybe a good signal for certain tables to temporarily switch from more compact 
Trie/GZip encoding/compression to faster-encoding Prefix/Snappy.  There are 
other factors at play though on a table-by-table basis.  Some tables might be 
random-read heavy, in which case you would want to keep the Trie.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, 
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, 
> hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueScanner and related 
> classe

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-05-30 Thread Jacques (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13285793#comment-13285793
 ] 

Jacques commented on HBASE-4676:


Random thought on this.  I was talking to Nicolas Spiegelberg and he was 
talking about how they are exploring changing encoding schemes when data 
becomes more permanent.  Don't quote me on this but it sounded like they were 
considering using gzip compression on major compactions.  I was wondering if 
something similar made sense here.  Basically, use less compression the shorter 
the likely lifetime of files.  Use the Trie compressions on only major 
compactions.   This way the performance difference could be less of a serious 
issue since you're paying for it a fewer number of times.  I glanced around and 
didn't see any Jiras about a two-tiered approach of data storage formats but it 
seems like that would be a prerequisite for a hybrid/tiered approach.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
>Assignee: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, 
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, 
> hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thor

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-04-11 Thread Matt Corgan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13251836#comment-13251836
 ] 

Matt Corgan commented on HBASE-4676:


the git url in above comment is wrong.  reposting instructions with public 
https endpoint (looks like i can't edit jira comments)

To use with Eclipse:

* check out hbase-0.94 from apache subversion to "myworkspace"
* apply the HBASE-4676-0.94-v1.patch from HBASE-4676
* cd myworkspace
* git clone -b 0.1 https://hotp...@github.com/hotpads/hbase-prefix-trie.git 
hbase-prefix-trie-0.1
* in eclipse -> new java project
* type "hbase-prefix-trie-0.1" and eclipse should find it. continue
* on the "Projects" tab, add hbase-0.94. finish
* ctrl-shift-R to open SeekBenchmarkMain.java
* follow instructions in SeekBenchmarkMain

To build a jar after modifying code:

* edit package-hbase-prefix-trie.xml in the project root to point to your build 
directory
* right click it -> Run As -> Ant Build


> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, 
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, 
> hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thoro

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-04-10 Thread Matt Corgan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13251375#comment-13251375
 ] 

Matt Corgan commented on HBASE-4676:


Sorry, the packaging is unusual for HBase but I think it's worth a shot at 
turning this into a module rather than dumping 200+ new .java files into 
hbase-core.  As it stands, the dependencies are very limited.  The 
hbase-prefix-trie project needs to see KeyValue.java and DataBlockEncoder.java 
(maybe a couple others, but not many).  Then, hbase-core instantiates a 
PrefixTrieDataBlockEncoder through reflection at the bottom of 
DataBlockEncoding.java, so hbase-core has no compile-time dependency on 
hbase-prefix-trie.

If HBASE-4336 makes headway, then we can extract some of the fundamental hbase 
classes (KeyValue, DataBlockEncoder) into hbase-common, and I can further limit 
the prefix-trie to only have a dependency on these fundamental interfaces in 
the lightweight hbase-common module.  HBase-common will tie together the 
modules.  Limiting the dependencies between modules is the key to keeping 
development agile, otherwise small changes will trigger massive refactorings 
that even the most seasoned devs won't be able to pull off safely.

I added instructions to the git project homepage for checking out the prefix 
trie and modifying it or using it for profiling.  Just a few extra steps beyond 
applying a patch: https://github.com/hotpads/hbase-prefix-trie.  Pasting them 
here as well:

To use with Eclipse:

* check out hbase-0.94 from apache subversion to "myworkspace"
* apply the HBASE-4676-0.94-v1.patch from HBASE-4676

* cd myworkspace
* git clone -b 0.1 g...@github.com:hotpads/hbase-prefix-trie.git 
hbase-prefix-trie-0.1
* in eclipse -> new java project
* type "hbase-prefix-trie-0.1" and eclipse should find it.  continue
* on the "Projects" tab, add hbase-0.94.  finish
* ctrl-shift-R to open SeekBenchmarkMain.java
* follow instructions in SeekBenchmarkMain

To build a jar after modifying code:
* edit package-hbase-prefix-trie.xml in the project root to point to your build 
directory
* right click it -> Run As -> Ant Build

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, 
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, 
> hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is repres

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-04-10 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13251272#comment-13251272
 ] 

Zhihong Yu commented on HBASE-4676:
---

Thanks for the update, Matt.

PrefixTrieDataBlockEncoder is currently in the jar.
If PrefixTrie related java classes are in the patch, that would make 
understanding your implementation easier.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
> Attachments: HBASE-4676-0.94-v1.patch, PrefixTrie_Format_v1.pdf, 
> PrefixTrie_Performance_v1.pdf, SeeksPerSec by blockSize.png, 
> hbase-prefix-trie-0.1.jar
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueScanner and related 
> classes that iterate through the trie will eventually need to be smarter and 
> have methods to do things like skipping to the next row of results without 
> scanning every cell in between.  When that is accomplished it will also allow 
> much faster compactions because the full row key will not have to be c

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-03-26 Thread Matt Corgan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238948#comment-13238948
 ] 

Matt Corgan commented on HBASE-4676:


{quote}Do we not have this now Matt?{quote}
I don't believe so.  Right now, HFileReaderV2.blockSeek(..) does a brute force 
loop through each KV in the block and does a full KV comparison at each KV.  
Say there are 1000 KVs in the block and 20/row, on average you will have to do 
500 full KV comparisons just to get to your key, and even if you are selecting 
all 20 KVs in the row you still do 500 comparisons on average to get to the 
start of the row.  It's also very important to remember that these are long 
keys and they are most likely to share a common prefix since they got sorted to 
the same block, so you're comparators are churning the same prefix bytes over 
and over.

{quote}The slow write and scan speeds would tend to rule it out for anything 
but a random read workload but when random reading only{quote}
Yeah, it's not really targeted at long scans on cold data, but there are some 
cases in between long cold scans and hot point queries.  Some factors: If it's 
warm data your block cache is now 6x bigger.  Then there are many scans that 
are really sub-block prefix scans to get all the keys in a row (the 20 of 1000 
cells i mentioned above).  If your scan will have a low hit ratio then the 
ability to jump between rows inside a block will lessen CPU usage.

It performs best when doing lots of random Gets on hot data (in block cache).  
Possibly a persistent memcached alternative, especially with SSDs.  I believe 
the current system is actually limited by CPU when doing high throughput Gets 
on data with long keys and short values.  The trie's individual request latency 
may not be much different, but a single server could serve many more parallel 
requests before maxing out cpu.

The bigger the value/key ratio of your data the smaller the difference between 
trie and anything else.  Seems like many people now have big values so I doubt 
they would see a difference.  I'm more trying to enable smarter schemas with 
compound primary keys.

{quote}Regards your working set read testing, did it all fit in memory?{quote}
yes, i'm trying to do the purest comparison of cpu usage possible.  leaving it 
up to others to extrapolate the results to what happens with the effectively 
bigger block cache, more rows/block fetched from disk for equivalent size 
block, etc.  i'm currently just standing up a StoreFile and using it directly.  
see 
https://github.com/hotpads/hbase-prefix-trie/blob/hcell-scanners/test/org/apache/hadoop/hbase/cell/pt/test/performance/seek/SeekBenchmarkMain.java.
  i'll try to factor in network and disk latencies in later tests (did some 
preliminary tests friday but am on vacation this week).

{quote}How did you measure cycles per seek?{quote}
simple assumption of 2 billion/second.  was just trying to emphasize how many 
cycles a seek currently takes

{quote}What is numBlocks? The total number of blocks the dataset fit in?{quote}
number of data blocks in the HFile i fed in

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
> Attachments: PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, 
> SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2012-03-26 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13238645#comment-13238645
 ] 

stack commented on HBASE-4676:
--

bq. The KeyValueScanner and related classes that iterate through the trie will 
eventually need to be smarter and have methods to do things like skipping to 
the next row of results without scanning every cell in between.

Do we not have this now Matt?

As said before, 'So 40-80x faster seeks at medium block size.', is pretty nice 
improvement.  Ditto on 'Seek speed is ~15x NONE seek speed while expanding the 
effective block cache size by 4x.'

What are you thinking these times Matt?  The slow write and scan speeds would 
tend to rule it out for anything but a random read workload but when random 
reading only, you'll want to use prefixtrie (smile).  Do you think it possible 
to approach prefix compression scan and write speeds?

Regards your working set read testing, did it all fit in memory?

How did you measure cycles per seek?

What is numBlocks?  The total number of blocks the dataset fit in?

Throughput is better for sure.. what is latency like?

Excellent stuff Matt.





> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io, performance, regionserver
>Affects Versions: 0.90.6
>Reporter: Matt Corgan
> Attachments: PrefixTrie_Format_v1.pdf, PrefixTrie_Performance_v1.pdf, 
> SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2011-11-01 Thread Matt Corgan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13141676#comment-13141676
 ] 

Matt Corgan commented on HBASE-4676:


More thorough benchmark shared here: 
https://docs.google.com/spreadsheet/ccc?key=0Ah5tKh7-sVXYdGkyWm9WenhSRzVfd2U5VC1XNF9tekE

cell format: Key[int,int,string,string]Value[VInt]

avg raw key bytes: 52
avg raw value bytes: 1
avg cells/row: ~10

The TRIE compressor gives ~6x compression when all the timestamps are the same. 
 For this table we would set the raw block size at 256KB or 1MB which gives 
encoded sizes of 39KB or 178KB.  

PREFIX compressor is doing 2x compression on this test because I don't think it 
gets too involved with qualifiers or timestamps.


Encoding MB/s are:
NONE: 1000
PREFIX: 275
TRIE: 15

That is purely because of CPU because everything is in my linux page cache.  
TRIE encoding could be improved quite a bit, but will likely stay slower than 
the others.  Although if the scanners/comparators could be improved down the 
road, then it may speed up compaction significantly.

-
Scanning MB/s are:
NONE: 3700
PREFIX: 1800
TRIE: 800

Trie is slower for a scan through all cells because of more complex decoding.  
However, it could be much faster for things like row counting because it can 
quickly jump from one row to the next without iterating cells in between.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Matt Corgan
> Attachments: SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of op

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2011-10-31 Thread Matt Corgan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140704#comment-13140704
 ] 

Matt Corgan commented on HBASE-4676:


Write speed is slower because of cpu usage.  There's still tons of room for 
optimization, but slower write speed will generally be the trade off for higher 
random read speed.  The current encoding algorithms on github throw off a lot 
of garbage which tends to indicate that they're not using the cpu cache very 
well.

Some data sets would be able to disable gzip/lzo compression and just go with 
prefix compression, so you save some cpu there.

I'm working on a benchmark to compare a few more factors, including compression 
ratio and sequential read speed.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Matt Corgan
> Attachments: SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueScanner and related 
> classes that iterate through the trie will eventually need to be smarter and 
> 

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2011-10-29 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13139472#comment-13139472
 ] 

stack commented on HBASE-4676:
--

bq. and the TRIE encoding can do ~1.6mm. So 40-80x faster seeks at medium 
block size.

Thats a pretty crazy different Matt.

The write speed is slower because its burning cpu doing the 'compression', 
right?

So TRIE takes up 1/4 size.  Did you say above how much less a PREFIX compressed 
block is than a NONE compressed block?



> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Matt Corgan
> Attachments: SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> need to be done to optimize the data structures used for encoding and could 
> probably show a 10x increase.  It will still be slower than delta encoding, 
> but with a much higher decode speed.  I have not yet created a thorough 
> benchmark for write speed nor sequential read speed.
> Though the trie is reaching a point where it is internally very efficient 
> (probably within half or a quarter of its max read speed) the way that hbase 
> currently uses it is far from optimal.  The KeyValueScanner and related 
> classes that iterate through the trie will eventually need to be smarter and 
> have methods to do things like skipping to the next row of results without 
> scanning every cell in between.  When that is accomplished it will also allow 
> much faster compactions because the full row key will not have to be compared 
> 

[jira] [Commented] (HBASE-4676) Prefix Compression - Trie data block encoding

2011-10-25 Thread Matt Corgan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13135600#comment-13135600
 ] 

Matt Corgan commented on HBASE-4676:


The attached chart "SeeksPerSec by blockSize.png" is for a table with a 4b key 
and small value.  It is a pretty extreme case that shows the strength of the 
trie in reading and its weakness in writing.  

Here is a comparison from a table we use internally for counters:

{code}
table name: Count5s
raw KeyValue blockSize: 256KB
encoding savings: ~4x
encoded blockSize: ~64KB
numReaderThreads: 1
pread: false
numCells: 1339210
avgKeyBytes: 85
avgValueBytes: 9
raw data size: 120MB
encoded data size: ~40MB (~5x L3 cache size)

encoding,sequentialMB/s,seeks/s
NONE,886.67,23604
PREFIX,404.3,15543
TRIE,25.63,340613
{code}

You can see the trie only encodes at 25MB/s, but i think this could be more 
like 200MB/s with some work.  Seek speed is ~15x NONE seek speed while 
expanding the effective block cache size by 4x.

Another advantage is that the trie should be able to operate on 
DirectByteBuffers without copying them to the heap first.  It would need a 
separate implementation of the PtArrayBlockSearcher called 
PtBufferBlockSearcher which would have the same code structure but replace all 
the bytes[i] (and similar) calls with buffer.get(i) calls.

> Prefix Compression - Trie data block encoding
> -
>
> Key: HBASE-4676
> URL: https://issues.apache.org/jira/browse/HBASE-4676
> Project: HBase
>  Issue Type: New Feature
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Matt Corgan
> Attachments: SeeksPerSec by blockSize.png
>
>
> The HBase data block format has room for 2 significant improvements for 
> applications that have high block cache hit ratios.  
> First, there is no prefix compression, and the current KeyValue format is 
> somewhat metadata heavy, so there can be tremendous memory bloat for many 
> common data layouts, specifically those with long keys and short values.
> Second, there is no random access to KeyValues inside data blocks.  This 
> means that every time you double the datablock size, average seek time (or 
> average cpu consumption) goes up by a factor of 2.  The standard 64KB block 
> size is ~10x slower for random seeks than a 4KB block size, but block sizes 
> as small as 4KB cause problems elsewhere.  Using block sizes of 256KB or 1MB 
> or more may be more efficient from a disk access and block-cache perspective 
> in many big-data applications, but doing so is infeasible from a random seek 
> perspective.
> The PrefixTrie block encoding format attempts to solve both of these 
> problems.  Some features:
> * trie format for row key encoding completely eliminates duplicate row keys 
> and encodes similar row keys into a standard trie structure which also saves 
> a lot of space
> * the column family is currently stored once at the beginning of each block.  
> this could easily be modified to allow multiple family names per block
> * all qualifiers in the block are stored in their own trie format which 
> caters nicely to wide rows.  duplicate qualifers between rows are eliminated. 
>  the size of this trie determines the width of the block's qualifier 
> fixed-width-int
> * the minimum timestamp is stored at the beginning of the block, and deltas 
> are calculated from that.  the maximum delta determines the width of the 
> block's timestamp fixed-width-int
> The block is structured with metadata at the beginning, then a section for 
> the row trie, then the column trie, then the timestamp deltas, and then then 
> all the values.  Most work is done in the row trie, where every leaf node 
> (corresponding to a row) contains a list of offsets/references corresponding 
> to the cells in that row.  Each cell is fixed-width to enable binary 
> searching and is represented by [1 byte operationType, X bytes qualifier 
> offset, X bytes timestamp delta offset].
> If all operation types are the same for a block, there will be zero per-cell 
> overhead.  Same for timestamps.  Same for qualifiers when i get a chance.  
> So, the compression aspect is very strong, but makes a few small sacrifices 
> on VarInt size to enable faster binary searches in trie fan-out nodes.
> A more compressed but slower version might build on this by also applying 
> further (suffix, etc) compression on the trie nodes at the cost of slower 
> write speed.  Even further compression could be obtained by using all VInts 
> instead of FInts with a sacrifice on random seek speed (though not huge).
> One current drawback is the current write speed.  While programmed with good 
> constructs like TreeMaps, ByteBuffers, binary searches, etc, it's not 
> programmed with the same level of optimization as the read path.  Work will 
> nee