[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-03-27 Thread Mikhail Bautin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239920#comment-13239920
 ] 

Mikhail Bautin commented on HBASE-4218:
---

No, DATA_BLOCK_ENCODING=NONE disables encoding, regardless of the value of 
ENCODE_ON_DISK.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D1659.1.patch, D1659.2.patch, D1659.3.patch, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.18.patch, D447.19.patch, D447.2.patch, 
> D447.20.patch, D447.21.patch, D447.22.patch, D447.23.patch, D447.24.patch, 
> D447.25.patch, D447.26.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding-2012-01-25_16_32_14.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-03-27 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239912#comment-13239912
 ] 

Zhihong Yu commented on HBASE-4218:
---

bq. To enable, on the column descriptor set DATA_BLOCK_ENCODING to NONE
Would selection of NONE enable encoding ?

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D1659.1.patch, D1659.2.patch, D1659.3.patch, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.18.patch, D447.19.patch, D447.2.patch, 
> D447.20.patch, D447.21.patch, D447.22.patch, D447.23.patch, D447.24.patch, 
> D447.25.patch, D447.26.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding-2012-01-25_16_32_14.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-02-24 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13216331#comment-13216331
 ] 

Hudson commented on HBASE-4218:
---

Integrated in HBase-TRUNK #2669 (See 
[https://builds.apache.org/job/HBase-TRUNK/2669/])
[jira] [HBASE-5470] Make DataBlockEncodingTool work correctly with no native
compression codecs loaded

Summary:
DataBlockEncodingTool was fixed as part of porting data block encoding
(HBASE-4218) to 89-fb
(https://reviews.facebook.net/rHBASEEIGHTNINEFBBRANCH1245291,
https://reviews.facebook.net/D1659). The bug being fixed here appeared when
using GZ as baseline compression codec but not loading native Hadoop libraries,
in which case the compressor instance would be null.

Test Plan:
Run DataBlockEncoding tool with GZ (no native codecs) and LZO (with native
codecs) as baseline (Hadoop-level) compression codecs

Reviewers: JIRA, Kannan, mcorgan, lhofhansl, todd, stack, tedyu

Reviewed By: tedyu

Differential Revision: https://reviews.facebook.net/D1917 (Revision 1293057)

 Result = SUCCESS
mbautin : 
Files : 
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/DataBlockEncodingTool.java


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D1659.1.patch, D1659.2.patch, D1659.3.patch, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.18.patch, D447.19.patch, D447.2.patch, 
> D447.20.patch, D447.21.patch, D447.22.patch, D447.23.patch, D447.24.patch, 
> D447.25.patch, D447.26.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding-2012-01-25_16_32_14.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atla

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-02-23 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215404#comment-13215404
 ] 

Hudson commented on HBASE-4218:
---

Integrated in HBase-TRUNK-security #121 (See 
[https://builds.apache.org/job/HBase-TRUNK-security/121/])
[jira] [HBASE-5470] Make DataBlockEncodingTool work correctly with no native
compression codecs loaded

Summary:
DataBlockEncodingTool was fixed as part of porting data block encoding
(HBASE-4218) to 89-fb
(https://reviews.facebook.net/rHBASEEIGHTNINEFBBRANCH1245291,
https://reviews.facebook.net/D1659). The bug being fixed here appeared when
using GZ as baseline compression codec but not loading native Hadoop libraries,
in which case the compressor instance would be null.

Test Plan:
Run DataBlockEncoding tool with GZ (no native codecs) and LZO (with native
codecs) as baseline (Hadoop-level) compression codecs

Reviewers: JIRA, Kannan, mcorgan, lhofhansl, todd, stack, tedyu

Reviewed By: tedyu

Differential Revision: https://reviews.facebook.net/D1917 (Revision 1293057)

 Result = SUCCESS
mbautin : 
Files : 
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/DataBlockEncodingTool.java


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D1659.1.patch, D1659.2.patch, D1659.3.patch, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.18.patch, D447.19.patch, D447.2.patch, 
> D447.20.patch, D447.21.patch, D447.22.patch, D447.23.patch, D447.24.patch, 
> D447.25.patch, D447.26.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding-2012-01-25_16_32_14.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see:

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-02-16 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13209981#comment-13209981
 ] 

Phabricator commented on HBASE-4218:


mbautin has committed the revision "[jira] [HBASE-4218] [89-fb] Porting HFile 
data block encoding to 89-fb".

REVISION DETAIL
  https://reviews.facebook.net/D1659

COMMIT
  https://reviews.facebook.net/rHBASEEIGHTNINEFBBRANCH1245291


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D1659.1.patch, D1659.2.patch, D1659.3.patch, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.18.patch, D447.19.patch, D447.2.patch, 
> D447.20.patch, D447.21.patch, D447.22.patch, D447.23.patch, D447.24.patch, 
> D447.25.patch, D447.26.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding-2012-01-25_16_32_14.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-02-15 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13208709#comment-13208709
 ] 

Phabricator commented on HBASE-4218:


Kannan has accepted the revision "[jira] [HBASE-4218] [89-fb] Porting HFile 
data block encoding to 89-fb".

  There is one minor merge conflict inside of comments. Otherwise looks great!

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:557 some merge 
conflicts left over here...

REVISION DETAIL
  https://reviews.facebook.net/D1659


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D1659.1.patch, D1659.2.patch, D447.1.patch, D447.10.patch, D447.11.patch, 
> D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, D447.16.patch, 
> D447.17.patch, D447.18.patch, D447.19.patch, D447.2.patch, D447.20.patch, 
> D447.21.patch, D447.22.patch, D447.23.patch, D447.24.patch, D447.25.patch, 
> D447.26.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, 
> D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding-2012-01-25_16_32_14.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-02-09 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13205176#comment-13205176
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12514065/D1659.2.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 106 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/935//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D1659.1.patch, D1659.2.patch, D447.1.patch, D447.10.patch, D447.11.patch, 
> D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, D447.16.patch, 
> D447.17.patch, D447.18.patch, D447.19.patch, D447.2.patch, D447.20.patch, 
> D447.21.patch, D447.22.patch, D447.23.patch, D447.24.patch, D447.25.patch, 
> D447.26.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, 
> D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding-2012-01-25_16_32_14.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-02-08 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13204255#comment-13204255
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12513897/D1659.1.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 106 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/927//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D1659.1.patch, D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, 
> D447.13.patch, D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, 
> D447.18.patch, D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, 
> D447.22.patch, D447.23.patch, D447.24.patch, D447.25.patch, D447.26.patch, 
> D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, 
> D447.8.patch, D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding-2012-01-25_16_32_14.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-30 Thread dhruba borthakur (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196244#comment-13196244
 ] 

dhruba borthakur commented on HBASE-4218:
-

TestHFileBlock works for me all the time, let me look at the logs produced by 
HadoopQA.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.25.patch, D447.26.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding-2012-01-25_16_32_14.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-30 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13196219#comment-13196219
 ] 

Zhihong Yu commented on HBASE-4218:
---

HFileBlock.readBlockDataInternal() has many if else blocks, making it less 
maintainable.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.25.patch, D447.26.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding-2012-01-25_16_32_14.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-26 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194408#comment-13194408
 ] 

Zhihong Yu commented on HBASE-4218:
---

TestHFileBlock was reported as failing by Hadoop QA (@26/Jan/12 02:58) before 
the checkin.

Now the test failure appears in every TRUNK build and every Hadoop QA report.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.25.patch, D447.26.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding-2012-01-25_16_32_14.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-25 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193564#comment-13193564
 ] 

Hudson commented on HBASE-4218:
---

Integrated in HBase-TRUNK-security #90 (See 
[https://builds.apache.org/job/HBase-TRUNK-security/90/])
[jira] [HBASE-4218] HFile data block encoding framework and delta encoding
implementation (Jacek Midgal, Mikhail Bautin)

Summary:

Adding a framework that allows to "encode" keys in an HFile data block. We
support two modes of encoding: (1) both on disk and in cache, and (2) in cache
only. This is distinct from compression that is already being done in HBase,
e.g. GZ or LZO. When data block encoding is enabled, we store blocks in cache
in an uncompressed but encoded form. This allows to fit more blocks in cache
and reduce the number of disk reads.

The most common example of data block encoding is delta encoding, where we take
advantage of the fact that HFile keys are sorted and share a lot of common
prefixes, and only store the delta between each pair of consecutive keys.
Initial encoding algorithms implemented are DIFF, FAST_DIFF, and PREFIX.

This is based on the delta encoding patch developed by Jacek Midgal during his
2011 summer internship at Facebook. The original patch is available here:
https://reviews.apache.org/r/2308/diff/.

Test Plan: Unit tests. Distributed load test on a five-node cluster.

Reviewers: JIRA, tedyu, stack, nspiegelberg, Kannan

Reviewed By: Kannan

CC: tedyu, todd, mbautin, stack, Kannan, mcorgan, gqchen

Differential Revision: https://reviews.facebook.net/D447

mbautin : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HConstants.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/CompressionState.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/CopyKeyDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoding.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DiffKeyDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/EncodedDataBlock.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/EncoderBufferTooSmallException.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixKeyDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileWriter.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheKey.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockType.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoderImpl.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFilePrettyPrinter.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV1.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/NoOpDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/CompactSplitThread.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/metrics/SchemaConfigured.java

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-25 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193554#comment-13193554
 ] 

Zhihong Yu commented on HBASE-4218:
---

I wonder if we need to increase -Xmx for the tests:
{code}
https://builds.apache.org/view/G-L/view/HBase/job/HBase-TRUNK/2647/testReport/org.apache.hadoop.hbase.io.hfile/TestHFileBlock/testPreviousOffset_1_/
{code}
I see OutOfMemoryError.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.25.patch, D447.26.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding-2012-01-25_16_32_14.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-25 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193543#comment-13193543
 ] 

Hudson commented on HBASE-4218:
---

Integrated in HBase-TRUNK #2646 (See 
[https://builds.apache.org/job/HBase-TRUNK/2646/])
[jira] [HBASE-4218] HFile data block encoding framework and delta encoding
implementation (Jacek Midgal, Mikhail Bautin)

Summary:

Adding a framework that allows to "encode" keys in an HFile data block. We
support two modes of encoding: (1) both on disk and in cache, and (2) in cache
only. This is distinct from compression that is already being done in HBase,
e.g. GZ or LZO. When data block encoding is enabled, we store blocks in cache
in an uncompressed but encoded form. This allows to fit more blocks in cache
and reduce the number of disk reads.

The most common example of data block encoding is delta encoding, where we take
advantage of the fact that HFile keys are sorted and share a lot of common
prefixes, and only store the delta between each pair of consecutive keys.
Initial encoding algorithms implemented are DIFF, FAST_DIFF, and PREFIX.

This is based on the delta encoding patch developed by Jacek Midgal during his
2011 summer internship at Facebook. The original patch is available here:
https://reviews.apache.org/r/2308/diff/.

Test Plan: Unit tests. Distributed load test on a five-node cluster.

Reviewers: JIRA, tedyu, stack, nspiegelberg, Kannan

Reviewed By: Kannan

CC: tedyu, todd, mbautin, stack, Kannan, mcorgan, gqchen

Differential Revision: https://reviews.facebook.net/D447

mbautin : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HConstants.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/CompressionState.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/CopyKeyDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoding.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DiffKeyDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/EncodedDataBlock.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/EncoderBufferTooSmallException.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixKeyDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileWriter.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheKey.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockType.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoderImpl.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFilePrettyPrinter.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV1.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/NoOpDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/CompactSplitThread.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/HRegion.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreScanner.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/metrics/SchemaConfigured.java
* /hbase/trun

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-25 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193535#comment-13193535
 ] 

Phabricator commented on HBASE-4218:


mbautin has committed the revision "[jira] [HBASE-4218] HFile data block 
encoding framework and delta encoding implementation".

REVISION DETAIL
  https://reviews.facebook.net/D447

COMMIT
  https://reviews.facebook.net/rHBASE1236031


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.25.patch, D447.26.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding-2012-01-25_16_32_14.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-25 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193534#comment-13193534
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12511917/Delta-encoding-2012-01-25_16_32_14.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 189 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -140 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 88 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.io.hfile.TestHFileBlock

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/852//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/852//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/852//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.25.patch, D447.26.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding-2012-01-25_16_32_14.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA ad

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-25 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193514#comment-13193514
 ] 

Phabricator commented on HBASE-4218:


Kannan has accepted the revision "[jira] [HBASE-4218] HFile data block encoding 
framework and delta encoding implementation".

  excellent!!

REVISION DETAIL
  https://reviews.facebook.net/D447


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.25.patch, D447.26.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding-2012-01-25_16_32_14.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-25 Thread Mikhail Bautin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193148#comment-13193148
 ] 

Mikhail Bautin commented on HBASE-4218:
---

Re-running unit tests that failed on Jenkins:

Running org.apache.hadoop.hbase.client.TestFromClientSide
Tests run: 52, Failures: 0, Errors: 0, Skipped: 3, Time elapsed: 181.919 sec
Running org.apache.hadoop.hbase.client.TestAdmin
Tests run: 35, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 195.194 sec
Running org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 223.405 sec
Running org.apache.hadoop.hbase.mapreduce.TestImportTsv
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 78.48 sec
Running org.apache.hadoop.hbase.mapreduce.TestTableMapReduce
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 97.561 sec
Running org.apache.hadoop.hbase.mapred.TestTableMapReduce
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 67.289 sec
Running org.apache.hadoop.hbase.io.hfile.TestHFileBlock
Tests run: 16, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 49.362 sec

Results :

Tests run: 122, Failures: 0, Errors: 0, Skipped: 3


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.25.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-25 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13193130#comment-13193130
 ] 

Phabricator commented on HBASE-4218:


gqchen has commented on the revision "[jira] [HBASE-4218] HFile data block 
encoding framework and delta encoding implementation".

  Looks really good to me!

  I haven't finished reviewing DiffKeyDeltaEncoding (another day or so) and 
might probably have a few minor comments about cosmetic things. But definitely 
no need to wait for that.

INLINE COMMENTS
  
src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java:198 
I think the logic is the following:
  1. if the value not the same, copy the whole value.
  2. however, if type is also not the same, take advantage of the fact that 
"type" field is right ahead of "value", and copy both type and value in one 
shot.

  So the code would be like:

  if ((flag & FLAG_SAME_VALUE) == 0) {
  if ((flag & FALG_SAME_TYPE) == 0) {
 valueOffset -= ...
 valueLength += ...
  }
  ByteBufferUtils.copy...
  }

  The headache is if we decide to add one more field between "type" and "value" 
in the future, this code will be silently broken.

REVISION DETAIL
  https://reviews.facebook.net/D447


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.25.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/j

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-25 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192959#comment-13192959
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12511817/Delta-encoding-2012-01-25_00_45_29.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 189 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -140 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 161 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
  org.apache.hadoop.hbase.client.TestAdmin
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.client.TestFromClientSide
  org.apache.hadoop.hbase.io.hfile.TestHFileBlock
  org.apache.hadoop.hbase.mapreduce.TestImportTsv

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/851//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/851//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/851//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.25.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-25 Thread Mikhail Bautin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192943#comment-13192943
 ] 

Mikhail Bautin commented on HBASE-4218:
---

All unit tests passed (either in parallel on map-reduce, or when I re-ran the 
failed ones locally).

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.25.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding-2012-01-25_00_45_29.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-24 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192901#comment-13192901
 ] 

Phabricator commented on HBASE-4218:


Kannan has commented on the revision "[jira] [HBASE-4218] HFile data block 
encoding framework and delta encoding implementation".

  +1. Looks great.

  Jerry-- can you take a look at the updated diff for the parts you reviewed?

REVISION DETAIL
  https://reviews.facebook.net/D447


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.25.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-24 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192838#comment-13192838
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12511791/D447.25.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 139 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/850//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.25.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-24 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13192767#comment-13192767
 ] 

Phabricator commented on HBASE-4218:


mbautin has commented on the revision "[jira] [HBASE-4218] HFile data block 
encoding framework and delta encoding implementation".

  Replying to the rest of comments. I will upload another patch shortly.

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java:130 
Done.
  src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java:129 
Updated this javadoc.
  src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java:137 
Done.
  
src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java:36 
Done. "128-bit" does not seem to appear anywhere else in the patch.
  
src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java:106 
Good catch! Implemented your suggestion.
  
src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java:141 
Spent quite a bit of time staring at this code to fully understand it, then 
added some more comments :)

  The difficult part for me was the "else" clause below. It turns out that as 
the column family length and name follow the row, they would be automatically 
included in commonPrefix if the whole row matches, so we don't need to 
special-case them.
  
src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java:156-159
 The condition on line 163 is FLAG_SAME_VALUE, not FLAG_SAME_TYPE, so moving 
these lines there would actually change logic. Why exactly are you saying we 
should move these two lines there?
  
src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java:305-313
 Yes, it appears that we are not using the qualifierLength field during 
decompression. Moved the state changes from this block to the 
FastDiffCompressionState.decompressFirstKV method.
  
src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixKeyDeltaEncoder.java:45 
Renamed to prevKeyOffset.
  src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheKey.java:84 This 
actually fixes a pre-existing bug. The previous heapSize() implementation in 
BlockCacheKey did not take into account the object overhead and the hfileName 
String reference, and there was no unit test for BlockCacheKey, which I've 
added to TestHeapSize.
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:91 Removed.
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:760 Actually 
this piece of logic was enabling caching of unencoded blocks. However, as we 
decided that we don't care about doing encoding on disk only but not in cache, 
I am getting rid of this additional logic.
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java:405 
Actually readerV1 is useful, because it saves us a cast from 
AbstractHFileReader to HFileReaderV1. But I have now renamed this field to 
reader to make it look indistinguishable from what happens in the base class.
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:96 The 
assignment below is an upcast. We use some FSReaderV2-specific methods in the 
constructor. Renamed the local variable to fsBlockReaderV2 and added a comment 
for clarity.
  src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java:752 Good 
point. Leaving as is for now—the method name clearly says it is for test only. 
We can optimize this later if necessary.
  src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java:158 Done.
  src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java:192 Done.
  src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java:471 Done.

REVISION DETAIL
  https://reviews.facebook.net/D447


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding.patch-2011

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-23 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13191807#comment-13191807
 ] 

Phabricator commented on HBASE-4218:


mbautin has commented on the revision "[jira] [HBASE-4218] HFile data block 
encoding framework and delta encoding implementation".

  See responses to some of the comments inline. I will upload a new version of 
the diff a bit later.

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java:98 Done.
  src/main/java/org/apache/hadoop/hbase/KeyValue.java:1901 That's correct, good 
catch! Yes, this is a pre-existing bug. Fixed this and added a new test, 
KeyValue.testCreateKeyValueFromKey, to verify this.
  
src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java:87
 Renamed the method to copyFromNext and the parameter to nextState. Added usage 
details to the javadoc.
  
src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java:129-141
 The reason for this is that the key is reconstructed by pieces, and the value 
is stored as-is in the original encoded buffer, so getValue() just provides a 
reference to a sub-array of the original byte array. I renamed these two seeker 
interface methods to getKeyDeepCopy() and getValueShallowCopy() for clarity.
  
src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java:146
 Replaced here and at line 1343 in createKeyOnly:

public KeyValue createKeyOnly(boolean lenAsVal) {
  // KV format:  
  // Rebuild as: <0:4>
  int dataLen = lenAsVal? Bytes.SIZEOF_INT : 0;
  byte [] newBuffer = new byte[getKeyLength() + ROW_OFFSET + dataLen];
  System.arraycopy(this.bytes, this.offset, newBuffer, 0,
  Math.min(newBuffer.length,this.length));

  
src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java:206
 Good point. If previous is invalid here, then the contract of this function is 
actually violated, as it cannot go to the previous block. The caller should 
check if the requested key is the first key of the block and load the previous 
block if necessary. I added an exception in case previous.isValid() is not true.
  src/main/java/org/apache/hadoop/hbase/io/encoding/CompressionState.java:65 
Actually, all member fields except prevOffset are overridden. prevOffset is 
manipulated directly by encoders/decoders. Added this to the method's javadoc.
  src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java:30 
Done.
  src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java:32 
Done.
  src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java:79 
All current implementations just wrap a portion of the actual block's buffer, 
which makes sense, because we don't encode the first key. Added this to the 
method's javadoc.

REVISION DETAIL
  https://reviews.facebook.net/D447


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to us

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-21 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190583#comment-13190583
 ] 

Phabricator commented on HBASE-4218:


Kannan has commented on the revision "[jira] [HBASE-4218] HFile data block 
encoding framework and delta encoding implementation".

  Mikhail -- my final set of comments for this pass of the review. Pretty minor 
comments only.

  Haven't reviewed the test files, and any encoders other than PrefixKey. But I 
think this is pretty much good to go.

  When you upload the updated diff, I'll go ahead and accept.


INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java:405 we are 
using readerV1 in some places, and reader in some other places. Sounds like we 
should get rid of one of them.
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:96 we have 
fsBlockReader as both as an instance variable and local variable? Can we get 
rid of the local variable?
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:91 unused.
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:760 "save the 
unencoded" here should be "save the encoded", correct?

REVISION DETAIL
  https://reviews.facebook.net/D447


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-20 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13190210#comment-13190210
 ] 

Phabricator commented on HBASE-4218:


gqchen has commented on the revision "[jira] [HBASE-4218] HFile data block 
encoding framework and delta encoding implementation".

  a few more comments.

INLINE COMMENTS
  
src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java:36 
7-bit encoding?
  src/main/java/org/apache/hadoop/hbase/io/encoding/CompressionState.java:65 It 
seems we assume all member variables in "this" should be reset in this 
function.  Otherwise we will be carrying the  values from two states earlier 
(prev of prev). Can we document this assumption?
  
src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java:106 
should we do "min(keyLength, previousState.keyLength) - 
KeyValue.TIMESTAMP_TYPE_SIZE"? If previous key is shorter, we can potentially 
match into the value area of the previous key. Since during seeking we only 
materialize key only, can it be a problem?
  
src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java:141 
some comments would be helpful here. This is probably the second time I read 
this part of the code and everytime I have to pause and think the reason behind 
this "if" condition.
  
src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java:156-159
 we should move these lines to right above line 164. Otherwise it's too 
confusing.
  
src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java:305-313
 It seems state.qualifierLength is not set here. It's probably not being used. 
But maybe we can move these code to a function in CompressionState?

REVISION DETAIL
  https://reviews.facebook.net/D447


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming 

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-19 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13189623#comment-13189623
 ] 

Phabricator commented on HBASE-4218:


Kannan has commented on the revision "[jira] [HBASE-4218] HFile data block 
encoding framework and delta encoding implementation".

  a few more minor comments. I still have a few more files left to review.

INLINE COMMENTS
  
src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixKeyDeltaEncoder.java:45 
rename: offset -> prevOffset
  src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java:158 we can 
avoid this duplication, and instead call the other load of copyToStream, and 
pass offset = 0.
  src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java:192 length 
name is confusing for 2nd argument. This function can put any long. Rename 
length -> value; and tmpLength -> tmpValue.
  src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java:471 There are 
4 overloads of the readCompressedInt().

  The first 2 work on a 7-bit encoding scheme.

  The next 2 work on a different format of encoding. These overloads are also 
unused other than tests. Let's remove these two overloads.


REVISION DETAIL
  https://reviews.facebook.net/D447


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-18 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188968#comment-13188968
 ] 

Phabricator commented on HBASE-4218:


gqchen has commented on the revision "[jira] [HBASE-4218] HFile data block 
encoding framework and delta encoding implementation".

  Looks really awesome! A few minor comments after going through 
PrefixKeyDeltaEncoder. Will continue on the next two algorithms.

INLINE COMMENTS
  
src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixKeyDeltaEncoder.java:45 
maybe "prevKeyOffset" instead of "offset" is a better name here? It also 
matches "prevKeyLength".
  
src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java:87
 Agree with Kannan. probably document it better, or maybe call it 
"copyFromNextState"?
  
src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java:129-141
 getKey does a deep copy and getValue does a shallow copy. Just wondering what 
is the motivation.
  
src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java:146
 use KeyValue.ROW_OFFSET instead of 2 * Bytes.SIZEOF_INT?
  
src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java:206
 should we check if previous is valid here as well?

REVISION DETAIL
  https://reviews.facebook.net/D447


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-18 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188767#comment-13188767
 ] 

Phabricator commented on HBASE-4218:


tedyu has commented on the revision "[jira] [HBASE-4218] HFile data block 
encoding framework and delta encoding implementation".

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java:130 
Should read 'array where the key is'

REVISION DETAIL
  https://reviews.facebook.net/D447


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-18 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188741#comment-13188741
 ] 

Phabricator commented on HBASE-4218:


Kannan has commented on the revision "[jira] [HBASE-4218] HFile data block 
encoding framework and delta encoding implementation".

  another partial round of comments.

INLINE COMMENTS
  
src/main/java/org/apache/hadoop/hbase/io/deltaencoder/PrefixKeyDeltaEncoder.java:45
 rename offset to prevKeyOffset for clarity.
  
src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoderAlgorithms.java:115
 has -> have
  
src/main/java/org/apache/hadoop/hbase/io/deltaencoder/DeltaEncoderAlgorithms.java:124
 DeltaEncoder -> DeltaEncoders
  src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java:93 Another 
related question/clarification needed on the spec of this feature...

  If delta-encoding is on in cache, then is blocksize setting for the CF based 
on the encoded size or the un-encoded size.

  [Personally, think the encoded size should be used for the blocksize. But can 
you clarify either way.]
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java:196 
getDeltaEncodingId() and getDeltaEncodedId() seem to be identical, but for 
their names.
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockDeltaEncoder.java:78 
operate -> operates
  
src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockDeltaEncoder.java:112 
I wasn't clear what "useEncodedScanner" is meant for. Currently, it seems to be 
used HFileReaderV1. Could you clarify the purpose of this...
  src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java:98 remove " in 
cache"
  src/main/java/org/apache/hadoop/hbase/KeyValue.java:1901 sounds like b.length 
should be l on this line also.

  Is this is a pre-existing bug?
  src/main/java/org/apache/hadoop/hbase/io/hfile/BlockCacheKey.java:84 Why 2 * 
ClassSize.REFERENCE? This change adds one reference to the encoding enum, 
correct?
  src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java:752 We can 
use a MutableInteger here to avoid creating lots of Integer objects. But, since 
this is just for a test, not a big deal.
  src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java:30 
KeyValue -> KeyValues
  src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java:32 
iterated always -> always iterated
  src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java:79 
Must the implementation make a deep copy? Or is it legal for the implementation 
to point to have the returned ByteBuffer point to a byte array in the input 
"block"?
  src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java:129 
confusing comment. "same key" as what?

  Should comment be something like:

  "Seek to specified key in the block."

  ?
  src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java:137 
blockSeekTo --> seekToKeyInBlock, perhaps?
  
src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java:87
 this logic appears to assume that the target ("this") we are copying into was 
positioned at the previous key. No?

REVISION DETAIL
  https://reviews.facebook.net/D447


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compress

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-17 Thread Matt Corgan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188294#comment-13188294
 ] 

Matt Corgan commented on HBASE-4218:


I used plenty of memory and a warmup run so that for the measured results all 
reads were served out of the OS page cache and HBase block cache.  I was trying 
to measure compression ratio and cpu performance assuming that the data set is 
very hot and cached nearly 100%.  If you're IO bound, then that 50% cpu 
difference shouldn't matter much, like you said.  It strikes me that bringing 
IO into the test is really just testing the effective size of the block cache 
which you can already do by adjusting the block cache size in hbase-site.  CPU 
efficiency difference would get drowned out.

A scenario i have where data is 100% cached is chronological log ("event") data 
(sharded 16 ways) where the last ~2 days fit in memory.  We add different 
secondary index tables to the primary table depending on different reports we 
want to generate.  When scanning those secondary indexes we pull millions of 
rows from the primary table in random order.  The better the compression, the 
more days of log events we can hold in memory, and the better the cpu 
efficiency, the faster we can do the random reads.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-17 Thread Lars Hofhansl (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188282#comment-13188282
 ] 

Lars Hofhansl commented on HBASE-4218:
--

(Just looking through the thread of comments)

@Matt: When you saw the 50% performance reduction, did your workload fit into 
the cache (before compression)? One of the ideas here is that because of the 
compression the cache can hold more KVs, so one would have to measure the 
reduced scan performance intra-block against more frequent block loads from 
HDFS.


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-17 Thread Mikhail Bautin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188175#comment-13188175
 ] 

Mikhail Bautin commented on HBASE-4218:
---

@Ted: we are in the process of doing a final review internally. It will 
probably be a couple more days -- we will post an update.

Thanks!
--Mikhail


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-17 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188169#comment-13188169
 ] 

Zhihong Yu commented on HBASE-4218:
---

@Kannan, @Mikhail:
Is the latest patch ready to go ?

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-17 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187951#comment-13187951
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12510876/Delta-encoding-2012-01-17_11_09_09.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 185 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -140 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 86 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.io.hfile.TestHFileBlock
  org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/793//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/793//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/793//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding-2012-01-17_11_09_09.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatica

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-17 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13187888#comment-13187888
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510870/D447.24.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 137 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/792//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.24.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-14 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186257#comment-13186257
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510586/4218-2012-01-14.txt
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 137 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -140 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 86 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.io.hfile.TestHFileBlock
  org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/761//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/761//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/761//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-2012-01-14.txt, 4218-v16.txt, 4218.txt, 
> D447.1.patch, D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, 
> D447.14.patch, D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, 
> D447.19.patch, D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, 
> D447.23.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, 
> D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your J

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-13 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13186067#comment-13186067
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12510557/D447.23.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 137 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/758//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, D447.19.patch, 
> D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, D447.23.patch, 
> D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, 
> D447.8.patch, D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-13 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185853#comment-13185853
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12510523/Delta-encoding.patch-2012-01-13_12_20_07.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 182 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -142 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 84 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.io.hfile.TestHFileBlock
  org.apache.hadoop.hbase.mapreduce.TestImportTsv

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/755//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/755//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/755//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, D447.19.patch, 
> D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA adminis

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-13 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185851#comment-13185851
 ] 

Phabricator commented on HBASE-4218:


tedyu has commented on the revision "[jira] [HBASE-4218] HFile data block 
encoding framework and delta encoding implementation".

  Amazing progress.

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoder.java:73 
encoding is repeated twice.
  src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java:307 Can we 
include dataBlockEncoder.getEncodingInCache() in the exception message ?

REVISION DETAIL
  https://reviews.facebook.net/D447


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, D447.19.patch, 
> D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-13 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13185834#comment-13185834
 ] 

Zhihong Yu commented on HBASE-4218:
---

PreCommit build #755:
{code}
Running org.apache.hadoop.hbase.io.hfile.TestHFileBlock
Tests run: 16, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 67.298 sec <<< 
FAILURE!
{code}

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, D447.19.patch, 
> D447.2.patch, D447.20.patch, D447.21.patch, D447.22.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta-encoding.patch-2012-01-13_12_20_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-07 Thread Mikhail Bautin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182134#comment-13182134
 ] 

Mikhail Bautin commented on HBASE-4218:
---

@Ted: I was running a load test with LZO compression and PREFIX encoding and 
everything was fine, but then I switched to encoding in cache only and 
compactions started failing. I need to look into this.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, D447.19.patch, 
> D447.2.patch, D447.20.patch, D447.21.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-07 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13182129#comment-13182129
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12509806/D447.21.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 133 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/697//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, D447.19.patch, 
> D447.2.patch, D447.20.patch, D447.21.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta-encoding.patch-2012-01-07_14_12_48.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-06 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181850#comment-13181850
 ] 

Zhihong Yu commented on HBASE-4218:
---

Test failure seemed to be caused by resource constraint 
(https://builds.apache.org/job/PreCommit-HBASE-Build/681/testReport/org.apache.hadoop.hbase.io.hfile/TestHFileBlock/testPreviousOffset_1_/):
{code}
java.lang.OutOfMemoryError
at java.util.zip.Inflater.init(Native Method)
{code}
TestHFileBlock passed on MacBook (with -d32 JVM arg).
TestSplitLogManager passed too.

@Mikhail:
Has the latest patch passed cluster testing ?

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, D447.19.patch, 
> D447.2.patch, D447.20.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-05 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181072#comment-13181072
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12509652/Delta-encoding.patch-2012-01-05_18_50_47.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 178 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -146 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 83 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.client.TestFromClientSide
  org.apache.hadoop.hbase.io.hfile.TestHFileBlock
  org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
  org.apache.hadoop.hbase.master.TestSplitLogManager

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/681//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/681//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/681//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, D447.19.patch, 
> D447.2.patch, D447.20.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA ad

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-05 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181062#comment-13181062
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12509651/D447.20.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 133 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/680//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, D447.19.patch, 
> D447.2.patch, D447.20.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta-encoding.patch-2012-01-05_18_50_47.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-05 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13181055#comment-13181055
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12509647/Delta-encoding.patch-2012-01-05_16_31_44_copy.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 174 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -146 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 83 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.io.hfile.TestHFileBlock
  org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
  org.apache.hadoop.hbase.master.TestSplitLogManager

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/679//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/679//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/679//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, D447.19.patch, 
> D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, 
> D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44_copy.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http:/

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-05 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180998#comment-13180998
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12509638/D447.19.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 131 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/678//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, D447.19.patch, 
> D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, 
> D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta-encoding.patch-2012-01-05_16_31_44.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-05 Thread Mikhail Bautin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180987#comment-13180987
 ] 

Mikhail Bautin commented on HBASE-4218:
---

The failed tests above pass locally:

Running org.apache.hadoop.hbase.replication.TestReplication
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 120.447 sec
Running org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
Tests run: 9, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 216.844 sec
Running org.apache.hadoop.hbase.mapreduce.TestImportTsv
Tests run: 8, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 79.119 sec
Running org.apache.hadoop.hbase.mapreduce.TestTableMapReduce
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 95.373 sec
Running org.apache.hadoop.hbase.mapred.TestTableMapReduce
Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 67.574 sec

Results :

Tests run: 27, Failures: 0, Errors: 0, Skipped: 0

The patch also works good (so far) in a LoadTestTool 5-node cluster test with 
LZO compression and PREFIX encoding. I have a couple more minor changes to the 
patch, so please don't commit yet.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, D447.2.patch, 
> D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, 
> D447.8.patch, D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-05 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13180964#comment-13180964
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12509627/Delta-encoding.patch-2012-01-05_15_16_43.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 174 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -146 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 83 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.replication.TestReplication
  org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/675//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/675//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/675//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.18.patch, D447.2.patch, 
> D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, 
> D447.8.patch, D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta-encoding.patch-2012-01-05_15_16_43.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-04 Thread Matt Corgan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179973#comment-13179973
 ] 

Matt Corgan commented on HBASE-4218:


{quote}I think it is OK to ignore cached encoded blocks on compaction{quote}The 
circumstance i was worried about is if you are doing many small flushes and 
minor compactions.  The blocks to be compacted could mostly be in cache, and 
you would be ignoring them all.  I guess it doesn't matter if it's just for 
testing, but might give a false impression of performance.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-04 Thread Mikhail Bautin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179959#comment-13179959
 ] 

Mikhail Bautin commented on HBASE-4218:
---

Actually, I think it is OK to ignore cached encoded blocks on compaction. We 
can get encoded blocks in cache and have a compaction write an unencoded file 
in two cases:
* Encoding is turned on in cache only. In that case we don't want to use 
encoded blocks during compaction at all, because the in-cache-only mode implies 
that we don't trust our encoding algorithms 100% and want to guard against 
possible persistent data corruption.
* Encoding was turned on (either in cache only or everywhere) and it was turned 
off entirely. Since this is not a very frequent case, I think we could probably 
optimize this after the patch is stabilized.


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-04 Thread Mikhail Bautin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179960#comment-13179960
 ] 

Mikhail Bautin commented on HBASE-4218:
---

Re-reading my previous post, I want to make an addition: we still use cached 
encoded blocks when compacting a fully-encoded column family.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-04 Thread Matt Corgan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179955#comment-13179955
 ] 

Matt Corgan commented on HBASE-4218:


{quote}I think that with an 8K line patch we probably should not try to put 
more complexity into the first version of delta encoding.{quote}Yes, totally 
agreeing here.  It is a work in progress, and so these settings in this patch 
don't have to make perfect sense.  I like the latest 
DATA_BLOCK_ENCODING=NONE(?) and ENCODE_ON_DISK=true defaults.

All other comments look sensible.  Have you covered the case where you have 
encoded blocks in the block cache and are compacting to an unencoded hfile?  
You will want to make sure that you are using (not ignoring) the cached blocks.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-04 Thread Mikhail Bautin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179945#comment-13179945
 ] 

Mikhail Bautin commented on HBASE-4218:
---

I think that with an 8K line patch we probably should not try to put more 
complexity into the first version of delta encoding. We can always make things 
more complicated later. I like the two-parameter setup: DATA_BLOCK_ENCODING 
sets the encoding type (on-disk and in-cache by default) and ENCODE_ON_DISK 
(true by default) allows to use in-cache-only encoding (when explicitly setting 
ENCODE_ON_DISK=false) and get the benefit of encoding in cache even before we 
are 100% sure that our encoding algorithms and encoded scanners are stable. If 
everyone agrees with that, I will finish the patch by (1) adding a unit test 
for switching data block encoding column family settings; (2) including 
encoding type in the cache key; and (3) simplifying the HFileDataBlockEncoder 
interface, since we assume that the "in-memory format" (used by scanners) is 
always the same as the in-cache format and don't need methods such as 
afterReadFromDiskAndPuttingInCache anymore.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-04 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179813#comment-13179813
 ] 

stack commented on HBASE-4218:
--

/me hearts this issue

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-04 Thread Matt Corgan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179802#comment-13179802
 ] 

Matt Corgan commented on HBASE-4218:


Some food for thought - there is probably more complexity to this down the 
road.  There are always going to be trade-offs between encoding speed, 
compression ratio, scan throughput, and seek latency.  These trade-offs can 
actually be quite huge, like 10x when you start considering things like suffix 
compression.  I can see having different encodings in the same column family 
depending on dynamic performance decisions.  For example, use the most compact 
encoding during major compaction, but use the fastest encoding if memstore 
flushes are backlogged.

We probably can't get it perfect in this first iteration.  Just want to avoid 
shooting ourselves in the foot as much as possible.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-04 Thread Kannan Muthukkaruppan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179763#comment-13179763
 ] 

Kannan Muthukkaruppan commented on HBASE-4218:
--

I also like ENCODE_ON_DISK instead of ENCODE_IN_CACHE_ONLY (with the reverse 
semantics).

I would say let's keep the default for ENCODE_ON_DISK to true though. This is 
more a testing knob in early stages-- where someone will set it to false before 
publishing a new data block encoder for general use. By the time end users try 
this, the code should be robust enough, and the Column Family setting of which 
data block encoding to use should be ideally the only knob they need to think 
about.



> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-04 Thread Lars Hofhansl (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179650#comment-13179650
 ] 

Lars Hofhansl commented on HBASE-4218:
--

One more thought about ENCODED_IN_CACHE_ONLY (and then I'll shut up about 
this)...

If we ever wanted to extend this in the future and allow disk only encoding, 
maybe a better way would be to have ENCODING and ENCODE_ON_DISK. ENCODE_ON_DISK 
(default false) would just be the inverse of what ENCODED_IN_CACHE_ONLY is. 
That way (if we felt so inclined) we can add ENCODE_IN_CACHE later and allow it 
to be false.


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-04 Thread Mikhail Bautin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179349#comment-13179349
 ] 

Mikhail Bautin commented on HBASE-4218:
---

A brief status update. I am in the process of implementing support for column 
family data block encoding configuration changes. Those changes are coming in 
the next version of the patch that I will post tomorrow. After discussing this 
with Kannan, our solution is:
* Assign an in-cache data block encoding to every HFile reader. This in-cache 
encoding is determined as follows:
** If the HFile is not encoded on disk, the in-cache encoding is set to the 
column family's DATA_BLOCK_ENCODING.
** If the HFile is encoded on disk, the in-cache encoding is set to the HFile 
encoding to avoid the wasted effort of re-encoding blocks for cache.
* When a non-encoded block is loaded from disk, it is encoded using the 
in-cache encoding and put in cache.
* When an encoded block is loaded from disk, its encoding is left as is.
* To reduce the complexity of data block encoding switching, we can include the 
in-cache encoding type in the block cache key. For example, if 
ENCODED_IN_CACHE_ONLY is turned on without encoding on disk, and then the 
encoding is turned off altogether, the cache will be populated with non-encoded 
blocks (since they will have completely different keys) and encoded blocks will 
age out from the cache. While this is suboptimal, the implementation is very 
simple and the common case (when the CF encoding options do not change) is not 
complicated with unnecessary corner cases.


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-04 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179335#comment-13179335
 ] 

Phabricator commented on HBASE-4218:


mcorgan has commented on the revision "[jira] [HBASE-4218] HFile data block 
encoding framework and delta encoding implementation".

  Trying to review this with an eye on schema changes and compactions.

INLINE COMMENTS
  
src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoderImpl.java:241
 What about the situation where regionserver is running for a while with 
ENCODING_IN_MEMORY=true and block cache gets filled with encoded blocks, and 
then user does schema change to disable encoding altogether.  Now the block 
cache may return an old encoded block.  (Assuming online schema change doesn't 
invalidate all blocks for a table?)

  If i'm understanding that correctly, then it shouldn't be an 
IllegalStateException but should be handled normally.  It should probably 
invalidate the encoded block from the block cache if possible, otherwise it 
will expire normally.  Then it should return null so that HfileReaderV2 knows 
to go to the filesystem to get the block.

REVISION DETAIL
  https://reviews.facebook.net/D447


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179299#comment-13179299
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12509377/4218.txt
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 111 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -138 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 81 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat
  org.apache.hadoop.hbase.master.TestSplitLogManager

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/662//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/662//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/662//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179270#comment-13179270
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12509375/4218.txt
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 111 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/661//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, 4218.txt, D447.1.patch, 
> D447.10.patch, D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, 
> D447.15.patch, D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread Lars Hofhansl (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179262#comment-13179262
 ] 

Lars Hofhansl commented on HBASE-4218:
--

Mikhail's explanation absolutely makes sense. In fact now I would even prefer 
to get rid of ENCODE_IN_CACHE_ONLY (am OK with leaving it in too).


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread Matt Corgan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179260#comment-13179260
 ] 

Matt Corgan commented on HBASE-4218:


It makes sense to me given the background.  Seems like the ENCODE_IN_CACHE_ONLY 
is more of a caution flag that people can fly until they're confident their 
data won't be corrupted.  Probabaly can be removed at some point down the road.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179254#comment-13179254
 ] 

Zhihong Yu commented on HBASE-4218:
---

@Stack, @Matt, @Lars:
Can I assume that you're Okay with the formation in the latest patch ?

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread Matt Corgan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179172#comment-13179172
 ] 

Matt Corgan commented on HBASE-4218:


oops - missed your comment before replying.

{quote}the real value of in-cache-only encoding for us is that if we can get a 
benefit of data block encoding in production faster without risking data 
corruption{quote}
makes sense to me.  sorry for the full-circle discussion!

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread Matt Corgan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179166#comment-13179166
 ] 

Matt Corgan commented on HBASE-4218:


{quote}Blocks read from existing HFiles will still be brought into cache using 
their original encoding{quote}
awesome - I was just about to bring that up.  Will be very important for tables 
that go many days between compactions

{quote}Another possible way to simplify things even further could be to get rid 
of the ENCODE_IN_CACHE_ONLY option completely{quote}
I am leaning towards this as well.  It's a cool feature for development and 
testing, but i can't think of a reason to use it in production.  As you 
mentioned, it makes more sense to do encoding during flushes and compactions 
and not during the read path.  Storing unencoded on disk and encoded in memory 
would make sense for workloads where the average block is read less than once, 
but that's pretty uncommon and that scenario is not likely to make good usage 
of the block cache anyway.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread Mikhail Bautin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179161#comment-13179161
 ] 

Mikhail Bautin commented on HBASE-4218:
---

Here is another update after discussing this with Jerry. Actually, the real 
value of in-cache-only encoding for us is that if we can get a benefit of data 
block encoding in production faster without risking data corruption, so we 
still want to support that option. This benefit should come from being able to 
put more stuff in cache, and (based on Jacek's experiments, I haven't confirmed 
this myself) from faster encoded scanners. We really need to make sure that we 
don't go through encoding/decoding on compactions when in-cache-only encoding 
is enabled, though.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread Mikhail Bautin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179153#comment-13179153
 ] 

Mikhail Bautin commented on HBASE-4218:
---

@Matt: what do you call "the settings UI"? I thought HColumnDescriptor was part 
of the user-visible API, and if we allowed more flexible options there, we 
would have to fully support them everywhere. 

On the performance issue: HBase is IO-bound for most production workloads, so 
if we can fit more data into cache, we should get a performance win. Jacek 
reported that encoded scanners were faster in his experiments, and if they are 
not, we should optimize them or disable prefix compression for that particular 
workload. In a CPU-bound situation, one reason encoded scanners could be slower 
is that the data does not compress well, so delta encoding introduces an 
unnecessary CPU overhead and does not really save any space in cache. For that 
type of workload, using prefix compression probably is not the right thing to 
do.

Could you please share some more details about the workload in your test? Is it 
CPU-bound or IO-bound? Is it similar to your envisioned use case for data block 
encoding? Are you planning to use the PREFIX algorithm or your trie 
implementation? Does the trie algorithm have the same encoded scanner 
performance problem?
 
@Lars, Matt:
"We have all the framework in place" and "features or already working code" are 
relative concepts. The framework still needs to be tweaked to (1) support all 
real use cases people have in mind; and (2) allow to solidify the existing 
implementation and test it really well. Jacek's original patch did not handle 
switching data block encoding settings in the column family, and I am in the 
process of modifying the patch to support that. The more flexibility we allow 
for column family encoding configuration, the more cases we have to test, and 
the more exotic edge cases we get.

A couple more notes on supporting switching data block encoding column family 
settings. Kannan and I discussed this, and we came up with a plan for allowing 
a seamless migration to a new data block encoding. Blocks read from existing 
HFiles will still be brought into cache using their original encoding, and we 
will allow storing a mixture of different data block encodings in the cache. 
The new encoding configuration will only be applied on flushes and compactions. 
This is similar to the seamless HFile format upgrade that we have already done 
successfully.

Another possible way to simplify things even further could be to get rid of the 
ENCODE_IN_CACHE_ONLY option completely. We introduced it for testing, but it 
seems to be causing more trouble than it is worth, and actually slows down 
patch stabilization and testing. Such "test-mode" encoding would require extra 
care to avoid using encoding during compactions, because that could actually 
corrupt on-disk data. I think a better way would be to add more unit tests for 
various edge cases and transitions for simplified configuration options, and do 
more synthetic load testing with those. For dark launch cluster it is always 
possible to take a backup and roll back if a data corruption happens. I still 
need to discuss that option with Kannan and the rest of our team, but please 
let me know what you think.


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key len

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread Matt Corgan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179130#comment-13179130
 ] 

Matt Corgan commented on HBASE-4218:


Yes, i think i used the most recent version.  I don't have the code readily 
available, but can check into it tonight.  

My main concern from this morning was that the modified settings hid features 
of already working code (like Lars mentioned) while not really simplifying 
things too much.  I guess the big problem with having the separate ON_DISK and 
IN_MEMORY settings is that a user would have to change both of them 
simultaneously, which is not obvious to a new user.

One option could be to persist the ENCODING_ON_DISK and ENCODING_IN_MEMORY 
separately in the HColumnDescriptor no matter what we put in the settings UI.  
That way we have the ability to change the user facing settings in the future 
without having to go through the painful process of versioning the 
HTableDescriptor (i'm not even sure how that works behind the scenes).  If we 
did that, I think the simplest setting we could expose to the user would just 
be ENCODING, and that would set both of the persistent variables to the same 
thing.

i hate to overthink it - just might be hard to change once it's in place

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179104#comment-13179104
 ] 

Zhihong Yu commented on HBASE-4218:
---

@Matt:
For clarification, did you use recent version of PrefixKeyDeltaEncoder for the 
scan performance evaluation ?

I think it is natural for different encoders to show different scan performance.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread Lars Hofhansl (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179057#comment-13179057
 ] 

Lars Hofhansl commented on HBASE-4218:
--

+1 on avoid different encoding on disk vs cache.
However, since we have all this framework in place, why not also allow it for 
disk only encoding.
It is in principle different from the current block based compression, as it 
can easily take the shape of KeyValues into account.

Could we have ENCODING, ENCODE_IN_CACHE, and ENCODE_ON_DISK?

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread Matt Corgan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179054#comment-13179054
 ] 

Matt Corgan commented on HBASE-4218:


Interesting... the testing i've been doing shows the delta algorithms to be 
about half as fast at scanning and seeking than the NONE encoding, which is why 
I was thinking you'd possibly want the opposite setting (encoded on disk, 
decoded in memory).  I'll look at my benchmark again to see if i can figure out 
the discrepancy.

I don't have a strong opinion either way as i'll probably always run with the 
same encoding on disk and in memory.  Was mostly curious.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread stack (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179045#comment-13179045
 ] 

stack commented on HBASE-4218:
--

bq. the problem with the previous settings was that they were too flexible, and 
allowed for different encodings in cache and in memory.

+1 on removing options if they can make the system seem more complicated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread Mikhail Bautin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179036#comment-13179036
 ] 

Mikhail Bautin commented on HBASE-4218:
---

@Matt, Ted: the problem with the previous settings was that they were too 
flexible, and allowed for different encodings in cache and in memory. We 
definitely don't want to re-encode a block using a different encoding algorithm 
after loading it from disk. After a discussion with Kannan we decided that the 
whole benefit of delta encoding is in encoded scanners and allowing to put more 
data into cache. If we want to use a compression algorithm on disk but not in 
cache, it is possible to implement that using the existing compression 
framework. Furthermore, Jacek found in his experiments that encoded scanners 
were actually faster than scanners on decoded blocks. Please let me know what 
use case you have in mind that would require storing decoded blocks in cache 
and would not allow for efficient scanning over encoded blocks.


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13179028#comment-13179028
 ] 

Zhihong Yu commented on HBASE-4218:
---

Reading JIRA description again, it clearly states the goal for this feature:
bq. It aims to save memory in cache as well as speeding seeks within 
HFileBlocks.

It is also evident in javadoc:
{code}
* @return the data block encoding algorithm used in block cache and
* optionally on disk
*/
public DataBlockEncoding getDataBlockEncoding() {
{code}
Matt's interpretation is reasonable I think.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread Matt Corgan (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178991#comment-13178991
 ] 

Matt Corgan commented on HBASE-4218:


Mikhail, can you explain the thinking behind the ENCODE_IN_CACHE_ONLY setting, 
as opposed to the previous ENCODING_IN_MEMORY setting?  I can't think of a 
scenario where you'd want to store unencoded values on disk and encode them 
every time you load a block into memory.  (Would that be for better compression 
ratios?)  I'd actually think it more likely to have encoded blocks on disk and 
decode them in memory for faster scans/seeks.

Anyway, I just thought the separate ENCODING_ON_DISK, and ENCODING_IN_MEMORY 
settings were not too complicated, and they had the added benefit of letting 
you encode on disk only.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-03 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178978#comment-13178978
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12509333/D447.17.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 111 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/654//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.17.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-01 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178290#comment-13178290
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12509029/4218-v16.txt
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 104 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -138 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 82 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.io.hfile.TestHFileBlock
  org.apache.hadoop.hbase.mapreduce.TestImportTsv
  org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/650//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/650//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/650//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, 4218-v16.txt, D447.1.patch, D447.10.patch, 
> D447.11.patch, D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, 
> D447.16.patch, D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, 
> D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-01 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178282#comment-13178282
 ] 

Phabricator commented on HBASE-4218:


tedyu has commented on the revision "[jira] [HBASE-4218] HFile data block 
encoding framework and delta encoding implementation".

  Thanks for the nice work, Mikhail.

INLINE COMMENTS
  
src/main/java/org/apache/hadoop/hbase/io/encoding/BitsetKeyDeltaEncoder.java:48 
Good.
  
src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java:238
 Wonderful.
  src/main/java/org/apache/hadoop/hbase/io/hfile/BlockType.java:115 Should read 
'they work exactly the same'

REVISION DETAIL
  https://reviews.facebook.net/D447


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, 
> D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, D447.16.patch, 
> D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, 
> D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2012-01-01 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178280#comment-13178280
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12509028/D447.16.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 104 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/649//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, 
> D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, D447.16.patch, 
> D447.2.patch, D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, 
> D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2011-12-28 Thread Mikhail Bautin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176976#comment-13176976
 ] 

Mikhail Bautin commented on HBASE-4218:
---

Just a quick note from an offline conversation with Kannan: we need to support 
modifying data block encoding column family settings. In the most recent 
version of the patch 
(https://reviews.facebook.net/D447?vs=&id=3237&whitespace=ignore-all) there are 
the following user-facing column family settings:

* DATA_BLOCK_ENCODING - specifies data block encoding type or NONE
* ENCODE_IN_CACHE_ONLY - boolean (false by default). If true, data blocks are 
only encoded in cache but not on disk

We removed the "encoded scanner" flag, and we use encoded scanners by default 
any time we use data block encoding.

Given the above column family settings, we need to unit-test at least the 
following transitions:
# Switching from no data block encoding to a data block encoding everywhere, 
and vice versa
# Switching from no data block encoding to a data block encoding in cache only, 
and vice versa
# Flipping the "in cache only" flag but keeping the data block encoding type 
the same
# Switching from one data block encoding everywhere to another one
# Switching from one data block encoding in cache only to another one
# Switching to a different data block encoding and flipping the "in cache only" 
flag.


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, 
> D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, D447.2.patch, 
> D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, 
> D447.8.patch, D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2011-12-28 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176974#comment-13176974
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12508818/D447.15.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 68 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/621//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, 
> D447.12.patch, D447.13.patch, D447.14.patch, D447.15.patch, D447.2.patch, 
> D447.3.patch, D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, 
> D447.8.patch, D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2011-12-28 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176873#comment-13176873
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12508801/D447.14.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 68 new or modified tests.

-1 patch.  The patch command could not apply the patch.

Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/618//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, 
> D447.12.patch, D447.13.patch, D447.14.patch, D447.2.patch, D447.3.patch, 
> D447.4.patch, D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, 
> D447.9.patch, Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2011-12-28 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176709#comment-13176709
 ] 

Phabricator commented on HBASE-4218:


mcorgan has commented on the revision "[jira] [HBASE-4218] HFile data block 
encoding (delta encoding)".

  I'm porting the TRIE encoding algorithm over to this new patch, so am able to 
review a little better in eclipse than on review board.  Couple things I've 
noticed so far:

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncodings.java:32 
The enum nested in a class is unusual.  Would a more typical approach be to 
call it DataBlockEncoding (singular) and make that the enum, eliminating the 
nested "Algorithm"?

  So you would have DataBlockEncoding.BITSET, etc.

  This would help elsewhere in the codebase since it will eliminate the 
confusion with the unfortunately named compression "Algorithm" (GZIP, LZO)
  
src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java:121
 This method was added before getKeyValueObject(), so I see why it happened 
this way, but this method should probably be called getKeyValueBuffer() or 
getKeyValueByteBuffer(), and the below method should be called getKeyValue()
  
src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java:134
 rename to getKeyValue()

REVISION DETAIL
  https://reviews.facebook.net/D447


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, 
> D447.12.patch, D447.13.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2011-12-25 Thread Phabricator (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175880#comment-13175880
 ] 

Phabricator commented on HBASE-4218:


tedyu has commented on the revision "[jira] [HBASE-4218] HFile data block 
encoding (delta encoding)".

INLINE COMMENTS
  src/main/java/org/apache/hadoop/hbase/regionserver/Store.java:294 This method 
should be made package private.

REVISION DETAIL
  https://reviews.facebook.net/D447


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, 
> D447.12.patch, D447.13.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2011-12-24 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175808#comment-13175808
 ] 

Hudson commented on HBASE-4218:
---

Integrated in HBase-TRUNK-security #48 (See 
[https://builds.apache.org/job/HBase-TRUNK-security/48/])
HBASE-4218 Data Block Encoding of KeyValues - revert, problems were 
uncovered during cluster testing

tedyu : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/BitsetKeyDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/CompressionState.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/CopyKeyDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncodings.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DiffKeyDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/EncodedDataBlock.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/EncoderBufferTooSmallException.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixKeyDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileWriter.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockType.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoderImpl.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFilePrettyPrinter.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV1.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/NoOpDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/metrics/SchemaConfigured.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java
* /hbase/trunk/src/main/ruby/hbase/admin.rb
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestCase.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/TestFromClientSide.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/TestHalfStoreFileReader.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/TestHeapSize.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/encoding/RedundantKVGenerator.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/encoding/TestBufferedDataBlockEncoder.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/encoding/TestDataBlockEncoders.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlock.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileDataBlockEncoder.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileWriterV2.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportExport.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/CreateRandomStoreFile.java
* 
/hbase/tru

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2011-12-24 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175805#comment-13175805
 ] 

Hudson commented on HBASE-4218:
---

Integrated in HBase-TRUNK #2576 (See 
[https://builds.apache.org/job/HBase-TRUNK/2576/])
HBASE-4218 Data Block Encoding of KeyValues - revert, problems were 
uncovered during cluster testing

tedyu : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/BitsetKeyDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/CompressionState.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/CopyKeyDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncodings.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DiffKeyDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/EncodedDataBlock.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/EncoderBufferTooSmallException.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixKeyDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileWriter.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockType.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoderImpl.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFilePrettyPrinter.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV1.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/NoOpDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/metrics/SchemaConfigured.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java
* /hbase/trunk/src/main/ruby/hbase/admin.rb
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestCase.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/TestFromClientSide.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/TestHalfStoreFileReader.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/TestHeapSize.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/encoding/RedundantKVGenerator.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/encoding/TestBufferedDataBlockEncoder.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/encoding/TestDataBlockEncoders.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlock.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileDataBlockEncoder.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileWriterV2.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/mapreduce/TestImportExport.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/regionserver/CreateRandomStoreFile.java
* 
/hbase/trunk/src/test/ja

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2011-12-24 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175794#comment-13175794
 ] 

Hudson commented on HBASE-4218:
---

Integrated in HBase-TRUNK-security #47 (See 
[https://builds.apache.org/job/HBase-TRUNK-security/47/])
HBASE-4218 Data Block Encoding of KeyValues (aka delta encoding / prefix 
compression) - files used for testing
HBASE-4218 Data Block Encoding of KeyValues (aka delta encoding / prefix 
compression) (Jacek, Mikhail)

tedyu : 
Files : 
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/encoding
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/encoding/RedundantKVGenerator.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/encoding/TestBufferedDataBlockEncoder.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/encoding/TestDataBlockEncoders.java

tedyu : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/BitsetKeyDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/CompressionState.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/CopyKeyDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncodings.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DiffKeyDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/EncodedDataBlock.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/EncoderBufferTooSmallException.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixKeyDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileWriter.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockType.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoderImpl.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFilePrettyPrinter.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV1.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/NoOpDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/metrics/SchemaConfigured.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java
* /hbase/trunk/src/main/ruby/hbase/admin.rb
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestCase.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/TestFromClientSide.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/TestHalfStoreFileReader.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/TestHeapSize.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlock.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileDataBlockEncoder.java
* 
/hbase/trunk/

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2011-12-24 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175792#comment-13175792
 ] 

Zhihong Yu commented on HBASE-4218:
---

Patch reverted off TRUNK.

Waiting for the problems uncovered in cluster testing to be fixed.

Also, TestHFileBlock keeps failing.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, 
> D447.12.patch, D447.13.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2011-12-24 Thread Mikhail Bautin (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175791#comment-13175791
 ] 

Mikhail Bautin commented on HBASE-4218:
---

@Ted: could you please revert the patch for now? It is not ready yet (sorry for 
not indicating this clearly, I will let you know when it's good to go). Even 
though it passes all unit tests, on Thursday I uncovered bugs in data block 
encoding handling during cluster testing. A simple load test with delta 
encoding turned on fails as soon as the first store file is written out. I am 
not sure if Jacek did this kind of testing during his internship, or if this is 
a new problem that I introduced while iterating on the patch. Furthermore, 
there is a design problem related to changing the encoding algorithm for an 
existing CF: if an encoded block has different encoding than what's configured 
by the CF, an assertion is thrown. These issues should not be that difficult to 
fix, though, and I still think the patch is very close to being finished.


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, 
> D447.12.patch, D447.13.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2011-12-24 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175787#comment-13175787
 ] 

Zhihong Yu commented on HBASE-4218:
---

About TRUNK build #2574

java.lang.OutOfMemoryError in:
https://builds.apache.org/view/G-L/view/HBase/job/HBase-TRUNK/lastCompletedBuild/testReport/org.apache.hadoop.hbase.io.hfile/TestHFileBlock/testPreviousOffset_1_/
and
https://builds.apache.org/view/G-L/view/HBase/job/HBase-TRUNK/lastCompletedBuild/testReport/org.apache.hadoop.hbase.io.hfile/TestHFileBlock/testConcurrentReading_1_/

I ran TestHFileBlock on MacBook and didn't reproduce any of the errors.


> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, 
> D447.12.patch, D447.13.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2011-12-24 Thread Hudson (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175772#comment-13175772
 ] 

Hudson commented on HBASE-4218:
---

Integrated in HBase-TRUNK #2573 (See 
[https://builds.apache.org/job/HBase-TRUNK/2573/])
HBASE-4218 Data Block Encoding of KeyValues (aka delta encoding / prefix 
compression) - files used for testing
HBASE-4218 Data Block Encoding of KeyValues (aka delta encoding / prefix 
compression) (Jacek, Mikhail)

tedyu : 
Files : 
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/encoding
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/encoding/RedundantKVGenerator.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/encoding/TestBufferedDataBlockEncoder.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/encoding/TestDataBlockEncoders.java

tedyu : 
Files : 
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/HColumnDescriptor.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/KeyValue.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/HalfStoreFileReader.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/BitsetKeyDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/BufferedDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/CompressionState.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/CopyKeyDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DataBlockEncodings.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/DiffKeyDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/EncodedDataBlock.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/EncoderBufferTooSmallException.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/FastDiffDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/encoding/PrefixKeyDeltaEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileReader.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/AbstractHFileWriter.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/BlockType.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFile.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlockIndex.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileDataBlockEncoderImpl.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFilePrettyPrinter.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV1.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileReaderV2.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV1.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileWriterV2.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/LruBlockCache.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/io/hfile/NoOpDataBlockEncoder.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/mapreduce/LoadIncrementalHFiles.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/MemStore.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/Store.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFile.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/StoreFileScanner.java
* 
/hbase/trunk/src/main/java/org/apache/hadoop/hbase/regionserver/metrics/SchemaConfigured.java
* /hbase/trunk/src/main/java/org/apache/hadoop/hbase/util/ByteBufferUtils.java
* /hbase/trunk/src/main/ruby/hbase/admin.rb
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/HBaseTestCase.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/HFilePerformanceEvaluation.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/client/TestFromClientSide.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/TestHalfStoreFileReader.java
* /hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/TestHeapSize.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/CacheTestUtils.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestCacheOnWrite.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlock.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileBlockIndex.java
* 
/hbase/trunk/src/test/java/org/apache/hadoop/hbase/io/hfile/TestHFileDataBlockEncoder.java
* 
/hbase/trunk/src/test/java/

[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2011-12-24 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175770#comment-13175770
 ] 

Zhihong Yu commented on HBASE-4218:
---

Integrated to TRUNK

Thanks for the awesome work, Jacek.

Thanks for the persistence to finish this feature, Mikhail.

Thanks for the suggestions, Matt. 

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, 
> D447.12.patch, D447.13.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2011-12-24 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175748#comment-13175748
 ] 

Zhihong Yu commented on HBASE-4218:
---

Large tests passed as well (TestZooKeeper passed when run standalone).

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, 
> D447.12.patch, D447.13.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2011-12-24 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175742#comment-13175742
 ] 

Zhihong Yu commented on HBASE-4218:
---

Small and medium tests passed on Mac:
{code}
Tests run: 551, Failures: 0, Errors: 0, Skipped: 1

[INFO] 
[INFO] BUILD SUCCESS
[INFO] 
[INFO] Total time: 39:54.323s
{code}
Running large tests.

Will integrate if large tests pass.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, 
> D447.12.patch, D447.13.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2011-12-23 Thread Zhihong Yu (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175672#comment-13175672
 ] 

Zhihong Yu commented on HBASE-4218:
---

Only 739 tests were executed, due to:
{code}
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (malloc) failed to allocate 32756 bytes for 
ChunkPool::allocate
# An error report file with more information is saved as:
# 
/home/jenkins/jenkins-slave/workspace/PreCommit-HBASE-Build/trunk/hs_err_pid20773.log
Aborted
{code}

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, 
> D447.12.patch, D447.13.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (HBASE-4218) Data Block Encoding of KeyValues (aka delta encoding / prefix compression)

2011-12-23 Thread Hadoop QA (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/HBASE-4218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13175633#comment-13175633
 ] 

Hadoop QA commented on HBASE-4218:
--

-1 overall.  Here are the results of testing the latest attachment 
  
http://issues.apache.org/jira/secure/attachment/12508576/Data-block-encoding-2011-12-23.patch
  against trunk revision .

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 92 new or modified tests.

-1 javadoc.  The javadoc tool appears to have generated -142 warning 
messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

-1 findbugs.  The patch appears to introduce 81 new Findbugs (version 
1.3.9) warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

 -1 core tests.  The patch failed these unit tests:
   org.apache.hadoop.hbase.mapred.TestTableMapReduce
  org.apache.hadoop.hbase.mapreduce.TestHFileOutputFormat

Test results: 
https://builds.apache.org/job/PreCommit-HBASE-Build/592//testReport/
Findbugs warnings: 
https://builds.apache.org/job/PreCommit-HBASE-Build/592//artifact/trunk/patchprocess/newPatchFindbugsWarnings.html
Console output: https://builds.apache.org/job/PreCommit-HBASE-Build/592//console

This message is automatically generated.

> Data Block Encoding of KeyValues  (aka delta encoding / prefix compression)
> ---
>
> Key: HBASE-4218
> URL: https://issues.apache.org/jira/browse/HBASE-4218
> Project: HBase
>  Issue Type: Improvement
>  Components: io
>Affects Versions: 0.94.0
>Reporter: Jacek Migdal
>Assignee: Mikhail Bautin
>  Labels: compression
> Fix For: 0.94.0
>
> Attachments: 0001-Delta-encoding-fixed-encoded-scanners.patch, 
> 0001-Delta-encoding.patch, D447.1.patch, D447.10.patch, D447.11.patch, 
> D447.12.patch, D447.13.patch, D447.2.patch, D447.3.patch, D447.4.patch, 
> D447.5.patch, D447.6.patch, D447.7.patch, D447.8.patch, D447.9.patch, 
> Data-block-encoding-2011-12-23.patch, 
> Delta-encoding.patch-2011-12-22_11_52_07.patch, 
> Delta_encoding_with_memstore_TS.patch, open-source.diff
>
>
> A compression for keys. Keys are sorted in HFile and they are usually very 
> similar. Because of that, it is possible to design better compression than 
> general purpose algorithms,
> It is an additional step designed to be used in memory. It aims to save 
> memory in cache as well as speeding seeks within HFileBlocks. It should 
> improve performance a lot, if key lengths are larger than value lengths. For 
> example, it makes a lot of sense to use it when value is a counter.
> Initial tests on real data (key length = ~ 90 bytes , value length = 8 bytes) 
> shows that I could achieve decent level of compression:
>  key compression ratio: 92%
>  total compression ratio: 85%
>  LZO on the same data: 85%
>  LZO after delta encoding: 91%
> While having much better performance (20-80% faster decompression ratio than 
> LZO). Moreover, it should allow far more efficient seeking which should 
> improve performance a bit.
> It seems that a simple compression algorithms are good enough. Most of the 
> savings are due to prefix compression, int128 encoding, timestamp diffs and 
> bitfields to avoid duplication. That way, comparisons of compressed data can 
> be much faster than a byte comparator (thanks to prefix compression and 
> bitfields).
> In order to implement it in HBase two important changes in design will be 
> needed:
> -solidify interface to HFileBlock / HFileReader Scanner to provide seeking 
> and iterating; access to uncompressed buffer in HFileBlock will have bad 
> performance
> -extend comparators to support comparison assuming that N first bytes are 
> equal (or some fields are equal)
> Link to a discussion about something similar:
> http://search-hadoop.com/m/5aqGXJEnaD1/hbase+windows&subj=Re+prefix+compression

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




  1   2   >