[ 
https://issues.apache.org/jira/browse/HBASE-26353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17429504#comment-17429504
 ] 

Andrew Kyle Purtell commented on HBASE-26353:
---------------------------------------------

Here is the performance test result.

I wrote an integration test that simulates a location data tracking use case. 
It writes 10 million rows, each row has a 64-bit random row key (not 
important), one column family, with four qualifiers, one for: first name, last 
name, latitude (encoded as an integer with scale of 3), and longitude (also 
encoded as an integer with scale of 3). Details aren't really important except 
to say the character strings are short, corresponding with typical length for 
English first and last names, and there are two 32-bit integer values. The 
32-bit integer values are generated with a zipfian distribution to reduce 
entropy and allow for potentially successful dictionary compression. But they 
are also short. When creating the table the IT specified a block size of 1K. 
Perhaps not unreasonable for a heavily indexed use case with short values. I 
could have achieved a higher compression ratio if the row keys were sequential 
instead of completely random. This is not really important.

I also wrote a simple utility that iterates over an HFile and saves each DATA 
or ENCODED_DATA block as a separate file somewhere else, just the block data. 
These files were used as the training set for {{zstd}}. I extracted a training 
set of 20,000 blocks to train a 1MB dictionary. The parameters I used for 
training with {{zstd}} were basic and not especially tuned. I am not expert in 
this aspect of ZStandard so can't estimate how much additional gain is possible.

The results demonstrate compression speed improvements as expected (a 22-33% 
improvement), as described by the ZStandard documentation. They also 
demonstrate efficiency gains, especially in combination with higher levels. 
Specifying higher levels is more affordable because of the relative speedups at 
each level. There is a demonstration of meaningful gains in just this simple 
case, with potential for more benefits when applied by someone with expert 
knowledge. It seems reasonable to support this feature.

 

*No Dictionary*
||Level||On Disk Size||Compression||Compaction Time (sec)||
|-|1,686,075,803|-|-|
|1|767,926,618|54.5%|42|
|3|756,427,617|55.1%|37|
|5|746,302,550|55.7%|48|
|6|744,741,449|55.8%|50|
|7|744,701,778|55.8%|54|
|12|731,150,341|56.6%|115|

 

*With Dictionary*
||Level||On Disk Size||Compression||Compaction Time (sec)||
|1|679,408,139|59.7%|28|
|3|652,587,956|61.3%|31|
|5|630,927,508|62.6%|37|
|6|632,251,996|62.5%|39|
|7|625,972,642|62.9%|56|
|12|626,293,580|62.9%|89|

> Support loadable dictionaries in hbase-compression-zstd
> -------------------------------------------------------
>
>                 Key: HBASE-26353
>                 URL: https://issues.apache.org/jira/browse/HBASE-26353
>             Project: HBase
>          Issue Type: Sub-task
>            Reporter: Andrew Kyle Purtell
>            Assignee: Andrew Kyle Purtell
>            Priority: Minor
>             Fix For: 2.5.0, 3.0.0-alpha-2
>
>
> ZStandard supports initialization of compressors and decompressors with a 
> precomputed dictionary, which can dramatically improve and speed up 
> compression of tables with small values. For more details, please see [The 
> Case For Small Data 
> Compression|https://github.com/facebook/zstd#the-case-for-small-data-compression].
>  
> If a table is going to have a lot of small values and the user can put 
> together a representative set of files that can be used to train a dictionary 
> for compressing those values, a dictionary can be trained with the {{zstd}} 
> command line utility, available in any zstandard package for your favorite OS:
> Training:
> {noformat}
> $ zstd --maxdict=1126400 --train-fastcover=shrink \
>     -o mytable.dict training_files/*
> Trying 82 different sets of parameters
> ...
> k=674                                      
> d=8
> f=20
> steps=40
> split=75
> accel=1
> Save dictionary of size 1126400 into file mytable.dict
> {noformat}
> Deploy the dictionary file to HDFS or S3, etc.
> Create the table:
> {noformat}
> hbase> create "mytable", 
>   ... ,
>   CONFIGURATION => {
>     'hbase.io.compress.zstd.level' => '6',
>     'hbase.io.compress.zstd.dictionary' => true,
>     'hbase.io.compress.zstd.dictonary.file' =>  
> 'hdfs://nn/zdicts/mytable.dict'
>   }
> {noformat}
> Now start storing data. Compression results even for small values will be 
> excellent.
> Note: Beware, if the dictionary is lost, the data will not be decompressable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to