[ 
https://issues.apache.org/jira/browse/HBASE-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229921#comment-13229921
 ] 

Li Pi commented on HBASE-4608:
------------------------------

On other compression things. I looked into those.

plugging into LZMA was the first thing I thought about doing - performance 
stops this one though.

There are other optimization we can make, such as modifying the dictionary to 
take into account frequency, and assigning the highest probability entries to 
the lowest numbers, then using vints rather than 2 bytes for everything. Note 
that we shouldn't be able to beat LZMA, because we neither compress values, nor 
do we compress the SequenceFile overhead. On some workloads, those overheads 
might be substantial - although I haven't checked.

This is actually pretty close to the challenge displayed by caching, in that we 
want to keep the most likely to be repeated entries in our dictionary, and 
evict the rest. I used LRU because LRU was simple, and like caching, pretty 
much anything results in a substantial performance increase over nothing. 

I'm pretty happy with cutting the WAL size in half on optimal workloads, though 
as always, it's nice to work towards future performance goals. I have other 
ideas, but they involve changing the HLog substantially in order to be more 
compact. In that case, we might end up abandoning the Hadoop Sequencefile 
format altogether, and this thing becomes a bit more complex.
                
> HLog Compression
> ----------------
>
>                 Key: HBASE-4608
>                 URL: https://issues.apache.org/jira/browse/HBASE-4608
>             Project: HBase
>          Issue Type: New Feature
>            Reporter: Li Pi
>            Assignee: stack
>             Fix For: 0.94.0
>
>         Attachments: 4608-v19.txt, 4608-v20.txt, 4608-v22.txt, 4608v1.txt, 
> 4608v13.txt, 4608v13.txt, 4608v14.txt, 4608v15.txt, 4608v16.txt, 4608v17.txt, 
> 4608v18.txt, 4608v23.txt, 4608v24.txt, 4608v25.txt, 4608v27.txt, 4608v29.txt, 
> 4608v30.txt, 4608v5.txt, 4608v6.txt, 4608v7.txt, 4608v8fixed.txt, 
> hbase-4608-v28-delta.txt, hbase-4608-v28.txt, hbase-4608-v28.txt
>
>
> The current bottleneck to HBase write speed is replicating the WAL appends 
> across different datanodes. We can speed up this process by compressing the 
> HLog. Current plan involves using a dictionary to compress table name, region 
> id, cf name, and possibly other bits of repeated data. Also, HLog format may 
> be changed in other ways to produce a smaller HLog.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to