[
https://issues.apache.org/jira/browse/HBASE-4608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229921#comment-13229921
]
Li Pi commented on HBASE-4608:
------------------------------
On other compression things. I looked into those.
plugging into LZMA was the first thing I thought about doing - performance
stops this one though.
There are other optimization we can make, such as modifying the dictionary to
take into account frequency, and assigning the highest probability entries to
the lowest numbers, then using vints rather than 2 bytes for everything. Note
that we shouldn't be able to beat LZMA, because we neither compress values, nor
do we compress the SequenceFile overhead. On some workloads, those overheads
might be substantial - although I haven't checked.
This is actually pretty close to the challenge displayed by caching, in that we
want to keep the most likely to be repeated entries in our dictionary, and
evict the rest. I used LRU because LRU was simple, and like caching, pretty
much anything results in a substantial performance increase over nothing.
I'm pretty happy with cutting the WAL size in half on optimal workloads, though
as always, it's nice to work towards future performance goals. I have other
ideas, but they involve changing the HLog substantially in order to be more
compact. In that case, we might end up abandoning the Hadoop Sequencefile
format altogether, and this thing becomes a bit more complex.
> HLog Compression
> ----------------
>
> Key: HBASE-4608
> URL: https://issues.apache.org/jira/browse/HBASE-4608
> Project: HBase
> Issue Type: New Feature
> Reporter: Li Pi
> Assignee: stack
> Fix For: 0.94.0
>
> Attachments: 4608-v19.txt, 4608-v20.txt, 4608-v22.txt, 4608v1.txt,
> 4608v13.txt, 4608v13.txt, 4608v14.txt, 4608v15.txt, 4608v16.txt, 4608v17.txt,
> 4608v18.txt, 4608v23.txt, 4608v24.txt, 4608v25.txt, 4608v27.txt, 4608v29.txt,
> 4608v30.txt, 4608v5.txt, 4608v6.txt, 4608v7.txt, 4608v8fixed.txt,
> hbase-4608-v28-delta.txt, hbase-4608-v28.txt, hbase-4608-v28.txt
>
>
> The current bottleneck to HBase write speed is replicating the WAL appends
> across different datanodes. We can speed up this process by compressing the
> HLog. Current plan involves using a dictionary to compress table name, region
> id, cf name, and possibly other bits of repeated data. Also, HLog format may
> be changed in other ways to produce a smaller HLog.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira