[ 
https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014531#comment-13014531
 ] 

Terje Marthinussen commented on CASSANDRA-47:
---------------------------------------------

This is not so interesting for a "proper" solution maybe, but adding just for 
the reference.

I needed to get space for more data, so I recently just crashed into a quick 
compression hack for supercolumns.

I was considering to compress the index blocks as Jonathan suggested, but I 
could not make up my mind on how safe that would be in terms of other code 
accessing the data and had a bit short time, so I looked for something more 
isolated.

Final decision was to simply compress the serialized columns in a supercolumn 
(+ add a bit caching to avoid recompressing all the time when serialized size 
is requested)

The data I have, has supercolumns with typically 50-60 subcolumns. Mostly small 
strings or numbers. 
In total, the subcolumns makes up 600-1200 bytes of data when serialized.

Usually a fair bit of supercolumns per row.

My test data was 447 keys. I tested with the ning lzf jars and the default 
java.util.zip.
This is not necessarily a good test, but I think json2sstable is somewhat 
useful to measure relative impact between implementations although not useful 
to determine real performance in any way.

In addition, I made a simple dictionary of column names (only applied to 
supercolumns) as the column names was not very well compressed when looking at 
just a single supercolumn at a time.

The result of both the digest and compression:
Standard cassandra. json2sstable:
real    0m55.148s
user    1m50.023s
sys     0m2.856s
sstable: 190MB

ning.com:
real    1m8.315s
user    2m18.361s
sys     0m4.600s
sstable: 108MB

java.util.zip
real    1m35.899s
user    2m49.691s
sys     0m2.940s
sstable: 90mb

As a reference, the whole sstable files compresses as follows:
ning.com (command line)
real    0m1.803s
user    0m1.536s
sys     0m0.396s
sstable: 80MB

gzip (command line)
real    0m6.175s
user    0m6.076s
sys     0m0.084s
sstable: 48MB


I doubt this implementation has much for inclusion in a release. Just added the 
numbers for the reference.
Of course, if requested, I could see if I could make the patch available 
somewhere.

> SSTable compression
> -------------------
>
>                 Key: CASSANDRA-47
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-47
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Priority: Minor
>             Fix For: 0.8
>
>
> We should be able to do SSTable compression which would trade CPU for I/O 
> (almost always a good trade).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to