[ 
https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033467#comment-13033467
 ] 

Terje Marthinussen commented on CASSANDRA-47:
---------------------------------------------

Just curious if any active work is done or planned near future on compressing 
larger data blocks or is it all suspended waiting for a new sstable design?

Having played with compression of just supercolumns for a while, I am a bit 
tempted to test out compression of larger blocks of data. At least row level 
compression seems reasonably easy to do.

Some experiences so far which may be usefull:
- Compression on sstables may actually be helpfull on memory pressure, but with 
my current implementation, non-batched update throughput may drop 50%.I am not 
100% sure why actually.

- Flushing of (compressed) memtables and compactions are clear potential 
bottlenecks
The obvious trouble makers here is the fact that you ceep 

For really high pressure work, I think it would be usefull to only compress 
tables once they pass a certain size to reduce the amount of recompression 
occuring on memtable flushes and when compacting small sstables (which is 
generally not a big disk problem anyway)

This is a bit awkward when doing things like I do in the super columns as I 
believe the supercolumn does not know anything about the data it is part of 
(except for recently, the deserializer has that info through "inner".

It would anyway probably be cleaner to let the datastructures/methods using the 
SC decide when to compress and noth 


- Working on a SC level, there seems to be some 10-15% extra compression on 
this specific data if column names that are highly repetetive in SC's can be 
extracted into some meta data structure so you only store references to these 
in the column names. That is, the final data is goes from about 40% compression 
to 50% compression. 

I don't think the effect of this will be equally big with larger blocks, but I 
suspect there should be some effect.

- total size reduction of the sstables I have in this project is currently in 
the 60-65% range. It is mainly beneficial for those that have supercolumns with 
at least a handfull of columns (400-600 bytes of serialized column data per sc 
at least)


- Reducing the meta data on columns by building a dictionary of timestamps as 
well as variable length name/value length data (instead of fixed short/int) 
cuts down another 10% in my test (I have just done a very quick simulation of 
this by a very quick "10 minute" hack on the serializer)

- We may want to look at how we can reuse whole compressed rows on compactions 
if for instance the other tables you compact with do not have the same data

- We may want a new cache on the uncompressed disk chunks. In my case, I 
preserve the compressed part of the supercolumn and 

In my supercolumn compression case, I have a cache for the compressed data so I 
can write that back without recompression if not modified. This also makes 
calls to get the serialized size cheaper (don't need to compress both to find 
serialized size and to actually serialize)

If people are interested in adding any of the above to current cassandra, I 
will try to get time to make some of this up to a quality where it could be 
used by the general public. 

If not, I will wait for new sstables to get a bit more ready and see if I can 
contribute there instead.

> SSTable compression
> -------------------
>
>                 Key: CASSANDRA-47
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-47
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Priority: Minor
>              Labels: compression
>             Fix For: 1.0
>
>
> We should be able to do SSTable compression which would trade CPU for I/O 
> (almost always a good trade).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to