[ https://issues.apache.org/jira/browse/CASSANDRA-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13033467#comment-13033467 ]
Terje Marthinussen commented on CASSANDRA-47: --------------------------------------------- Just curious if any active work is done or planned near future on compressing larger data blocks or is it all suspended waiting for a new sstable design? Having played with compression of just supercolumns for a while, I am a bit tempted to test out compression of larger blocks of data. At least row level compression seems reasonably easy to do. Some experiences so far which may be usefull: - Compression on sstables may actually be helpfull on memory pressure, but with my current implementation, non-batched update throughput may drop 50%.I am not 100% sure why actually. - Flushing of (compressed) memtables and compactions are clear potential bottlenecks The obvious trouble makers here is the fact that you ceep For really high pressure work, I think it would be usefull to only compress tables once they pass a certain size to reduce the amount of recompression occuring on memtable flushes and when compacting small sstables (which is generally not a big disk problem anyway) This is a bit awkward when doing things like I do in the super columns as I believe the supercolumn does not know anything about the data it is part of (except for recently, the deserializer has that info through "inner". It would anyway probably be cleaner to let the datastructures/methods using the SC decide when to compress and noth - Working on a SC level, there seems to be some 10-15% extra compression on this specific data if column names that are highly repetetive in SC's can be extracted into some meta data structure so you only store references to these in the column names. That is, the final data is goes from about 40% compression to 50% compression. I don't think the effect of this will be equally big with larger blocks, but I suspect there should be some effect. - total size reduction of the sstables I have in this project is currently in the 60-65% range. It is mainly beneficial for those that have supercolumns with at least a handfull of columns (400-600 bytes of serialized column data per sc at least) - Reducing the meta data on columns by building a dictionary of timestamps as well as variable length name/value length data (instead of fixed short/int) cuts down another 10% in my test (I have just done a very quick simulation of this by a very quick "10 minute" hack on the serializer) - We may want to look at how we can reuse whole compressed rows on compactions if for instance the other tables you compact with do not have the same data - We may want a new cache on the uncompressed disk chunks. In my case, I preserve the compressed part of the supercolumn and In my supercolumn compression case, I have a cache for the compressed data so I can write that back without recompression if not modified. This also makes calls to get the serialized size cheaper (don't need to compress both to find serialized size and to actually serialize) If people are interested in adding any of the above to current cassandra, I will try to get time to make some of this up to a quality where it could be used by the general public. If not, I will wait for new sstables to get a bit more ready and see if I can contribute there instead. > SSTable compression > ------------------- > > Key: CASSANDRA-47 > URL: https://issues.apache.org/jira/browse/CASSANDRA-47 > Project: Cassandra > Issue Type: New Feature > Components: Core > Reporter: Jonathan Ellis > Priority: Minor > Labels: compression > Fix For: 1.0 > > > We should be able to do SSTable compression which would trade CPU for I/O > (almost always a good trade). -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira