I propose a simple idea for compression using a compressed string datatype. 

The compressed string datatype could be implemented for column family keys by 
creating a compressed string ordered partitioner. The compressed string ordered 
partitioner works by decompressing the string and then applying an ordered 
partitioner for strings to the decompressed string. The hash based partitioner 
would be used with the compressed string without any modification. The 
compressed string datatype could be implemented for column keys by creating a 
compressed string comparator. A compressed string comparator would work by 
decompressing the string and then applying a string comparator. The compressed 
string datatype could be implemented for column values. The compressed string 
datatype would be an internal datatype for Cassandra. The compressed string 
would be converted to a string before returning the value to a client. I 
suppose you could also have an option of passing the compressed form back to 
the client if the client wanted it that way.

I propose using an adaptive arithmetic coding compressor. This type of 
compression can be done a byte at a time. It will build a compression model 
only on the string that is presented, a byte at a time. See the below papers.

Moffat, Alistair, Radford M. Neal, & Ian H. Witten, (1998), "Arithmetic Coding 
Revisited", ACM Trans. on Info. Systems, Vol. 16, No. 3, pp. 256-294.

Witten, Ian H., Radford M. Neal, & John G. Cleary, (1987), "Arithmetic Coding 
for Data Compression", Communications of the ACM, Vol. 30, No. 6, pp. 520-540.

It has been reported that arithmetic coding based compression applied to text 
can get compression ratios of up to 2.2 bits per character. Assuming you only 
get 4 bits per character because of short strings. This would be a 50% 
compression of text data, including keys and column names. Many applications 
would benefit. It should speed up the overall operation of Cassandra because 
you would be moving significantly less data through the system.

This would provide a compression option that could be implemented without any 
redesign to the internal structure of Cassandra except for the a new 
partitioner class, a new comparator class, a new datatype class, and the 
compression class.
-------------
Sincerely,
David G. Boney
dbon...@semanticartifacts.com
http://www.semanticartifacts.com




Reply via email to