On 31 January 2011 04:41, David G. Boney <dbon...@semanticartifacts.com> wrote:
> I propose a simple idea for compression using a compressed string datatype.
>
> The compressed string datatype could be implemented for column family keys by 
> creating a compressed string ordered partitioner. The compressed string 
> ordered partitioner works by decompressing the string and then applying an 
> ordered partitioner for strings to the decompressed string. The hash based 
> partitioner would be used with the compressed string without any 
> modification. The compressed string datatype could be implemented for column 
> keys by creating a compressed string comparator. A compressed string 
> comparator would work by decompressing the string and then applying a string 
> comparator. The compressed string datatype could be implemented for column 
> values. The compressed string datatype would be an internal datatype for 
> Cassandra. The compressed string would be converted to a string before 
> returning the value to a client. I suppose you could also have an option of 
> passing the compressed form back to the client if the client wanted it that 
> way.
>
> I propose using an adaptive arithmetic coding compressor. This type of 
> compression can be done a byte at a time. It will build a compression model 
> only on the string that is presented, a byte at a time. See the below papers.
>
> Moffat, Alistair, Radford M. Neal, & Ian H. Witten, (1998), "Arithmetic 
> Coding Revisited", ACM Trans. on Info. Systems, Vol. 16, No. 3, pp. 256-294.
>
> Witten, Ian H., Radford M. Neal, & John G. Cleary, (1987), "Arithmetic Coding 
> for Data Compression", Communications of the ACM, Vol. 30, No. 6, pp. 520-540.
>
> It has been reported that arithmetic coding based compression applied to text 
> can get compression ratios of up to 2.2 bits per character. Assuming you only 
> get 4 bits per character because of short strings. This would be a 50% 
> compression of text data, including keys and column names. Many applications 
> would benefit. It should speed up the overall operation of Cassandra because 
> you would be moving significantly less data through the system.

I have not read those papers but how do they and their reported
compressions apply to the unicode characters that java strings are
stored as?

-Stephen

>
> This would provide a compression option that could be implemented without any 
> redesign to the internal structure of Cassandra except for the a new 
> partitioner class, a new comparator class, a new datatype class, and the 
> compression class.
> -------------
> Sincerely,
> David G. Boney
> dbon...@semanticartifacts.com
> http://www.semanticartifacts.com
>
>
>
>
>

Reply via email to