On 31 January 2011 04:41, David G. Boney <dbon...@semanticartifacts.com> wrote: > I propose a simple idea for compression using a compressed string datatype. > > The compressed string datatype could be implemented for column family keys by > creating a compressed string ordered partitioner. The compressed string > ordered partitioner works by decompressing the string and then applying an > ordered partitioner for strings to the decompressed string. The hash based > partitioner would be used with the compressed string without any > modification. The compressed string datatype could be implemented for column > keys by creating a compressed string comparator. A compressed string > comparator would work by decompressing the string and then applying a string > comparator. The compressed string datatype could be implemented for column > values. The compressed string datatype would be an internal datatype for > Cassandra. The compressed string would be converted to a string before > returning the value to a client. I suppose you could also have an option of > passing the compressed form back to the client if the client wanted it that > way. > > I propose using an adaptive arithmetic coding compressor. This type of > compression can be done a byte at a time. It will build a compression model > only on the string that is presented, a byte at a time. See the below papers. > > Moffat, Alistair, Radford M. Neal, & Ian H. Witten, (1998), "Arithmetic > Coding Revisited", ACM Trans. on Info. Systems, Vol. 16, No. 3, pp. 256-294. > > Witten, Ian H., Radford M. Neal, & John G. Cleary, (1987), "Arithmetic Coding > for Data Compression", Communications of the ACM, Vol. 30, No. 6, pp. 520-540. > > It has been reported that arithmetic coding based compression applied to text > can get compression ratios of up to 2.2 bits per character. Assuming you only > get 4 bits per character because of short strings. This would be a 50% > compression of text data, including keys and column names. Many applications > would benefit. It should speed up the overall operation of Cassandra because > you would be moving significantly less data through the system.
I have not read those papers but how do they and their reported compressions apply to the unicode characters that java strings are stored as? -Stephen > > This would provide a compression option that could be implemented without any > redesign to the internal structure of Cassandra except for the a new > partitioner class, a new comparator class, a new datatype class, and the > compression class. > ------------- > Sincerely, > David G. Boney > dbon...@semanticartifacts.com > http://www.semanticartifacts.com > > > > >