I don't see anything inherently wrong with your proposal, it would almost
definitely be beneficial in certain scenarios. We use what could be called
static compression (golomb-esque encodings) for some data types on our
Cassandra clusters. It's useful for representing things like full precision
decimals efficiently. It sounds like your solution would be a more generic
mechanism for doing that sort of thing.
However, unless I'm misunderstanding things, I don't think your proposal is
a sufficient solution to compression in Cassandra in general. Specific
algorithms aside, there's simply too much regularity (i.e., opportunity for
compression) in the data structures and metadata that are being written to
disk. For instance: multiple rows in a column family often contain columns
with the same name. And column names in a row often share a common prefix.
This sort of regularity can be exploited by a block-level compression
method, but not by a mechanism that simply re-encodes column values.
Mike
On Mon, Jan 31, 2011 at 10:15 AM, David G. Boney
dbon...@semanticartifacts.com wrote:
In Cassandra, strings are stored as UTF-8. In arithmetic coding
compression, the modeling is separate from the coding. A standard
arrangement is to have a 0-order model, frequencies of individual bytes,
1-order model, frequencies of two byte occurrences, and 2-order models,
frequencies of three byte occurrences. For multibyte unicode encodings,
which almost all fall in the two or three byte encodings, the higher order
models should capture the statistics. Having said than, there is nothing to
prevent someone from developing a UTF-8 specific model for capturing the
statistics better.
An adaptive arithmetic coding model starts with the probabilities of all
the symbols being equal. This results in all the symbols using 8-bits per
encoding. The model adapts with each symbol it sees but it takes seeing some
data to start seeing compression. I do not know, or have any reports from
the literature, on how many bytes, on average, it takes to start seeing
compression. All the papers I have seen have studied large blocks of text.
My assumption is that the majority of strings to compress in Cassandra will
be short string, less then 32 bytes, to medium length strings, less than,
256 bytes. An example of short strings might be column names. An example of
medium length strings would be URLs. My assumption is that most short
strings would not get much compression from adaptive arithmetic coding
compression but medium length to longer strings would. In an effort to get
higher compression on the short strings, static encoding methods could be
used. A static encoding method uses a hard coded frequency table for the
bytes. The start of the compressed string could have a 2 bit code, 00-no
compression, 01-static, default probability table, 10-static, locale
probability table, 11-adaptive coding. The default static coding would be
based on English. For the static locale probability table, the 2 bit code
would need to be followed by a code table, indicating the probability table
to use for a particular language. The locale probability tables would be
stored in the compressor jar file as a separate class for each locale
supported (This issue needs more analysis, what I don't think is effective
is to load the probability tables from disk each time a new compressor is
created). During the encoding phase, both adaptive and static coding of the
string would occur. In general, compression with a static coding table is
very fast. static codig is basically a table lookup from a table with 256
entries. It is the adaptive coding that is more computationally intensive.
Furthermore, you could place a limit on the use of static coding, say
strings less than 256 bytes. This would need to be set empirically. The
shorter string from the two methods is used as the compressed string. There
is no additional burden on the decoding side, since you know the type of
compression encoding.
It might be the case that block compressors can achieve greater compression
ratios. From what I have read on the mailing list and JIRA, it appears that
there is a lot of pain involved to implement block based compressors. This
method, a compressed string data type, is presented as a method that is
minimally invasive to the Cassandra architecture. Since everything is
already stored as a byte array, nothing would need to be changed in the
internal data types. Furthermore, no surgery of the internal tables is need.
The main addition is the development of comparators, for keys and column
headings, that know how to decompress the byte arrays before comparing them.
There is also the additional work on writing the codex but there are a
number of examples in C and Java from which to draw. Moffat's web site would
be a good place to start.
-
Sincerely,
David G. Boney
dbon...@semanticartifacts.com
http://www.semanticartifacts.com