Re: Simple Compression Idea

2011-01-31 Thread Mike Malone
I don't see anything inherently wrong with your proposal, it would almost
definitely be beneficial in certain scenarios. We use what could be called
static compression (golomb-esque encodings) for some data types on our
Cassandra clusters. It's useful for representing things like full precision
decimals efficiently. It sounds like your solution would be a more generic
mechanism for doing that sort of thing.

However, unless I'm misunderstanding things, I don't think your proposal is
a sufficient solution to compression in Cassandra in general. Specific
algorithms aside, there's simply too much regularity (i.e., opportunity for
compression) in the data structures and metadata that are being written to
disk. For instance: multiple rows in a column family often contain columns
with the same name. And column names in a row often share a common prefix.
This sort of regularity can be exploited by a block-level compression
method, but not by a mechanism that simply re-encodes column values.

Mike

On Mon, Jan 31, 2011 at 10:15 AM, David G. Boney 
dbon...@semanticartifacts.com wrote:

 In Cassandra, strings are stored as UTF-8. In arithmetic coding
 compression, the modeling is separate from the coding. A standard
 arrangement is to have a 0-order model, frequencies of individual bytes,
 1-order model, frequencies of two byte occurrences, and 2-order models,
 frequencies of three byte occurrences.  For multibyte unicode encodings,
 which almost all fall in the two or three byte encodings, the higher order
 models should capture the statistics. Having said than, there is nothing to
 prevent someone from developing a UTF-8 specific model for capturing the
 statistics better.

 An adaptive arithmetic coding model starts with the probabilities of all
 the symbols being equal. This results in all the symbols using 8-bits per
 encoding. The model adapts with each symbol it sees but it takes seeing some
 data to start seeing compression. I do not know, or have any reports from
 the literature, on how many bytes, on average, it takes to start seeing
 compression. All the papers I have seen have studied large blocks of text.

 My assumption is that the majority of strings to compress in Cassandra will
 be short string, less then 32 bytes, to medium length strings, less than,
 256 bytes. An example of short strings might be column names. An example of
 medium length strings would be URLs. My assumption is that most short
 strings would not get much compression from adaptive arithmetic coding
 compression but medium length to longer strings would. In an effort to get
 higher compression on the short strings, static encoding methods could be
 used. A static encoding method uses a hard coded frequency table for the
 bytes. The start of the compressed string could have a 2 bit code, 00-no
 compression, 01-static, default probability table, 10-static, locale
 probability table, 11-adaptive coding. The default static coding would be
 based on English. For the static locale probability table, the 2 bit code
 would need to be followed by a code table, indicating the probability table
 to use for a particular language. The locale probability tables would be
 stored in the compressor jar file as a separate class for each locale
 supported (This issue needs more analysis, what I don't think is effective
 is to load the probability tables from disk each time a new compressor is
 created). During the encoding phase, both adaptive and static coding of the
 string would occur. In general, compression with a static coding table is
 very fast. static codig is basically a table lookup from a table with 256
 entries. It is the adaptive coding that is more computationally intensive.
 Furthermore, you could place a limit on the use of static coding, say
 strings less than 256 bytes. This would need to be set empirically. The
 shorter string from the two methods is used as the compressed string. There
 is no additional burden on the decoding side, since you know the type of
 compression encoding.

 It might be the case that block compressors can achieve greater compression
 ratios. From what I have read on the mailing list and JIRA, it appears that
 there is a lot of pain involved to implement block based compressors. This
 method, a compressed string data type, is presented as a method that is
 minimally invasive to the Cassandra architecture. Since everything is
 already stored as a byte array, nothing would need to be changed in the
 internal data types. Furthermore, no surgery of the internal tables is need.
 The main addition is the development of comparators, for keys and column
 headings, that know how to decompress the byte arrays before comparing them.
 There is also the additional work on writing the codex but there are a
 number of examples in C and Java from which to draw. Moffat's web site would
 be a good place to start.

 -
 Sincerely,
 David G. Boney
 dbon...@semanticartifacts.com
 http://www.semanticartifacts.com




Re: Is SuperColumn necessary?

2010-05-07 Thread Mike Malone
On Thu, May 6, 2010 at 5:38 PM, Vijay vijay2...@gmail.com wrote:

 I would rather be interested in Tree type structure where supercolumns have
 supercolumns in it. you dont need to compare all the columns to find a
 set of columns and will also reduce the bytes transfered for separator, at
 least string concatenation (Or something like that) for read and write
 column name generation. it is more logically stored and structured by this
 way and also we can make caching work better by selectively caching the
 tree (User defined if you will)

 But nothing wrong in supporting both :)


I'm 99% sure we're talking about the same thing and we don't need to support
both. How names/values are separated is pretty irrelevant. It has to happen
somewhere. I agree that it'd be nice if it happened on the server, but doing
it in the client makes it easier to explore ideas.

On Thu, May 6, 2010 at 5:27 PM, philip andrew philip14...@gmail.com wrote:

 Please create a new term word if the existing terms are misleading, if its
 not a file system then its not good to call it a file system.


While it's seriously bikesheddy, I guess you're right.

Let's call them thingies for now, then. So you can have a top-level
thingy and it can have an arbitrarily nested tree of sub-thingies. Each
thingy has a thingy type [1]. You can also tell Cassandra if you want a
particular level of thingy to be indexed. At one (or maybe more) levels
you can tell Cassandra you want your thingies to be split onto separate
nodes in your cluster. At one (or maybe more) levels you could also tell
Cassandra that you want your thingies split into separate files [2].

The upshot is, the Cassandra data model would go from being it's a nested
dictionary, just kidding no it's not! to being it's a nested dictionary,
for serious. Again, these are all just ideas... but I think this simplified
data model would allow you to express pretty much any query in a graph of
simple primitives like Predicates, Filters, Aggregations, Transformations,
etc. The indexes would allow you to cheat when evaluating certain types of
queries - if you get a SlicePredicate on an indexed thingy you don't have
to enumerate the entire set of sub-thingies for example.

So, you'd query your thingies by building out a predicate,
transformations, filters, etc., serializing the graph of primitives, and
sending it over the wire to Cassandra. Cassandra would rebuild the graph and
run it over your dataset.

So instead of:

  Cassandra.get_range_slices(
keyspace=AwesomeApp,
column_parent=ColumnParent(column_family=user),
slice_predicate=SlicePredicate(column_names=['username', 'dob']),
range=KeyRange(start_key='a', end_key='m'),
consistency_level=ONE
  )

You'd do something like:

  Cassandra.query(
SubThingyTransformer(
NamePredicate(names=[AwesomeApp],
SubThingyTransformer(
NamePredicate(names=[user]),
SubThingyTransformer(
SlicePredicate(start=a, end=m),
NamePredicate(names=[username, dob])
)
)
),
consistency_level=ONE
  )

Which seems complicated, but it's basically just [(user['username'],
user['dob']) for user in Cassandra['AwesomeApp']['user'].slice('a', 'm')]
and could probably be expressed that way in a client library.

I think batch_mutate is awesome the way it is and should be the only way to
insert/update data. I'd rename it mutate. So our interface becomes:

  Cassandra.query(query, consistency_level)
  Cassandra.mutate(mutation, consistency_level)

Ta-da.

Anyways, I was trying to avoid writing all of this out in prose and try
mocking some of it up in code instead. I guess this this works too. Either
way, I do think something like this would simplify the codebase, simplify
the data model, simplify the interface, make the entire system more
flexible, and be generally awesome.

Mike

[1] These can be subclasses of Thingy in Java... or maybe they'd implement
IThingy. But either way they'd handle serialization and probably implement
compareTo to define natural ordering. So you'd have classes like
ASCIIThingy, UTF8Thingy, and LongThingy (ahem) - these would replace
comparators.

[2] I think there's another simplification here. Splitting into separate
files is really very similar to splitting onto separate nodes. There might
be a way around some of the row size limitations with this sort of concept.
And we may be able to get better utilization of multiple disks by giving
each disk (or data directory) a subset of the node's token range. Caveat:
thought not fully baked.