Thanks to all for valuable insight! Two comments: a) this is not actually time series data, but yes, each item has a timestamp and thus chronological attribution.
b) so, what do you practically recommend? I need to delete half a million to a million entries daily, then insert fresh data. What's the right operation procedure? For some reason I can still select on the index in the CLI, it's the Pycassa module that gives me trouble, but I need it as this is my platform and we are a Python shop. Maxim On 11/13/2011 7:22 PM, Peter Schuller wrote:
Deletions in Cassandra imply the use of tombstones (see http://wiki.apache.org/cassandra/DistributedDeletes) and under some circumstances reads can turn O(n) with respect to the amount of columns deleted, depending. It sounds like this is what you're seeing. For example, suppose you're inserting a range of columns into a row, deleting it, and inserting another non-overlapping subsequent range. Repeat that a bunch of times. In terms of what's stored in Cassandra for the row you now have: tomb tomb tomb tomb .... actual data If you then do something like a slice on that row with the end-points being such that they include all the tombstones, Cassandra essentially has to read through and process all those tombstones (for the PostgreSQL aware: this is similar to the effect you can get if implementing e.g. a FIFO queue, where MIN(pos) turns O(n) with respect to the number of deleted entries until the last vacuum - improved in modern versions)).