Re: Mass deletion -- slowing down

Maxim Potekhin Sun, 13 Nov 2011 16:55:34 -0800

Thanks to all for valuable insight!

Two comments:
a) this is not actually time series data, but yes, each item has
a timestamp and thus chronological attribution.


b) so, what do you practically recommend? I need to delete
half a million to a million entries daily, then insert fresh data.
What's the right operation procedure?

For some reason I can still select on the index in the CLI, it's
the Pycassa module that gives me trouble, but I need it as this
is my platform and we are a Python shop.

Maxim



On 11/13/2011 7:22 PM, Peter Schuller wrote:

Deletions in Cassandra imply the use of tombstones (see
http://wiki.apache.org/cassandra/DistributedDeletes) and under some
circumstances reads can turn O(n) with respect to the amount of
columns deleted, depending. It sounds like this is what you're seeing.

For example, suppose you're inserting a range of columns into a row,
deleting it, and inserting another non-overlapping subsequent range.
Repeat that a bunch of times. In terms of what's stored in Cassandra
for the row you now have:

   tomb
   tomb
   tomb
   tomb
   ....
    actual data

If you then do something like a slice on that row with the end-points
being such that they include all the tombstones, Cassandra essentially
has to read through and process all those tombstones (for the
PostgreSQL aware: this is similar to the effect you can get if
implementing e.g. a FIFO queue, where MIN(pos) turns O(n) with respect
to the number of deleted entries until the last vacuum - improved in
modern versions)).

Re: Mass deletion -- slowing down

Reply via email to