Range deletes, wide partitions, and reverse iterators

Stefano Ortolani Tue, 16 May 2017 06:34:32 -0700

Hi all,

I am seeing inconsistencies when mixing range tombstones, wide partitions,
and reverse iterators.
I still have to understand if the behaviour is to be expected hence the
message on the mailing list.


The situation is conceptually simple. I am using a table defined as follows:

CREATE TABLE test_cql.test_cf (
  hash blob,
  timeid timeuuid,
  PRIMARY KEY (hash, timeid)
) WITH CLUSTERING ORDER BY (timeid ASC)
  AND compaction = {'class' : 'LeveledCompactionStrategy'};

I then proceed by loading 2/3GB from 3 sstables which I know contain a
really wide partition (> 512 MB) for `hash = x`. I then delete the oldest
_half_ of that partition by executing the query below, and restart the node:

DELETE
FROM test_cql.test_cf
WHERE hash = x AND timeid < y;

If I keep compactions disabled the following query timeouts (takes more
than 10 seconds to
succeed):

SELECT *
FROM test_cql.test_cf
WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
ORDER BY timeid ASC;

While the following returns immediately (obviously because no deleted data
is ever read):

SELECT *
FROM test_cql.test_cf
WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
ORDER BY timeid DESC;

If I force a compaction the problem is gone, but I presume just because the
data is rearranged.

It seems to me that reading by ASC does not make use of the range tombstone
until C* reads the
last sstables (which actually contains the range tombstone and is flushed
at node restart), and it wastes time reading all rows that are actually not
live anymore.

Is this expected? Should the range tombstone actually help in these cases?

Thanks a lot!
Stefano

Range deletes, wide partitions, and reverse iterators

Reply via email to