Hi all, I am seeing inconsistencies when mixing range tombstones, wide partitions, and reverse iterators. I still have to understand if the behaviour is to be expected hence the message on the mailing list.
The situation is conceptually simple. I am using a table defined as follows: CREATE TABLE test_cql.test_cf ( hash blob, timeid timeuuid, PRIMARY KEY (hash, timeid) ) WITH CLUSTERING ORDER BY (timeid ASC) AND compaction = {'class' : 'LeveledCompactionStrategy'}; I then proceed by loading 2/3GB from 3 sstables which I know contain a really wide partition (> 512 MB) for `hash = x`. I then delete the oldest _half_ of that partition by executing the query below, and restart the node: DELETE FROM test_cql.test_cf WHERE hash = x AND timeid < y; If I keep compactions disabled the following query timeouts (takes more than 10 seconds to succeed): SELECT * FROM test_cql.test_cf WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf ORDER BY timeid ASC; While the following returns immediately (obviously because no deleted data is ever read): SELECT * FROM test_cql.test_cf WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf ORDER BY timeid DESC; If I force a compaction the problem is gone, but I presume just because the data is rearranged. It seems to me that reading by ASC does not make use of the range tombstone until C* reads the last sstables (which actually contains the range tombstone and is flushed at node restart), and it wastes time reading all rows that are actually not live anymore. Is this expected? Should the range tombstone actually help in these cases? Thanks a lot! Stefano