This is a bit of guessing but it probably reads sstables in some sort of sequence, so even if sstable 2 contains the tombstone, it still scans through the sstable 1 for possible data to be read.
BR, Hannu > On 16 May 2017, at 19:40, Stefano Ortolani <ostef...@gmail.com> wrote: > > Little update: also the following query timeouts, which is weird since the > range tombstone should have been read by then... > > SELECT * > FROM test_cql.test_cf > WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf > AND timeid < the_oldest_deleted_timeid > ORDER BY timeid DESC; > > > > On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani <ostef...@gmail.com > <mailto:ostef...@gmail.com>> wrote: > Yes, that was my intention but I wanted to cross-check with the ML and the > devs keeping an eye on it first. > > On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger <hkro...@gmail.com > <mailto:hkro...@gmail.com>> wrote: > Well, > > sstables contain some statistics about the cell timestamps and using that > information and the tombstone timestamp it might be possible to skip some > data but I’m not sure that Cassandra currently does that. Maybe it would be > worth a JIRA ticket and see what the devs think about it. If optimizing this > case would make sense. > > Hannu > >> On 16 May 2017, at 18:03, Stefano Ortolani <ostef...@gmail.com >> <mailto:ostef...@gmail.com>> wrote: >> >> Hi Hannu, >> >> the piece of data in question is older. In my example the tombstone is the >> newest piece of data. >> Since a range tombstone has information re the clustering key ranges, and >> the data is clustering key sorted, I would expect a linear scan not to be >> necessary. >> >> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger <hkro...@gmail.com >> <mailto:hkro...@gmail.com>> wrote: >> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip >> bigger regions of deleted data based on range tombstone. If some piece of >> data in a partition is newer than the tombstone, then it cannot be skipped. >> Therefore some partition level statistics of cell ages would need to be kept >> in the column index for the skipping and that is probably not there. >> >> Hannu >> >>> On 16 May 2017, at 17:33, Stefano Ortolani <ostef...@gmail.com >>> <mailto:ostef...@gmail.com>> wrote: >>> >>> That is another way to see the question: are reverse iterators range >>> tombstone aware? Yes. >>> That is why I am puzzled by this afore-mentioned behavior. >>> I would expect them to handle this case more gracefully. >>> >>> Cheers, >>> Stefano >>> >>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth <ni...@bamlabs.com >>> <mailto:ni...@bamlabs.com>> wrote: >>> Hannu, >>> >>> How can you read a partition in reverse? >>> >>> Sent from my iPhone >>> >>> > On May 16, 2017, at 9:20 AM, Hannu Kröger <hkro...@gmail.com >>> > <mailto:hkro...@gmail.com>> wrote: >>> > >>> > Well, I’m guessing that Cassandra doesn't really know if the range >>> > tombstone is useful for this or not. >>> > >>> > In many cases it might be that the partition contains data that is within >>> > the range of the tombstone but is newer than the tombstone and therefore >>> > it might be still be returned. Scanning through deleted data can be >>> > avoided by reading the partition in reverse (if all the deleted data is >>> > in the beginning of the partition). Eventually you will still end up >>> > reading a lot of tombstones but you will get a lot of live data first and >>> > the implicit query limit of 10000 probably is reached before you get to >>> > the tombstones. Therefore you will get an immediate answer. >>> > >>> > Does it make sense? >>> > >>> > Hannu >>> > >>> >> On 16 May 2017, at 16:33, Stefano Ortolani <ostef...@gmail.com >>> >> <mailto:ostef...@gmail.com>> wrote: >>> >> >>> >> Hi all, >>> >> >>> >> I am seeing inconsistencies when mixing range tombstones, wide >>> >> partitions, and reverse iterators. >>> >> I still have to understand if the behaviour is to be expected hence the >>> >> message on the mailing list. >>> >> >>> >> The situation is conceptually simple. I am using a table defined as >>> >> follows: >>> >> >>> >> CREATE TABLE test_cql.test_cf ( >>> >> hash blob, >>> >> timeid timeuuid, >>> >> PRIMARY KEY (hash, timeid) >>> >> ) WITH CLUSTERING ORDER BY (timeid ASC) >>> >> AND compaction = {'class' : 'LeveledCompactionStrategy'}; >>> >> >>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a >>> >> really wide partition (> 512 MB) for `hash = x`. I then delete the >>> >> oldest _half_ of that partition by executing the query below, and >>> >> restart the node: >>> >> >>> >> DELETE >>> >> FROM test_cql.test_cf >>> >> WHERE hash = x AND timeid < y; >>> >> >>> >> If I keep compactions disabled the following query timeouts (takes more >>> >> than 10 seconds to >>> >> succeed): >>> >> >>> >> SELECT * >>> >> FROM test_cql.test_cf >>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf >>> >> ORDER BY timeid ASC; >>> >> >>> >> While the following returns immediately (obviously because no deleted >>> >> data is ever read): >>> >> >>> >> SELECT * >>> >> FROM test_cql.test_cf >>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf >>> >> ORDER BY timeid DESC; >>> >> >>> >> If I force a compaction the problem is gone, but I presume just because >>> >> the data is rearranged. >>> >> >>> >> It seems to me that reading by ASC does not make use of the range >>> >> tombstone until C* reads the >>> >> last sstables (which actually contains the range tombstone and is >>> >> flushed at node restart), and it wastes time reading all rows that are >>> >> actually not live anymore. >>> >> >>> >> Is this expected? Should the range tombstone actually help in these >>> >> cases? >>> >> >>> >> Thanks a lot! >>> >> Stefano >>> > >>> > >>> > --------------------------------------------------------------------- >>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >>> > <mailto:user-unsubscr...@cassandra.apache.org> >>> > For additional commands, e-mail: user-h...@cassandra.apache.org >>> > <mailto:user-h...@cassandra.apache.org> >>> > >>> >> >> > > >