Re: Range deletes, wide partitions, and reverse iterators

Hannu Kröger Tue, 16 May 2017 09:43:46 -0700

This is a bit of guessing but it probably reads sstables in some sort of 
sequence, so even if sstable 2 contains the tombstone, it still scans through 
the sstable 1 for possible data to be read.


BR,
Hannu

> On 16 May 2017, at 19:40, Stefano Ortolani <ostef...@gmail.com> wrote:
> 
> Little update: also the following query timeouts, which is weird since the 
> range tombstone should have been read by then...
> 
> SELECT * 
> FROM test_cql.test_cf 
> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf 
> AND timeid < the_oldest_deleted_timeid
> ORDER BY timeid DESC;
> 
> 
> 
> On Tue, May 16, 2017 at 5:17 PM, Stefano Ortolani <ostef...@gmail.com 
> <mailto:ostef...@gmail.com>> wrote:
> Yes, that was my intention but I wanted to cross-check with the ML and the 
> devs keeping an eye on it first.
> 
> On Tue, May 16, 2017 at 5:10 PM, Hannu Kröger <hkro...@gmail.com 
> <mailto:hkro...@gmail.com>> wrote:
> Well,
> 
> sstables contain some statistics about the cell timestamps and using that 
> information and the tombstone timestamp it might be possible to skip some 
> data but I’m not sure that Cassandra currently does that. Maybe it would be 
> worth a JIRA ticket and see what the devs think about it. If optimizing this 
> case would make sense.
> 
> Hannu
> 
>> On 16 May 2017, at 18:03, Stefano Ortolani <ostef...@gmail.com 
>> <mailto:ostef...@gmail.com>> wrote:
>> 
>> Hi Hannu,
>> 
>> the piece of data in question is older. In my example the tombstone is the 
>> newest piece of data.
>> Since a range tombstone has information re the clustering key ranges, and 
>> the data is clustering key sorted, I would expect a linear scan not to be 
>> necessary.
>> 
>> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger <hkro...@gmail.com 
>> <mailto:hkro...@gmail.com>> wrote:
>> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip 
>> bigger regions of deleted data based on range tombstone. If some piece of 
>> data in a partition is newer than the tombstone, then it cannot be skipped. 
>> Therefore some partition level statistics of cell ages would need to be kept 
>> in the column index for the skipping and that is probably not there.
>> 
>> Hannu 
>> 
>>> On 16 May 2017, at 17:33, Stefano Ortolani <ostef...@gmail.com 
>>> <mailto:ostef...@gmail.com>> wrote:
>>> 
>>> That is another way to see the question: are reverse iterators range 
>>> tombstone aware? Yes.
>>> That is why I am puzzled by this afore-mentioned behavior. 
>>> I would expect them to handle this case more gracefully.
>>> 
>>> Cheers,
>>> Stefano
>>> 
>>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth <ni...@bamlabs.com 
>>> <mailto:ni...@bamlabs.com>> wrote:
>>> Hannu,
>>> 
>>> How can you read a partition in reverse?
>>> 
>>> Sent from my iPhone
>>> 
>>> > On May 16, 2017, at 9:20 AM, Hannu Kröger <hkro...@gmail.com 
>>> > <mailto:hkro...@gmail.com>> wrote:
>>> >
>>> > Well, I’m guessing that Cassandra doesn't really know if the range 
>>> > tombstone is useful for this or not.
>>> >
>>> > In many cases it might be that the partition contains data that is within 
>>> > the range of the tombstone but is newer than the tombstone and therefore 
>>> > it might be still be returned. Scanning through deleted data can be 
>>> > avoided by reading the partition in reverse (if all the deleted data is 
>>> > in the beginning of the partition). Eventually you will still end up 
>>> > reading a lot of tombstones but you will get a lot of live data first and 
>>> > the implicit query limit of 10000 probably is reached before you get to 
>>> > the tombstones. Therefore you will get an immediate answer.
>>> >
>>> > Does it make sense?
>>> >
>>> > Hannu
>>> >
>>> >> On 16 May 2017, at 16:33, Stefano Ortolani <ostef...@gmail.com 
>>> >> <mailto:ostef...@gmail.com>> wrote:
>>> >>
>>> >> Hi all,
>>> >>
>>> >> I am seeing inconsistencies when mixing range tombstones, wide 
>>> >> partitions, and reverse iterators.
>>> >> I still have to understand if the behaviour is to be expected hence the 
>>> >> message on the mailing list.
>>> >>
>>> >> The situation is conceptually simple. I am using a table defined as 
>>> >> follows:
>>> >>
>>> >> CREATE TABLE test_cql.test_cf (
>>> >>  hash blob,
>>> >>  timeid timeuuid,
>>> >>  PRIMARY KEY (hash, timeid)
>>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>>> >>
>>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a 
>>> >> really wide partition (> 512 MB) for `hash = x`. I then delete the 
>>> >> oldest _half_ of that partition by executing the query below, and 
>>> >> restart the node:
>>> >>
>>> >> DELETE
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = x AND timeid < y;
>>> >>
>>> >> If I keep compactions disabled the following query timeouts (takes more 
>>> >> than 10 seconds to
>>> >> succeed):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid ASC;
>>> >>
>>> >> While the following returns immediately (obviously because no deleted 
>>> >> data is ever read):
>>> >>
>>> >> SELECT *
>>> >> FROM test_cql.test_cf
>>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>>> >> ORDER BY timeid DESC;
>>> >>
>>> >> If I force a compaction the problem is gone, but I presume just because 
>>> >> the data is rearranged.
>>> >>
>>> >> It seems to me that reading by ASC does not make use of the range 
>>> >> tombstone until C* reads the
>>> >> last sstables (which actually contains the range tombstone and is 
>>> >> flushed at node restart), and it wastes time reading all rows that are 
>>> >> actually not live anymore.
>>> >>
>>> >> Is this expected? Should the range tombstone actually help in these 
>>> >> cases?
>>> >>
>>> >> Thanks a lot!
>>> >> Stefano
>>> >
>>> >
>>> > ---------------------------------------------------------------------
>>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
>>> > <mailto:user-unsubscr...@cassandra.apache.org>
>>> > For additional commands, e-mail: user-h...@cassandra.apache.org 
>>> > <mailto:user-h...@cassandra.apache.org>
>>> >
>>> 
>> 
>> 
> 
> 
>

Re: Range deletes, wide partitions, and reverse iterators

Reply via email to