Re: Cassandra collection tombstones

Chris Lohfink Fri, 25 Jan 2019 08:23:35 -0800

>  The "estimated droppable tombstone" value is actually always wrong. Because 
> it's an estimate that does not consider overlaps (and I'm not sure about the 
> fact it considers the gc_grace_seconds either).


It considers the time the tombstone was created and the gc_grace_seconds, it 
doesn't matter if the tombstone is overlapped it still need to be kept for the 
gc_grace before purging or it can result in data resurrection. sstablemetadata 
cannot reliably or safely know the table parameters that are not kept in the 
sstable so to get an accurate value you have to provide a -g or 
--gc-grace-seconds parameter. I am not sure where the "always wrong" comes in 
as the quantity of data thats being shadowed is not what its tracking (although 
it would be more meaningful for single sstable compactions if it did), just 
when tombstones can be purged.

Chris


> On Jan 25, 2019, at 8:11 AM, Alain RODRIGUEZ <arodr...@gmail.com> wrote:
> 
> Hello, 
> 
> I think you might be inserting on the top of an existing collection, 
> implicitly, Cassandra creates a range tombstone. Cassandra does not 
> update/delete data, it always inserts (data or tombstone). Then eventually 
> compaction merges the data and evict the tombstones. Thus, when overwriting 
> an entire collection, Cassandra performs a delete first under the hood.
> 
> I wrote about this, in this post about 2 years ago, in the middle of this 
> (long) article: 
> http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html 
> <http://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html>
> 
> Here is the part that might be of interest in your case:
> 
> "Note: When using collections, range tombstones will be generated by INSERT 
> and UPDATE operations every time you are using an entire collection, and not 
> updating parts of it. Inserting a collection over an existing collection, 
> rather than appending it or updating only an item in it, leads to range 
> tombstones insert followed by the insert of the new values for the 
> collection. This DELETE operation is hidden leading to some weird and 
> frustrating tombstones issues."
> 
> and
> 
> "From the mailing list I found out that James Ravn posted about this topic 
> using list example, but it is true for all the collections, so I won’t go 
> through more details, I just wanted to point this out as it can be 
> surprising, see: 
> http://www.jsravn.com/2015/05/13/cassandra-tombstones-collections.html#lists 
> <http://www.jsravn.com/2015/05/13/cassandra-tombstones-collections.html#lists>"
> 
> Thus to specifically answer your questions:
> 
>  Does this tombstone ever get removed?
> 
> Yes, after gc_grace_seconds (table option) happened AND if the data that is 
> shadowed by the tombstone is also part of the same compaction (all the 
> previous shards need to be there if I remember correctly). So yes, but 
> eventually, not immediately nor any time soon (10+ days by default). 
>  
> Also when I run sstablemetadata on the only sstable, it shows "Estimated 
> droppable tombstones" as 0.5", Similarly it shows one record with epoch time 
> as insert time for - "Estimated tombstone drop times: 1548384720: 1". Does it 
> mean that when I do sstablemetadata on a table having collections, the 
> estimated droppable tombstone ratio and drop times values are not true and 
> dependable values due to collection/list range tombstones?
> 
> I do not remember this precisely but you can check the code, it's worth 
> having a look. The "estimated droppable tombstone" value is actually always 
> wrong. Because it's an estimate that does not consider overlaps (and I'm not 
> sure about the fact it considers the gc_grace_seconds either). But also 
> because calculation does not count a certain type of tombstones and the 
> weight of range tombstones compared to the tombstone cells makes the count 
> quite inaccurate: 
> http://thelastpickle.com/blog/2018/07/05/undetectable-tombstones-in-apache-cassandra.html
>  
> <http://thelastpickle.com/blog/2018/07/05/undetectable-tombstones-in-apache-cassandra.html>.
> 
> I think this evolved since I looked at it and might not remember well, but 
> this value is definitely not accurate. 
> 
> If you're re-inserting a collection for a given existing partition often, 
> there is probably plenty of tombstones sitting around though, that's almost 
> guaranteed.
> 
> Does tombstone_threshold of compaction depend on the sstablemetadata 
> threshold value? If so then for tables having collections, this is not a true 
> threshold right?
> 
> Yes, I believe the tombstone threshold actually uses the "estimated droppable 
> tombstone" value to chose to trigger or not a "single-SSTable"/"tombstone" 
> compaction. Yet, in your case, this will not clean the tombstones in the 
> first 10 days at least (gc_grace_seconds default value). Compactions do not 
> keep triggering because there is a minimum interval defined between 2 
> tombstones compactions of an SSTable (1 day by default). This setting is 
> keeping you away from a useless compaction loop most probably, I would not 
> try to change this. Collection or not collection does not change how the 
> compaction strategy operates.
> 
> I faced this in the past. Operationally you can have things working, but it's 
> hard and really pointless (it was in my case at least). I would definitely 
> recommend changing the model to update parts of the map and never rewrite a 
> map.
> 
> C*heers,
> -----------------------
> Alain Rodriguez - al...@thelastpickle.com <mailto:al...@thelastpickle.com>
> France / Spain
> 
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com <http://www.thelastpickle.com/>
> Le ven. 25 janv. 2019 à 05:29, Ayub M <hia...@gmail.com 
> <mailto:hia...@gmail.com>> a écrit :
> I have created a table with a collection. Inserted a record and took 
> sstabledump of it and seeing there is range tombstone for it in the sstable. 
> Does this tombstone ever get removed? Also when I run sstablemetadata on the 
> only sstable, it shows "Estimated droppable tombstones" as 0.5", Similarly it 
> shows one record with epoch time as insert time for - "Estimated tombstone 
> drop times: 1548384720: 1". Does it mean that when I do sstablemetadata on a 
> table having collections, the estimated droppable tombstone ratio and drop 
> times values are not true and dependable values due to collection/list range 
> tombstones?
> 
> CREATE TABLE ks.nmtest (
>     reservation_id text,
>     order_id text,
>     c1 int,
>     order_details map<text, text>,
>     PRIMARY KEY (reservation_id, order_id)
> ) WITH CLUSTERING ORDER BY (order_id ASC)
> 
> user@cqlsh:ks> insert into nmtest (reservation_id , order_id , c1, 
> order_details ) values('3','3',3,{'key':'value'});
> user@cqlsh:ks> select * from nmtest ;
>  reservation_id | order_id | c1 | order_details
> ----------------+----------+----+------------------
>               3 |        3 |  3 | {'key': 'value'}
> (1 rows)
> 
> [root@localhost nmtest-e1302500201d11e983bb693c02c04c62]# sstabledump 
> mc-5-big-Data.db 
> WARN  02:52:19,596 memtable_cleanup_threshold has been deprecated and should 
> be removed from cassandra.yaml
> [
>   {
>     "partition" : {
>       "key" : [ "3" ],
>       "position" : 0
>     },
>     "rows" : [
>       {
>         "type" : "row",
>         "position" : 41,
>         "clustering" : [ "3" ],
>         "liveness_info" : { "tstamp" : "2019-01-25T02:51:13.574409Z" },
>         "cells" : [
>           { "name" : "c1", "value" : 3 },
>           { "name" : "order_details", "deletion_info" : { "marked_deleted" : 
> "2019-01-25T02:51:13.574408Z", "local_delete_time" : "2019-01-25T02:51:13Z" } 
> },
>           { "name" : "order_details", "path" : [ "key" ], "value" : "value" }
>         ]
>       }
>     ]
>   }
> SSTable: /data/data/ks/nmtest-e1302500201d11e983bb693c02c04c62/mc-5-big
> Partitioner: org.apache.cassandra.dht.Murmur3Partitioner
> Bloom Filter FP chance: 0.010000
> Minimum timestamp: 1548384673574408
> Maximum timestamp: 1548384673574409
> SSTable min local deletion time: 1548384673
> SSTable max local deletion time: 2147483647
> Compressor: org.apache.cassandra.io.compress.LZ4Compressor
> Compression ratio: 1.0714285714285714
> TTL min: 0
> TTL max: 0
> First token: -155496620801056360 (key=3)
> Last token: -155496620801056360 (key=3)
> minClustringValues: [3]
> maxClustringValues: [3]
> Estimated droppable tombstones: 0.5
> SSTable Level: 0
> Repaired at: 0
> Replay positions covered: {CommitLogPosition(segmentId=1548382769966, 
> position=6243201)=CommitLogPosition(segmentId=1548382769966, 
> position=6433666)}
> totalColumnsSet: 2
> totalRows: 1
> Estimated tombstone drop times:
> 1548384720:         1
> 
> Does tombstone_threshold of compaction depend on the sstablemetadata 
> threshold value? If so then for tables having collections, this is not a true 
> threshold right?

Re: Cassandra collection tombstones

Reply via email to