Thanks for a very helpful reply. Will try to refactor the code accordingly.
On Tue, Jan 16, 2018 at 4:36 PM, Alexander Dejanovski < a...@thelastpickle.com> wrote: > I would not plan on deleting data at the row level as you'll end up with a > lot of tombstones eventually (and you won't even notice them). > It's not healthy to allow that many tombstones to be read, and while your > latency may fit your SLA now, it may not in the future. > Tombstones are going to create a lot of heap pressure and eventually > trigger long GC pauses, which then tend to affect the whole cluster (a slow > node is worse than a down node). > > You should definitely separate data that is TTLed and data that is not in > different tables so that you can adjust compaction strategies, > gc_grace_seconds and read patterns accordingly. I understand that it will > complexify your code, but it will prevent severe performance issues in > Cassandra. > > Tombstones won't be a problem for repair, they will get repaired as > classic cells. They negatively affect the read path mostly, and use space > on disk. > > On Tue, Jan 16, 2018 at 2:12 PM Python_Max <python....@gmail.com> wrote: > >> Hello. >> >> I was planning to remove a row (not partition). >> >> Most of the tombstones are seen in the use case of geographic grid with >> X:Y as partition key and object id (timeuuid) as clustering key where >> objects could be temporary with TTL about 10 hours or fully persistent. >> When I select all objects in specific X:Y I can even hit 100k (default) >> limit for some X:Y. I have changed this limit to 500k since 99.9p read >> latency is < 75ms so I should not (?) care how many tombstones while read >> latency is fine. >> >> Splitting entities to temporary and permanent and using different >> compaction strategies is an option but it will lead to code duplication and >> 2x read queries. >> >> Is my assumption correct about tombstones are not so big problem as soon >> as read latency and disk usage are okey? Are tombstones affect repair time >> (using reaper)? >> >> Thanks. >> >> >> On Tue, Jan 16, 2018 at 11:32 AM, Alexander Dejanovski < >> a...@thelastpickle.com> wrote: >> >>> Hi, >>> >>> could you be more specific about the deletes you're planning to perform ? >>> This will end up moving your problem somewhere else as you'll be >>> generating new tombstones (and if you're planning on deleting rows, be >>> aware that row level tombstones aren't reported anywhere in the metrics, >>> logs and query traces). >>> Currently you can delete your data at the partition level, which will >>> create a single tombstone that will shadow all your expired (and non >>> expired) data and is very efficient. The read path is optimized for such >>> tombstones and the data won't be fully read from disk nor exchanged between >>> replicas. But that's of course if your use case allows to delete full >>> partitions. >>> >>> We usually model so that we can restrict our reads to live data. >>> If you're creating time series, your clustering key should include a >>> timestamp, which you can use to avoid reading expired data. If your TTL is >>> set to 60 days, you can read only data that is strictly younger than that. >>> Then you can partition by time ranges, and access exclusively partitions >>> that have no chance to be expired yet. >>> Those techniques usually work better with TWCS, but the former could >>> make you hit a lot of SSTables if your partitions can spread over all time >>> buckets, so only use TWCS if you can restrict individual reads to up to 4 >>> time windows. >>> >>> Cheers, >>> >>> >>> On Tue, Jan 16, 2018 at 10:01 AM Python_Max <python....@gmail.com> >>> wrote: >>> >>>> Hi. >>>> >>>> Thank you very much for detailed explanation. >>>> Seems that there is nothing I can do about it except delete records by >>>> key instead of expiring. >>>> >>>> >>>> On Fri, Jan 12, 2018 at 7:30 PM, Alexander Dejanovski < >>>> a...@thelastpickle.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> As DuyHai said, different TTLs could theoretically be set for >>>>> different cells of the same row. And one TTLed cell could be shadowing >>>>> another cell that has no TTL (say you forgot to set a TTL and set one >>>>> afterwards by performing an update), or vice versa. >>>>> One cell could also be missing from a node without Cassandra knowing. >>>>> So turning an incomplete row that only has expired cells into a tombstone >>>>> row could lead to wrong results being returned at read time : the >>>>> tombstone >>>>> row could potentially shadow a valid live cell from another replica. >>>>> >>>>> Cassandra needs to retain each TTLed cell and send it to replicas >>>>> during reads to cover all possible cases. >>>>> >>>>> >>>>> On Fri, Jan 12, 2018 at 5:28 PM Python_Max <python....@gmail.com> >>>>> wrote: >>>>> >>>>>> Thank you for response. >>>>>> >>>>>> I know about the option of setting TTL per column or even per item in >>>>>> collection. However in my example entire row has expired, shouldn't >>>>>> Cassandra be able to detect this situation and spawn a single tombstone >>>>>> for >>>>>> entire row instead of many? >>>>>> Is there any reason not doing this except that no one needs it? Is >>>>>> this suitable for feature request or improvement? >>>>>> >>>>>> Thanks. >>>>>> >>>>>> On Wed, Jan 10, 2018 at 4:52 PM, DuyHai Doan <doanduy...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> "The question is why Cassandra creates a tombstone for every column >>>>>>> instead of single tombstone per row?" >>>>>>> >>>>>>> --> Simply because technically it is possible to set different TTL >>>>>>> value on each column of a CQL row >>>>>>> >>>>>>> On Wed, Jan 10, 2018 at 2:59 PM, Python_Max <python....@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hello, C* users and experts. >>>>>>>> >>>>>>>> I have (one more) question about tombstones. >>>>>>>> >>>>>>>> Consider the following example: >>>>>>>> cqlsh> create keyspace test_ttl with replication = {'class': >>>>>>>> 'SimpleStrategy', 'replication_factor': '1'}; use test_ttl; >>>>>>>> cqlsh> create table items(a text, b text, c1 text, c2 text, c3 >>>>>>>> text, primary key (a, b)); >>>>>>>> cqlsh> insert into items(a,b,c1,c2,c3) values('AAA', 'BBB', 'C111', >>>>>>>> 'C222', 'C333') using ttl 60; >>>>>>>> bash$ nodetool flush >>>>>>>> bash$ sleep 60 >>>>>>>> bash$ nodetool compact test_ttl items >>>>>>>> bash$ sstabledump mc-2-big-Data.db >>>>>>>> >>>>>>>> [ >>>>>>>> { >>>>>>>> "partition" : { >>>>>>>> "key" : [ "AAA" ], >>>>>>>> "position" : 0 >>>>>>>> }, >>>>>>>> "rows" : [ >>>>>>>> { >>>>>>>> "type" : "row", >>>>>>>> "position" : 58, >>>>>>>> "clustering" : [ "BBB" ], >>>>>>>> "liveness_info" : { "tstamp" : "2018-01-10T13:29:25.777Z", >>>>>>>> "ttl" : 60, "expires_at" : "2018-01-10T13:30:25Z", "expired" : true }, >>>>>>>> "cells" : [ >>>>>>>> { "name" : "c1", "deletion_info" : { "local_delete_time" >>>>>>>> : "2018-01-10T13:29:25Z" } >>>>>>>> }, >>>>>>>> { "name" : "c2", "deletion_info" : { "local_delete_time" >>>>>>>> : "2018-01-10T13:29:25Z" } >>>>>>>> }, >>>>>>>> { "name" : "c3", "deletion_info" : { "local_delete_time" >>>>>>>> : "2018-01-10T13:29:25Z" } >>>>>>>> } >>>>>>>> ] >>>>>>>> } >>>>>>>> ] >>>>>>>> } >>>>>>>> ] >>>>>>>> >>>>>>>> The question is why Cassandra creates a tombstone for every column >>>>>>>> instead of single tombstone per row? >>>>>>>> >>>>>>>> In production environment I have a table with ~30 columns and It >>>>>>>> gives me a warning for 30k tombstones and 300 live rows. It is 30 times >>>>>>>> more then it could be. >>>>>>>> Can this behavior be tuned in some way? >>>>>>>> >>>>>>>> Thanks. >>>>>>>> >>>>>>>> -- >>>>>>>> Best regards, >>>>>>>> Python_Max. >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Best regards, >>>>>> Python_Max. >>>>>> >>>>> >>>>> >>>>> -- >>>>> ----------------- >>>>> Alexander Dejanovski >>>>> France >>>>> @alexanderdeja >>>>> >>>>> Consultant >>>>> Apache Cassandra Consulting >>>>> http://www.thelastpickle.com >>>>> >>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> Python_Max. >>>> >>> >>> >>> -- >>> ----------------- >>> Alexander Dejanovski >>> France >>> @alexanderdeja >>> >>> Consultant >>> Apache Cassandra Consulting >>> http://www.thelastpickle.com >>> >> >> >> >> -- >> Best regards, >> Python_Max. >> > > > -- > ----------------- > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > -- Best regards, Python_Max.