Batches are for atomicity, not performance. I would do single deletes with a prepared statement. An IN clause causes extra work for the coordinator because multiple partitions are being impacted. So, the coordinator has to coordinate all nodes involved in those writes (up to the whole cluster). Availability and performance are compromised for multiple partition operations. I do not allow them.
Also – TTL at insert (or update) is a much better solution than large purge strategies. As someone who spent a month wrangling hundreds of billions of deletes, I am an ardent preacher of TTL during design time. Sean Durity From: Attila Wind <attilaw@swf.technology> Sent: Friday, February 21, 2020 2:52 AM To: user@cassandra.apache.org Subject: [EXTERNAL] Re: IN OPERATOR VS BATCH QUERY Hi Sergio, AFAIK you use batches when you want to get "all or nothing" approach from Cassandra. So turning multiple statements into one atomic operation. One very typical use case for this is when you have denormalized data in multiple tables (optimized for different queries) but you need to modify all of them the same way as they were just one entity. This means that if any ofyour delete statements would fail for whatever reason then all of your delete statements would be rolled back. I think you dont want that overhead here for sure... We are not there yet with our development but we will need similar "cleanup" functionality soon. I was also thinking about the IN operator for similar cases but I am curious if anyone here has better idea... Why does the IN operator blowing up the coordinator? I do not entirely get it... Thanks Attila Sergio <lapostadiser...@gmail.com<mailto:lapostadiser...@gmail.com>> ezt írta (időpont: 2020. febr. 21., P 3:44): The current approach is delete from key_value where id = whatever and it is performed asynchronously from the client. I was thinking to reduce at least the network round-trips between client and coordinator with that Batch approach. :) In any case, I would test it it will improve or not. So when do you use batch then? Best, Sergio On Thu, Feb 20, 2020, 6:18 PM Erick Ramirez <erick.rami...@datastax.com<mailto:erick.rami...@datastax.com>> wrote: Batches aren't really meant for optimisation in the same way as RDBMS. If anything, it will just put pressure on the coordinator having to fire off multiple requests to lots of replicas. The IN operator falls into the same category and I personally wouldn't use it with more than 2 or 3 partitions because then the coordinator will suffer from the same problem. If it were me, I'd just issue single-partition deletes and throttle it to a "reasonable" throughput that your cluster can handle. The word "reasonable" is in quotes because only you can determine that magic number for your cluster through testing. Cheers! ________________________________ The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.