First off I hope this appropriate here- I couldn't decide whether this was a 
question for Cassandra users or spark users so if you think it's in the wiring 
place feel free to redirect me.

I have a system that does a load of data manipulation using spark.  The output 
of this program is a effectively the new state that I want my Cassandra table 
to be in and the final step is to update Cassandra so that it matches this 
state.

At present I'm currently inserting all rows in my generated state into 
Cassandra. This works for new rows and also for updating existing rows but 
doesn't of course delete any rows that were already in Cassandra but not in my 
new state. 
 
The problem I have now is how best to delete these missing rows. Options I have 
considered are:

1. Setting a ttl on inserts which is roughly the same as my data refresh 
period. This would probably be pretty performant but I really don't want to do 
this because it would mean that all data in my database would disappear if I 
had issues running my refresh task!

2. Every time I refresh the data I would first have to fetch all primary keys 
from Cassandra and, compare them to primary keys locally to create a list of 
pks to delete before the insert. This seems the most logicaly correct option 
but is going to result in reading vast amounts of data from Cassandra.

3. Truncating the entire table before refreshing Cassandra. This has the 
benefit of being pretty simple in code but I'm not sure of the performance 
implications of this and what will happen if I truncate while a node is offline.

For reference the table is on the order of 10s of millions of rows and for any 
data refresh only a very small fraction (<.1%) will actually need deleting. 99% 
of the time I'll just be overwriting existing keys. 

I'd be grateful if anyone could shed some advice on the best solution here or 
whether there's some better way I haven't thought of.

Thanks,

Chris 

Reply via email to