You have to partition that data on the Spark Streaming by the primary key, and then make sure insert data into Cassandra atomically per key, or per set of keys in the partition. You can use the combination of the (batch time, and partition Id) of the RDD inside foreachRDD as the unique id for the data you are inserting. This will guard against multiple attempts to run the task that inserts into Cassandra.
See http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations TD On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch <learnings.chitt...@gmail.com> wrote: > Hi All, > > I have a problem when writing streaming data to cassandra. Or existing > product is on Oracle DB in which while wrtiting data, locks are maintained > such that duplicates in the DB are avoided. > > But as spark has parallel processing architecture, if more than 1 thread > is trying to write same data i.e with same primary key, is there as any > scope to created duplicates? If yes, how to address this problem either > from spark or from cassandra side ? > > Thanks, > Padma Ch >