Hi All, Can someone throw insights on this ?
On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch <learnings.chitt...@gmail.com> wrote: > > > Hi TD, > > Thanks for the info. I have the scenario like this. > > I am reading the data from kafka topic. Let's say kafka has 3 partitions > for the topic. In my streaming application, I would configure 3 receivers > with 1 thread each such that they would receive 3 dstreams (from 3 > partitions of kafka topic) and also I implement partitioner. Now there is a > possibility of receiving messages with same primary key twice or more, one > is at the time message is created and other times if there is an update to > any fields for same message. > > If two messages M1 and M2 with same primary key are read by 2 receivers > then even the partitioner in spark would still end up in parallel > processing as there are altogether in different dstreams. How do we address > in this situation ? > > Thanks, > Padma Ch > > On Tue, Jul 28, 2015 at 12:12 PM, Tathagata Das <t...@databricks.com> > wrote: > >> You have to partition that data on the Spark Streaming by the primary >> key, and then make sure insert data into Cassandra atomically per key, or >> per set of keys in the partition. You can use the combination of the (batch >> time, and partition Id) of the RDD inside foreachRDD as the unique id for >> the data you are inserting. This will guard against multiple attempts to >> run the task that inserts into Cassandra. >> >> See >> http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations >> >> TD >> >> On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch <learnings.chitt...@gmail.com> >> wrote: >> >>> Hi All, >>> >>> I have a problem when writing streaming data to cassandra. Or existing >>> product is on Oracle DB in which while wrtiting data, locks are maintained >>> such that duplicates in the DB are avoided. >>> >>> But as spark has parallel processing architecture, if more than 1 thread >>> is trying to write same data i.e with same primary key, is there as any >>> scope to created duplicates? If yes, how to address this problem either >>> from spark or from cassandra side ? >>> >>> Thanks, >>> Padma Ch >>> >> >> > >