Fwd: Writing streaming data to cassandra creates duplicates

2015-08-04 Thread Priya Ch
Yes...union would be one solution. I am not doing any aggregation hence reduceByKey would not be useful. If I use groupByKey, messages with same key would be obtained in a partition. But groupByKey is very expensive operation as it involves shuffle operation. My ultimate goal is to write the

Re: Writing streaming data to cassandra creates duplicates

2015-08-04 Thread Gerard Maas
(removing dev from the to: as not relevant) it would be good to see some sample data and the cassandra schema to have a more concrete idea of the problem space. Some thoughts: reduceByKey could still be used to 'pick' one element. example of arbitrarily choosing the first one: reduceByKey{case

Re: Writing streaming data to cassandra creates duplicates

2015-08-04 Thread gaurav sharma
Ideally the 2 messages read from kafka must differ on some parameter atleast, or else they are logically same As a solution to your problem, if the message content is same, u cud create a new field UUID, which might play the role of partition key while inserting the 2 messages in Cassandra Msg1

Re: Writing streaming data to cassandra creates duplicates

2015-07-30 Thread Priya Ch
Hi All, Can someone throw insights on this ? On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch learnings.chitt...@gmail.com wrote: Hi TD, Thanks for the info. I have the scenario like this. I am reading the data from kafka topic. Let's say kafka has 3 partitions for the topic. In my

Re: Writing streaming data to cassandra creates duplicates

2015-07-30 Thread Juan Rodríguez Hortalá
Hi, Just my two cents. I understand your problem is that your problem is that you have messages with the same key in two different dstreams. What I would do would be making a union of all the dstreams with StreamingContext.union or several calls to DStream.union, and then I would create a pair

Fwd: Writing streaming data to cassandra creates duplicates

2015-07-28 Thread Priya Ch
Hi TD, Thanks for the info. I have the scenario like this. I am reading the data from kafka topic. Let's say kafka has 3 partitions for the topic. In my streaming application, I would configure 3 receivers with 1 thread each such that they would receive 3 dstreams (from 3 partitions of kafka

Re: Writing streaming data to cassandra creates duplicates

2015-07-28 Thread Tathagata Das
You have to partition that data on the Spark Streaming by the primary key, and then make sure insert data into Cassandra atomically per key, or per set of keys in the partition. You can use the combination of the (batch time, and partition Id) of the RDD inside foreachRDD as the unique id for the

Writing streaming data to cassandra creates duplicates

2015-07-26 Thread Priya Ch
Hi All, I have a problem when writing streaming data to cassandra. Or existing product is on Oracle DB in which while wrtiting data, locks are maintained such that duplicates in the DB are avoided. But as spark has parallel processing architecture, if more than 1 thread is trying to write same