Hi TD,

 Thanks for the info. I have the scenario like this.

 I am reading the data from kafka topic. Let's say kafka has 3 partitions
for the topic. In my streaming application, I would configure 3 receivers
with 1 thread each such that they would receive 3 dstreams (from 3
partitions of kafka topic) and also I implement partitioner. Now there is a
possibility of receiving messages with same primary key twice or more, one
is at the time message is created and other times if there is an update to
any fields for same message.

If two messages M1 and M2 with same primary key are read by 2 receivers
then even the partitioner in spark would still end up in parallel
processing as there are altogether in different dstreams. How do we address
in this situation ?

Thanks,
Padma Ch

On Tue, Jul 28, 2015 at 12:12 PM, Tathagata Das <t...@databricks.com> wrote:

> You have to partition that data on the Spark Streaming by the primary key,
> and then make sure insert data into Cassandra atomically per key, or per
> set of keys in the partition. You can use the combination of the (batch
> time, and partition Id) of the RDD inside foreachRDD as the unique id for
> the data you are inserting. This will guard against multiple attempts to
> run the task that inserts into Cassandra.
>
> See
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations
>
> TD
>
> On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
>> Hi All,
>>
>>  I have a problem when writing streaming data to cassandra. Or existing
>> product is on Oracle DB in which while wrtiting data, locks are maintained
>> such that duplicates in the DB are avoided.
>>
>> But as spark has parallel processing architecture, if more than 1 thread
>> is trying to write same data i.e with same primary key, is there as any
>> scope to created duplicates? If yes, how to address this problem either
>> from spark or from cassandra side ?
>>
>> Thanks,
>> Padma Ch
>>
>
>

Reply via email to