Re: Writing streaming data to cassandra creates duplicates

gaurav sharma Tue, 04 Aug 2015 11:44:40 -0700

Ideally the 2 messages read from kafka must differ on some parameter
atleast, or else they are logically same


As a solution to your problem, if the message content is same, u cud create
a new field UUID, which might play the role of partition key while
inserting the 2 messages in Cassandra

Msg1 - UUID1, GAURAV, 100
Msg2 - UUID2, PRIYA, 200
Msg3 - UUID1, GAURAV, 100

Now when inserting in Cassandra 3 different rows would be created, pls
note, that even though Msg1, Msg3 have same content, they are inserted as 2
separate rows in Cassandra, since they differ on UUID,which is partition
key in my column family

Regards,
Gaurav
(please excuse spelling mistakes)
Sent from phone
On Aug 4, 2015 4:54 PM, "Gerard Maas" <gerard.m...@gmail.com> wrote:

> (removing dev from the to: as not relevant)
>
> it would be good to see some sample data and the cassandra schema to have
> a more concrete idea of the problem space.
>
> Some thoughts: reduceByKey could still be used to 'pick' one element.
> example of arbitrarily choosing the first one: reduceByKey{case (e1,e2) =>
> e1}
>
> The question to be answered is: what should happen to the multiple values
> that arrive for 1 key?
>
> And why are they creating duplicates in cassandra? if they have the same
> key, they will result in an overwrite (that's not desirable due to
> tombstones anyway)
>
> -kr, Gerard.
>
>
>
> On Tue, Aug 4, 2015 at 1:03 PM, Priya Ch <learnings.chitt...@gmail.com>
> wrote:
>
>>
>>
>>
>> Yes...union would be one solution. I am not doing any aggregation hence
>> reduceByKey would not be useful. If I use groupByKey, messages with same
>> key would be obtained in a partition. But groupByKey is very expensive
>> operation as it involves shuffle operation. My ultimate goal is to write
>> the messages to cassandra. if the messages with same key are handled by
>> different streams...there would be concurrency issues. To resolve this i
>> can union dstreams and apply hash parttioner so that it would bring all the
>> same keys to a single partition or do a groupByKey which does the same.
>>
>> As groupByKey is expensive, is there any work around for this ?
>>
>> On Thu, Jul 30, 2015 at 2:33 PM, Juan Rodríguez Hortalá <
>> juan.rodriguez.hort...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Just my two cents. I understand your problem is that your problem is
>>> that you have messages with the same key in two different dstreams. What I
>>> would do would be making a union of all the dstreams with
>>> StreamingContext.union or several calls to DStream.union, and then I would
>>> create a pair dstream with the primary key as key, and then I'd use
>>> groupByKey or reduceByKey (or combineByKey etc) to combine the messages
>>> with the same primary key.
>>>
>>> Hope that helps.
>>>
>>> Greetings,
>>>
>>> Juan
>>>
>>>
>>> 2015-07-30 10:50 GMT+02:00 Priya Ch <learnings.chitt...@gmail.com>:
>>>
>>>> Hi All,
>>>>
>>>>  Can someone throw insights on this ?
>>>>
>>>> On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch <learnings.chitt...@gmail.com
>>>> > wrote:
>>>>
>>>>>
>>>>>
>>>>> Hi TD,
>>>>>
>>>>>  Thanks for the info. I have the scenario like this.
>>>>>
>>>>>  I am reading the data from kafka topic. Let's say kafka has 3
>>>>> partitions for the topic. In my streaming application, I would configure 3
>>>>> receivers with 1 thread each such that they would receive 3 dstreams (from
>>>>> 3 partitions of kafka topic) and also I implement partitioner. Now there 
>>>>> is
>>>>> a possibility of receiving messages with same primary key twice or more,
>>>>> one is at the time message is created and other times if there is an 
>>>>> update
>>>>> to any fields for same message.
>>>>>
>>>>> If two messages M1 and M2 with same primary key are read by 2
>>>>> receivers then even the partitioner in spark would still end up in 
>>>>> parallel
>>>>> processing as there are altogether in different dstreams. How do we 
>>>>> address
>>>>> in this situation ?
>>>>>
>>>>> Thanks,
>>>>> Padma Ch
>>>>>
>>>>> On Tue, Jul 28, 2015 at 12:12 PM, Tathagata Das <t...@databricks.com>
>>>>> wrote:
>>>>>
>>>>>> You have to partition that data on the Spark Streaming by the primary
>>>>>> key, and then make sure insert data into Cassandra atomically per key, or
>>>>>> per set of keys in the partition. You can use the combination of the 
>>>>>> (batch
>>>>>> time, and partition Id) of the RDD inside foreachRDD as the unique id for
>>>>>> the data you are inserting. This will guard against multiple attempts to
>>>>>> run the task that inserts into Cassandra.
>>>>>>
>>>>>> See
>>>>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations
>>>>>>
>>>>>> TD
>>>>>>
>>>>>> On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch <
>>>>>> learnings.chitt...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>>  I have a problem when writing streaming data to cassandra. Or
>>>>>>> existing product is on Oracle DB in which while wrtiting data, locks are
>>>>>>> maintained such that duplicates in the DB are avoided.
>>>>>>>
>>>>>>> But as spark has parallel processing architecture, if more than 1
>>>>>>> thread is trying to write same data i.e with same primary key, is there 
>>>>>>> as
>>>>>>> any scope to created duplicates? If yes, how to address this problem 
>>>>>>> either
>>>>>>> from spark or from cassandra side ?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Padma Ch
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>>
>

Re: Writing streaming data to cassandra creates duplicates

Reply via email to