Ideally the 2 messages read from kafka must differ on some parameter atleast, or else they are logically same
As a solution to your problem, if the message content is same, u cud create a new field UUID, which might play the role of partition key while inserting the 2 messages in Cassandra Msg1 - UUID1, GAURAV, 100 Msg2 - UUID2, PRIYA, 200 Msg3 - UUID1, GAURAV, 100 Now when inserting in Cassandra 3 different rows would be created, pls note, that even though Msg1, Msg3 have same content, they are inserted as 2 separate rows in Cassandra, since they differ on UUID,which is partition key in my column family Regards, Gaurav (please excuse spelling mistakes) Sent from phone On Aug 4, 2015 4:54 PM, "Gerard Maas" <gerard.m...@gmail.com> wrote: > (removing dev from the to: as not relevant) > > it would be good to see some sample data and the cassandra schema to have > a more concrete idea of the problem space. > > Some thoughts: reduceByKey could still be used to 'pick' one element. > example of arbitrarily choosing the first one: reduceByKey{case (e1,e2) => > e1} > > The question to be answered is: what should happen to the multiple values > that arrive for 1 key? > > And why are they creating duplicates in cassandra? if they have the same > key, they will result in an overwrite (that's not desirable due to > tombstones anyway) > > -kr, Gerard. > > > > On Tue, Aug 4, 2015 at 1:03 PM, Priya Ch <learnings.chitt...@gmail.com> > wrote: > >> >> >> >> Yes...union would be one solution. I am not doing any aggregation hence >> reduceByKey would not be useful. If I use groupByKey, messages with same >> key would be obtained in a partition. But groupByKey is very expensive >> operation as it involves shuffle operation. My ultimate goal is to write >> the messages to cassandra. if the messages with same key are handled by >> different streams...there would be concurrency issues. To resolve this i >> can union dstreams and apply hash parttioner so that it would bring all the >> same keys to a single partition or do a groupByKey which does the same. >> >> As groupByKey is expensive, is there any work around for this ? >> >> On Thu, Jul 30, 2015 at 2:33 PM, Juan Rodríguez Hortalá < >> juan.rodriguez.hort...@gmail.com> wrote: >> >>> Hi, >>> >>> Just my two cents. I understand your problem is that your problem is >>> that you have messages with the same key in two different dstreams. What I >>> would do would be making a union of all the dstreams with >>> StreamingContext.union or several calls to DStream.union, and then I would >>> create a pair dstream with the primary key as key, and then I'd use >>> groupByKey or reduceByKey (or combineByKey etc) to combine the messages >>> with the same primary key. >>> >>> Hope that helps. >>> >>> Greetings, >>> >>> Juan >>> >>> >>> 2015-07-30 10:50 GMT+02:00 Priya Ch <learnings.chitt...@gmail.com>: >>> >>>> Hi All, >>>> >>>> Can someone throw insights on this ? >>>> >>>> On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch <learnings.chitt...@gmail.com >>>> > wrote: >>>> >>>>> >>>>> >>>>> Hi TD, >>>>> >>>>> Thanks for the info. I have the scenario like this. >>>>> >>>>> I am reading the data from kafka topic. Let's say kafka has 3 >>>>> partitions for the topic. In my streaming application, I would configure 3 >>>>> receivers with 1 thread each such that they would receive 3 dstreams (from >>>>> 3 partitions of kafka topic) and also I implement partitioner. Now there >>>>> is >>>>> a possibility of receiving messages with same primary key twice or more, >>>>> one is at the time message is created and other times if there is an >>>>> update >>>>> to any fields for same message. >>>>> >>>>> If two messages M1 and M2 with same primary key are read by 2 >>>>> receivers then even the partitioner in spark would still end up in >>>>> parallel >>>>> processing as there are altogether in different dstreams. How do we >>>>> address >>>>> in this situation ? >>>>> >>>>> Thanks, >>>>> Padma Ch >>>>> >>>>> On Tue, Jul 28, 2015 at 12:12 PM, Tathagata Das <t...@databricks.com> >>>>> wrote: >>>>> >>>>>> You have to partition that data on the Spark Streaming by the primary >>>>>> key, and then make sure insert data into Cassandra atomically per key, or >>>>>> per set of keys in the partition. You can use the combination of the >>>>>> (batch >>>>>> time, and partition Id) of the RDD inside foreachRDD as the unique id for >>>>>> the data you are inserting. This will guard against multiple attempts to >>>>>> run the task that inserts into Cassandra. >>>>>> >>>>>> See >>>>>> http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations >>>>>> >>>>>> TD >>>>>> >>>>>> On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch < >>>>>> learnings.chitt...@gmail.com> wrote: >>>>>> >>>>>>> Hi All, >>>>>>> >>>>>>> I have a problem when writing streaming data to cassandra. Or >>>>>>> existing product is on Oracle DB in which while wrtiting data, locks are >>>>>>> maintained such that duplicates in the DB are avoided. >>>>>>> >>>>>>> But as spark has parallel processing architecture, if more than 1 >>>>>>> thread is trying to write same data i.e with same primary key, is there >>>>>>> as >>>>>>> any scope to created duplicates? If yes, how to address this problem >>>>>>> either >>>>>>> from spark or from cassandra side ? >>>>>>> >>>>>>> Thanks, >>>>>>> Padma Ch >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >> >> >