Re: Writing streaming data to cassandra creates duplicates

2015-08-04 Thread Gerard Maas
(removing dev from the to: as not relevant)

it would be good to see some sample data and the cassandra schema to have a
more concrete idea of the problem space.

Some thoughts: reduceByKey could still be used to 'pick' one element.
example of arbitrarily choosing the first one: reduceByKey{case (e1,e2) =
e1}

The question to be answered is: what should happen to the multiple values
that arrive for 1 key?

And why are they creating duplicates in cassandra? if they have the same
key, they will result in an overwrite (that's not desirable due to
tombstones anyway)

-kr, Gerard.



On Tue, Aug 4, 2015 at 1:03 PM, Priya Ch learnings.chitt...@gmail.com
wrote:




 Yes...union would be one solution. I am not doing any aggregation hence
 reduceByKey would not be useful. If I use groupByKey, messages with same
 key would be obtained in a partition. But groupByKey is very expensive
 operation as it involves shuffle operation. My ultimate goal is to write
 the messages to cassandra. if the messages with same key are handled by
 different streams...there would be concurrency issues. To resolve this i
 can union dstreams and apply hash parttioner so that it would bring all the
 same keys to a single partition or do a groupByKey which does the same.

 As groupByKey is expensive, is there any work around for this ?

 On Thu, Jul 30, 2015 at 2:33 PM, Juan Rodríguez Hortalá 
 juan.rodriguez.hort...@gmail.com wrote:

 Hi,

 Just my two cents. I understand your problem is that your problem is that
 you have messages with the same key in two different dstreams. What I would
 do would be making a union of all the dstreams with StreamingContext.union
 or several calls to DStream.union, and then I would create a pair dstream
 with the primary key as key, and then I'd use groupByKey or reduceByKey (or
 combineByKey etc) to combine the messages with the same primary key.

 Hope that helps.

 Greetings,

 Juan


 2015-07-30 10:50 GMT+02:00 Priya Ch learnings.chitt...@gmail.com:

 Hi All,

  Can someone throw insights on this ?

 On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch learnings.chitt...@gmail.com
 wrote:



 Hi TD,

  Thanks for the info. I have the scenario like this.

  I am reading the data from kafka topic. Let's say kafka has 3
 partitions for the topic. In my streaming application, I would configure 3
 receivers with 1 thread each such that they would receive 3 dstreams (from
 3 partitions of kafka topic) and also I implement partitioner. Now there is
 a possibility of receiving messages with same primary key twice or more,
 one is at the time message is created and other times if there is an update
 to any fields for same message.

 If two messages M1 and M2 with same primary key are read by 2 receivers
 then even the partitioner in spark would still end up in parallel
 processing as there are altogether in different dstreams. How do we address
 in this situation ?

 Thanks,
 Padma Ch

 On Tue, Jul 28, 2015 at 12:12 PM, Tathagata Das t...@databricks.com
 wrote:

 You have to partition that data on the Spark Streaming by the primary
 key, and then make sure insert data into Cassandra atomically per key, or
 per set of keys in the partition. You can use the combination of the 
 (batch
 time, and partition Id) of the RDD inside foreachRDD as the unique id for
 the data you are inserting. This will guard against multiple attempts to
 run the task that inserts into Cassandra.

 See
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations

 TD

 On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch 
 learnings.chitt...@gmail.com wrote:

 Hi All,

  I have a problem when writing streaming data to cassandra. Or
 existing product is on Oracle DB in which while wrtiting data, locks are
 maintained such that duplicates in the DB are avoided.

 But as spark has parallel processing architecture, if more than 1
 thread is trying to write same data i.e with same primary key, is there 
 as
 any scope to created duplicates? If yes, how to address this problem 
 either
 from spark or from cassandra side ?

 Thanks,
 Padma Ch











Re: Writing streaming data to cassandra creates duplicates

2015-08-04 Thread gaurav sharma
Ideally the 2 messages read from kafka must differ on some parameter
atleast, or else they are logically same

As a solution to your problem, if the message content is same, u cud create
a new field UUID, which might play the role of partition key while
inserting the 2 messages in Cassandra

Msg1 - UUID1, GAURAV, 100
Msg2 - UUID2, PRIYA, 200
Msg3 - UUID1, GAURAV, 100

Now when inserting in Cassandra 3 different rows would be created, pls
note, that even though Msg1, Msg3 have same content, they are inserted as 2
separate rows in Cassandra, since they differ on UUID,which is partition
key in my column family

Regards,
Gaurav
(please excuse spelling mistakes)
Sent from phone
On Aug 4, 2015 4:54 PM, Gerard Maas gerard.m...@gmail.com wrote:

 (removing dev from the to: as not relevant)

 it would be good to see some sample data and the cassandra schema to have
 a more concrete idea of the problem space.

 Some thoughts: reduceByKey could still be used to 'pick' one element.
 example of arbitrarily choosing the first one: reduceByKey{case (e1,e2) =
 e1}

 The question to be answered is: what should happen to the multiple values
 that arrive for 1 key?

 And why are they creating duplicates in cassandra? if they have the same
 key, they will result in an overwrite (that's not desirable due to
 tombstones anyway)

 -kr, Gerard.



 On Tue, Aug 4, 2015 at 1:03 PM, Priya Ch learnings.chitt...@gmail.com
 wrote:




 Yes...union would be one solution. I am not doing any aggregation hence
 reduceByKey would not be useful. If I use groupByKey, messages with same
 key would be obtained in a partition. But groupByKey is very expensive
 operation as it involves shuffle operation. My ultimate goal is to write
 the messages to cassandra. if the messages with same key are handled by
 different streams...there would be concurrency issues. To resolve this i
 can union dstreams and apply hash parttioner so that it would bring all the
 same keys to a single partition or do a groupByKey which does the same.

 As groupByKey is expensive, is there any work around for this ?

 On Thu, Jul 30, 2015 at 2:33 PM, Juan Rodríguez Hortalá 
 juan.rodriguez.hort...@gmail.com wrote:

 Hi,

 Just my two cents. I understand your problem is that your problem is
 that you have messages with the same key in two different dstreams. What I
 would do would be making a union of all the dstreams with
 StreamingContext.union or several calls to DStream.union, and then I would
 create a pair dstream with the primary key as key, and then I'd use
 groupByKey or reduceByKey (or combineByKey etc) to combine the messages
 with the same primary key.

 Hope that helps.

 Greetings,

 Juan


 2015-07-30 10:50 GMT+02:00 Priya Ch learnings.chitt...@gmail.com:

 Hi All,

  Can someone throw insights on this ?

 On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch learnings.chitt...@gmail.com
  wrote:



 Hi TD,

  Thanks for the info. I have the scenario like this.

  I am reading the data from kafka topic. Let's say kafka has 3
 partitions for the topic. In my streaming application, I would configure 3
 receivers with 1 thread each such that they would receive 3 dstreams (from
 3 partitions of kafka topic) and also I implement partitioner. Now there 
 is
 a possibility of receiving messages with same primary key twice or more,
 one is at the time message is created and other times if there is an 
 update
 to any fields for same message.

 If two messages M1 and M2 with same primary key are read by 2
 receivers then even the partitioner in spark would still end up in 
 parallel
 processing as there are altogether in different dstreams. How do we 
 address
 in this situation ?

 Thanks,
 Padma Ch

 On Tue, Jul 28, 2015 at 12:12 PM, Tathagata Das t...@databricks.com
 wrote:

 You have to partition that data on the Spark Streaming by the primary
 key, and then make sure insert data into Cassandra atomically per key, or
 per set of keys in the partition. You can use the combination of the 
 (batch
 time, and partition Id) of the RDD inside foreachRDD as the unique id for
 the data you are inserting. This will guard against multiple attempts to
 run the task that inserts into Cassandra.

 See
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations

 TD

 On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch 
 learnings.chitt...@gmail.com wrote:

 Hi All,

  I have a problem when writing streaming data to cassandra. Or
 existing product is on Oracle DB in which while wrtiting data, locks are
 maintained such that duplicates in the DB are avoided.

 But as spark has parallel processing architecture, if more than 1
 thread is trying to write same data i.e with same primary key, is there 
 as
 any scope to created duplicates? If yes, how to address this problem 
 either
 from spark or from cassandra side ?

 Thanks,
 Padma Ch












Re: Writing streaming data to cassandra creates duplicates

2015-07-30 Thread Priya Ch
Hi All,

 Can someone throw insights on this ?

On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch learnings.chitt...@gmail.com
wrote:



 Hi TD,

  Thanks for the info. I have the scenario like this.

  I am reading the data from kafka topic. Let's say kafka has 3 partitions
 for the topic. In my streaming application, I would configure 3 receivers
 with 1 thread each such that they would receive 3 dstreams (from 3
 partitions of kafka topic) and also I implement partitioner. Now there is a
 possibility of receiving messages with same primary key twice or more, one
 is at the time message is created and other times if there is an update to
 any fields for same message.

 If two messages M1 and M2 with same primary key are read by 2 receivers
 then even the partitioner in spark would still end up in parallel
 processing as there are altogether in different dstreams. How do we address
 in this situation ?

 Thanks,
 Padma Ch

 On Tue, Jul 28, 2015 at 12:12 PM, Tathagata Das t...@databricks.com
 wrote:

 You have to partition that data on the Spark Streaming by the primary
 key, and then make sure insert data into Cassandra atomically per key, or
 per set of keys in the partition. You can use the combination of the (batch
 time, and partition Id) of the RDD inside foreachRDD as the unique id for
 the data you are inserting. This will guard against multiple attempts to
 run the task that inserts into Cassandra.

 See
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations

 TD

 On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch learnings.chitt...@gmail.com
 wrote:

 Hi All,

  I have a problem when writing streaming data to cassandra. Or existing
 product is on Oracle DB in which while wrtiting data, locks are maintained
 such that duplicates in the DB are avoided.

 But as spark has parallel processing architecture, if more than 1 thread
 is trying to write same data i.e with same primary key, is there as any
 scope to created duplicates? If yes, how to address this problem either
 from spark or from cassandra side ?

 Thanks,
 Padma Ch







Re: Writing streaming data to cassandra creates duplicates

2015-07-30 Thread Juan Rodríguez Hortalá
Hi,

Just my two cents. I understand your problem is that your problem is that
you have messages with the same key in two different dstreams. What I would
do would be making a union of all the dstreams with StreamingContext.union
or several calls to DStream.union, and then I would create a pair dstream
with the primary key as key, and then I'd use groupByKey or reduceByKey (or
combineByKey etc) to combine the messages with the same primary key.

Hope that helps.

Greetings,

Juan


2015-07-30 10:50 GMT+02:00 Priya Ch learnings.chitt...@gmail.com:

 Hi All,

  Can someone throw insights on this ?

 On Wed, Jul 29, 2015 at 8:29 AM, Priya Ch learnings.chitt...@gmail.com
 wrote:



 Hi TD,

  Thanks for the info. I have the scenario like this.

  I am reading the data from kafka topic. Let's say kafka has 3 partitions
 for the topic. In my streaming application, I would configure 3 receivers
 with 1 thread each such that they would receive 3 dstreams (from 3
 partitions of kafka topic) and also I implement partitioner. Now there is a
 possibility of receiving messages with same primary key twice or more, one
 is at the time message is created and other times if there is an update to
 any fields for same message.

 If two messages M1 and M2 with same primary key are read by 2 receivers
 then even the partitioner in spark would still end up in parallel
 processing as there are altogether in different dstreams. How do we address
 in this situation ?

 Thanks,
 Padma Ch

 On Tue, Jul 28, 2015 at 12:12 PM, Tathagata Das t...@databricks.com
 wrote:

 You have to partition that data on the Spark Streaming by the primary
 key, and then make sure insert data into Cassandra atomically per key, or
 per set of keys in the partition. You can use the combination of the (batch
 time, and partition Id) of the RDD inside foreachRDD as the unique id for
 the data you are inserting. This will guard against multiple attempts to
 run the task that inserts into Cassandra.

 See
 http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations

 TD

 On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch learnings.chitt...@gmail.com
  wrote:

 Hi All,

  I have a problem when writing streaming data to cassandra. Or existing
 product is on Oracle DB in which while wrtiting data, locks are maintained
 such that duplicates in the DB are avoided.

 But as spark has parallel processing architecture, if more than 1
 thread is trying to write same data i.e with same primary key, is there as
 any scope to created duplicates? If yes, how to address this problem either
 from spark or from cassandra side ?

 Thanks,
 Padma Ch








Re: Writing streaming data to cassandra creates duplicates

2015-07-28 Thread Tathagata Das
You have to partition that data on the Spark Streaming by the primary key,
and then make sure insert data into Cassandra atomically per key, or per
set of keys in the partition. You can use the combination of the (batch
time, and partition Id) of the RDD inside foreachRDD as the unique id for
the data you are inserting. This will guard against multiple attempts to
run the task that inserts into Cassandra.

See
http://spark.apache.org/docs/latest/streaming-programming-guide.html#semantics-of-output-operations

TD

On Sun, Jul 26, 2015 at 11:19 AM, Priya Ch learnings.chitt...@gmail.com
wrote:

 Hi All,

  I have a problem when writing streaming data to cassandra. Or existing
 product is on Oracle DB in which while wrtiting data, locks are maintained
 such that duplicates in the DB are avoided.

 But as spark has parallel processing architecture, if more than 1 thread
 is trying to write same data i.e with same primary key, is there as any
 scope to created duplicates? If yes, how to address this problem either
 from spark or from cassandra side ?

 Thanks,
 Padma Ch