So you need some state between messages in a partition. You can use mapPartitions or foreachPartition, which allow you to write code to process an entire partition.
On Thu, Aug 13, 2015 at 11:48 AM, Priya Ch <learnings.chitt...@gmail.com> wrote: > Hi Philip, > > I have the following requirement - > I read the streams of data from various partitions of kafka topic. And > then I union the dstreams and apply hash partitioner so messages of same > key would go into single partition of an rdd, which is ofcourse handled by > a single thread. This way we trying to resolve concurrency issue. > > Now one of the partitions of the rdd holds messages with same key. Let's > say 1st message in the partition may correspond to ticket issuance and 2nd > message might corresponds to update on the ticket. Now while handling 1st > message there is different logic and 2nd message's logic depends on 1st > message. > Hence using rdd.foreach i am handling different logic for individual > messages. Now bulk rdd.saveToCassandra will now work. > > Hope you got what i am trying to say.. > > On Fri, Aug 14, 2015 at 12:07 AM, Philip Weaver <philip.wea...@gmail.com> > wrote: > >> All you'd need to do is *transform* the rdd before writing it, e.g. >> using the .map function. >> >> >> On Thu, Aug 13, 2015 at 11:30 AM, Priya Ch <learnings.chitt...@gmail.com> >> wrote: >> >>> Hi All, >>> >>> I have a question in writing rdd to cassandra. Instead of writing >>> entire rdd to cassandra, i want to write individual statement into >>> cassandra beacuse there is a need to perform to ETL on each message ( which >>> requires checking with the DB). >>> How could i insert statements individually? Using >>> CassandraConnector.session ?? >>> >>> If so, what is the performance impact of this ? How about using >>> sc.parallelize() for eah message in the rdd and then insert into cassandra ? >>> >>> Thanks, >>> Padma Ch >>> >> >> >