sending again. any help is appreciated ! thanks in advance.
On Thu, Nov 10, 2016 at 8:42 AM, Manish Malhotra < manish.malhotra.w...@gmail.com> wrote: > Hello Spark Devs/Users, > > Im trying to solve the use case with Spark Streaming 1.6.2 where for every > batch ( say 2 mins) data needs to go to the same reducer node after > grouping by key. > The underlying storage is Cassandra and not HDFS. > > This is a map-reduce job, where also trying to use the partitions of the > Cassandra table to batch the data for the same partition. > > The requirement of sticky session/partition across batches is because the > operations which we need to do, needs to read data for every key and then > merge this with the current batch aggregate values. So, currently when > there is no stickyness across batches, we have to read for every key, merge > and then write back. and reads are very expensive. So, if we have sticky > session, we can avoid read in every batch and have a cache of till last > batch aggregates across batches. > > So, there are few options, can think of: > > 1. to change the TaskSchedulerImpl, as its using Random to identify the > node for mapper/reducer before starting the batch/phase. > Not sure if there is a custom scheduler way of achieving it? > > 2. Can custom RDD can help to find the node for the key-->node. > there is a getPreferredLocation() method. > But not sure, whether this will be persistent or can vary for some edge > cases? > > Thanks in advance for you help and time ! > > Regards, > Manish >