Thanks! On Tue, Nov 15, 2016 at 1:07 AM Takeshi Yamamuro <linguin....@gmail.com> wrote:
> - dev > > Hi, > > AFAIK, if you use RDDs only, you can control the partition mapping to some > extent > by using a partition key RDD[(key, data)]. > A defined partitioner distributes data into partitions depending on the > key. > As a good example to control partitions, you can see the GraphX code; > > https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L291 > > GraphX holds `PartitionId` in edge RDDs to control the partition where > edge data are. > > // maropu > > > On Tue, Nov 15, 2016 at 5:19 AM, Manish Malhotra < > manish.malhotra.w...@gmail.com> wrote: > > sending again. > any help is appreciated ! > > thanks in advance. > > On Thu, Nov 10, 2016 at 8:42 AM, Manish Malhotra < > manish.malhotra.w...@gmail.com> wrote: > > Hello Spark Devs/Users, > > Im trying to solve the use case with Spark Streaming 1.6.2 where for every > batch ( say 2 mins) data needs to go to the same reducer node after > grouping by key. > The underlying storage is Cassandra and not HDFS. > > This is a map-reduce job, where also trying to use the partitions of the > Cassandra table to batch the data for the same partition. > > The requirement of sticky session/partition across batches is because the > operations which we need to do, needs to read data for every key and then > merge this with the current batch aggregate values. So, currently when > there is no stickyness across batches, we have to read for every key, merge > and then write back. and reads are very expensive. So, if we have sticky > session, we can avoid read in every batch and have a cache of till last > batch aggregates across batches. > > So, there are few options, can think of: > > 1. to change the TaskSchedulerImpl, as its using Random to identify the > node for mapper/reducer before starting the batch/phase. > Not sure if there is a custom scheduler way of achieving it? > > 2. Can custom RDD can help to find the node for the key-->node. > there is a getPreferredLocation() method. > But not sure, whether this will be persistent or can vary for some edge > cases? > > Thanks in advance for you help and time ! > > Regards, > Manish > > > > > > -- > --- > Takeshi Yamamuro >