- dev Hi,
AFAIK, if you use RDDs only, you can control the partition mapping to some extent by using a partition key RDD[(key, data)]. A defined partitioner distributes data into partitions depending on the key. As a good example to control partitions, you can see the GraphX code; https://github.com/apache/spark/blob/master/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala#L291 GraphX holds `PartitionId` in edge RDDs to control the partition where edge data are. // maropu On Tue, Nov 15, 2016 at 5:19 AM, Manish Malhotra < manish.malhotra.w...@gmail.com> wrote: > sending again. > any help is appreciated ! > > thanks in advance. > > On Thu, Nov 10, 2016 at 8:42 AM, Manish Malhotra < > manish.malhotra.w...@gmail.com> wrote: > >> Hello Spark Devs/Users, >> >> Im trying to solve the use case with Spark Streaming 1.6.2 where for >> every batch ( say 2 mins) data needs to go to the same reducer node after >> grouping by key. >> The underlying storage is Cassandra and not HDFS. >> >> This is a map-reduce job, where also trying to use the partitions of the >> Cassandra table to batch the data for the same partition. >> >> The requirement of sticky session/partition across batches is because the >> operations which we need to do, needs to read data for every key and then >> merge this with the current batch aggregate values. So, currently when >> there is no stickyness across batches, we have to read for every key, merge >> and then write back. and reads are very expensive. So, if we have sticky >> session, we can avoid read in every batch and have a cache of till last >> batch aggregates across batches. >> >> So, there are few options, can think of: >> >> 1. to change the TaskSchedulerImpl, as its using Random to identify the >> node for mapper/reducer before starting the batch/phase. >> Not sure if there is a custom scheduler way of achieving it? >> >> 2. Can custom RDD can help to find the node for the key-->node. >> there is a getPreferredLocation() method. >> But not sure, whether this will be persistent or can vary for some edge >> cases? >> >> Thanks in advance for you help and time ! >> >> Regards, >> Manish >> > > -- --- Takeshi Yamamuro