Re: Spark Streaming: question on sticky session across batches ?

2016-11-14 Thread Manish Malhotra
sending again.
any help is appreciated !

thanks in advance.

On Thu, Nov 10, 2016 at 8:42 AM, Manish Malhotra <
manish.malhotra.w...@gmail.com> wrote:

> Hello Spark Devs/Users,
>
> Im trying to solve the use case with Spark Streaming 1.6.2 where for every
> batch ( say 2 mins) data needs to go to the same reducer node after
> grouping by key.
> The underlying storage is Cassandra and not HDFS.
>
> This is a map-reduce job, where also trying to use the partitions of the
> Cassandra table to batch the data for the same partition.
>
> The requirement of sticky session/partition across batches is because the
> operations which we need to do, needs to read data for every key and then
> merge this with the current batch aggregate values. So, currently when
> there is no stickyness across batches, we have to read for every key, merge
> and then write back. and reads are very expensive. So, if we have sticky
> session, we can avoid read in every batch and have a cache of till last
> batch aggregates across batches.
>
> So, there are few options, can think of:
>
> 1. to change the TaskSchedulerImpl, as its using Random to identify the
> node for mapper/reducer before starting the batch/phase.
> Not sure if there is a custom scheduler way of achieving it?
>
> 2. Can custom RDD can help to find the node for the key-->node.
> there is a getPreferredLocation() method.
> But not sure, whether this will be persistent or can vary for some edge
> cases?
>
> Thanks in advance for you help and time !
>
> Regards,
> Manish
>


Spark Streaming: question on sticky session across batches ?

2016-11-10 Thread Manish Malhotra
Hello Spark Devs/Users,

Im trying to solve the use case with Spark Streaming 1.6.2 where for every
batch ( say 2 mins) data needs to go to the same reducer node after
grouping by key.
The underlying storage is Cassandra and not HDFS.

This is a map-reduce job, where also trying to use the partitions of the
Cassandra table to batch the data for the same partition.

The requirement of sticky session/partition across batches is because the
operations which we need to do, needs to read data for every key and then
merge this with the current batch aggregate values. So, currently when
there is no stickyness across batches, we have to read for every key, merge
and then write back. and reads are very expensive. So, if we have sticky
session, we can avoid read in every batch and have a cache of till last
batch aggregates across batches.

So, there are few options, can think of:

1. to change the TaskSchedulerImpl, as its using Random to identify the
node for mapper/reducer before starting the batch/phase.
Not sure if there is a custom scheduler way of achieving it?

2. Can custom RDD can help to find the node for the key-->node.
there is a getPreferredLocation() method.
But not sure, whether this will be persistent or can vary for some edge
cases?

Thanks in advance for you help and time !

Regards,
Manish