Re: ReduceByKeyAndWindow does repartitioning twice on recovering from checkpoint

2015-11-16 Thread Tathagata Das
ReduceByKeyAndWindow with inverse function has certain unique
characteristics - it reuses a lot of the intermediate partitioned,
partially-reduced data. For every new batch, it "reduces" the new data, and
"inverse reduces" the old out-of-window data. That inverse-reduce needs
data from an old batch. Under normal operation, that old batch is
partitioned and cached, and so there is only one partitioning - the new
data. Thats not the case for the recovery time - the to-be-inverse-reduced
old batch data needs to be re-read from kafka and repartitioned again.
Hence the two repartitions.

On Sun, Nov 15, 2015 at 9:05 AM, kundan kumar  wrote:

> Hi,
>
> I am using spark streaming check-pointing mechanism and reading the data
> from Kafka. The window duration for my application is 2 hrs with a sliding
> interval of 15 minutes.
>
> So, my batches run at following intervals...
>
>- 09:45
>- 10:00
>- 10:15
>- 10:30
>- and so on
>
> When my job is restarted, and recovers from the checkpoint it does the
> re-partitioning step twice for each 15 minute job until the window of 2
> hours is complete. Then the re-partitioning takes place only once.
>
> For example - when the job recovers at 16:15 it does re-partitioning for
> the 16:15 Kafka stream and the 14:15 Kafka stream as well. Also, all the
> other intermediate stages are computed for 16:15 batch. I am using
> reduceByKeyAndWindow with inverse function. Now once the 2 hrs window is
> complete 18:15 onward re-partitioning takes place only once. Seems like the
> checkpoint does not have RDD stored for beyond 2 hrs which is my window
> duration. Because of this my job takes more time than usual.
>
> Is there a way or some configuration parameter which would help avoid
> repartitioning twice ?
>
> Attaching the snaps when repartitioning takes place twice after recovery
> from checkpoint.
>
> Thanks !!
>
> Kundan
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


ReduceByKeyAndWindow does repartitioning twice on recovering from checkpoint

2015-11-15 Thread kundan kumar
Hi,

I am using spark streaming check-pointing mechanism and reading the data
from Kafka. The window duration for my application is 2 hrs with a sliding
interval of 15 minutes.

So, my batches run at following intervals...

   - 09:45
   - 10:00
   - 10:15
   - 10:30
   - and so on

When my job is restarted, and recovers from the checkpoint it does the
re-partitioning step twice for each 15 minute job until the window of 2
hours is complete. Then the re-partitioning takes place only once.

For example - when the job recovers at 16:15 it does re-partitioning for
the 16:15 Kafka stream and the 14:15 Kafka stream as well. Also, all the
other intermediate stages are computed for 16:15 batch. I am using
reduceByKeyAndWindow with inverse function. Now once the 2 hrs window is
complete 18:15 onward re-partitioning takes place only once. Seems like the
checkpoint does not have RDD stored for beyond 2 hrs which is my window
duration. Because of this my job takes more time than usual.

Is there a way or some configuration parameter which would help avoid
repartitioning twice ?

Attaching the snaps when repartitioning takes place twice after recovery
from checkpoint.

Thanks !!

Kundan

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org