Hi,

I am using spark streaming check-pointing mechanism and reading the data
from Kafka. The window duration for my application is 2 hrs with a sliding
interval of 15 minutes.

So, my batches run at following intervals...

   - 09:45
   - 10:00
   - 10:15
   - 10:30
   - and so on

When my job is restarted, and recovers from the checkpoint it does the
re-partitioning step twice for each 15 minute job until the window of 2
hours is complete. Then the re-partitioning takes place only once.

For example - when the job recovers at 16:15 it does re-partitioning for
the 16:15 Kafka stream and the 14:15 Kafka stream as well. Also, all the
other intermediate stages are computed for 16:15 batch. I am using
reduceByKeyAndWindow with inverse function. Now once the 2 hrs window is
complete 18:15 onward re-partitioning takes place only once. Seems like the
checkpoint does not have RDD stored for beyond 2 hrs which is my window
duration. Because of this my job takes more time than usual.

Is there a way or some configuration parameter which would help avoid
repartitioning twice ?

Attaching the snaps when repartitioning takes place twice after recovery
from checkpoint.

Thanks !!

Kundan
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to