Hi Tathagata/Cody, I am facing a challenge in Production with DAG behaviour during checkpointing in spark streaming -
Step 1 : Read data from Kafka every 15 min - call this KafkaStreamRDD ~ 100 GB of data Step 2 : Repartition KafkaStreamRdd from 5 to 100 partitions to parallelise processing - call this RepartitionedKafkaStreamRdd Step 3 : on this RepartitionedKafkaStreamRdd I run map and reduceByKeyAndWindow over a window of 2 hours, call this RDD1 ~ 100 MB of data Checkpointing is enabled. If i restart my streaming context, it picks up from last checkpointed state, READS data for all the 8 SUCCESSFULLY FINISHED 15 minute batches from Kafka , re-performs Repartition of all the data of all these 8 , 15 minute batches. Then reads data for current 15 minute batch and runs map and reduceByKeyAndWindow over a window of 2 hours. Challenge - 1> I cant checkpoint KafkaStreamRDD or RepartitionedKafkaStreamRdd as this is huge data around 800GB for 2 hours, reading and writing (checkpointing) this at every 15 minutes would be very slow. 2> When i have checkpointed data of RDD1 at every 15 minutes, and map and reduceByKeyAndWindow is being run over RDD1 only, and i have snapshot of all of the last 8, 15 minute batches of RDD1, why is spark reading all the data for last 8 successfully completed batches from Kafka again(Step 1) and again performing re-partitioning(Step 2) and then again running map and reduceByKeyandWindow over these newly fetched kafkaStreamRdd data of last 8 , 15 minute batches. Because of above mentioned challenges, i am not able to exploit checkpointing, in case streaming context is restarted at high load. Please help out in understanding, if there is something that i am missing Regards, Gaurav