Hi Tathagata/Cody,

I am facing a challenge in Production with DAG behaviour during
checkpointing in spark streaming -

Step 1 : Read data from Kafka every 15 min -  call this KafkaStreamRDD ~
100 GB of data

Step 2 : Repartition KafkaStreamRdd from 5 to 100 partitions to parallelise
processing - call this RepartitionedKafkaStreamRdd

Step 3 : on this RepartitionedKafkaStreamRdd I run map and
reduceByKeyAndWindow over a window of 2 hours, call this RDD1 ~ 100 MB of
data

Checkpointing is enabled.

If i restart my streaming context, it picks up from last checkpointed
state,

READS data for all the 8 SUCCESSFULLY FINISHED 15 minute batches from Kafka
, re-performs Repartition of all the data of all these 8 , 15 minute
batches.

Then reads data for current 15 minute batch and runs map and
reduceByKeyAndWindow over a window of 2 hours.

Challenge -
1> I cant checkpoint KafkaStreamRDD or RepartitionedKafkaStreamRdd as this
is huge data around 800GB for 2 hours, reading and writing (checkpointing)
this at every 15 minutes would be very slow.

2> When i have checkpointed data of RDD1 at every 15 minutes, and map and
reduceByKeyAndWindow is being run over RDD1 only, and i have snapshot of
all of the last 8, 15 minute batches of RDD1,
why is spark reading all the data for last 8 successfully completed batches
from Kafka again(Step 1) and again performing re-partitioning(Step 2) and
then again running map and reduceByKeyandWindow over these newly fetched
kafkaStreamRdd data of last 8 , 15 minute batches.

Because of above mentioned challenges, i am not able to exploit
checkpointing, in case streaming context is restarted at high load.

Please help out in understanding, if there is something that i am missing

Regards,
Gaurav

Reply via email to