First of all, if you are running batches of 15 minutes, and you dont need
second level latencies, it might be just easier to run batch jobs in a for
loop - you will have greater control over what is going on.
And if you are using reduceByKeyAndWindow without the
inverseReduceFunction, then Spark
Where are you calling checkpointing? Metadata checkpointing for a kafa
direct stream should just be the offsets, not the data.
TD can better speak to reduceByKeyAndWindow behavior when restoring from a
checkpoint, but ultimately the only available choices would be replay the
prior window data
Hi Tathagata/Cody,
I am facing a challenge in Production with DAG behaviour during
checkpointing in spark streaming -
Step 1 : Read data from Kafka every 15 min - call this KafkaStreamRDD ~
100 GB of data
Step 2 : Repartition KafkaStreamRdd from 5 to 100 partitions to parallelise
processing -