Hi all, I'm running Flink on EMR/YARN with 2x m3.xlarge instances and am checkpointing a fairly large RocksDB state to S3.
I've found that when the state size hits 10GB, the checkpoint takes around 6 minutes, according to the Flink dashboard. Originally my checkpoint interval was 5 minutes for the job, but I've found that the YARN container crashes (I guess because the checkpoint time is greater than the checkpoint interval), so have now decreased the checkpoint frequency to every 10 minutes. I was just wondering if anyone has any tips about how to reduce the checkpoint time. Taking 6 minutes to checkpoint ~10GB state means it's uploading at ~30MB/sec. I believe the m3.xlarge instances should have around 125MB/sec network bandwidth each, so I think the bottleneck is S3. Since there are 2 instances, I'm not sure if that means each instance is uploading at 15MB/sec - do the state uploads get shared equally among the instances, assuming the state is split equally between the task managers? If the state upload is split between the instances, perhaps the only way to speed up the checkpoints is to add more instances and task managers, and split the state equally among the task managers? Also just wondering - is there any chance the incremental checkpoints work will be complete any time soon? Thanks, Josh