rkhachatryan commented on a change in pull request #12478: URL: https://github.com/apache/flink/pull/12478#discussion_r436563699
########## File path: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImpl.java ########## @@ -56,7 +56,7 @@ public class ChannelStateWriterImpl implements ChannelStateWriter { private static final Logger LOG = LoggerFactory.getLogger(ChannelStateWriterImpl.class); - private static final int DEFAULT_MAX_CHECKPOINTS = 5; // currently, only single in-flight checkpoint is supported + private static final int DEFAULT_MAX_CHECKPOINTS = 100; // includes max-concurrent-checkpoints + checkpoints to be aborted (scheduled via mailbox) Review comment: The map is already cleared during the regular checkpoint abortion by the task thread but it happens with some delay. I see only two reasons why its size could significantly exceed `max-concurrent-checkpoints`: 1. bug in abort procedure 2. task thread is stuck while netty thread continues to receive new barriers fast Theoretically, 2nd case is unbounded, but in practice, I didn't observe issues with a value of 5 in `UnalignedITCase` after fixing other issues. I think such a scenario where task thread is not aborting writes would signify some problem and should be investigated. Besides that, aborting async parts of previous checkpoints upon receiving a new barrier won't work with `max-concurrent-checkpoints` > 1. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org