[GitHub] [flink] rkhachatryan commented on a change in pull request #12478: [FLINK-17869][task][checkpointing] Fix race condition when caling ChannelStateWriter.abort

GitBox Mon, 08 Jun 2020 02:22:12 -0700


rkhachatryan commented on a change in pull request #12478:
URL: https://github.com/apache/flink/pull/12478#discussion_r436563699




##########
File path: 
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/channel/ChannelStateWriterImpl.java
##########
@@ -56,7 +56,7 @@
 public class ChannelStateWriterImpl implements ChannelStateWriter {
 
        private static final Logger LOG = 
LoggerFactory.getLogger(ChannelStateWriterImpl.class);
-       private static final int DEFAULT_MAX_CHECKPOINTS = 5; // currently, 
only single in-flight checkpoint is supported
+       private static final int DEFAULT_MAX_CHECKPOINTS = 100; // includes 
max-concurrent-checkpoints + checkpoints to be aborted (scheduled via mailbox)

Review comment:
       The map is already cleared during the regular checkpoint abortion by the 
task thread but it happens with some delay. I see only two reasons why its size 
could significantly exceed `max-concurrent-checkpoints`:
   1. bug in abort procedure
   2. task thread is stuck while netty thread continues to receive new barriers 
fast
   
   Theoretically, 2nd case is unbounded, but in practice, I didn't observe 
issues with a value of 5 in `UnalignedITCase` after fixing other issues. I 
think such a scenario where task thread is not aborting writes would signify 
some problem and should be investigated.
   
   Besides that, aborting async parts of previous checkpoints upon receiving a 
new barrier won't work with `max-concurrent-checkpoints` > 1.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [flink] rkhachatryan commented on a change in pull request #12478: [FLINK-17869][task][checkpointing] Fix race condition when caling ChannelStateWriter.abort

Reply via email to