Julius Michaelis created FLINK-13808:
----------------------------------------

             Summary: Checkpoints expired by timeout may leak RocksDB files
                 Key: FLINK-13808
                 URL: https://issues.apache.org/jira/browse/FLINK-13808
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Checkpointing
    Affects Versions: 1.8.1, 1.8.0
         Environment: Reproduction is difficult, the only place I can reliably 
reproduce is a 4-node cluster with parallelismĀ +>+ 100.

A semi-reliable way of reproducing the issue is provided by [this 
docker-compose|https://github.com/jcaesar/flink-rocksdb-file-leak].
            Reporter: Julius Michaelis


A RocksDB state backend with HDFS checkpoints, with or without local recovery, 
may leak files inĀ {{io.tmp.dirs}} on checkpoint expiry by timeout.

If the size of a checkpoint crosses what can be transferred during one 
checkpoint timeout, checkpoints will continue to fail forever. If this is 
combined with a quick rollover of SST files (e.g. due to a high density of 
writes), this may quickly exhaust available disk space (or memory, as /tmp is 
the default location).

As a workaround, the jobmanager's REST API can be frequently queried for failed 
checkpoints, and deleting associated files accordingly.

I've tried investing the cause a little bit, but I'm stuck:
 * {{Checkpoint 19 of job ac7efce3457d9d73b0a4f775a6ef46f8 expired before 
completing.}} and similar gets printed, so
 * [{{abortExpired}} is 
invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointCoordinator.java#L549],
 so
 * [{{dispose}} is 
invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L455],
 so
 * [{{cancelCaller}} is 
invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L488],
 so
 * [the canceler is 
invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpoint.java#L497]
 ([through one more 
layer|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L129]),
 so
 * [{{cleanup}} is 
invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L95],
 (possibly [not from 
{{cancel}}|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L84]),
 so
 * [{{cleanupProvidedResources}} is 
invoked|https://github.com/apache/flink/blob/release-1.8.1/flink-runtime/src/main/java/org/apache/flink/runtime/state/AsyncSnapshotCallable.java#L162]
 (this is the indirection that made me give up), so
 * [this trace 
log|https://github.com/apache/flink/blob/release-1.8.1/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L372]
 should be printed, but it isn't.

I have some time to further investigate, but I'd appreciate help on finding out 
where in this chain things go wrong.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to