Hi Till, Thanks for driving this topic. I think this FLIP is very important to let us could enable local recovery [1] by default.
We previously also took similar method to make the working directory to let local state dir as the same as state-backend's local dir to ensure local recovery could well. I noticed that this FLIP also want to make the working directory the same even process failure so that restarted processor could also take the old one. However, I think there might exist some problems in YARN environment. YARN would select all the local directories on different disks as the 'LOCAL_DIRS' to represent the "io.tmp.dirs" [2]. To allow the reuse of same old working directory, we need to always select the same directory from all disk candidates for the specific resource. Thus, we might need to store the working directory location persistently. If we use hash or similar method to calculate which directory would always be used as the working directory for specific 'resource id', it might meet problem if one of the disks is temporarily full or broken. [1] https://issues.apache.org/jira/browse/FLINK-15507 [2] https://github.com/apache/flink/blob/cf1e8c39111378735e4c05a5edb3bd713229bb08/flink-core/src/main/java/org/apache/flink/configuration/CoreOptions.java#L363 Best Yun Tang ________________________________ From: Till Rohrmann <trohrm...@apache.org> Sent: Saturday, December 11, 2021 0:54 To: dev <dev@flink.apache.org> Subject: [DISCUSS] FLIP-198: Working directory for Flink processes Hi everyone, I would like to start a discussion about introducing an explicit working directory for Flink processes that can be used to store information [1]. Per default this working directory will reside in the temporary directory of the node Flink runs on. However, if configured to reside on a persistent volume, then this information can be used to recover from process/node failures. Moreover, such a working directory can be used to consolidate some of our other directories Flink creates under /tmp (e.g. blobStorage, RocksDB working directory). Here is a draft PR that outlines the required changes [2]. Looking forward to your feedback. [1] https://cwiki.apache.org/confluence/x/ZZiqCw [2] https://github.com/apache/flink/pull/18083 Cheers, Till