Thanks Till for creating this FLIP. I believe the feature is really useful for standalone K8s deployment with persist volume. For native K8s and Yarn deployment, Flink ResourceManager will create a new TaskManager with new resource id. So we still could not benefit from this FLIP.
Moreover, I am curious about the clean-up mechanism. Will the working directory be deleted once the Flink job reached globally terminal state? Or it needs to be deleted externally. Best, Yang Yun Tang <myas...@live.com> 于2021年12月11日周六 下午10:48写道: > Hi Till, > > Thanks for driving this topic. I think this FLIP is very important to let > us could enable local recovery [1] by default. > > We previously also took similar method to make the working directory to > let local state dir as the same as state-backend's local dir to ensure > local recovery could well. > > I noticed that this FLIP also want to make the working directory the same > even process failure so that restarted processor could also take the old > one. However, I think there might exist some problems in YARN environment. > YARN would select all the local directories on different disks as the > 'LOCAL_DIRS' to represent the "io.tmp.dirs" [2]. To allow the reuse of same > old working directory, we need to always select the same directory from all > disk candidates for the specific resource. Thus, we might need to store the > working directory location persistently. If we use hash or similar method > to calculate which directory would always be used as the working directory > for specific 'resource id', it might meet problem if one of the disks is > temporarily full or broken. > > > > [1] https://issues.apache.org/jira/browse/FLINK-15507 > [2] > https://github.com/apache/flink/blob/cf1e8c39111378735e4c05a5edb3bd713229bb08/flink-core/src/main/java/org/apache/flink/configuration/CoreOptions.java#L363 > > Best > Yun Tang > ________________________________ > From: Till Rohrmann <trohrm...@apache.org> > Sent: Saturday, December 11, 2021 0:54 > To: dev <dev@flink.apache.org> > Subject: [DISCUSS] FLIP-198: Working directory for Flink processes > > Hi everyone, > > I would like to start a discussion about introducing an explicit working > directory for Flink processes that can be used to store information [1]. > Per default this working directory will reside in the temporary directory > of the node Flink runs on. However, if configured to reside on a persistent > volume, then this information can be used to recover from process/node > failures. Moreover, such a working directory can be used to consolidate > some of our other directories Flink creates under /tmp (e.g. blobStorage, > RocksDB working directory). > > Here is a draft PR that outlines the required changes [2]. > > Looking forward to your feedback. > > [1] https://cwiki.apache.org/confluence/x/ZZiqCw > [2] https://github.com/apache/flink/pull/18083 > > Cheers, > Till >