Feifan Wang created FLINK-28172:
-----------------------------------

             Summary: Scatter dstl files into separate directories by job id
                 Key: FLINK-28172
                 URL: https://issues.apache.org/jira/browse/FLINK-28172
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / State Backends
    Affects Versions: 1.15.0
            Reporter: Feifan Wang


In the current implementation of {_}FsStateChangelogStorage{_}, dstl files from 
all jobs are put into the same directory (configured via 
{_}dstl.dfs.base-path{_}). Everything is fine if it's a filesystem like S3.But 
if it is a file system like hadoop, there will be some problems.

First, there may be an upper limit to the number of files in a single 
directory. Increasing this threshold will greatly reduce the performance of the 
distributed file system.

Second, dstl file management becomes difficult because the user cannot tell 
which job the dstl file belongs to, especially when the retained checkpoint is 
turned on.
h3. Propose
 # create a subdirectory named with the job id under the _dstl.dfs.base-path_ 
directory when the job starts
 # all dstl files upload to the subdirectory

( Going a step further, we can even create two levels of subdirectories under 
the _dstl.dfs.base-path_ directory, like _base-path/\{jobId}/dstl ._ This way, 
if the user configures the same dstl.dfs.base-path as state.checkpoints.dir, 
all files needed for job recovery will be in the same directory and well 
organized. )



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to