[ https://issues.apache.org/jira/browse/FLINK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Arvid Heise reassigned FLINK-22792: ----------------------------------- Assignee: Tianxin Zhao > Limit size of already processed files in File Source SplitEnumerator > -------------------------------------------------------------------- > > Key: FLINK-22792 > URL: https://issues.apache.org/jira/browse/FLINK-22792 > Project: Flink > Issue Type: Improvement > Components: Connectors / FileSystem > Reporter: Tianxin Zhao > Assignee: Tianxin Zhao > Priority: Major > > File Source makes use of {{ContinuousFileSplitEnumerator}} to discover files > in selected file system. Task inside the SplitEnumerator periodically lists > given path and creates splits from the path. To avoid splits getting > reprocessed, currently all processed paths is recorded in the set > {{pathsAlreadyProcessed}}. However, this set could grow indefinitely with new > files added to the input path and eventually result in out of memory issue. > (Original PR: [https://github.com/apache/flink/pull/13401]) > This ticket aim to limit the size of {{pathsAlreadyProcessed}} in use of a > configurable SLA such that files older than some (watermark - SLA) would be > ignored to be processed and also cleaned up from the > {{pathsAlreadyProcessed}} set. Watermark is decided based on the minimum > modification time of unprocessed files. {{pathsAlreadyProcessed}} set would > be cleaned up during every snapshot. -- This message was sent by Atlassian Jira (v8.3.4#803005)