[ 
https://issues.apache.org/jira/browse/FLINK-22792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arvid Heise reassigned FLINK-22792:
-----------------------------------

    Assignee: Tianxin Zhao

> Limit size of already processed files in File Source SplitEnumerator
> --------------------------------------------------------------------
>
>                 Key: FLINK-22792
>                 URL: https://issues.apache.org/jira/browse/FLINK-22792
>             Project: Flink
>          Issue Type: Improvement
>          Components: Connectors / FileSystem
>            Reporter: Tianxin Zhao
>            Assignee: Tianxin Zhao
>            Priority: Major
>
> File Source makes use of {{ContinuousFileSplitEnumerator}} to discover files 
> in selected file system. Task inside the SplitEnumerator periodically lists 
> given path and creates splits from the path. To avoid splits getting 
> reprocessed, currently all processed paths is recorded in the set 
> {{pathsAlreadyProcessed}}. However, this set could grow indefinitely with new 
> files added to the input path and eventually result in out of memory issue. 
> (Original PR: [https://github.com/apache/flink/pull/13401])
> This ticket aim to limit the size of {{pathsAlreadyProcessed}} in use of a 
> configurable SLA such that files older than some (watermark - SLA) would be 
> ignored to be processed and also cleaned up from the 
> {{pathsAlreadyProcessed}} set. Watermark is decided based on the minimum 
> modification time of unprocessed files. {{pathsAlreadyProcessed}} set would 
> be cleaned up during every snapshot.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to