Martijn Visser created FLINK-25672:
--------------------------------------

             Summary: FileSource enumerator remembers paths of all already 
processed files which can result in large state
                 Key: FLINK-25672
                 URL: https://issues.apache.org/jira/browse/FLINK-25672
             Project: Flink
          Issue Type: Improvement
          Components: Connectors / FileSystem
            Reporter: Martijn Visser


As mentioned in the Filesystem documentation, for Unbounded File Sources, the 
{{FileEnumerator}} currently remembers paths of all already processed files, 
which is a state that can in come cases grow rather large. 

We should look into possibilities to reduce this. We could look into adding a 
compressed form of tracking already processed files (for example by keeping 
modification timestamps lower boundaries).

When fixed, this should also be reflected in the documentation, as mentioned in 
https://github.com/apache/flink/pull/18288#discussion_r785707311



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to