[ https://issues.apache.org/jira/browse/FLINK-25672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17680537#comment-17680537 ]
Martijn Visser commented on FLINK-25672: ---------------------------------------- [~cre...@gmail.com] Thank you > FileSource enumerator remembers paths of all already processed files which > can result in large state > ---------------------------------------------------------------------------------------------------- > > Key: FLINK-25672 > URL: https://issues.apache.org/jira/browse/FLINK-25672 > Project: Flink > Issue Type: Improvement > Components: Connectors / FileSystem > Reporter: Martijn Visser > Priority: Major > > As mentioned in the Filesystem documentation, for Unbounded File Sources, the > {{FileEnumerator}} currently remembers paths of all already processed files, > which is a state that can in come cases grow rather large. > We should look into possibilities to reduce this. We could look into adding a > compressed form of tracking already processed files (for example by keeping > modification timestamps lower boundaries). > When fixed, this should also be reflected in the documentation, as mentioned > in https://github.com/apache/flink/pull/18288#discussion_r785707311 -- This message was sent by Atlassian Jira (v8.20.10#820010)