[
https://issues.apache.org/jira/browse/FLINK-25672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18078637#comment-18078637
]
David Radley commented on FLINK-25672:
--------------------------------------
[~sizokun] thanks for the details. I do not understand option a - you call it
watermark and offset - how are watermarks involved? option b, as proposed is
_Introduce a new variable LinkedHashSet<Pair<Path, Long>>
alreadyProcessedPathAndTimestamps in
PendingSplitsCheckpoint, and Duration retentionTime in
ContinuousEnumerationSettings.
I can understand this, it is simple and solves the issue I think. I wonder:
- how you deal with existing data,
- What do you mean by _no assumption about mtime semantics backfills work
correctly._
[[email protected]] is this something you are still interested in? What do you
think of [~sizokun]'s suggestions.
> FileSource enumerator remembers paths of all already processed files which
> can result in large state
> ----------------------------------------------------------------------------------------------------
>
> Key: FLINK-25672
> URL: https://issues.apache.org/jira/browse/FLINK-25672
> Project: Flink
> Issue Type: Improvement
> Components: Connectors / FileSystem
> Reporter: Martijn Visser
> Assignee: Sophia Izokun
> Priority: Major
>
> As mentioned in the Filesystem documentation, for Unbounded File Sources, the
> {{FileEnumerator}} currently remembers paths of all already processed files,
> which is a state that can in come cases grow rather large.
> We should look into possibilities to reduce this. We could look into adding a
> compressed form of tracking already processed files (for example by keeping
> modification timestamps lower boundaries).
> When fixed, this should also be reflected in the documentation, as mentioned
> in https://github.com/apache/flink/pull/18288#discussion_r785707311
--
This message was sent by Atlassian Jira
(v8.20.10#820010)