[jira] [Commented] (FLINK-25672) FileSource enumerator remembers paths of all already processed files which can result in large state

2023-01-25 Thread Martijn Visser (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17680537#comment-17680537
 ] 

Martijn Visser commented on FLINK-25672:


[~cre...@gmail.com] Thank you

> FileSource enumerator remembers paths of all already processed files which 
> can result in large state
> 
>
> Key: FLINK-25672
> URL: https://issues.apache.org/jira/browse/FLINK-25672
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem
>Reporter: Martijn Visser
>Priority: Major
>
> As mentioned in the Filesystem documentation, for Unbounded File Sources, the 
> {{FileEnumerator}} currently remembers paths of all already processed files, 
> which is a state that can in come cases grow rather large. 
> We should look into possibilities to reduce this. We could look into adding a 
> compressed form of tracking already processed files (for example by keeping 
> modification timestamps lower boundaries).
> When fixed, this should also be reflected in the documentation, as mentioned 
> in https://github.com/apache/flink/pull/18288#discussion_r785707311



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-25672) FileSource enumerator remembers paths of all already processed files which can result in large state

2023-01-23 Thread Cliff Resnick (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17679874#comment-17679874
 ] 

Cliff Resnick commented on FLINK-25672:
---

I imagine it may require a breaking change from what I can tell by the design, 
with the stateless factory fro the Enumerator. But I will be looking at it.

> FileSource enumerator remembers paths of all already processed files which 
> can result in large state
> 
>
> Key: FLINK-25672
> URL: https://issues.apache.org/jira/browse/FLINK-25672
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem
>Reporter: Martijn Visser
>Priority: Major
>
> As mentioned in the Filesystem documentation, for Unbounded File Sources, the 
> {{FileEnumerator}} currently remembers paths of all already processed files, 
> which is a state that can in come cases grow rather large. 
> We should look into possibilities to reduce this. We could look into adding a 
> compressed form of tracking already processed files (for example by keeping 
> modification timestamps lower boundaries).
> When fixed, this should also be reflected in the documentation, as mentioned 
> in https://github.com/apache/flink/pull/18288#discussion_r785707311



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-25672) FileSource enumerator remembers paths of all already processed files which can result in large state

2023-01-23 Thread Martijn Visser (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17679870#comment-17679870
 ] 

Martijn Visser commented on FLINK-25672:


[~cre...@gmail.com] Unfortunately not, still looking for volunteers

> FileSource enumerator remembers paths of all already processed files which 
> can result in large state
> 
>
> Key: FLINK-25672
> URL: https://issues.apache.org/jira/browse/FLINK-25672
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem
>Reporter: Martijn Visser
>Priority: Major
>
> As mentioned in the Filesystem documentation, for Unbounded File Sources, the 
> {{FileEnumerator}} currently remembers paths of all already processed files, 
> which is a state that can in come cases grow rather large. 
> We should look into possibilities to reduce this. We could look into adding a 
> compressed form of tracking already processed files (for example by keeping 
> modification timestamps lower boundaries).
> When fixed, this should also be reflected in the documentation, as mentioned 
> in https://github.com/apache/flink/pull/18288#discussion_r785707311



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-25672) FileSource enumerator remembers paths of all already processed files which can result in large state

2023-01-23 Thread Cliff Resnick (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17679868#comment-17679868
 ] 

Cliff Resnick commented on FLINK-25672:
---

This issue has turned into a real problem for us with our transactional 
datastream jobs. The problem is exacerbated by the the fact the the state is 
not distributed, and instead localized to the job manager, which is rather ugly 
in our HA K8s setup where we have a 16Gb limit in the common pool that our JMs 
run in, and we are blowing past that simply with the un-evictable file path 
history.

Is anyone looking into this?

 

 

 

> FileSource enumerator remembers paths of all already processed files which 
> can result in large state
> 
>
> Key: FLINK-25672
> URL: https://issues.apache.org/jira/browse/FLINK-25672
> Project: Flink
>  Issue Type: Improvement
>  Components: Connectors / FileSystem
>Reporter: Martijn Visser
>Priority: Major
>
> As mentioned in the Filesystem documentation, for Unbounded File Sources, the 
> {{FileEnumerator}} currently remembers paths of all already processed files, 
> which is a state that can in come cases grow rather large. 
> We should look into possibilities to reduce this. We could look into adding a 
> compressed form of tracking already processed files (for example by keeping 
> modification timestamps lower boundaries).
> When fixed, this should also be reflected in the documentation, as mentioned 
> in https://github.com/apache/flink/pull/18288#discussion_r785707311



--
This message was sent by Atlassian Jira
(v8.20.10#820010)