Huyen Levan created FLINK-9940:
----------------------------------

             Summary: File source continuous monitoring mode: S3 files 
sometimes missed
                 Key: FLINK-9940
                 URL: https://issues.apache.org/jira/browse/FLINK-9940
             Project: Flink
          Issue Type: Bug
          Components: Streaming
    Affects Versions: 1.5.1
         Environment: Flink 1.5, EMRFS
            Reporter: Huyen Levan


When using StreamExecutionEnvironment.readFile() with 
FileProcessingMode.PROCESS_CONTINUOUSLY mode to monitor an S3 prefix, if there 
is a high amount of new/modified files at the same time, the directory 
monitoring process might miss some files. The number of missing files depends 
on the monitoring interval.

Cause: Flink tracks which files it has read by remembering the modification 
time of the file that was added (or modified) last. So when there are multiple 
files having a same last-modified timestamp.

Suggested solution (thanks to [[Fabian 
Hueske|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=25]|http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/template/NamlServlet.jtp?macro=user_nodes&user=25]):
 a hybrid approach that keeps the names of all files that have a mod timestamp 
that is larger than the max mod time minus an offset. 
_org.apache.flink.streaming.api.functions.source.ContinuousFileMonitoringFunction_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to