Re: S3 file source - continuous monitoring - many files missed

2018-07-30 Thread Averell
Here is my https://github.com/lvhuyen/flink implementation of the change. 3 files were updated: StreamExecutionEnvironment.java, StreamExecutionEnvironment.scala, and ContinuousFileMonitoringFunction.java. All the thanks to Fabian. -- Sent from:

Re: S3 file source - continuous monitoring - many files missed

2018-07-25 Thread Fabian Hueske
Hi, First of all, the ticket reports a bug (or improvement or feature suggestion) such that others are aware of the problem and understand its cause. At some point it might be picked up and implemented. In general, there is no guarantee whether or when this happens, but the Flink community is of

Re: S3 file source - continuous monitoring - many files missed

2018-07-25 Thread Averell
Thank you Fabian for the guide to implement the fix. I'm not quite clear about the best practice of creating JIRA ticket. I modified its priority to Major as you said that it is important. What would happen next with that issue then? Someone (anyone) will pick it and create a fix, then include

Re: S3 file source - continuous monitoring - many files missed

2018-07-25 Thread Fabian Hueske
Hi, Thanks for creating the Jira issue. I'm not sure if I would consider this a blocker but it is certainly an important problem to fix. Anyway, in the original version Flink checkpoints the modification timestamp up to which all files have been read (or at least up to which point it *thinks* to

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Averell
Hello Fabian, I created the JIRA bug https://issues.apache.org/jira/browse/FLINK-9940 BTW, I have one more question: Is it worth to checkpoint that list of processed files? Does the current implementation of file-source guarantee exactly-once? Thanks for your support. -- Sent from:

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Averell
Thank you Fabian. I tried to implement a quick test basing on what you suggested: having an offset from system time, and I did get improvement: with offset = 500ms - the problem has completely gone. With offset = 50ms, I still got around 3-5 files missed out of 10,000. This number might come

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Fabian Hueske
Hi, The problem is that Flink tracks which files it has read by remembering the modification time of the file that was added (or modified) last. We use the modification time, to avoid that we have to remember the names of all files that were ever consumed, which would be expensive to check and

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Averell
Hello Jörn. Thanks for your help. "/Probably the system is putting them to the folder and Flink is triggered before they are consistent./" <<< yes, I also guess so. However, if Flink is triggered before they are consistent, either (a) there should be some error messages, or (b) Flink should be

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Jörn Franke
https://docs.aws.amazon.com/AmazonS3/latest/dev/Introduction.html You will find there a passage of the consistency model. Probably the system is putting them to the folder and Flink is triggered before they are consistent. What happens after Flink put s them on S3 ? Are they reused by another

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Averell
Could you please help explain more details on "/try read after write consistency (assuming the files are not modified) /"? I guess that the problem I got comes from the inconsistency in S3 files listing. Otherwise, I would have got exceptions on file not found. My use case is to read output

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Jörn Franke
Sure kinesis is another way. Can you try read after write consistency (assuming the files are not modified) In any case it looks you would be better suited with a NoSQL store or kinesis (I don’t know your exact use case in order to provide you more details) > On 24. Jul 2018, at 09:51, Averell

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Averell
Just some update: I tried to enable "EMRFS Consistent View" option, but it didn't help. Not sure whether that's what you recommended, or something else. Thanks! -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Averell
Hi Jörn, Thanks. I had missed that EMRFS strong consistency configuration. Will try that now. We also had a backup solution - using Kinesis instead of S3 (I don't see Kinesis in your suggestion, but hope that it would be alright). "/The small size and high rate is not suitable for S3 or HDFS/"

Re: S3 file source - continuous monitoring - many files missed

2018-07-24 Thread Jörn Franke
It could be related to S3 that seems to be configured for eventual consistency. Maybe it helps to configure strong consistency. However, I recommend to replace S3 with a NoSQL database (since you are amazon Dynamo would help + Dynamodb streams, alternatively sns or sqs). The small size and

S3 file source - continuous monitoring - many files missed

2018-07-23 Thread Averell
Good day everyone, I have a Flink job that has an S3 folder as a source, and we keep putting thousands of small (around 1KB each) gzip files into that folder, with the rate of about 5000 files per minute. Here is how I created that source in Scala: / val my_input_format = new