RE: Using local FS for checkpoint

2017-08-31 Thread prashantnayak
We ran into issues using EFS (which under the covers is a NFS like filesystem)... details are in this post http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/External-checkpoints-not-getting-cleaned-up-discarded-potentially-causing-high-load-tp14073p14106.html -- Sent from:

Re: Flink Mesos Outstanding Offers - trouble launching task managers

2017-08-31 Thread prashantnayak
Hi Eron No, unfortunately we did not directly resolve it... we work around it for now by ensuring that our Mesos slaves are set up to correctly support the JobManager with offers. Prashant -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: S3 recovery and checkpoint directories exhibit explosive growth

2017-07-26 Thread prashantnayak
Thanks Stefan. +1 on "I am even considering packing this list as a plain text file with the checkpoint, to make this more transparent for users" that is def. more Ops friendly... Thanks Prashant -- View this message in context:

Re: S3 recovery and checkpoint directories exhibit explosive growth

2017-07-26 Thread prashantnayak
Thanks Stephan and Stefan We're looking forward to this patch in 1.3.2 We will use a patched version depending upon when 1.3.2 is going to be available. We're also implementing a cron job to remove orphaned/older completedCheckpoint files per your recommendations.. one caveat with a job like

Re: S3 recovery and checkpoint directories exhibit explosive growth

2017-07-25 Thread prashantnayak
Hi Stephan Unclear on what you mean by the "trash" option... thought that was only available for command line hadoop and not applicable for API, which is what Flink uses? If there is a configuration for the Flink/Hadoop connector, please let me know. Also, one additional thing about S3 S3

Re: S3 recovery and checkpoint directories exhibit explosive growth

2017-07-25 Thread prashantnayak
Thanks Stephan We can confirm that turning off RocksDB incremental checkpointing seems to help and greatly reduces the number of files (from tens of thousands to low thousands). We still see that there is a inflection point when running > 50 jobs causes the appmaster to stop deleting files from

Re: S3 recovery and checkpoint directories exhibit explosive growth

2017-07-23 Thread prashantnayak
Hi Xiaogang and Stephan We're continuing to test and have now set up the cluster to disable incremental RocksDB checkpointing as well as increasing the checkpoint interval from 30s to 120s (not ideal really :-( ) We'll run it with a large number of jobs and report back if this setup shows

Re: S3 recovery and checkpoint directories exhibit explosive growth

2017-07-20 Thread prashantnayak
Wanted to add - we took some stack traces and memory dumps... will post them or send them to you, but the stack trace indicates that the appmaster is spending a lot of time in the AWS s3 library trying to list a S3 directory (recovery?) Thanks Prashant -- View this message in context:

Re: S3 recovery and checkpoint directories exhibit explosive growth

2017-07-20 Thread prashantnayak
Hi Xiaogang and Stephan Thank you for your response. Sorry about the delay in responding (was traveling): We've been trying to figure out what triggers this - but your points about master not being able to delete files "in time" seems to be correct We've been test out two different

Re: S3 recovery and checkpoint directories exhibit explosive growth

2017-07-13 Thread prashantnayak
To add one more data point... it seems like the recovery directory is the bottleneck somehow.. so if we delete the recovery directory and restart the job manager - it comes back and is responsive. Of course, we lose all jobs, since none can be recovered... and that is of course not ideal. So