We ran into issues using EFS (which under the covers is a NFS like
filesystem)... details are in this post
http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/External-checkpoints-not-getting-cleaned-up-discarded-potentially-causing-high-load-tp14073p14106.html
--
Sent from:
Hi Eron
No, unfortunately we did not directly resolve it... we work around it for
now by ensuring that our Mesos slaves are set up to correctly support the
JobManager with offers.
Prashant
--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
Thanks Stefan.
+1 on "I am even considering packing this list as a plain text file with the
checkpoint, to make this more transparent for users"
that is def. more Ops friendly...
Thanks
Prashant
--
View this message in context:
Thanks Stephan and Stefan
We're looking forward to this patch in 1.3.2
We will use a patched version depending upon when 1.3.2 is going to be
available.
We're also implementing a cron job to remove orphaned/older
completedCheckpoint files per your recommendations.. one caveat with a job
like
Hi Stephan
Unclear on what you mean by the "trash" option... thought that was only
available for command line hadoop and not applicable for API, which is what
Flink uses? If there is a configuration for the Flink/Hadoop connector,
please let me know.
Also, one additional thing about S3 S3
Thanks Stephan
We can confirm that turning off RocksDB incremental checkpointing seems to
help and greatly reduces the number of files (from tens of thousands to low
thousands).
We still see that there is a inflection point when running > 50 jobs causes
the appmaster to stop deleting files from
Hi Xiaogang and Stephan
We're continuing to test and have now set up the cluster to disable
incremental RocksDB checkpointing as well as increasing the checkpoint
interval from 30s to 120s (not ideal really :-( )
We'll run it with a large number of jobs and report back if this setup shows
Wanted to add - we took some stack traces and memory dumps... will post them
or send them to you, but the stack trace indicates that the appmaster is
spending a lot of time in the AWS s3 library trying to list a S3 directory
(recovery?)
Thanks
Prashant
--
View this message in context:
Hi Xiaogang and Stephan
Thank you for your response. Sorry about the delay in responding (was
traveling):
We've been trying to figure out what triggers this - but your points about
master not being able to delete files "in time" seems to be correct
We've been test out two different
To add one more data point... it seems like the recovery directory is the
bottleneck somehow.. so if we delete the recovery directory and restart the
job manager - it comes back and is responsive.
Of course, we lose all jobs, since none can be recovered... and that is of
course not ideal.
So
10 matches
Mail list logo