Re: Stream job failed after increasing number retained checkpoints

2018-01-16 Thread Jose Miguel Tejedor Fernandez
Thanks Piotr and Stefan, The problem was the overhead in the heap memory usage of the JobManager when increasing the num-retained checkpoints. It was solved once I revert that value to one. BR That's the actual error according to the JobManager log in the OOM: 2018-01-08 22:27:09,293 WARN org.j

Re: Stream job failed after increasing number retained checkpoints

2018-01-10 Thread Piotr Nowojski
Hi, This Task Manager log is suggesting that problems lays on the Job Manager side (no visible gap in the logs, GC Time reported is accumulated and 31 seconds accumulated over 963 gc collections is low value). Could you show the Job Manager log itself? Probably it’s the own that’s causing the T

Re: Stream job failed after increasing number retained checkpoints

2018-01-10 Thread Jose Miguel Tejedor Fernandez
Hi, I wonder what reason you might have that you ever want such a huge number > of retained checkpoints? The Flink jobs running on EMR cluster require a checkpoint at midnight. (In our use case we need to synch a loaded delta to our a third party partner with the streamed data). The delta load t

Re: Stream job failed after increasing number retained checkpoints

2018-01-10 Thread Stefan Richter
Hi, there is no known limitation in the strict sense, but you might run out of dfs space or job manager memory if you keep around a huge number checkpoints. I wonder what reason you might have that you ever want such a huge number of retained checkpoints? Usually keeping one checkpoint should d

Re: Stream job failed after increasing number retained checkpoints

2018-01-09 Thread Piotr Nowojski
Hi, Increasing akka’s timeouts is rarely a solution for any problems - it either do not help, or just mask the issue making it less visible. But yes, it is possible to bump the limits: https://ci.apache.org/projects/flink/flink-docs-release-1.3/setup/config.html#distributed-coordination-via-akk