We have a single Jobmanager in an HA setup. From looking at logs and
metrics, it appears that before the issue occurred there was a long (15s)
GC pause on the jobmanager, which then caused a leadership election.
Because there is only one jobmanager, the same one became leader again
after it recover
Hi Micah,
the problem looks indeed similar to FLINK-10255. Could you tell me your
exact cluster setup (HA with stand by JobManagers?). Moreover, the logs of
all JobManagers on DEBUG level would be helpful for further debugging.
Cheers,
Till
On Tue, Dec 11, 2018 at 10:09 AM Stefan Richter
wrote:
Hi,
Thanks for reporting the problem, I think the exception trace looks indeed very
similar to traces in the discussion for FLINK-10184. I will pull in Till who
worked on the fix to hear his opinion. Maybe the current fix only made the
problem less likely to appear but is not complete, yet?
Be
Hello,
We've been seeing an issue with several Flink 1.5.4 clusters that looks
like this:
1. Job is cancelled with a savepoint
2. The jar is deleted from our HA blobstore (S3)
3. The jobgraph in ZK is *not* deleted
4. We restart the cluster
5. Startup fails in recovery because the jar is not avai