Re: After job cancel, leftover ZK state prevents job manager startup

2018-12-12 Thread Micah Wylde
We have a single Jobmanager in an HA setup. From looking at logs and metrics, it appears that before the issue occurred there was a long (15s) GC pause on the jobmanager, which then caused a leadership election. Because there is only one jobmanager, the same one became leader again after it recover

Re: After job cancel, leftover ZK state prevents job manager startup

2018-12-11 Thread Till Rohrmann
Hi Micah, the problem looks indeed similar to FLINK-10255. Could you tell me your exact cluster setup (HA with stand by JobManagers?). Moreover, the logs of all JobManagers on DEBUG level would be helpful for further debugging. Cheers, Till On Tue, Dec 11, 2018 at 10:09 AM Stefan Richter wrote:

Re: After job cancel, leftover ZK state prevents job manager startup

2018-12-11 Thread Stefan Richter
Hi, Thanks for reporting the problem, I think the exception trace looks indeed very similar to traces in the discussion for FLINK-10184. I will pull in Till who worked on the fix to hear his opinion. Maybe the current fix only made the problem less likely to appear but is not complete, yet? Be

After job cancel, leftover ZK state prevents job manager startup

2018-12-10 Thread Micah Wylde
Hello, We've been seeing an issue with several Flink 1.5.4 clusters that looks like this: 1. Job is cancelled with a savepoint 2. The jar is deleted from our HA blobstore (S3) 3. The jobgraph in ZK is *not* deleted 4. We restart the cluster 5. Startup fails in recovery because the jar is not avai