Re: stack job on fail over

2019-11-26 Thread Biao Liu
Hi Nick, Yes, reducing heartbeat timeout is not a perfect solution. It just alleviates the pain a bit. I'm wondering my guess is right or not. Is it caused by heartbeat detection? Does it help with an elegant way of shutting down? Thanks, Biao /'bɪ.aʊ/ On Tue, 26 Nov 2019 at 20:22, Nick

Re: stack job on fail over

2019-11-26 Thread Biao Liu
Hi Nick, I guess the reason is your Flink job manager doesn't detect the task manager is lost until heartbeat timeout. You could check the job manager log to verify that. Maybe a more elegant way of shutting down task manager helps, like through "taskmanager.sh stop" or "kill" command without 9

stack job on fail over

2019-11-26 Thread Nick Toker
Hi i have a standalone cluster with 3 nodes and rocksdb backend when one task manager fails ( the process is being killed) it takes very long time until the job is totally canceled and a new job is resubmitted i see that all slots on all nodes are being canceled except from the slots of the dead