Hi Nick,
Yes, reducing heartbeat timeout is not a perfect solution. It just
alleviates the pain a bit.
I'm wondering my guess is right or not. Is it caused by heartbeat
detection? Does it help with an elegant way of shutting down?
Thanks,
Biao /'bɪ.aʊ/
On Tue, 26 Nov 2019 at 20:22, Nick Toker
Hi Nick,
I guess the reason is your Flink job manager doesn't detect the task
manager is lost until heartbeat timeout.
You could check the job manager log to verify that.
Maybe a more elegant way of shutting down task manager helps, like through
"taskmanager.sh stop" or "kill" command without 9 s
Hi
i have a standalone cluster with 3 nodes and rocksdb backend
when one task manager fails ( the process is being killed)
it takes very long time until the job is totally canceled and a new job is
resubmitted
i see that all slots on all nodes are being canceled except from the slots
of the dead
t