stack job on fail over

Nick Toker Tue, 26 Nov 2019 00:09:28 -0800

Hi
i have a standalone cluster with 3 nodes  and rocksdb backend
when one task manager fails ( the process is being killed)
it takes very long time until the job is totally canceled and a new job is
resubmitted
i see that all slots on all nodes are being canceled except from the slots
of the dead
task manager , it takes about 30- 40 second for the job to totally shutdown.
is that something i can do to reduce this time or there is a plan for a fix
( if so when)?


regards,
nick

stack job on fail over

Reply via email to