Re: stack job on fail over
Hi Nick, Yes, reducing heartbeat timeout is not a perfect solution. It just alleviates the pain a bit. I'm wondering my guess is right or not. Is it caused by heartbeat detection? Does it help with an elegant way of shutting down? Thanks, Biao /'bɪ.aʊ/ On Tue, 26 Nov 2019 at 20:22, Nick Toker wrote: > Thanks > its to the trick > > > regards, > nick > > On Tue, Nov 26, 2019 at 11:26 AM Biao Liu wrote: > >> Hi Nick, >> >> I guess the reason is your Flink job manager doesn't detect the task >> manager is lost until heartbeat timeout. >> You could check the job manager log to verify that. >> >> Maybe a more elegant way of shutting down task manager helps, like >> through "taskmanager.sh stop" or "kill" command without 9 signal. >> Or you could reduce heartbeat interval and timeout through configuration >> "heartbeat.interval" and "heartbeat.timeout". >> >> Thanks, >> Biao /'bɪ.aʊ/ >> >> >> >> On Tue, 26 Nov 2019 at 16:09, Nick Toker wrote: >> >>> Hi >>> i have a standalone cluster with 3 nodes and rocksdb backend >>> when one task manager fails ( the process is being killed) >>> it takes very long time until the job is totally canceled and a new job >>> is resubmitted >>> i see that all slots on all nodes are being canceled except from the >>> slots of the dead >>> task manager , it takes about 30- 40 second for the job to totally >>> shutdown. >>> is that something i can do to reduce this time or there is a plan for a >>> fix ( if so when)? >>> >>> regards, >>> nick >>> >>
Re: stack job on fail over
Hi Nick, I guess the reason is your Flink job manager doesn't detect the task manager is lost until heartbeat timeout. You could check the job manager log to verify that. Maybe a more elegant way of shutting down task manager helps, like through "taskmanager.sh stop" or "kill" command without 9 signal. Or you could reduce heartbeat interval and timeout through configuration "heartbeat.interval" and "heartbeat.timeout". Thanks, Biao /'bɪ.aʊ/ On Tue, 26 Nov 2019 at 16:09, Nick Toker wrote: > Hi > i have a standalone cluster with 3 nodes and rocksdb backend > when one task manager fails ( the process is being killed) > it takes very long time until the job is totally canceled and a new job is > resubmitted > i see that all slots on all nodes are being canceled except from the slots > of the dead > task manager , it takes about 30- 40 second for the job to totally > shutdown. > is that something i can do to reduce this time or there is a plan for a > fix ( if so when)? > > regards, > nick >
stack job on fail over
Hi i have a standalone cluster with 3 nodes and rocksdb backend when one task manager fails ( the process is being killed) it takes very long time until the job is totally canceled and a new job is resubmitted i see that all slots on all nodes are being canceled except from the slots of the dead task manager , it takes about 30- 40 second for the job to totally shutdown. is that something i can do to reduce this time or there is a plan for a fix ( if so when)? regards, nick