Re: stack job on fail over

2019-11-26 Thread Biao Liu
Hi Nick,

Yes, reducing heartbeat timeout is not a perfect solution. It just
alleviates the pain a bit.

I'm wondering my guess is right or not. Is it caused by heartbeat
detection? Does it help with an elegant way of shutting down?

Thanks,
Biao /'bɪ.aʊ/



On Tue, 26 Nov 2019 at 20:22, Nick Toker  wrote:

> Thanks
> its to the trick
>
>
> regards,
> nick
>
> On Tue, Nov 26, 2019 at 11:26 AM Biao Liu  wrote:
>
>> Hi Nick,
>>
>> I guess the reason is your Flink job manager doesn't detect the task
>> manager is lost until heartbeat timeout.
>> You could check the job manager log to verify that.
>>
>> Maybe a more elegant way of shutting down task manager helps, like
>> through "taskmanager.sh stop" or "kill" command without 9 signal.
>> Or you could reduce heartbeat interval and timeout through configuration
>> "heartbeat.interval" and "heartbeat.timeout".
>>
>> Thanks,
>> Biao /'bɪ.aʊ/
>>
>>
>>
>> On Tue, 26 Nov 2019 at 16:09, Nick Toker  wrote:
>>
>>> Hi
>>> i have a standalone cluster with 3 nodes  and rocksdb backend
>>> when one task manager fails ( the process is being killed)
>>> it takes very long time until the job is totally canceled and a new job
>>> is resubmitted
>>> i see that all slots on all nodes are being canceled except from the
>>> slots of the dead
>>> task manager , it takes about 30- 40 second for the job to totally
>>> shutdown.
>>> is that something i can do to reduce this time or there is a plan for a
>>> fix ( if so when)?
>>>
>>> regards,
>>> nick
>>>
>>


Re: stack job on fail over

2019-11-26 Thread Biao Liu
Hi Nick,

I guess the reason is your Flink job manager doesn't detect the task
manager is lost until heartbeat timeout.
You could check the job manager log to verify that.

Maybe a more elegant way of shutting down task manager helps, like through
"taskmanager.sh stop" or "kill" command without 9 signal.
Or you could reduce heartbeat interval and timeout through configuration
"heartbeat.interval" and "heartbeat.timeout".

Thanks,
Biao /'bɪ.aʊ/



On Tue, 26 Nov 2019 at 16:09, Nick Toker  wrote:

> Hi
> i have a standalone cluster with 3 nodes  and rocksdb backend
> when one task manager fails ( the process is being killed)
> it takes very long time until the job is totally canceled and a new job is
> resubmitted
> i see that all slots on all nodes are being canceled except from the slots
> of the dead
> task manager , it takes about 30- 40 second for the job to totally
> shutdown.
> is that something i can do to reduce this time or there is a plan for a
> fix ( if so when)?
>
> regards,
> nick
>


stack job on fail over

2019-11-26 Thread Nick Toker
Hi
i have a standalone cluster with 3 nodes  and rocksdb backend
when one task manager fails ( the process is being killed)
it takes very long time until the job is totally canceled and a new job is
resubmitted
i see that all slots on all nodes are being canceled except from the slots
of the dead
task manager , it takes about 30- 40 second for the job to totally shutdown.
is that something i can do to reduce this time or there is a plan for a fix
( if so when)?

regards,
nick