Flink failure recovery tooks very long time

trung kien Thu, 06 Sep 2018 03:51:01 -0700

Hi all,

I am trying to test failure recovery of a Flink job when a JM or TM goes
down.
Our target is having job auto restart and back to normal condition in any
case.


However, what's I am seeing is very strange and hope someone here help me
to understand it.

When JM or TM went down, I see the job was being restarted but as soon as
it restart it's working on checkingpoint and usually took 30+ minutes to
finish (usually in normal case, it only take 1-2 mins for checkpoint), As
soon as the checkpoint is finish, the job is back to normal condition.

I'm using 1.4.2, but seeing similar thing on 1.6.0 as well.

Could anyone please help to explain this behavior? We really want to reduce
the time of recovery but doesn't seem to find any document mentioned about
recovery process in detail.

Any help is really appreciate.


-- 
Thanks
Kien
-- 
Thanks
Kien

Flink failure recovery tooks very long time

Reply via email to