Re: Job Recovery Time on TM Lost

2021-07-12 Thread 刘建刚
Yes, time is main when detecting the TM's liveness. The count method will check by certain intervals. Gen Luo 于2021年7月9日周五 上午10:37写道: > @刘建刚 > Welcome to join the discuss and thanks for sharing your experience. > > I have a minor question. In my experience, network failures in a certain >

Re: Job Recovery Time on TM Lost

2021-07-09 Thread Till Rohrmann
Gen is right with his explanation why the dead TM discovery can be faster with Flink < 1.12. Concerning flaky TaskManager connections: 2.1 I think the problem is that the receiving TM does not know the container ID of the sending TM. It only knows its address. But this is something one could

Re: Job Recovery Time on TM Lost

2021-07-08 Thread Gen Luo
@刘建刚 Welcome to join the discuss and thanks for sharing your experience. I have a minor question. In my experience, network failures in a certain cluster usually takes a time to recovery, which can be measured as p99 to guide configuring. So I suppose it would be better to use time than attempt

Re: Job Recovery Time on TM Lost

2021-07-08 Thread Lu Niu
Thanks everyone! This is a great discussion! 1. Restarting takes 30s when throwing exceptions from application code because the restart delay is 30s in config. Before lots of related config are 30s which lead to the confusion. I redo the test with config:

Re: Job Recovery Time on TM Lost

2021-07-07 Thread 刘建刚
It is really helpful to find the lost container quickly. In our inner flink version, we optimize it by task's report and jobmaster's probe. When a task fails because of the connection, it reports to the jobmaster. The jobmaster will try to confirm the liveness of the unconnected taskmanager for

Re: Job Recovery Time on TM Lost

2021-07-06 Thread Gen Luo
Yes, I have noticed the PR and commented there with some consideration about the new option. We can discuss further there. On Tue, Jul 6, 2021 at 6:04 PM Till Rohrmann wrote: > This is actually a very good point Gen. There might not be a lot to gain > for us by implementing a fancy algorithm

Re: Job Recovery Time on TM Lost

2021-07-06 Thread Till Rohrmann
This is actually a very good point Gen. There might not be a lot to gain for us by implementing a fancy algorithm for figuring out whether a TM is dead or not based on failed heartbeat RPCs from the JM if the TM <> TM communication does not tolerate failures and directly fails the affected tasks.

Re: Job Recovery Time on TM Lost

2021-07-05 Thread Gen Luo
I know that there are retry strategies for akka rpc frameworks. I was just considering that, since the environment is shared by JM and TMs, and the connections among TMs (using netty) are flaky in unstable environments, which will also cause the job failure, is it necessary to build a strongly

Re: Job Recovery Time on TM Lost

2021-07-05 Thread Till Rohrmann
I think for RPC communication there are retry strategies used by the underlying Akka ActorSystem. So a RpcEndpoint can reconnect to a remote ActorSystem and resume communication. Moreover, there are also reconciliation protocols in place which reconcile the states between the components because of

Re: Job Recovery Time on TM Lost

2021-07-05 Thread Gen Luo
As far as I know, a TM will report connection failure once its connected TM is lost. I suppose JM can believe the report and fail the tasks in the lost TM if it also encounters a connection failure. Of course, it won't work if the lost TM is standalone. But I suppose we can use the same strategy

Re: Job Recovery Time on TM Lost

2021-07-02 Thread Till Rohrmann
Could you share the full logs with us for the second experiment, Lu? I cannot tell from the top of my head why it should take 30s unless you have configured a restart delay of 30s. Let's discuss FLINK-23216 on the JIRA ticket, Gen. I've now implemented FLINK-23209 [1] but it somehow has the

Re: Job Recovery Time on TM Lost

2021-07-02 Thread Gen Luo
Thanks for sharing, Till and Yang. @Lu Sorry but I don't know how to explain the new test with the log. Let's wait for others' reply. @Till It would be nice if JIRAs could be fixed. Thanks again for proposing them. In addition, I was tracking an issue that RM keeps allocating and freeing slots

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Lu Niu
Another side question, Shall we add metric to cover the complete restarting time (phase 1 + phase 2)? Current metric jm.restartingTime only covers phase 1. Thanks! Best Lu On Thu, Jul 1, 2021 at 12:09 PM Lu Niu wrote: > Thanks TIll and Yang for help! Also Thanks Till for a quick fix! > > I did

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Lu Niu
Thanks TIll and Yang for help! Also Thanks Till for a quick fix! I did another test yesterday. In this test, I intentionally throw exception from the source operator: ``` if (runtimeContext.getIndexOfThisSubtask() == 1 && errorFrenquecyInMin > 0 && System.currentTimeMillis() -

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Lu Niu
Thanks TIll and Yang for help! Also Thanks Till for a quick fix! I did another test yesterday. In this test, I intentionally throw exception from the source operator: ``` if (runtimeContext.getIndexOfThisSubtask() == 1 && errorFrenquecyInMin > 0 && System.currentTimeMillis() -

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Till Rohrmann
A quick addition, I think with FLINK-23202 it should now also be possible to improve the heartbeat mechanism in the general case. We can leverage the unreachability exception thrown if a remote target is no longer reachable to mark an heartbeat target as no longer reachable [1]. This can then be

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Yang Wang
Since you are deploying Flink workloads on Yarn, the Flink ResourceManager should get the container completion event after the heartbeat of Yarn NM->Yarn RM->Flink RM, which is 8 seconds by default. And Flink ResourceManager will release the dead TaskManager container once received the completion

Re: Job Recovery Time on TM Lost

2021-07-01 Thread Till Rohrmann
The analysis of Gen is correct. Flink currently uses its heartbeat as the primary means to detect dead TaskManagers. This means that Flink will take at least `heartbeat.timeout` time before the system recovers. Even if the cancellation happens fast (e.g. by having configured a low

Re: Job Recovery Time on TM Lost

2021-06-30 Thread Lu Niu
Thanks Gen! cc flink-dev to collect more inputs. Best Lu On Wed, Jun 30, 2021 at 12:55 AM Gen Luo wrote: > I'm also wondering here. > > In my opinion, it's because the JM can not confirm whether the TM is lost > or it's a temporary network trouble and will recover soon, since I can see > in