Re: YarnResourceManager unresponsive under heavy containers allocations

qi luo Wed, 10 Jul 2019 04:44:28 -0700
Got it, thank you!

> On Jul 10, 2019, at 2:20 PM, Xintong Song <tonysong...@gmail.com> wrote:
> 
> Thanks for the kindly offer, Qi. 
> 
> I think this work should not take much time, so I can take care of it. It's 
> just the community is currently under feature freeze for release 1.9, so we 
> need to wait until the release code branch being cut.
> 
> Thank you~
> Xintong Song
> 
> 
> On Wed, Jul 10, 2019 at 1:55 PM qi luo <luoqi...@gmail.com 
> <mailto:luoqi...@gmail.com>> wrote:
> Thanks Xintong and Haibo, I’ve found the fix in the blink branch. 
> 
> We’re also glad to help contribute this patch to community version, in case 
> you don’t have time.
> 
> Regards,
> Qi
> 
>> On Jul 10, 2019, at 11:51 AM, Haibo Sun <sunhaib...@163.com 
>> <mailto:sunhaib...@163.com>> wrote:
>> 
>> Hi, Qi
>> 
>> Sorry, by talking to Xintong Song offline, I made sure that some of what I 
>> said before is wrong. Please refer to Xintong's answer.
>> 
>> Best,
>> Haibo
>> 
>> At 2019-07-10 11:37:16, "Xintong Song" <tonysong...@gmail.com 
>> <mailto:tonysong...@gmail.com>> wrote:
>> Hi Qi,
>> 
>> Thanks for reporting this problem. I think you are right. Starting large 
>> amount of TMs in main thread on YARN could take relative long time, causing 
>> RM to become unresponsive. In our enterprise version Blink, we actually have 
>> a thread pool for starting TMs. I think we should contribute this feature to 
>> the community version as well. I created a JIRA ticket 
>> <https://issues.apache.org/jira/browse/FLINK-13184> from which you can track 
>> the progress of this issue. 
>> 
>> For solving your problem at the moment, I agree with Haibo that configure a 
>> larger registration timeout could be a workaround.
>> 
>> Thank you~
>> Xintong Song
>> 
>> 
>> On Wed, Jul 10, 2019 at 10:37 AM Haibo Sun <sunhaib...@163.com 
>> <mailto:sunhaib...@163.com>> wrote:
>> Hi, Qi
>> 
>> According to our experience, it is no problem to allocate more than 1000 
>> containers when the registration timeout is set 5 minutes . Perhaps there 
>> are other reasons? Or you can try to increase the value of 
>> `taskmanager.registration.timeout`. For allocating containers using 
>> multi-thread, I personally think it's going to get very complicated, and the 
>> more recommended way is to put some waiting works into asynchronous 
>> processing, so as to liberate the main thread.
>> 
>> Best,
>> Haibo
>> 
>> 
>> At 2019-07-09 21:05:51, "qi luo" <luoqi...@gmail.com 
>> <mailto:luoqi...@gmail.com>> wrote:
>> Hi guys,
>> 
>> We’re using latest version Flink YarnResourceManager, but our job startup 
>> occasionally hangs when allocating many Yarn containers (e.g. >1000). I 
>> checked the related code in YarnResourceManager as below:
>> 
>> <PastedGraphic-1.tiff>
>> 
>> It seems that it handles all allocated containers and starts TM in main 
>> thread. Thus when containers allocations are heavy, the RM thread becomes 
>> unresponsive (such as no response to TM heartbeats, see TM logs as below).
>> 
>> Any idea on how to better handle such case (e.g. multi-threading to handle 
>> allocated containers) would be very appreciated. Thanks!
>> 
>> Regards,
>> Qi 
>> 
>> 
>> ———————————————————————— 
>> TM log:
>> 
>> 2019-07-09 13:56:59,110 INFO  
>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Connecting 
>> to ResourceManager 
>> akka.tcp://flink@xxx/user/resourcemanager(00000000000000000000000000000000) 
>> <>.
>> 2019-07-09 14:00:01,138 INFO  
>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Could not 
>> resolve ResourceManager address akka.tcp://flink@xxx/user/resourcemanager 
>> <>, retrying in 10000 ms: Ask timed out on 
>> [ActorSelection[Anchor(akka.tcp://flink@xxx/ <>), 
>> Path(/user/resourcemanager)]] after [182000 ms]. Sender[null] sent message 
>> of type "akka.actor.Identify"..
>> 2019-07-09 14:01:59,137 ERROR 
>> org.apache.flink.runtime.taskexecutor.TaskExecutor            - Fatal error 
>> occurred in TaskExecutor akka.tcp://flink@xxx/user/taskmanager_0 <>.
>> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
>>  Could not register at the ResourceManager within the specified maximum 
>> registration duration 300000 ms. This indicates a problem with this 
>> instance. Terminating now.
>>      at 
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1023)
>>      at 
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1009)
>>      at 
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
>>      at 
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
>>      at 
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
>>      at 
>> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>>      at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>>      at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>>      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>>      at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>>      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>>      at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>>      at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>>      at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>>      at 
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>>      at 
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>>      at 
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>> 
>
Re: YarnResourceManager unresponsive under heavy containers allocations

Reply via email to