Got it, thank you!
> On Jul 10, 2019, at 2:20 PM, Xintong Song <tonysong...@gmail.com> wrote:
>
> Thanks for the kindly offer, Qi.
>
> I think this work should not take much time, so I can take care of it. It's
> just the community is currently under feature freeze for release 1.9, so we
> need to wait until the release code branch being cut.
>
> Thank you~
> Xintong Song
>
>
> On Wed, Jul 10, 2019 at 1:55 PM qi luo <luoqi...@gmail.com
> <mailto:luoqi...@gmail.com>> wrote:
> Thanks Xintong and Haibo, I’ve found the fix in the blink branch.
>
> We’re also glad to help contribute this patch to community version, in case
> you don’t have time.
>
> Regards,
> Qi
>
>> On Jul 10, 2019, at 11:51 AM, Haibo Sun <sunhaib...@163.com
>> <mailto:sunhaib...@163.com>> wrote:
>>
>> Hi, Qi
>>
>> Sorry, by talking to Xintong Song offline, I made sure that some of what I
>> said before is wrong. Please refer to Xintong's answer.
>>
>> Best,
>> Haibo
>>
>> At 2019-07-10 11:37:16, "Xintong Song" <tonysong...@gmail.com
>> <mailto:tonysong...@gmail.com>> wrote:
>> Hi Qi,
>>
>> Thanks for reporting this problem. I think you are right. Starting large
>> amount of TMs in main thread on YARN could take relative long time, causing
>> RM to become unresponsive. In our enterprise version Blink, we actually have
>> a thread pool for starting TMs. I think we should contribute this feature to
>> the community version as well. I created a JIRA ticket
>> <https://issues.apache.org/jira/browse/FLINK-13184> from which you can track
>> the progress of this issue.
>>
>> For solving your problem at the moment, I agree with Haibo that configure a
>> larger registration timeout could be a workaround.
>>
>> Thank you~
>> Xintong Song
>>
>>
>> On Wed, Jul 10, 2019 at 10:37 AM Haibo Sun <sunhaib...@163.com
>> <mailto:sunhaib...@163.com>> wrote:
>> Hi, Qi
>>
>> According to our experience, it is no problem to allocate more than 1000
>> containers when the registration timeout is set 5 minutes . Perhaps there
>> are other reasons? Or you can try to increase the value of
>> `taskmanager.registration.timeout`. For allocating containers using
>> multi-thread, I personally think it's going to get very complicated, and the
>> more recommended way is to put some waiting works into asynchronous
>> processing, so as to liberate the main thread.
>>
>> Best,
>> Haibo
>>
>>
>> At 2019-07-09 21:05:51, "qi luo" <luoqi...@gmail.com
>> <mailto:luoqi...@gmail.com>> wrote:
>> Hi guys,
>>
>> We’re using latest version Flink YarnResourceManager, but our job startup
>> occasionally hangs when allocating many Yarn containers (e.g. >1000). I
>> checked the related code in YarnResourceManager as below:
>>
>> <PastedGraphic-1.tiff>
>>
>> It seems that it handles all allocated containers and starts TM in main
>> thread. Thus when containers allocations are heavy, the RM thread becomes
>> unresponsive (such as no response to TM heartbeats, see TM logs as below).
>>
>> Any idea on how to better handle such case (e.g. multi-threading to handle
>> allocated containers) would be very appreciated. Thanks!
>>
>> Regards,
>> Qi
>>
>>
>> ————————————————————————
>> TM log:
>>
>> 2019-07-09 13:56:59,110 INFO
>> org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting
>> to ResourceManager
>> akka.tcp://flink@xxx/user/resourcemanager(00000000000000000000000000000000)
>> <>.
>> 2019-07-09 14:00:01,138 INFO
>> org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not
>> resolve ResourceManager address akka.tcp://flink@xxx/user/resourcemanager
>> <>, retrying in 10000 ms: Ask timed out on
>> [ActorSelection[Anchor(akka.tcp://flink@xxx/ <>),
>> Path(/user/resourcemanager)]] after [182000 ms]. Sender[null] sent message
>> of type "akka.actor.Identify"..
>> 2019-07-09 14:01:59,137 ERROR
>> org.apache.flink.runtime.taskexecutor.TaskExecutor - Fatal error
>> occurred in TaskExecutor akka.tcp://flink@xxx/user/taskmanager_0 <>.
>> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
>> Could not register at the ResourceManager within the specified maximum
>> registration duration 300000 ms. This indicates a problem with this
>> instance. Terminating now.
>> at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1023)
>> at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3(TaskExecutor.java:1009)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:332)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:158)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive(AkkaRpcActor.java:142)
>> at
>> akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165)
>> at akka.actor.Actor$class.aroundReceive(Actor.scala:502)
>> at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95)
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526)
>> at akka.actor.ActorCell.invoke(ActorCell.scala:495)
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257)
>> at akka.dispatch.Mailbox.run(Mailbox.scala:224)
>> at akka.dispatch.Mailbox.exec(Mailbox.scala:234)
>> at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>> at
>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>> at
>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>>
>