Re: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged

2019-11-05 Thread Till Rohrmann
Hi Regina, I've taken another look at the problem I think we could improve the situation by reordering the calls we do in YarnResourceManager#onContainersAllocated. I've created a PR [1] for the re-opened issue [2]. Would it be possible for you to verify the fix? What you need to do is to check

Re: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged

2019-11-01 Thread Till Rohrmann
Hi Regina, at the moment the community works towards the 1.10 release with a lot of features trying to be completed. The intended feature freeze is end of November. Due to this it is quite hard to tell when exactly this problem will be properly fixed but we'll try our best. Cheers, Till On Thu,

Re: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged

2019-10-30 Thread Till Rohrmann
Hi Regina, sorry for not getting back to you earlier. I've gone through the logs and I couldn't find something suspicious. What I can see though is the following: When you start the cluster, you submit a couple of jobs. This starts at 9:20. In total 120 slots are being required to run these jobs.

Re: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged

2019-10-29 Thread Yang Wang
Hi Chan, If it is a bug, i think it is critical. Could you share the job manager logs with me too? I have some time to analyze and hope to find the root cause. Best, Yang Chan, Regina 于2019年10月30日周三 上午10:55写道: > Till, were you able find anything? Do you need more logs? > > > > > > *From:*

Re: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged

2019-10-26 Thread Till Rohrmann
Forget my last email. I received the on time code and could access the logs. Cheers, Till On Sat, Oct 26, 2019 at 6:49 PM Till Rohrmann wrote: > Hi Regina, > > I couldn't access the log files because LockBox asked to create a new > password and now it asks me for the one time code to confirm

Re: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged

2019-10-26 Thread Till Rohrmann
Hi Regina, I couldn't access the log files because LockBox asked to create a new password and now it asks me for the one time code to confirm this change. It says that it will send the one time code to my registered email which I don't have. Cheers, Till On Fri, Oct 25, 2019 at 10:14 PM Till

Re: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged

2019-10-25 Thread Till Rohrmann
Great, thanks a lot Regina. I'll check the logs tomorrow. If info level is not enough, then I'll let you know. Cheers, Till On Fri, Oct 25, 2019, 21:20 Chan, Regina wrote: > Till, I added you to this lockbox area where you should be able to > download the logs. You should have also received an

Re: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged

2019-10-25 Thread Till Rohrmann
Could you provide me with the full logs of the cluster entrypoint/JobManager. I'd like to see what's going on there. Cheers, Till On Fri, Oct 25, 2019, 19:10 Chan, Regina wrote: > Till, > > > > We’re still seeing a large number of returned containers even with this > heart beat set to

RE: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged

2019-10-25 Thread Chan, Regina
Till, We’re still seeing a large number of returned containers even with this heart beat set to something higher. Do you have hints as to what’s going on? It seems to be bursty in nature. The bursty requests cause the job to fail with the cluster not having enough resources because it’s in the

RE: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged

2019-10-23 Thread Chan, Regina
Yeah thanks for the responses. We’re in the process of testing 1.9.1 after we found https://issues.apache.org/jira/browse/FLINK-12342 as the cause of the original issue. FLINK-9455 makes sense as to why it didn’t work on legacy mode. From: Till Rohrmann Sent: Wednesday, October 23, 2019 5:32

Re: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged

2019-10-23 Thread Till Rohrmann
Hi Regina, When using the FLIP-6 mode, you can control how long it takes for an idle TaskManager to be released via resourcemanager.taskmanager-timeout. Per default it is set to 30s. In the Flink version you are using, 1.6.4, we do not support TaskManagers with multiple slots properly [1]. The

Re: The RMClient's and YarnResourceManagers internal state about the number of pending container requests has diverged

2019-10-23 Thread Yang Wang
Hi Chan, After FLIP-6, the Flink ResourceManager dynamically allocate resource from Yarn on demand. What's your flink version? On the current code base, if the pending containers in resource manager is zero, then it will releaseall the excess containers. Could you please check the "Remaining