[ https://issues.apache.org/jira/browse/FLINK-14074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16932634#comment-16932634 ]
Till Rohrmann commented on FLINK-14074: --------------------------------------- Thanks a lot for looking into the problem in more depth [~Atlaster]. In the {{YarnResourceManager}} we always check whether a new container needs to be started if a container/task stops or cannot get started via: https://github.com/apache/flink/blob/master/flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java#L510. Since we restart failed Mesos tasks in https://github.com/apache/flink/blob/master/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosResourceManager.java#L668 the only way I can think of how we could "lose" Mesos tasks is at start up which is not reported or maybe there is a problem with the matching of pending slots https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/resourcemanager/slotmanager/SlotManagerImpl.java#L594. For the latter problem it would be interesting to see with which slot profile the {{TaskManagers}} are being started and which profile the slots report when they register. > MesosResourceManager can't create new taskmanagers in Session Cluster Mode. > --------------------------------------------------------------------------- > > Key: FLINK-14074 > URL: https://issues.apache.org/jira/browse/FLINK-14074 > Project: Flink > Issue Type: Bug > Components: Deployment / Mesos > Affects Versions: 1.9.0 > Environment: Flink HA Session cluster 1.9.0 on mesos. > Reporter: Alexander Kasyanenko > Priority: Major > > Hi, I'm trying to launch multiple jobs in Flink Session Cluster, deployed on > mesos. > Flink's version is 1.9.0. > The very first resource allocation completes successfully, and first > submitted job launches, but submitting any amount of jobs afterwords doesn't > affect the cluster in any way and no additional TaskManagers are allocated. > From the logs I see that MesosResourceManager is requesting Slots for the > newly submitted jobs: "{{o.a.f.m.r.c.MesosResourceManager - Request slot > with profile ResourceProfile..."}} but line {{"Starting a new worker.}}" > appears in log only the same amount of times as taskmanagers count, allocated > for the first job. > I'm a complete noob in flink internals, but took a wild guess about a reason. > I think that the problem is in this check: > [https://github.com/apache/flink/blob/release-1.9.0/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosResourceManager.java#L436] > It might be that RM is lazily allocated at the first call by a factory, and > then a private final field {{slotsPerWorker}} is set. So this check will > prevent creation of any new worker after iterator traverses the entire > collection. My main assumption is that {{slotsPerWorker}} is never modified > again. > > I'm sorry that I didn't do much of investigation before reporting, but I'll > try to do some after a weekend. I plan to build flink without this check and > see if it helps. Also I'll play around with tests for this RM. Since it's my > time running time flink internals, I'll be back after a few days. > Any help will much appreciated. > Thanks in advance. -- This message was sent by Atlassian Jira (v8.3.4#803005)