[ 
https://issues.apache.org/jira/browse/FLINK-14074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Kasyanenko updated FLINK-14074:
-----------------------------------------
    Description: 
Hi, I'm trying to launch multiple jobs in Flink Session Cluster, deployed on 
mesos.
 Flink's version is 1.9.0.

The very first resource allocation completes successfully, and first submitted 
job launches, but submitting any amount of jobs afterwords doesn't affect the 
cluster in any way and no additional TaskManagers are allocated.

>From the logs I see that MesosResourceManager is requesting Slots for the 
>newly submitted jobs:  "{{o.a.f.m.r.c.MesosResourceManager - Request slot with 
>profile ResourceProfile..."}} but line {{"Starting a new worker.}}" appears in 
>log only the same amount of times as taskmanagers count, allocated for the 
>first job.

I'm a complete noob in flink internals, but took a wild guess about a reason. I 
think that the problem is in this check: 
[https://github.com/apache/flink/blob/release-1.9.0/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosResourceManager.java#L436]

It might be that RM is lazily allocated at the first call by a factory, and 
then a private final field {{slotsPerWorker}} is set. So this check will 
prevent creation of any new worker after iterator completely iterates over this 
collection. My main assumption is that {{slotsPerWorker}} is never modified 
again.

 

I'm sorry that I didn't do much of investigation before reporting, but I'll try 
to do some after a weekend. I plan to build flink without this check and see if 
it helps. Also I'll play around with tests for this RM. Since it's my time 
running time flink intermals, I'll be back after a few days.

Any help will much appreciated.

Thanks in advance.

  was:
Hi, I'm trying to launch multiple jobs in Flink Session Cluster, deployed on 
mesos.
 Flink's version is 1.9.0.

The very first resource allocation completes successfully, and first submitted 
job launches, but submitting any amount of jobs afterwords doesn't affect the 
cluster in any way and no additional TaskManagers are allocated.

>From the logs I see that MesosResourceManager is requesting Slots for the 
>newly submitted jobs:  "{{o.a.f.m.r.c.MesosResourceManager - Request slot with 
>profile ResourceProfile..."}} but line {{"Starting a new worker.}}" appears in 
>log only the same amount of times as taskmanagers count, allocated for the 
>first job.

I'm a complete noob in flink internals, but took a wild guess about a reason. I 
think that the problem is in this check: 
[https://github.com/apache/flink/blob/release-1.9.0/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosResourceManager.java#L436]

It might be that RM is lazily allocated at the first call by a factory, and 
then a private final field {{slotsPerWorker}} is set. So this check will 
prevent creation of any new worker after iterator completely iterates over this 
collection. My main assumption is that {{slotsPerWorker}} is never modified 
again.

 

I'm sorry that I didn't do much of investigation before reporting, but I'll try 
to do some after a weekend. I plan to build flink without this check and see if 
it helps. Also I'll play around with tests for this RM. Since it's my time 
running time flink intermals, I'll be back after a few days (it would take some 
time + country I'm currently in will have a national holiday).

Any help will much appreciated.

Thanks in advance.


> MesosResourceManager can't create new taskmanagers in Session Cluster Mode.
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-14074
>                 URL: https://issues.apache.org/jira/browse/FLINK-14074
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Mesos
>    Affects Versions: 1.9.0
>         Environment: Flink HA Session cluster 1.9.0 on mesos.
>            Reporter: Alexander Kasyanenko
>            Priority: Major
>
> Hi, I'm trying to launch multiple jobs in Flink Session Cluster, deployed on 
> mesos.
>  Flink's version is 1.9.0.
> The very first resource allocation completes successfully, and first 
> submitted job launches, but submitting any amount of jobs afterwords doesn't 
> affect the cluster in any way and no additional TaskManagers are allocated.
> From the logs I see that MesosResourceManager is requesting Slots for the 
> newly submitted jobs:  "{{o.a.f.m.r.c.MesosResourceManager - Request slot 
> with profile ResourceProfile..."}} but line {{"Starting a new worker.}}" 
> appears in log only the same amount of times as taskmanagers count, allocated 
> for the first job.
> I'm a complete noob in flink internals, but took a wild guess about a reason. 
> I think that the problem is in this check: 
> [https://github.com/apache/flink/blob/release-1.9.0/flink-mesos/src/main/java/org/apache/flink/mesos/runtime/clusterframework/MesosResourceManager.java#L436]
> It might be that RM is lazily allocated at the first call by a factory, and 
> then a private final field {{slotsPerWorker}} is set. So this check will 
> prevent creation of any new worker after iterator completely iterates over 
> this collection. My main assumption is that {{slotsPerWorker}} is never 
> modified again.
>  
> I'm sorry that I didn't do much of investigation before reporting, but I'll 
> try to do some after a weekend. I plan to build flink without this check and 
> see if it helps. Also I'll play around with tests for this RM. Since it's my 
> time running time flink intermals, I'll be back after a few days.
> Any help will much appreciated.
> Thanks in advance.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to