Hi Robert, the problem here is that YARN's scheduler (there are different schedulers in YARN: FIFO, CapacityScheduler, ...) is not giving Flink's ApplicationMaster/JobManager all the containers it is requesting. By increasing the size of the AM/JM container, there is probably no memory left to fit the last TaskManager container. I also experienced this issue, when I wanted to run a Flink job on YARN and the containers were fitting theoretically, but YARN was not giving me all the containers I requested. Back then, I asked on the yarn-dev list [1] (there were also some off-list emails) but we could not resolve the issue.
Can you check the resource manager logs? Maybe there is a log message which explains why the container request of Flink's AM is not fulfilled. [1] http://search-hadoop.com/m/AsBtCilK5r1pKLjf1&subj=Re+QUESTION+Allocating+a+full+YARN+cluster On Wed, Sep 30, 2015 at 5:02 PM, Robert Schmidtke <ro.schmid...@gmail.com> wrote: > It's me again. This is a strange issue, I hope I managed to find the right > keywords. I got 8 machines, 1 for the JM, the other 7 are TMs with 64G of > memory each. > > When running my job like so: > > $FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16384 -ytm 40960 -yn 7 ..... > > The job completes without any problems. When running it like so: > > $FLINK_HOME/bin/flink run -m yarn-cluster -yjm 16385 -ytm 40960 -yn 7 ..... > > (note the one more M of memory for the JM), the execution stalls, > continuously reporting: > > ..... > TaskManager status (6/7) > TaskManager status (6/7) > TaskManager status (6/7) > ..... > > I did some poking around, but I couldn't find any direct correlation with > the code. > > The JM log says: > > ..... > 16:49:01,893 INFO org.apache.flink.yarn.ApplicationMaster$ > - JVM Options: > 16:49:01,893 INFO org.apache.flink.yarn.ApplicationMaster$ > - -Xmx12289M > ..... > > but then continues to report > > ..... > 16:52:59,311 INFO > org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - The user > requested 7 containers, 6 running. 1 containers missing > 16:52:59,831 INFO > org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - The user > requested 7 containers, 6 running. 1 containers missing > 16:53:00,351 INFO > org.apache.flink.yarn.ApplicationMaster$$anonfun$2$$anon$1 - The user > requested 7 containers, 6 running. 1 containers missing > ..... > > forever until I cancel the job. > > If you have any ideas I'm happy to try them out. Thanks in advance for any > hints! Cheers. > > Robert > -- > My GPG Key ID: 336E2680 >