Hi Till, Thanks for taking this issue. We are not comfortable sending logs to a email list which is this open. I'll send logs to you.
Thanks, Bowen On Wed, Aug 9, 2017 at 2:46 AM, Till Rohrmann <trohrm...@apache.org> wrote: > Hi Bowen, > > if I'm not mistaken, then Flink's current Yarn implementation does not > actively releases containers. The `YarnFlinkResourceManager` is started > with a fixed number of containers it always tries to acquire. If a > container should die, then it will request a new one. > > In case of a failure all slots should be freed and then they should be > subject to rescheduling the new tasks. Thus, it is not necessarily the case > that 12 new slots will be used unless the old slots are no longer available > (failure of a TM). Therefore, it sounds like a bug what you are describing. > Could you share the logs with us? > > Cheers, > Till > > On Wed, Aug 9, 2017 at 9:32 AM, Bowen Li <bowen...@offerupnow.com> wrote: > >> Hi guys, >> I was running a Flink job (12 parallelism) on an EMR cluster with 48 >> YARN slots. When the job starts, I can see from Flink UI that the job took >> 12 slots, and 36 slots were left available. >> >> I would expect that when the job fails, it would restart from >> checkpointing by taking another 12 slots and freeing the original 12 slots. >> *Well, >> I observed that the job took new slots but never free original slots. The >> Flink job ended up killed by YARN because there's no available slots >> anymore.* >> >> Here's the command I ran Flink job: >> >> ``` >> flink run -m yarn-cluster -yn 6 -ys 8 -ytm 40000 xxx.jar >> ``` >> >> Does anyone know what's going wrong? >> >> Thanks, >> Bowen >> > >