Hi Till,
    Thanks for taking this issue.

    We are not comfortable sending logs to a email list which is this open.
I'll send logs to you.

Thanks,
Bowen


On Wed, Aug 9, 2017 at 2:46 AM, Till Rohrmann <trohrm...@apache.org> wrote:

> Hi Bowen,
>
> if I'm not mistaken, then Flink's current Yarn implementation does not
> actively releases containers. The `YarnFlinkResourceManager` is started
> with a fixed number of containers it always tries to acquire. If a
> container should die, then it will request a new one.
>
> In case of a failure all slots should be freed and then they should be
> subject to rescheduling the new tasks. Thus, it is not necessarily the case
> that 12 new slots will be used unless the old slots are no longer available
> (failure of a TM). Therefore, it sounds like a bug what you are describing.
> Could you share the logs with us?
>
> Cheers,
> Till
>
> On Wed, Aug 9, 2017 at 9:32 AM, Bowen Li <bowen...@offerupnow.com> wrote:
>
>> Hi guys,
>>     I was running a Flink job (12 parallelism) on an EMR cluster with 48
>> YARN slots. When the job starts, I can see from Flink UI that the job took
>> 12 slots, and 36 slots were left available.
>>
>>     I would expect that when the job fails, it would restart from
>> checkpointing by taking another 12 slots and freeing the original 12 slots. 
>> *Well,
>> I observed that the job took new slots but never free original slots. The
>> Flink job ended up killed by YARN because there's no available slots
>> anymore.*
>>
>>      Here's the command I ran Flink job:
>>
>>      ```
>>      flink run -m yarn-cluster -yn 6 -ys 8 -ytm 40000  xxx.jar
>>      ```
>>
>>      Does anyone know what's going wrong?
>>
>> Thanks,
>> Bowen
>>
>
>

Reply via email to