Re: [DISCUSSION] Introduce a separated memory pool for the TM merge shuffle

Till Rohrmann Fri, 05 Mar 2021 07:03:34 -0800

Thanks for this proposal Guowei. +1 for it.

Concerning the default size, maybe we can run some experiments and see how
the system behaves with different pool sizes.


Cheers,
Till

On Fri, Mar 5, 2021 at 2:45 PM Stephan Ewen <se...@apache.org> wrote:

> Thanks Guowei, for the proposal.
>
> As discussed offline already, I think this sounds good.
>
> One thought is that 16m sounds very small for a default read buffer pool.
> How risky do you think it is to increase this to 32m or 64m?
>
> Best,
> Stephan
>
> On Fri, Mar 5, 2021 at 4:33 AM Guowei Ma <guowei....@gmail.com> wrote:
>
>> Hi, all
>>
>>
>> In the Flink 1.12 we introduce the TM merge shuffle. But the
>> out-of-the-box experience of using TM merge shuffle is not very good. The
>> main reason is that the default configuration always makes users encounter
>> OOM [1]. So we hope to introduce a managed memory pool for TM merge shuffle
>> to avoid the problem.
>> Goals
>>
>>    1. Don't affect the streaming and pipelined-shuffle-only batch setups.
>>    2. Don't mix memory with different life cycle in the same pool. E.g.,
>>    write buffers needed by running tasks and read buffer needed even after
>>    tasks being finished.
>>    3. User can use the TM merge shuffle with default memory
>>    configurations. (May need further tunings for performance optimization, 
>> but
>>    should not fail with the default configurations.)
>>
>> Proposal
>>
>>    1. Introduce a configuration `taskmanager.memory.network.batch-read`
>>    to specify the size of this memory pool. The default value is 16m.
>>    2. Allocate the pool lazily. It means that the memory pool would be
>>    allocated when the TM merge shuffle is used at the first time.
>>    3. This pool size will not be add up to the TM's total memory size,
>>    but will be considered part of
>>    `taskmanager.memory.framework.off-heap.size`. We need to check that the
>>    pool size is not larger than the framework off-heap size, if TM merge
>>    shuffle is enabled.
>>
>>
>> In this default configuration, the allocation of the memory pool is
>> almost impossible to fail. Currently the default framework’s off-heap
>> memory is 128m, which is mainly used by Netty. But after we introduced zero
>> copy, the usage of it has been reduced, and you can refer to the detailed
>> data [2].
>> Known Limitation
>> Usability for increasing the memory pool size
>>
>> In addition to increasing `taskmanager.memory.network.batch-read`, the
>> user may also need to adjust `taskmanager.memory.framework.off-heap.size`
>> at the same time. It also means that once the user forgets this, it is
>> likely to fail the check when allocating the memory pool.
>>
>>
>> So in the following two situations, we will still prompt the user to
>> increase the size of `framework.off-heap.size`.
>>
>>    1. `taskmanager.memory.network.batch-read` is bigger than
>>    `taskmanager.memory.framework.off-heap.size`
>>    2. Allocating the pool encounters the OOM.
>>
>>
>> An alternative is that when the user adjusts the size of the memory pool,
>> the system automatically adjusts it. But we are not entierly sure about
>> this, given its implicity and complicating the memory configurations.
>> Potential memory waste
>>
>> In the first step, the memory pool will not be released once allocated. This
>> means in the first step, even if there is no subsequent batch job, the
>> pooled memory cannot be used by other consumers.
>>
>>
>> We are not releasing the pool in the first step due to the concern that
>> frequently allocating/deallocating the entire pool may increase the GC
>> pressue. Investitations on how to dynamically release the pool when it's no
>> longer needed is considered a future follow-up.
>>
>>
>> Looking forward to your feedback.
>>
>>
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-20740
>>
>> [2] https://github.com/apache/flink/pull/7368.
>> Best,
>> Guowei
>>
>

Re: [DISCUSSION] Introduce a separated memory pool for the TM merge shuffle

Reply via email to