Re: [DISCUSS] Allow streaming operators to use managed memory

Xintong Song Mon, 04 Jan 2021 20:13:05 -0800

+1 for allowing streaming operators to use managed memory.

As for the consumer names, I'm afraid using `DATAPROC` for both streaming
ops and state backends will not work. Currently, RocksDB state backend uses
a shared piece of memory for all the states within that slot. It's not the
operator's decision how much memory it uses for the states.


I would suggest the following. (IIUC, the same as what Jark proposed)
* `OPERATOR` for both streaming and bath operators
* `STATE_BACKEND` for state backends
* `PYTHON` for python processes
* `DATAPROC` as a legacy key for state backend or batch operators if
`STATE_BACKEND` or `OPERATOR` are not specified.

Thank you~

Xintong Song



On Tue, Jan 5, 2021 at 11:23 AM Jark Wu <[email protected]> wrote:

> Hi Aljoscha,
>
> I think we may need to divide `DATAPROC` into `OPERATOR` and
> `STATE_BACKEND`, because they have different scope (slot vs. operator).
> But @Xintong Song <[email protected]> may have more insights on it.
>
> Best,
> Jark
>
>
> On Mon, 4 Jan 2021 at 20:44, Aljoscha Krettek <[email protected]> wrote:
>
>> I agree, we should allow streaming operators to use managed memory for
>> other use cases.
>>
>> Do you think we need an additional "consumer" setting or that they would
>> just use `DATAPROC` and decide by themselves what to use the memory for?
>>
>> Best,
>> Aljoscha
>>
>> On 2020/12/22 17:14, Jark Wu wrote:
>> >Hi all,
>> >
>> >I found that currently the managed memory can only be used in 3 workloads
>> >[1]:
>> >- state backends for streaming jobs
>> >- sorting, hash tables for batch jobs
>> >- python UDFs
>> >
>> >And the configuration option
>> `taskmanager.memory.managed.consumer-weights`
>> >only allows values: PYTHON and DATAPROC (state in streaming or algorithms
>> >in batch).
>> >I'm confused why it doesn't allow streaming operators to use managed
>> memory
>> >for purposes other than state backends.
>> >
>> >The background is that we are planning to use some batch algorithms
>> >(sorting & bytes hash table) to improve the performance of streaming SQL
>> >operators, especially for the mini-batch operators.
>> >Currently, the mini-batch operators are buffering input records and
>> >accumulators in heap (i.e. Java HashMap) which is not efficient and there
>> >are potential risks of full GC and OOM.
>> >With the managed memory, we can fully use the memory to buffer more data
>> >without worrying about OOM and improve the performance a lot.
>> >
>> >What do you think about allowing streaming operators to use managed
>> memory
>> >and exposing it in configuration.
>> >
>> >Best,
>> >Jark
>> >
>> >[1]:
>> >
>> https://ci.apache.org/projects/flink/flink-docs-master/deployment/memory/mem_setup_tm.html#managed-memory
>>
>

Re: [DISCUSS] Allow streaming operators to use managed memory

Reply via email to