[DISCUSS] Allow streaming operators to use managed memory

Jark Wu Tue, 22 Dec 2020 01:15:10 -0800

Hi all,

I found that currently the managed memory can only be used in 3 workloads
[1]:
- state backends for streaming jobs
- sorting, hash tables for batch jobs
- python UDFs


And the configuration option `taskmanager.memory.managed.consumer-weights`
only allows values: PYTHON and DATAPROC (state in streaming or algorithms
in batch).
I'm confused why it doesn't allow streaming operators to use managed memory
for purposes other than state backends.

The background is that we are planning to use some batch algorithms
(sorting & bytes hash table) to improve the performance of streaming SQL
operators, especially for the mini-batch operators.
Currently, the mini-batch operators are buffering input records and
accumulators in heap (i.e. Java HashMap) which is not efficient and there
are potential risks of full GC and OOM.
With the managed memory, we can fully use the memory to buffer more data
without worrying about OOM and improve the performance a lot.

What do you think about allowing streaming operators to use managed memory
and exposing it in configuration.

Best,
Jark

[1]:
https://ci.apache.org/projects/flink/flink-docs-master/deployment/memory/mem_setup_tm.html#managed-memory

[DISCUSS] Allow streaming operators to use managed memory

Reply via email to