+1 for allowing streaming operators to use managed memory. As for the consumer names, I'm afraid using `DATAPROC` for both streaming ops and state backends will not work. Currently, RocksDB state backend uses a shared piece of memory for all the states within that slot. It's not the operator's decision how much memory it uses for the states.
I would suggest the following. (IIUC, the same as what Jark proposed) * `OPERATOR` for both streaming and bath operators * `STATE_BACKEND` for state backends * `PYTHON` for python processes * `DATAPROC` as a legacy key for state backend or batch operators if `STATE_BACKEND` or `OPERATOR` are not specified. Thank you~ Xintong Song On Tue, Jan 5, 2021 at 11:23 AM Jark Wu <imj...@gmail.com> wrote: > Hi Aljoscha, > > I think we may need to divide `DATAPROC` into `OPERATOR` and > `STATE_BACKEND`, because they have different scope (slot vs. operator). > But @Xintong Song <tonysong...@gmail.com> may have more insights on it. > > Best, > Jark > > > On Mon, 4 Jan 2021 at 20:44, Aljoscha Krettek <aljos...@apache.org> wrote: > >> I agree, we should allow streaming operators to use managed memory for >> other use cases. >> >> Do you think we need an additional "consumer" setting or that they would >> just use `DATAPROC` and decide by themselves what to use the memory for? >> >> Best, >> Aljoscha >> >> On 2020/12/22 17:14, Jark Wu wrote: >> >Hi all, >> > >> >I found that currently the managed memory can only be used in 3 workloads >> >[1]: >> >- state backends for streaming jobs >> >- sorting, hash tables for batch jobs >> >- python UDFs >> > >> >And the configuration option >> `taskmanager.memory.managed.consumer-weights` >> >only allows values: PYTHON and DATAPROC (state in streaming or algorithms >> >in batch). >> >I'm confused why it doesn't allow streaming operators to use managed >> memory >> >for purposes other than state backends. >> > >> >The background is that we are planning to use some batch algorithms >> >(sorting & bytes hash table) to improve the performance of streaming SQL >> >operators, especially for the mini-batch operators. >> >Currently, the mini-batch operators are buffering input records and >> >accumulators in heap (i.e. Java HashMap) which is not efficient and there >> >are potential risks of full GC and OOM. >> >With the managed memory, we can fully use the memory to buffer more data >> >without worrying about OOM and improve the performance a lot. >> > >> >What do you think about allowing streaming operators to use managed >> memory >> >and exposing it in configuration. >> > >> >Best, >> >Jark >> > >> >[1]: >> > >> https://ci.apache.org/projects/flink/flink-docs-master/deployment/memory/mem_setup_tm.html#managed-memory >> >