Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

Xintong Song Tue, 15 Nov 2022 19:00:56 -0800

Concerning isolation, I think ideally we want everything to be isolated
between jobs running in the same cluster (i.e., slots in the same TM).
Unfortunately, this is impractical.
- Heap / Off-heap memory are directly allocated / deallocated through JVM /
OS. Flink does not have a good way to cap their usages per slot.
- Network memory does not have the good property of managed memory, that a
job can adapt to any given amount of managed memory (with a very small min
limitation). We are trying to improve network memory towards that
direction, and once achieved it can be isolated as well.
I think not being able to isolate all kinds of memory does not mean we
should give up the isolation on all kinds of memory. And I believe "managed
memory is isolated and others are not" is much easier for the users to
understand compared to "part of the managed memory is isolated and others
are not".


By waste, I meant reserving a certain amount of memory that is only used by
certain use cases that do not always exist. This is exactly what we want to
avoid with managed memory in FLIP-49 [1]. We used to have managed memory
only used for batch operators, and a containerized-cut-off memory
(something similar to framework.off-heap) for rocksdb state backend. The
problem was that, if the user does not change the configuration when
switching between streaming / batch jobs, there would always be some memory
(managed or cut-off) wasted. Similarly, introducing a shared managed memory
zone means reserving one more dedicated part of memory that can get wasted
in many cases. This is probably a necessary price for this new feature, but
let's not break the concept / properties of managed memory for it.

In your proposal, the fraction for the share managed memory is by default
0. That means to enable the rocksdb memory sharing, users need to manually
increase the fraction anyway. Thus, having the memory sharing rocksdb use
managed memory or off-heap memory does not make a significant difference
for the new feature users. I'd think of this as "extra operational overhead
for users of a certain new feature" vs. "significant learning cost and
potential behavior change for pretty much all users". I'd be fine with
having some shortcuts to simplify the configuration on the user side for
this new feature, but not to invade the managed memory.

Best,

Xintong


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors

On Tue, Nov 15, 2022 at 5:46 PM Khachatryan Roman <
[email protected]> wrote:

> Thanks for your reply Xingong Song,
>
> Could you please elaborate on the following:
>
> > The proposed changes break several good properties that we designed for
> > managed memory.
> > 1. It's isolated across slots
> Just to clarify, any way to manage the memory efficiently while capping its
> usage
> will break the isolation. It's just a matter of whether it's managed memory
> or not.
> Do you see any reasons why unmanaged memory can be shared, and managed
> memory can not?
>
> > 2. It should never be wasted (unless there's nothing in the job that
> needs
> > managed memory)
> If I understand correctly, the managed memory can already be wasted because
> it is divided evenly between slots, regardless of the existence of its
> consumers in a particular slot.
> And in general, even if every slot has RocksDB / python, it's not
> guaranteed equal consumption.
> So this property would rather be fixed in the current proposal.
>
> > In addition, it further complicates configuration / computation logics of
> > managed memory.
> I think having multiple options overriding each other increases the
> complexity for the user. As for the computation, I think it's desirable to
> let Flink do it rather than users.
>
> Both approaches need some help from TM for:
> - storing the shared resources (static field in a class might be too
> dangerous because if the backend is loaded by the user-class-loader then
> memory will leak silently).
> - reading the configuration
>
> Regards,
> Roman
>
>
> On Sun, Nov 13, 2022 at 11:24 AM Xintong Song <[email protected]>
> wrote:
>
> > I like the idea of sharing RocksDB memory across slots. However, I'm
> quite
> > concerned by the current proposed approach.
> >
> > The proposed changes break several good properties that we designed for
> > managed memory.
> > 1. It's isolated across slots
> > 2. It should never be wasted (unless there's nothing in the job that
> needs
> > managed memory)
> > In addition, it further complicates configuration / computation logics of
> > managed memory.
> >
> > As an alternative, I'd suggest introducing a variant of
> > RocksDBStateBackend, that shares memory across slots and does not use
> > managed memory. This basically means the shared memory is not considered
> as
> > part of managed memory. For users of this new feature, they would need to
> > configure how much memory the variant state backend should use, and
> > probably also a larger framework-off-heap / jvm-overhead memory. The
> latter
> > might require a bit extra user effort compared to the current approach,
> but
> > would avoid significant complexity in the managed memory configuration
> and
> > calculation logics which affects broader users.
> >
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Sat, Nov 12, 2022 at 1:21 AM Roman Khachatryan <[email protected]>
> > wrote:
> >
> > > Hi John, Yun,
> > >
> > > Thank you for your feedback
> > >
> > > @John
> > >
> > > > It seems like operators would either choose isolation for the
> cluster’s
> > > jobs
> > > > or they would want to share the memory between jobs.
> > > > I’m not sure I see the motivation to reserve only part of the memory
> > for
> > > sharing
> > > > and allowing jobs to choose whether they will share or be isolated.
> > >
> > > I see two related questions here:
> > >
> > > 1) Whether to allow mixed workloads within the same cluster.
> > > I agree that most likely all the jobs will have the same "sharing"
> > > requirement.
> > > So we can drop "state.backend.memory.share-scope" from the proposal.
> > >
> > > 2) Whether to allow different memory consumers to use shared or
> exclusive
> > > memory.
> > > Currently, only RocksDB is proposed to use shared memory. For python,
> > it's
> > > non-trivial because it is job-specific.
> > > So we have to partition managed memory into shared/exclusive and
> > therefore
> > > can NOT replace "taskmanager.memory.managed.shared-fraction" with some
> > > boolean flag.
> > >
> > > I think your question was about (1), just wanted to clarify why the
> > > shared-fraction is needed.
> > >
> > > @Yun
> > >
> > > > I am just curious whether this could really bring benefits to our
> users
> > > with such complex configuration logic.
> > > I agree, and configuration complexity seems a common concern.
> > > I hope that removing "state.backend.memory.share-scope" (as proposed
> > above)
> > > reduces the complexity.
> > > Please share any ideas of how to simplify it further.
> > >
> > > > Could you share some real experimental results?
> > > I did an experiment to verify that the approach is feasible,
> > > i.e. multilple jobs can share the same memory/block cache.
> > > But I guess that's not what you mean here? Do you have any experiments
> in
> > > mind?
> > >
> > > > BTW, as talked before, I am not sure whether different lifecycles of
> > > RocksDB state-backends
> > > > would affect the memory usage of block cache & write buffer manager
> in
> > > RocksDB.
> > > > Currently, all instances would start and destroy nearly
> simultaneously,
> > > > this would change after we introduce this feature with jobs running
> at
> > > different scheduler times.
> > > IIUC, the concern is that closing a RocksDB instance might close the
> > > BlockCache.
> > > I checked that manually and it seems to work as expected.
> > > And I think that would contradict the sharing concept, as described in
> > the
> > > documentation [1].
> > >
> > > [1]
> > > https://github.com/facebook/rocksdb/wiki/Block-Cache
> > >
> > > Regards,
> > > Roman
> > >
> > >
> > > On Wed, Nov 9, 2022 at 3:50 AM Yanfei Lei <[email protected]> wrote:
> > >
> > > > Hi Roman,
> > > > Thanks for the proposal, this allows State Backend to make better use
> > of
> > > > memory.
> > > >
> > > > After reading the ticket, I'm curious about some points:
> > > >
> > > > 1. Is shared-memory only for the state backend? If both
> > > > "taskmanager.memory.managed.shared-fraction: >0" and
> > > > "state.backend.rocksdb.memory.managed: false" are set at the same
> time,
> > > > will the shared-memory be wasted?
> > > > 2. It's said that "Jobs 4 and 5 will use the same 750Mb of unmanaged
> > > memory
> > > > and will compete with each other" in the example, how is the memory
> > size
> > > of
> > > > unmanaged part calculated?
> > > > 3. For fine-grained-resource-management, the control
> > > > of cpuCores, taskHeapMemory can still work, right?  And I am a little
> > > > worried that too many memory-about configuration options are
> > complicated
> > > > for users to understand.
> > > >
> > > > Regards,
> > > > Yanfei
> > > >
> > > > Roman Khachatryan <[email protected]> 于2022年11月8日周二 23:22写道：
> > > >
> > > > > Hi everyone,
> > > > >
> > > > > I'd like to discuss sharing RocksDB memory across slots as proposed
> > in
> > > > > FLINK-29928 [1].
> > > > >
> > > > > Since 1.10 / FLINK-7289 [2], it is possible to:
> > > > > - share these objects among RocksDB instances of the same slot
> > > > > - bound the total memory usage by all RocksDB instances of a TM
> > > > >
> > > > > However, the memory is divided between the slots equally (unless
> > using
> > > > > fine-grained resource control). This is sub-optimal if some slots
> > > contain
> > > > > more memory intensive tasks than the others.
> > > > > Using fine-grained resource control is also often not an option
> > because
> > > > the
> > > > > workload might not be known in advance.
> > > > >
> > > > > The proposal is to widen the scope of sharing memory to TM, so that
> > it
> > > > can
> > > > > be shared across all RocksDB instances of that TM. That would
> reduce
> > > the
> > > > > overall memory consumption in exchange for resource isolation.
> > > > >
> > > > > Please see FLINK-29928 [1] for more details.
> > > > >
> > > > > Looking forward to feedback on that proposal.
> > > > >
> > > > > [1]
> > > > > https://issues.apache.org/jira/browse/FLINK-29928
> > > > > [2]
> > > > > https://issues.apache.org/jira/browse/FLINK-7289
> > > > >
> > > > > Regards,
> > > > > Roman
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

Reply via email to