Agreed that the documentation regarding managed memory could be improved.

Just to clarify, breaking isolation is just one of the concerns. I think my
biggest concern is the extra complexity in managed memory calculations.
I've been reached out by many users, online or offline, asking about the
managed memory calculation. Even with my help, it's not easy for all users
to understand, giving me the impression that the current calculation logics
are already quite complicated. That's why I'd be hesitant to further
complicate it. Treating the shared rocksdb memory as something
independent from managed memory would help in that sense.

- there's one more memory type, in addition to 8 existing types [2]
>
- its size needs to be calculated manually

Not necessarily. We may consider it as part of the framework.off-heap
memory. And we can still have some automatic calculations, e.g., allowing
using up to a certain fraction of the off-heap framework memory.

- flink code needs to take this into account and "adjust" the weights
>
We already have Memory/FsStateBackend that does not use managed memory. To
exclude the state backend from the managed memory calculations, you just
need to return `false` for `StateBackend#useManagedMemory`. That's why I
suggest a variant of the rocksdb state backend, where you can reuse most of
the original rocksdb state backend codes.

Best,

Xintong



On Thu, Nov 17, 2022 at 4:20 AM Roman Khachatryan <ro...@apache.org> wrote:

> > I think not being able to isolate all kinds of memory does not mean we
> > should give up the isolation on all kinds of memory. And I believe
> "managed
> > memory is isolated and others are not" is much easier for the users to
> > understand compared to "part of the managed memory is isolated and others
> > are not".
>
> It looks like the users can not infer that managed memory has the isolation
> property:
> neither documentation [1] nor the FILP don't mention that. I guess this is
> because isolation is not its most important property (see below).
>
> An explicit option would be self-documenting and will let the users know
> what memory is shared and what isn't.
>
> From my perspective, the most important property is shared budget which
> allows
> to avoid:
> 1) OOM errors when there are too many consumers (i.e. tasks)
> 2) manual calculation of memory for each type of consumer
>
> Both these properties are desirable for shared and non-shared memory;
> as well as not wasting memory if there is no consumer as you described.
>
> OTH, using *unmanaged* shared memory for RocksDB implies:
> - there's one more memory type, in addition to 8 existing types [2]
> - its size needs to be calculated manually
> - flink code needs to take this into account and "adjust" the weights
>
> Having said that, I'd be fine with either approach as both seem to have
> pros and cons.
>
> What do you think?
>
> [1]
>
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#managed-memory
> [2]
>
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#detailed-memory-model
>
> Regards,
> Roman
>
>
> On Wed, Nov 16, 2022 at 4:01 AM Xintong Song <tonysong...@gmail.com>
> wrote:
>
> > Concerning isolation, I think ideally we want everything to be isolated
> > between jobs running in the same cluster (i.e., slots in the same TM).
> > Unfortunately, this is impractical.
> > - Heap / Off-heap memory are directly allocated / deallocated through
> JVM /
> > OS. Flink does not have a good way to cap their usages per slot.
> > - Network memory does not have the good property of managed memory, that
> a
> > job can adapt to any given amount of managed memory (with a very small
> min
> > limitation). We are trying to improve network memory towards that
> > direction, and once achieved it can be isolated as well.
> > I think not being able to isolate all kinds of memory does not mean we
> > should give up the isolation on all kinds of memory. And I believe
> "managed
> > memory is isolated and others are not" is much easier for the users to
> > understand compared to "part of the managed memory is isolated and others
> > are not".
> >
> > By waste, I meant reserving a certain amount of memory that is only used
> by
> > certain use cases that do not always exist. This is exactly what we want
> to
> > avoid with managed memory in FLIP-49 [1]. We used to have managed memory
> > only used for batch operators, and a containerized-cut-off memory
> > (something similar to framework.off-heap) for rocksdb state backend. The
> > problem was that, if the user does not change the configuration when
> > switching between streaming / batch jobs, there would always be some
> memory
> > (managed or cut-off) wasted. Similarly, introducing a shared managed
> memory
> > zone means reserving one more dedicated part of memory that can get
> wasted
> > in many cases. This is probably a necessary price for this new feature,
> but
> > let's not break the concept / properties of managed memory for it.
> >
> > In your proposal, the fraction for the share managed memory is by default
> > 0. That means to enable the rocksdb memory sharing, users need to
> manually
> > increase the fraction anyway. Thus, having the memory sharing rocksdb use
> > managed memory or off-heap memory does not make a significant difference
> > for the new feature users. I'd think of this as "extra operational
> overhead
> > for users of a certain new feature" vs. "significant learning cost and
> > potential behavior change for pretty much all users". I'd be fine with
> > having some shortcuts to simplify the configuration on the user side for
> > this new feature, but not to invade the managed memory.
> >
> > Best,
> >
> > Xintong
> >
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> >
> > On Tue, Nov 15, 2022 at 5:46 PM Khachatryan Roman <
> > khachatryan.ro...@gmail.com> wrote:
> >
> > > Thanks for your reply Xingong Song,
> > >
> > > Could you please elaborate on the following:
> > >
> > > > The proposed changes break several good properties that we designed
> for
> > > > managed memory.
> > > > 1. It's isolated across slots
> > > Just to clarify, any way to manage the memory efficiently while capping
> > its
> > > usage
> > > will break the isolation. It's just a matter of whether it's managed
> > memory
> > > or not.
> > > Do you see any reasons why unmanaged memory can be shared, and managed
> > > memory can not?
> > >
> > > > 2. It should never be wasted (unless there's nothing in the job that
> > > needs
> > > > managed memory)
> > > If I understand correctly, the managed memory can already be wasted
> > because
> > > it is divided evenly between slots, regardless of the existence of its
> > > consumers in a particular slot.
> > > And in general, even if every slot has RocksDB / python, it's not
> > > guaranteed equal consumption.
> > > So this property would rather be fixed in the current proposal.
> > >
> > > > In addition, it further complicates configuration / computation
> logics
> > of
> > > > managed memory.
> > > I think having multiple options overriding each other increases the
> > > complexity for the user. As for the computation, I think it's desirable
> > to
> > > let Flink do it rather than users.
> > >
> > > Both approaches need some help from TM for:
> > > - storing the shared resources (static field in a class might be too
> > > dangerous because if the backend is loaded by the user-class-loader
> then
> > > memory will leak silently).
> > > - reading the configuration
> > >
> > > Regards,
> > > Roman
> > >
> > >
> > > On Sun, Nov 13, 2022 at 11:24 AM Xintong Song <tonysong...@gmail.com>
> > > wrote:
> > >
> > > > I like the idea of sharing RocksDB memory across slots. However, I'm
> > > quite
> > > > concerned by the current proposed approach.
> > > >
> > > > The proposed changes break several good properties that we designed
> for
> > > > managed memory.
> > > > 1. It's isolated across slots
> > > > 2. It should never be wasted (unless there's nothing in the job that
> > > needs
> > > > managed memory)
> > > > In addition, it further complicates configuration / computation
> logics
> > of
> > > > managed memory.
> > > >
> > > > As an alternative, I'd suggest introducing a variant of
> > > > RocksDBStateBackend, that shares memory across slots and does not use
> > > > managed memory. This basically means the shared memory is not
> > considered
> > > as
> > > > part of managed memory. For users of this new feature, they would
> need
> > to
> > > > configure how much memory the variant state backend should use, and
> > > > probably also a larger framework-off-heap / jvm-overhead memory. The
> > > latter
> > > > might require a bit extra user effort compared to the current
> approach,
> > > but
> > > > would avoid significant complexity in the managed memory
> configuration
> > > and
> > > > calculation logics which affects broader users.
> > > >
> > > >
> > > > Best,
> > > >
> > > > Xintong
> > > >
> > > >
> > > >
> > > > On Sat, Nov 12, 2022 at 1:21 AM Roman Khachatryan <ro...@apache.org>
> > > > wrote:
> > > >
> > > > > Hi John, Yun,
> > > > >
> > > > > Thank you for your feedback
> > > > >
> > > > > @John
> > > > >
> > > > > > It seems like operators would either choose isolation for the
> > > cluster’s
> > > > > jobs
> > > > > > or they would want to share the memory between jobs.
> > > > > > I’m not sure I see the motivation to reserve only part of the
> > memory
> > > > for
> > > > > sharing
> > > > > > and allowing jobs to choose whether they will share or be
> isolated.
> > > > >
> > > > > I see two related questions here:
> > > > >
> > > > > 1) Whether to allow mixed workloads within the same cluster.
> > > > > I agree that most likely all the jobs will have the same "sharing"
> > > > > requirement.
> > > > > So we can drop "state.backend.memory.share-scope" from the
> proposal.
> > > > >
> > > > > 2) Whether to allow different memory consumers to use shared or
> > > exclusive
> > > > > memory.
> > > > > Currently, only RocksDB is proposed to use shared memory. For
> python,
> > > > it's
> > > > > non-trivial because it is job-specific.
> > > > > So we have to partition managed memory into shared/exclusive and
> > > > therefore
> > > > > can NOT replace "taskmanager.memory.managed.shared-fraction" with
> > some
> > > > > boolean flag.
> > > > >
> > > > > I think your question was about (1), just wanted to clarify why the
> > > > > shared-fraction is needed.
> > > > >
> > > > > @Yun
> > > > >
> > > > > > I am just curious whether this could really bring benefits to our
> > > users
> > > > > with such complex configuration logic.
> > > > > I agree, and configuration complexity seems a common concern.
> > > > > I hope that removing "state.backend.memory.share-scope" (as
> proposed
> > > > above)
> > > > > reduces the complexity.
> > > > > Please share any ideas of how to simplify it further.
> > > > >
> > > > > > Could you share some real experimental results?
> > > > > I did an experiment to verify that the approach is feasible,
> > > > > i.e. multilple jobs can share the same memory/block cache.
> > > > > But I guess that's not what you mean here? Do you have any
> > experiments
> > > in
> > > > > mind?
> > > > >
> > > > > > BTW, as talked before, I am not sure whether different lifecycles
> > of
> > > > > RocksDB state-backends
> > > > > > would affect the memory usage of block cache & write buffer
> manager
> > > in
> > > > > RocksDB.
> > > > > > Currently, all instances would start and destroy nearly
> > > simultaneously,
> > > > > > this would change after we introduce this feature with jobs
> running
> > > at
> > > > > different scheduler times.
> > > > > IIUC, the concern is that closing a RocksDB instance might close
> the
> > > > > BlockCache.
> > > > > I checked that manually and it seems to work as expected.
> > > > > And I think that would contradict the sharing concept, as described
> > in
> > > > the
> > > > > documentation [1].
> > > > >
> > > > > [1]
> > > > > https://github.com/facebook/rocksdb/wiki/Block-Cache
> > > > >
> > > > > Regards,
> > > > > Roman
> > > > >
> > > > >
> > > > > On Wed, Nov 9, 2022 at 3:50 AM Yanfei Lei <fredia...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi Roman,
> > > > > > Thanks for the proposal, this allows State Backend to make better
> > use
> > > > of
> > > > > > memory.
> > > > > >
> > > > > > After reading the ticket, I'm curious about some points:
> > > > > >
> > > > > > 1. Is shared-memory only for the state backend? If both
> > > > > > "taskmanager.memory.managed.shared-fraction: >0" and
> > > > > > "state.backend.rocksdb.memory.managed: false" are set at the same
> > > time,
> > > > > > will the shared-memory be wasted?
> > > > > > 2. It's said that "Jobs 4 and 5 will use the same 750Mb of
> > unmanaged
> > > > > memory
> > > > > > and will compete with each other" in the example, how is the
> memory
> > > > size
> > > > > of
> > > > > > unmanaged part calculated?
> > > > > > 3. For fine-grained-resource-management, the control
> > > > > > of cpuCores, taskHeapMemory can still work, right?  And I am a
> > little
> > > > > > worried that too many memory-about configuration options are
> > > > complicated
> > > > > > for users to understand.
> > > > > >
> > > > > > Regards,
> > > > > > Yanfei
> > > > > >
> > > > > > Roman Khachatryan <ro...@apache.org> 于2022年11月8日周二 23:22写道:
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > I'd like to discuss sharing RocksDB memory across slots as
> > proposed
> > > > in
> > > > > > > FLINK-29928 [1].
> > > > > > >
> > > > > > > Since 1.10 / FLINK-7289 [2], it is possible to:
> > > > > > > - share these objects among RocksDB instances of the same slot
> > > > > > > - bound the total memory usage by all RocksDB instances of a TM
> > > > > > >
> > > > > > > However, the memory is divided between the slots equally
> (unless
> > > > using
> > > > > > > fine-grained resource control). This is sub-optimal if some
> slots
> > > > > contain
> > > > > > > more memory intensive tasks than the others.
> > > > > > > Using fine-grained resource control is also often not an option
> > > > because
> > > > > > the
> > > > > > > workload might not be known in advance.
> > > > > > >
> > > > > > > The proposal is to widen the scope of sharing memory to TM, so
> > that
> > > > it
> > > > > > can
> > > > > > > be shared across all RocksDB instances of that TM. That would
> > > reduce
> > > > > the
> > > > > > > overall memory consumption in exchange for resource isolation.
> > > > > > >
> > > > > > > Please see FLINK-29928 [1] for more details.
> > > > > > >
> > > > > > > Looking forward to feedback on that proposal.
> > > > > > >
> > > > > > > [1]
> > > > > > > https://issues.apache.org/jira/browse/FLINK-29928
> > > > > > > [2]
> > > > > > > https://issues.apache.org/jira/browse/FLINK-7289
> > > > > > >
> > > > > > > Regards,
> > > > > > > Roman
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to