Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

2022-11-17 Thread Xintong Song
Thanks, Roman~


Best,

Xintong



On Thu, Nov 17, 2022 at 10:56 PM Khachatryan Roman <
khachatryan.ro...@gmail.com> wrote:

> I agree, the current calculation logic is already complicated.
> I just think that not using managed memory complicates the memory model
> even further.
>
> But as I mentioned earlier, both approaches have their pros and cons, so
> I'll update the proposal to use unmanaged memory.
>
> Thanks!
>
> Regards,
> Roman
>
>
> On Thu, Nov 17, 2022 at 3:25 AM Xintong Song 
> wrote:
>
> > Agreed that the documentation regarding managed memory could be improved.
> >
> > Just to clarify, breaking isolation is just one of the concerns. I think
> my
> > biggest concern is the extra complexity in managed memory calculations.
> > I've been reached out by many users, online or offline, asking about the
> > managed memory calculation. Even with my help, it's not easy for all
> users
> > to understand, giving me the impression that the current calculation
> logics
> > are already quite complicated. That's why I'd be hesitant to further
> > complicate it. Treating the shared rocksdb memory as something
> > independent from managed memory would help in that sense.
> >
> > - there's one more memory type, in addition to 8 existing types [2]
> > >
> > - its size needs to be calculated manually
> >
> > Not necessarily. We may consider it as part of the framework.off-heap
> > memory. And we can still have some automatic calculations, e.g., allowing
> > using up to a certain fraction of the off-heap framework memory.
> >
> > - flink code needs to take this into account and "adjust" the weights
> > >
> > We already have Memory/FsStateBackend that does not use managed memory.
> To
> > exclude the state backend from the managed memory calculations, you just
> > need to return `false` for `StateBackend#useManagedMemory`. That's why I
> > suggest a variant of the rocksdb state backend, where you can reuse most
> of
> > the original rocksdb state backend codes.
> >
> > Best,
> >
> > Xintong
> >
> >
> >
> > On Thu, Nov 17, 2022 at 4:20 AM Roman Khachatryan 
> > wrote:
> >
> > > > I think not being able to isolate all kinds of memory does not mean
> we
> > > > should give up the isolation on all kinds of memory. And I believe
> > > "managed
> > > > memory is isolated and others are not" is much easier for the users
> to
> > > > understand compared to "part of the managed memory is isolated and
> > others
> > > > are not".
> > >
> > > It looks like the users can not infer that managed memory has the
> > isolation
> > > property:
> > > neither documentation [1] nor the FILP don't mention that. I guess this
> > is
> > > because isolation is not its most important property (see below).
> > >
> > > An explicit option would be self-documenting and will let the users
> know
> > > what memory is shared and what isn't.
> > >
> > > From my perspective, the most important property is shared budget which
> > > allows
> > > to avoid:
> > > 1) OOM errors when there are too many consumers (i.e. tasks)
> > > 2) manual calculation of memory for each type of consumer
> > >
> > > Both these properties are desirable for shared and non-shared memory;
> > > as well as not wasting memory if there is no consumer as you described.
> > >
> > > OTH, using *unmanaged* shared memory for RocksDB implies:
> > > - there's one more memory type, in addition to 8 existing types [2]
> > > - its size needs to be calculated manually
> > > - flink code needs to take this into account and "adjust" the weights
> > >
> > > Having said that, I'd be fine with either approach as both seem to have
> > > pros and cons.
> > >
> > > What do you think?
> > >
> > > [1]
> > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#managed-memory
> > > [2]
> > >
> > >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#detailed-memory-model
> > >
> > > Regards,
> > > Roman
> > >
> > >
> > > On Wed, Nov 16, 2022 at 4:01 AM Xintong Song 
> > > wrote:
> > >
> > > > Concerning isolation, I think ideally we want everything to be
> isolated
> > > > between jobs running in the same cluster (i.e., slots in the same
> TM).
> > > > Unfortunately, this is impractical.
> > > > - Heap / Off-heap memory are directly allocated / deallocated through
> > > JVM /
> > > > OS. Flink does not have a good way to cap their usages per slot.
> > > > - Network memory does not have the good property of managed memory,
> > that
> > > a
> > > > job can adapt to any given amount of managed memory (with a very
> small
> > > min
> > > > limitation). We are trying to improve network memory towards that
> > > > direction, and once achieved it can be isolated as well.
> > > > I think not being able to isolate all kinds of memory does not mean
> we
> > > > should give up the isolation on all kinds of memory. And I believe
> > > "managed
> > > > memory is isolated and others are not" is much easier for 

Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

2022-11-17 Thread Khachatryan Roman
I agree, the current calculation logic is already complicated.
I just think that not using managed memory complicates the memory model
even further.

But as I mentioned earlier, both approaches have their pros and cons, so
I'll update the proposal to use unmanaged memory.

Thanks!

Regards,
Roman


On Thu, Nov 17, 2022 at 3:25 AM Xintong Song  wrote:

> Agreed that the documentation regarding managed memory could be improved.
>
> Just to clarify, breaking isolation is just one of the concerns. I think my
> biggest concern is the extra complexity in managed memory calculations.
> I've been reached out by many users, online or offline, asking about the
> managed memory calculation. Even with my help, it's not easy for all users
> to understand, giving me the impression that the current calculation logics
> are already quite complicated. That's why I'd be hesitant to further
> complicate it. Treating the shared rocksdb memory as something
> independent from managed memory would help in that sense.
>
> - there's one more memory type, in addition to 8 existing types [2]
> >
> - its size needs to be calculated manually
>
> Not necessarily. We may consider it as part of the framework.off-heap
> memory. And we can still have some automatic calculations, e.g., allowing
> using up to a certain fraction of the off-heap framework memory.
>
> - flink code needs to take this into account and "adjust" the weights
> >
> We already have Memory/FsStateBackend that does not use managed memory. To
> exclude the state backend from the managed memory calculations, you just
> need to return `false` for `StateBackend#useManagedMemory`. That's why I
> suggest a variant of the rocksdb state backend, where you can reuse most of
> the original rocksdb state backend codes.
>
> Best,
>
> Xintong
>
>
>
> On Thu, Nov 17, 2022 at 4:20 AM Roman Khachatryan 
> wrote:
>
> > > I think not being able to isolate all kinds of memory does not mean we
> > > should give up the isolation on all kinds of memory. And I believe
> > "managed
> > > memory is isolated and others are not" is much easier for the users to
> > > understand compared to "part of the managed memory is isolated and
> others
> > > are not".
> >
> > It looks like the users can not infer that managed memory has the
> isolation
> > property:
> > neither documentation [1] nor the FILP don't mention that. I guess this
> is
> > because isolation is not its most important property (see below).
> >
> > An explicit option would be self-documenting and will let the users know
> > what memory is shared and what isn't.
> >
> > From my perspective, the most important property is shared budget which
> > allows
> > to avoid:
> > 1) OOM errors when there are too many consumers (i.e. tasks)
> > 2) manual calculation of memory for each type of consumer
> >
> > Both these properties are desirable for shared and non-shared memory;
> > as well as not wasting memory if there is no consumer as you described.
> >
> > OTH, using *unmanaged* shared memory for RocksDB implies:
> > - there's one more memory type, in addition to 8 existing types [2]
> > - its size needs to be calculated manually
> > - flink code needs to take this into account and "adjust" the weights
> >
> > Having said that, I'd be fine with either approach as both seem to have
> > pros and cons.
> >
> > What do you think?
> >
> > [1]
> >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#managed-memory
> > [2]
> >
> >
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#detailed-memory-model
> >
> > Regards,
> > Roman
> >
> >
> > On Wed, Nov 16, 2022 at 4:01 AM Xintong Song 
> > wrote:
> >
> > > Concerning isolation, I think ideally we want everything to be isolated
> > > between jobs running in the same cluster (i.e., slots in the same TM).
> > > Unfortunately, this is impractical.
> > > - Heap / Off-heap memory are directly allocated / deallocated through
> > JVM /
> > > OS. Flink does not have a good way to cap their usages per slot.
> > > - Network memory does not have the good property of managed memory,
> that
> > a
> > > job can adapt to any given amount of managed memory (with a very small
> > min
> > > limitation). We are trying to improve network memory towards that
> > > direction, and once achieved it can be isolated as well.
> > > I think not being able to isolate all kinds of memory does not mean we
> > > should give up the isolation on all kinds of memory. And I believe
> > "managed
> > > memory is isolated and others are not" is much easier for the users to
> > > understand compared to "part of the managed memory is isolated and
> others
> > > are not".
> > >
> > > By waste, I meant reserving a certain amount of memory that is only
> used
> > by
> > > certain use cases that do not always exist. This is exactly what we
> want
> > to
> > > avoid with managed memory in FLIP-49 [1]. We used to have managed
> memory
> > > only used 

Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

2022-11-16 Thread Xintong Song
Agreed that the documentation regarding managed memory could be improved.

Just to clarify, breaking isolation is just one of the concerns. I think my
biggest concern is the extra complexity in managed memory calculations.
I've been reached out by many users, online or offline, asking about the
managed memory calculation. Even with my help, it's not easy for all users
to understand, giving me the impression that the current calculation logics
are already quite complicated. That's why I'd be hesitant to further
complicate it. Treating the shared rocksdb memory as something
independent from managed memory would help in that sense.

- there's one more memory type, in addition to 8 existing types [2]
>
- its size needs to be calculated manually

Not necessarily. We may consider it as part of the framework.off-heap
memory. And we can still have some automatic calculations, e.g., allowing
using up to a certain fraction of the off-heap framework memory.

- flink code needs to take this into account and "adjust" the weights
>
We already have Memory/FsStateBackend that does not use managed memory. To
exclude the state backend from the managed memory calculations, you just
need to return `false` for `StateBackend#useManagedMemory`. That's why I
suggest a variant of the rocksdb state backend, where you can reuse most of
the original rocksdb state backend codes.

Best,

Xintong



On Thu, Nov 17, 2022 at 4:20 AM Roman Khachatryan  wrote:

> > I think not being able to isolate all kinds of memory does not mean we
> > should give up the isolation on all kinds of memory. And I believe
> "managed
> > memory is isolated and others are not" is much easier for the users to
> > understand compared to "part of the managed memory is isolated and others
> > are not".
>
> It looks like the users can not infer that managed memory has the isolation
> property:
> neither documentation [1] nor the FILP don't mention that. I guess this is
> because isolation is not its most important property (see below).
>
> An explicit option would be self-documenting and will let the users know
> what memory is shared and what isn't.
>
> From my perspective, the most important property is shared budget which
> allows
> to avoid:
> 1) OOM errors when there are too many consumers (i.e. tasks)
> 2) manual calculation of memory for each type of consumer
>
> Both these properties are desirable for shared and non-shared memory;
> as well as not wasting memory if there is no consumer as you described.
>
> OTH, using *unmanaged* shared memory for RocksDB implies:
> - there's one more memory type, in addition to 8 existing types [2]
> - its size needs to be calculated manually
> - flink code needs to take this into account and "adjust" the weights
>
> Having said that, I'd be fine with either approach as both seem to have
> pros and cons.
>
> What do you think?
>
> [1]
>
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#managed-memory
> [2]
>
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#detailed-memory-model
>
> Regards,
> Roman
>
>
> On Wed, Nov 16, 2022 at 4:01 AM Xintong Song 
> wrote:
>
> > Concerning isolation, I think ideally we want everything to be isolated
> > between jobs running in the same cluster (i.e., slots in the same TM).
> > Unfortunately, this is impractical.
> > - Heap / Off-heap memory are directly allocated / deallocated through
> JVM /
> > OS. Flink does not have a good way to cap their usages per slot.
> > - Network memory does not have the good property of managed memory, that
> a
> > job can adapt to any given amount of managed memory (with a very small
> min
> > limitation). We are trying to improve network memory towards that
> > direction, and once achieved it can be isolated as well.
> > I think not being able to isolate all kinds of memory does not mean we
> > should give up the isolation on all kinds of memory. And I believe
> "managed
> > memory is isolated and others are not" is much easier for the users to
> > understand compared to "part of the managed memory is isolated and others
> > are not".
> >
> > By waste, I meant reserving a certain amount of memory that is only used
> by
> > certain use cases that do not always exist. This is exactly what we want
> to
> > avoid with managed memory in FLIP-49 [1]. We used to have managed memory
> > only used for batch operators, and a containerized-cut-off memory
> > (something similar to framework.off-heap) for rocksdb state backend. The
> > problem was that, if the user does not change the configuration when
> > switching between streaming / batch jobs, there would always be some
> memory
> > (managed or cut-off) wasted. Similarly, introducing a shared managed
> memory
> > zone means reserving one more dedicated part of memory that can get
> wasted
> > in many cases. This is probably a necessary price for this new feature,
> but
> > let's not break the concept / properties of 

Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

2022-11-16 Thread Roman Khachatryan
> I think not being able to isolate all kinds of memory does not mean we
> should give up the isolation on all kinds of memory. And I believe
"managed
> memory is isolated and others are not" is much easier for the users to
> understand compared to "part of the managed memory is isolated and others
> are not".

It looks like the users can not infer that managed memory has the isolation
property:
neither documentation [1] nor the FILP don't mention that. I guess this is
because isolation is not its most important property (see below).

An explicit option would be self-documenting and will let the users know
what memory is shared and what isn't.

>From my perspective, the most important property is shared budget which
allows
to avoid:
1) OOM errors when there are too many consumers (i.e. tasks)
2) manual calculation of memory for each type of consumer

Both these properties are desirable for shared and non-shared memory;
as well as not wasting memory if there is no consumer as you described.

OTH, using *unmanaged* shared memory for RocksDB implies:
- there's one more memory type, in addition to 8 existing types [2]
- its size needs to be calculated manually
- flink code needs to take this into account and "adjust" the weights

Having said that, I'd be fine with either approach as both seem to have
pros and cons.

What do you think?

[1]
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#managed-memory
[2]
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/memory/mem_setup_tm/#detailed-memory-model

Regards,
Roman


On Wed, Nov 16, 2022 at 4:01 AM Xintong Song  wrote:

> Concerning isolation, I think ideally we want everything to be isolated
> between jobs running in the same cluster (i.e., slots in the same TM).
> Unfortunately, this is impractical.
> - Heap / Off-heap memory are directly allocated / deallocated through JVM /
> OS. Flink does not have a good way to cap their usages per slot.
> - Network memory does not have the good property of managed memory, that a
> job can adapt to any given amount of managed memory (with a very small min
> limitation). We are trying to improve network memory towards that
> direction, and once achieved it can be isolated as well.
> I think not being able to isolate all kinds of memory does not mean we
> should give up the isolation on all kinds of memory. And I believe "managed
> memory is isolated and others are not" is much easier for the users to
> understand compared to "part of the managed memory is isolated and others
> are not".
>
> By waste, I meant reserving a certain amount of memory that is only used by
> certain use cases that do not always exist. This is exactly what we want to
> avoid with managed memory in FLIP-49 [1]. We used to have managed memory
> only used for batch operators, and a containerized-cut-off memory
> (something similar to framework.off-heap) for rocksdb state backend. The
> problem was that, if the user does not change the configuration when
> switching between streaming / batch jobs, there would always be some memory
> (managed or cut-off) wasted. Similarly, introducing a shared managed memory
> zone means reserving one more dedicated part of memory that can get wasted
> in many cases. This is probably a necessary price for this new feature, but
> let's not break the concept / properties of managed memory for it.
>
> In your proposal, the fraction for the share managed memory is by default
> 0. That means to enable the rocksdb memory sharing, users need to manually
> increase the fraction anyway. Thus, having the memory sharing rocksdb use
> managed memory or off-heap memory does not make a significant difference
> for the new feature users. I'd think of this as "extra operational overhead
> for users of a certain new feature" vs. "significant learning cost and
> potential behavior change for pretty much all users". I'd be fine with
> having some shortcuts to simplify the configuration on the user side for
> this new feature, but not to invade the managed memory.
>
> Best,
>
> Xintong
>
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
>
> On Tue, Nov 15, 2022 at 5:46 PM Khachatryan Roman <
> khachatryan.ro...@gmail.com> wrote:
>
> > Thanks for your reply Xingong Song,
> >
> > Could you please elaborate on the following:
> >
> > > The proposed changes break several good properties that we designed for
> > > managed memory.
> > > 1. It's isolated across slots
> > Just to clarify, any way to manage the memory efficiently while capping
> its
> > usage
> > will break the isolation. It's just a matter of whether it's managed
> memory
> > or not.
> > Do you see any reasons why unmanaged memory can be shared, and managed
> > memory can not?
> >
> > > 2. It should never be wasted (unless there's nothing in the job that
> > needs
> > > managed memory)
> > If I understand correctly, the managed memory can 

Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

2022-11-15 Thread Xintong Song
Concerning isolation, I think ideally we want everything to be isolated
between jobs running in the same cluster (i.e., slots in the same TM).
Unfortunately, this is impractical.
- Heap / Off-heap memory are directly allocated / deallocated through JVM /
OS. Flink does not have a good way to cap their usages per slot.
- Network memory does not have the good property of managed memory, that a
job can adapt to any given amount of managed memory (with a very small min
limitation). We are trying to improve network memory towards that
direction, and once achieved it can be isolated as well.
I think not being able to isolate all kinds of memory does not mean we
should give up the isolation on all kinds of memory. And I believe "managed
memory is isolated and others are not" is much easier for the users to
understand compared to "part of the managed memory is isolated and others
are not".

By waste, I meant reserving a certain amount of memory that is only used by
certain use cases that do not always exist. This is exactly what we want to
avoid with managed memory in FLIP-49 [1]. We used to have managed memory
only used for batch operators, and a containerized-cut-off memory
(something similar to framework.off-heap) for rocksdb state backend. The
problem was that, if the user does not change the configuration when
switching between streaming / batch jobs, there would always be some memory
(managed or cut-off) wasted. Similarly, introducing a shared managed memory
zone means reserving one more dedicated part of memory that can get wasted
in many cases. This is probably a necessary price for this new feature, but
let's not break the concept / properties of managed memory for it.

In your proposal, the fraction for the share managed memory is by default
0. That means to enable the rocksdb memory sharing, users need to manually
increase the fraction anyway. Thus, having the memory sharing rocksdb use
managed memory or off-heap memory does not make a significant difference
for the new feature users. I'd think of this as "extra operational overhead
for users of a certain new feature" vs. "significant learning cost and
potential behavior change for pretty much all users". I'd be fine with
having some shortcuts to simplify the configuration on the user side for
this new feature, but not to invade the managed memory.

Best,

Xintong


[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors

On Tue, Nov 15, 2022 at 5:46 PM Khachatryan Roman <
khachatryan.ro...@gmail.com> wrote:

> Thanks for your reply Xingong Song,
>
> Could you please elaborate on the following:
>
> > The proposed changes break several good properties that we designed for
> > managed memory.
> > 1. It's isolated across slots
> Just to clarify, any way to manage the memory efficiently while capping its
> usage
> will break the isolation. It's just a matter of whether it's managed memory
> or not.
> Do you see any reasons why unmanaged memory can be shared, and managed
> memory can not?
>
> > 2. It should never be wasted (unless there's nothing in the job that
> needs
> > managed memory)
> If I understand correctly, the managed memory can already be wasted because
> it is divided evenly between slots, regardless of the existence of its
> consumers in a particular slot.
> And in general, even if every slot has RocksDB / python, it's not
> guaranteed equal consumption.
> So this property would rather be fixed in the current proposal.
>
> > In addition, it further complicates configuration / computation logics of
> > managed memory.
> I think having multiple options overriding each other increases the
> complexity for the user. As for the computation, I think it's desirable to
> let Flink do it rather than users.
>
> Both approaches need some help from TM for:
> - storing the shared resources (static field in a class might be too
> dangerous because if the backend is loaded by the user-class-loader then
> memory will leak silently).
> - reading the configuration
>
> Regards,
> Roman
>
>
> On Sun, Nov 13, 2022 at 11:24 AM Xintong Song 
> wrote:
>
> > I like the idea of sharing RocksDB memory across slots. However, I'm
> quite
> > concerned by the current proposed approach.
> >
> > The proposed changes break several good properties that we designed for
> > managed memory.
> > 1. It's isolated across slots
> > 2. It should never be wasted (unless there's nothing in the job that
> needs
> > managed memory)
> > In addition, it further complicates configuration / computation logics of
> > managed memory.
> >
> > As an alternative, I'd suggest introducing a variant of
> > RocksDBStateBackend, that shares memory across slots and does not use
> > managed memory. This basically means the shared memory is not considered
> as
> > part of managed memory. For users of this new feature, they would need to
> > configure how much memory the variant state backend should use, and
> > probably also a larger 

Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

2022-11-15 Thread Khachatryan Roman
Thanks for your reply Xingong Song,

Could you please elaborate on the following:

> The proposed changes break several good properties that we designed for
> managed memory.
> 1. It's isolated across slots
Just to clarify, any way to manage the memory efficiently while capping its
usage
will break the isolation. It's just a matter of whether it's managed memory
or not.
Do you see any reasons why unmanaged memory can be shared, and managed
memory can not?

> 2. It should never be wasted (unless there's nothing in the job that needs
> managed memory)
If I understand correctly, the managed memory can already be wasted because
it is divided evenly between slots, regardless of the existence of its
consumers in a particular slot.
And in general, even if every slot has RocksDB / python, it's not
guaranteed equal consumption.
So this property would rather be fixed in the current proposal.

> In addition, it further complicates configuration / computation logics of
> managed memory.
I think having multiple options overriding each other increases the
complexity for the user. As for the computation, I think it's desirable to
let Flink do it rather than users.

Both approaches need some help from TM for:
- storing the shared resources (static field in a class might be too
dangerous because if the backend is loaded by the user-class-loader then
memory will leak silently).
- reading the configuration

Regards,
Roman


On Sun, Nov 13, 2022 at 11:24 AM Xintong Song  wrote:

> I like the idea of sharing RocksDB memory across slots. However, I'm quite
> concerned by the current proposed approach.
>
> The proposed changes break several good properties that we designed for
> managed memory.
> 1. It's isolated across slots
> 2. It should never be wasted (unless there's nothing in the job that needs
> managed memory)
> In addition, it further complicates configuration / computation logics of
> managed memory.
>
> As an alternative, I'd suggest introducing a variant of
> RocksDBStateBackend, that shares memory across slots and does not use
> managed memory. This basically means the shared memory is not considered as
> part of managed memory. For users of this new feature, they would need to
> configure how much memory the variant state backend should use, and
> probably also a larger framework-off-heap / jvm-overhead memory. The latter
> might require a bit extra user effort compared to the current approach, but
> would avoid significant complexity in the managed memory configuration and
> calculation logics which affects broader users.
>
>
> Best,
>
> Xintong
>
>
>
> On Sat, Nov 12, 2022 at 1:21 AM Roman Khachatryan 
> wrote:
>
> > Hi John, Yun,
> >
> > Thank you for your feedback
> >
> > @John
> >
> > > It seems like operators would either choose isolation for the cluster’s
> > jobs
> > > or they would want to share the memory between jobs.
> > > I’m not sure I see the motivation to reserve only part of the memory
> for
> > sharing
> > > and allowing jobs to choose whether they will share or be isolated.
> >
> > I see two related questions here:
> >
> > 1) Whether to allow mixed workloads within the same cluster.
> > I agree that most likely all the jobs will have the same "sharing"
> > requirement.
> > So we can drop "state.backend.memory.share-scope" from the proposal.
> >
> > 2) Whether to allow different memory consumers to use shared or exclusive
> > memory.
> > Currently, only RocksDB is proposed to use shared memory. For python,
> it's
> > non-trivial because it is job-specific.
> > So we have to partition managed memory into shared/exclusive and
> therefore
> > can NOT replace "taskmanager.memory.managed.shared-fraction" with some
> > boolean flag.
> >
> > I think your question was about (1), just wanted to clarify why the
> > shared-fraction is needed.
> >
> > @Yun
> >
> > > I am just curious whether this could really bring benefits to our users
> > with such complex configuration logic.
> > I agree, and configuration complexity seems a common concern.
> > I hope that removing "state.backend.memory.share-scope" (as proposed
> above)
> > reduces the complexity.
> > Please share any ideas of how to simplify it further.
> >
> > > Could you share some real experimental results?
> > I did an experiment to verify that the approach is feasible,
> > i.e. multilple jobs can share the same memory/block cache.
> > But I guess that's not what you mean here? Do you have any experiments in
> > mind?
> >
> > > BTW, as talked before, I am not sure whether different lifecycles of
> > RocksDB state-backends
> > > would affect the memory usage of block cache & write buffer manager in
> > RocksDB.
> > > Currently, all instances would start and destroy nearly simultaneously,
> > > this would change after we introduce this feature with jobs running at
> > different scheduler times.
> > IIUC, the concern is that closing a RocksDB instance might close the
> > BlockCache.
> > I checked that manually and it seems to work as expected.
> > And 

Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

2022-11-13 Thread Xintong Song
I like the idea of sharing RocksDB memory across slots. However, I'm quite
concerned by the current proposed approach.

The proposed changes break several good properties that we designed for
managed memory.
1. It's isolated across slots
2. It should never be wasted (unless there's nothing in the job that needs
managed memory)
In addition, it further complicates configuration / computation logics of
managed memory.

As an alternative, I'd suggest introducing a variant of
RocksDBStateBackend, that shares memory across slots and does not use
managed memory. This basically means the shared memory is not considered as
part of managed memory. For users of this new feature, they would need to
configure how much memory the variant state backend should use, and
probably also a larger framework-off-heap / jvm-overhead memory. The latter
might require a bit extra user effort compared to the current approach, but
would avoid significant complexity in the managed memory configuration and
calculation logics which affects broader users.


Best,

Xintong



On Sat, Nov 12, 2022 at 1:21 AM Roman Khachatryan  wrote:

> Hi John, Yun,
>
> Thank you for your feedback
>
> @John
>
> > It seems like operators would either choose isolation for the cluster’s
> jobs
> > or they would want to share the memory between jobs.
> > I’m not sure I see the motivation to reserve only part of the memory for
> sharing
> > and allowing jobs to choose whether they will share or be isolated.
>
> I see two related questions here:
>
> 1) Whether to allow mixed workloads within the same cluster.
> I agree that most likely all the jobs will have the same "sharing"
> requirement.
> So we can drop "state.backend.memory.share-scope" from the proposal.
>
> 2) Whether to allow different memory consumers to use shared or exclusive
> memory.
> Currently, only RocksDB is proposed to use shared memory. For python, it's
> non-trivial because it is job-specific.
> So we have to partition managed memory into shared/exclusive and therefore
> can NOT replace "taskmanager.memory.managed.shared-fraction" with some
> boolean flag.
>
> I think your question was about (1), just wanted to clarify why the
> shared-fraction is needed.
>
> @Yun
>
> > I am just curious whether this could really bring benefits to our users
> with such complex configuration logic.
> I agree, and configuration complexity seems a common concern.
> I hope that removing "state.backend.memory.share-scope" (as proposed above)
> reduces the complexity.
> Please share any ideas of how to simplify it further.
>
> > Could you share some real experimental results?
> I did an experiment to verify that the approach is feasible,
> i.e. multilple jobs can share the same memory/block cache.
> But I guess that's not what you mean here? Do you have any experiments in
> mind?
>
> > BTW, as talked before, I am not sure whether different lifecycles of
> RocksDB state-backends
> > would affect the memory usage of block cache & write buffer manager in
> RocksDB.
> > Currently, all instances would start and destroy nearly simultaneously,
> > this would change after we introduce this feature with jobs running at
> different scheduler times.
> IIUC, the concern is that closing a RocksDB instance might close the
> BlockCache.
> I checked that manually and it seems to work as expected.
> And I think that would contradict the sharing concept, as described in the
> documentation [1].
>
> [1]
> https://github.com/facebook/rocksdb/wiki/Block-Cache
>
> Regards,
> Roman
>
>
> On Wed, Nov 9, 2022 at 3:50 AM Yanfei Lei  wrote:
>
> > Hi Roman,
> > Thanks for the proposal, this allows State Backend to make better use of
> > memory.
> >
> > After reading the ticket, I'm curious about some points:
> >
> > 1. Is shared-memory only for the state backend? If both
> > "taskmanager.memory.managed.shared-fraction: >0" and
> > "state.backend.rocksdb.memory.managed: false" are set at the same time,
> > will the shared-memory be wasted?
> > 2. It's said that "Jobs 4 and 5 will use the same 750Mb of unmanaged
> memory
> > and will compete with each other" in the example, how is the memory size
> of
> > unmanaged part calculated?
> > 3. For fine-grained-resource-management, the control
> > of cpuCores, taskHeapMemory can still work, right?  And I am a little
> > worried that too many memory-about configuration options are complicated
> > for users to understand.
> >
> > Regards,
> > Yanfei
> >
> > Roman Khachatryan  于2022年11月8日周二 23:22写道:
> >
> > > Hi everyone,
> > >
> > > I'd like to discuss sharing RocksDB memory across slots as proposed in
> > > FLINK-29928 [1].
> > >
> > > Since 1.10 / FLINK-7289 [2], it is possible to:
> > > - share these objects among RocksDB instances of the same slot
> > > - bound the total memory usage by all RocksDB instances of a TM
> > >
> > > However, the memory is divided between the slots equally (unless using
> > > fine-grained resource control). This is sub-optimal if some slots
> contain
> > > more memory 

Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

2022-11-11 Thread Roman Khachatryan
Hi John, Yun,

Thank you for your feedback

@John

> It seems like operators would either choose isolation for the cluster’s
jobs
> or they would want to share the memory between jobs.
> I’m not sure I see the motivation to reserve only part of the memory for
sharing
> and allowing jobs to choose whether they will share or be isolated.

I see two related questions here:

1) Whether to allow mixed workloads within the same cluster.
I agree that most likely all the jobs will have the same "sharing"
requirement.
So we can drop "state.backend.memory.share-scope" from the proposal.

2) Whether to allow different memory consumers to use shared or exclusive
memory.
Currently, only RocksDB is proposed to use shared memory. For python, it's
non-trivial because it is job-specific.
So we have to partition managed memory into shared/exclusive and therefore
can NOT replace "taskmanager.memory.managed.shared-fraction" with some
boolean flag.

I think your question was about (1), just wanted to clarify why the
shared-fraction is needed.

@Yun

> I am just curious whether this could really bring benefits to our users
with such complex configuration logic.
I agree, and configuration complexity seems a common concern.
I hope that removing "state.backend.memory.share-scope" (as proposed above)
reduces the complexity.
Please share any ideas of how to simplify it further.

> Could you share some real experimental results?
I did an experiment to verify that the approach is feasible,
i.e. multilple jobs can share the same memory/block cache.
But I guess that's not what you mean here? Do you have any experiments in
mind?

> BTW, as talked before, I am not sure whether different lifecycles of
RocksDB state-backends
> would affect the memory usage of block cache & write buffer manager in
RocksDB.
> Currently, all instances would start and destroy nearly simultaneously,
> this would change after we introduce this feature with jobs running at
different scheduler times.
IIUC, the concern is that closing a RocksDB instance might close the
BlockCache.
I checked that manually and it seems to work as expected.
And I think that would contradict the sharing concept, as described in the
documentation [1].

[1]
https://github.com/facebook/rocksdb/wiki/Block-Cache

Regards,
Roman


On Wed, Nov 9, 2022 at 3:50 AM Yanfei Lei  wrote:

> Hi Roman,
> Thanks for the proposal, this allows State Backend to make better use of
> memory.
>
> After reading the ticket, I'm curious about some points:
>
> 1. Is shared-memory only for the state backend? If both
> "taskmanager.memory.managed.shared-fraction: >0" and
> "state.backend.rocksdb.memory.managed: false" are set at the same time,
> will the shared-memory be wasted?
> 2. It's said that "Jobs 4 and 5 will use the same 750Mb of unmanaged memory
> and will compete with each other" in the example, how is the memory size of
> unmanaged part calculated?
> 3. For fine-grained-resource-management, the control
> of cpuCores, taskHeapMemory can still work, right?  And I am a little
> worried that too many memory-about configuration options are complicated
> for users to understand.
>
> Regards,
> Yanfei
>
> Roman Khachatryan  于2022年11月8日周二 23:22写道:
>
> > Hi everyone,
> >
> > I'd like to discuss sharing RocksDB memory across slots as proposed in
> > FLINK-29928 [1].
> >
> > Since 1.10 / FLINK-7289 [2], it is possible to:
> > - share these objects among RocksDB instances of the same slot
> > - bound the total memory usage by all RocksDB instances of a TM
> >
> > However, the memory is divided between the slots equally (unless using
> > fine-grained resource control). This is sub-optimal if some slots contain
> > more memory intensive tasks than the others.
> > Using fine-grained resource control is also often not an option because
> the
> > workload might not be known in advance.
> >
> > The proposal is to widen the scope of sharing memory to TM, so that it
> can
> > be shared across all RocksDB instances of that TM. That would reduce the
> > overall memory consumption in exchange for resource isolation.
> >
> > Please see FLINK-29928 [1] for more details.
> >
> > Looking forward to feedback on that proposal.
> >
> > [1]
> > https://issues.apache.org/jira/browse/FLINK-29928
> > [2]
> > https://issues.apache.org/jira/browse/FLINK-7289
> >
> > Regards,
> > Roman
> >
>


Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

2022-11-09 Thread Yun Tang
Hi Roman,

Thanks for the proposal.

I am just curious whether this could really bring benefits to our users with 
such complex configuration logic. Could you share some real experimental 
results?

BTW, as talked before, I am not sure whether different lifecycles of RocksDB 
state-backends would affect the memory usage of block cache & write buffer 
manager in RocksDB. Currently, all instances would start and destroy nearly 
simultaneously, this would change after we introduce this feature with jobs 
running at different scheduler times.

Best,
Yun Tang

From: John Roesler 
Sent: Thursday, November 10, 2022 10:15
To: dev@flink.apache.org 
Subject: Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

Hi Roman,

Thanks for the proposal! This will make scheduling a lot more flexible for our 
use case.

Just a quick follow-up question about the number of new configs we’re adding 
here. It seems like the proposed configs provide a lot of flexibility, but at 
the expense of added complexity.

It seems like operators would either choose isolation for the cluster’s jobs or 
they would want to share the memory between jobs. I’m not sure I see the 
motivation to reserve only part of the memory for sharing and allowing jobs to 
choose whether they will share or be isolated.

I’m new to Flink, though, and no operational experience. Is there a use case 
you have in mind for this kind of split configuration?

Thanks,
John

On Wed, Nov 9, 2022, at 08:17, Roman Khachatryan wrote:
> Hi Yanfei,
>
> Thanks, good questions
>
>> 1. Is shared-memory only for the state backend? If both
>> "taskmanager.memory.managed.shared-fraction: >0" and
>> "state.backend.rocksdb.memory.managed: false" are set at the same time,
>> will the shared-memory be wasted?
> Yes, shared memory is only for the state backend currently;
> If no job uses it then it will be wasted.
> Session cluster can not validate this configuration
> because the job configuration is not known in advance.
>
>> 2. It's said that "Jobs 4 and 5 will use the same 750Mb of unmanaged
> memory
>> and will compete with each other" in the example, how is the memory size
> of
>> unmanaged part calculated?
> It's calculated the same way as managed memory size currently,
> i.e. taskmanager.memory.managed.size *
> taskmanager.memory.managed.shared-fraction
> Separate parameters for unmanaged memory would be more clear.
> However, I doubt that this configuration would ever be used (I listed it
> just for completeness).
> So I'm not sure whether adding them would be justified.
> WDYT?
>
>> 3. For fine-grained-resource-management, the control
>> of cpuCores, taskHeapMemory can still work, right?
> Yes, for other resources fine-grained-resource-management should work.
>
>> And I am a little
>> worried that too many memory-about configuration options are complicated
>> for users to understand.
> I'm also worried about having too many options, but I don't see any better
> alternative.
> The existing users definitely shouldn't be affected,
> so there must be at least feature toggle ("shared-fraction").
> "share-scope" could potentially be replaced by some inference logic,
> but having it explicit seems less error-prone.
>
> Regards,
> Roman
>
>
> On Wed, Nov 9, 2022 at 3:50 AM Yanfei Lei  wrote:
>
>> Hi Roman,
>> Thanks for the proposal, this allows State Backend to make better use of
>> memory.
>>
>> After reading the ticket, I'm curious about some points:
>>
>> 1. Is shared-memory only for the state backend? If both
>> "taskmanager.memory.managed.shared-fraction: >0" and
>> "state.backend.rocksdb.memory.managed: false" are set at the same time,
>> will the shared-memory be wasted?
>> 2. It's said that "Jobs 4 and 5 will use the same 750Mb of unmanaged memory
>> and will compete with each other" in the example, how is the memory size of
>> unmanaged part calculated?
>> 3. For fine-grained-resource-management, the control
>> of cpuCores, taskHeapMemory can still work, right?  And I am a little
>> worried that too many memory-about configuration options are complicated
>> for users to understand.
>>
>> Regards,
>> Yanfei
>>
>> Roman Khachatryan  于2022年11月8日周二 23:22写道:
>>
>> > Hi everyone,
>> >
>> > I'd like to discuss sharing RocksDB memory across slots as proposed in
>> > FLINK-29928 [1].
>> >
>> > Since 1.10 / FLINK-7289 [2], it is possible to:
>> > - share these objects among RocksDB instances of the same slot
>> > - bound the total memory usage by all RocksDB instances o

Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

2022-11-09 Thread John Roesler
Hi Roman,

Thanks for the proposal! This will make scheduling a lot more flexible for our 
use case.

Just a quick follow-up question about the number of new configs we’re adding 
here. It seems like the proposed configs provide a lot of flexibility, but at 
the expense of added complexity.

It seems like operators would either choose isolation for the cluster’s jobs or 
they would want to share the memory between jobs. I’m not sure I see the 
motivation to reserve only part of the memory for sharing and allowing jobs to 
choose whether they will share or be isolated.

I’m new to Flink, though, and no operational experience. Is there a use case 
you have in mind for this kind of split configuration?

Thanks,
John

On Wed, Nov 9, 2022, at 08:17, Roman Khachatryan wrote:
> Hi Yanfei,
>
> Thanks, good questions
>
>> 1. Is shared-memory only for the state backend? If both
>> "taskmanager.memory.managed.shared-fraction: >0" and
>> "state.backend.rocksdb.memory.managed: false" are set at the same time,
>> will the shared-memory be wasted?
> Yes, shared memory is only for the state backend currently;
> If no job uses it then it will be wasted.
> Session cluster can not validate this configuration
> because the job configuration is not known in advance.
>
>> 2. It's said that "Jobs 4 and 5 will use the same 750Mb of unmanaged
> memory
>> and will compete with each other" in the example, how is the memory size
> of
>> unmanaged part calculated?
> It's calculated the same way as managed memory size currently,
> i.e. taskmanager.memory.managed.size *
> taskmanager.memory.managed.shared-fraction
> Separate parameters for unmanaged memory would be more clear.
> However, I doubt that this configuration would ever be used (I listed it
> just for completeness).
> So I'm not sure whether adding them would be justified.
> WDYT?
>
>> 3. For fine-grained-resource-management, the control
>> of cpuCores, taskHeapMemory can still work, right?
> Yes, for other resources fine-grained-resource-management should work.
>
>> And I am a little
>> worried that too many memory-about configuration options are complicated
>> for users to understand.
> I'm also worried about having too many options, but I don't see any better
> alternative.
> The existing users definitely shouldn't be affected,
> so there must be at least feature toggle ("shared-fraction").
> "share-scope" could potentially be replaced by some inference logic,
> but having it explicit seems less error-prone.
>
> Regards,
> Roman
>
>
> On Wed, Nov 9, 2022 at 3:50 AM Yanfei Lei  wrote:
>
>> Hi Roman,
>> Thanks for the proposal, this allows State Backend to make better use of
>> memory.
>>
>> After reading the ticket, I'm curious about some points:
>>
>> 1. Is shared-memory only for the state backend? If both
>> "taskmanager.memory.managed.shared-fraction: >0" and
>> "state.backend.rocksdb.memory.managed: false" are set at the same time,
>> will the shared-memory be wasted?
>> 2. It's said that "Jobs 4 and 5 will use the same 750Mb of unmanaged memory
>> and will compete with each other" in the example, how is the memory size of
>> unmanaged part calculated?
>> 3. For fine-grained-resource-management, the control
>> of cpuCores, taskHeapMemory can still work, right?  And I am a little
>> worried that too many memory-about configuration options are complicated
>> for users to understand.
>>
>> Regards,
>> Yanfei
>>
>> Roman Khachatryan  于2022年11月8日周二 23:22写道:
>>
>> > Hi everyone,
>> >
>> > I'd like to discuss sharing RocksDB memory across slots as proposed in
>> > FLINK-29928 [1].
>> >
>> > Since 1.10 / FLINK-7289 [2], it is possible to:
>> > - share these objects among RocksDB instances of the same slot
>> > - bound the total memory usage by all RocksDB instances of a TM
>> >
>> > However, the memory is divided between the slots equally (unless using
>> > fine-grained resource control). This is sub-optimal if some slots contain
>> > more memory intensive tasks than the others.
>> > Using fine-grained resource control is also often not an option because
>> the
>> > workload might not be known in advance.
>> >
>> > The proposal is to widen the scope of sharing memory to TM, so that it
>> can
>> > be shared across all RocksDB instances of that TM. That would reduce the
>> > overall memory consumption in exchange for resource isolation.
>> >
>> > Please see FLINK-29928 [1] for more details.
>> >
>> > Looking forward to feedback on that proposal.
>> >
>> > [1]
>> > https://issues.apache.org/jira/browse/FLINK-29928
>> > [2]
>> > https://issues.apache.org/jira/browse/FLINK-7289
>> >
>> > Regards,
>> > Roman
>> >
>>


Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

2022-11-09 Thread Roman Khachatryan
Hi Yanfei,

Thanks, good questions

> 1. Is shared-memory only for the state backend? If both
> "taskmanager.memory.managed.shared-fraction: >0" and
> "state.backend.rocksdb.memory.managed: false" are set at the same time,
> will the shared-memory be wasted?
Yes, shared memory is only for the state backend currently;
If no job uses it then it will be wasted.
Session cluster can not validate this configuration
because the job configuration is not known in advance.

> 2. It's said that "Jobs 4 and 5 will use the same 750Mb of unmanaged
memory
> and will compete with each other" in the example, how is the memory size
of
> unmanaged part calculated?
It's calculated the same way as managed memory size currently,
i.e. taskmanager.memory.managed.size *
taskmanager.memory.managed.shared-fraction
Separate parameters for unmanaged memory would be more clear.
However, I doubt that this configuration would ever be used (I listed it
just for completeness).
So I'm not sure whether adding them would be justified.
WDYT?

> 3. For fine-grained-resource-management, the control
> of cpuCores, taskHeapMemory can still work, right?
Yes, for other resources fine-grained-resource-management should work.

> And I am a little
> worried that too many memory-about configuration options are complicated
> for users to understand.
I'm also worried about having too many options, but I don't see any better
alternative.
The existing users definitely shouldn't be affected,
so there must be at least feature toggle ("shared-fraction").
"share-scope" could potentially be replaced by some inference logic,
but having it explicit seems less error-prone.

Regards,
Roman


On Wed, Nov 9, 2022 at 3:50 AM Yanfei Lei  wrote:

> Hi Roman,
> Thanks for the proposal, this allows State Backend to make better use of
> memory.
>
> After reading the ticket, I'm curious about some points:
>
> 1. Is shared-memory only for the state backend? If both
> "taskmanager.memory.managed.shared-fraction: >0" and
> "state.backend.rocksdb.memory.managed: false" are set at the same time,
> will the shared-memory be wasted?
> 2. It's said that "Jobs 4 and 5 will use the same 750Mb of unmanaged memory
> and will compete with each other" in the example, how is the memory size of
> unmanaged part calculated?
> 3. For fine-grained-resource-management, the control
> of cpuCores, taskHeapMemory can still work, right?  And I am a little
> worried that too many memory-about configuration options are complicated
> for users to understand.
>
> Regards,
> Yanfei
>
> Roman Khachatryan  于2022年11月8日周二 23:22写道:
>
> > Hi everyone,
> >
> > I'd like to discuss sharing RocksDB memory across slots as proposed in
> > FLINK-29928 [1].
> >
> > Since 1.10 / FLINK-7289 [2], it is possible to:
> > - share these objects among RocksDB instances of the same slot
> > - bound the total memory usage by all RocksDB instances of a TM
> >
> > However, the memory is divided between the slots equally (unless using
> > fine-grained resource control). This is sub-optimal if some slots contain
> > more memory intensive tasks than the others.
> > Using fine-grained resource control is also often not an option because
> the
> > workload might not be known in advance.
> >
> > The proposal is to widen the scope of sharing memory to TM, so that it
> can
> > be shared across all RocksDB instances of that TM. That would reduce the
> > overall memory consumption in exchange for resource isolation.
> >
> > Please see FLINK-29928 [1] for more details.
> >
> > Looking forward to feedback on that proposal.
> >
> > [1]
> > https://issues.apache.org/jira/browse/FLINK-29928
> > [2]
> > https://issues.apache.org/jira/browse/FLINK-7289
> >
> > Regards,
> > Roman
> >
>


Re: [DISCUSS] Allow sharing (RocksDB) memory between slots

2022-11-08 Thread Yanfei Lei
Hi Roman,
Thanks for the proposal, this allows State Backend to make better use of
memory.

After reading the ticket, I'm curious about some points:

1. Is shared-memory only for the state backend? If both
"taskmanager.memory.managed.shared-fraction: >0" and
"state.backend.rocksdb.memory.managed: false" are set at the same time,
will the shared-memory be wasted?
2. It's said that "Jobs 4 and 5 will use the same 750Mb of unmanaged memory
and will compete with each other" in the example, how is the memory size of
unmanaged part calculated?
3. For fine-grained-resource-management, the control
of cpuCores, taskHeapMemory can still work, right?  And I am a little
worried that too many memory-about configuration options are complicated
for users to understand.

Regards,
Yanfei

Roman Khachatryan  于2022年11月8日周二 23:22写道:

> Hi everyone,
>
> I'd like to discuss sharing RocksDB memory across slots as proposed in
> FLINK-29928 [1].
>
> Since 1.10 / FLINK-7289 [2], it is possible to:
> - share these objects among RocksDB instances of the same slot
> - bound the total memory usage by all RocksDB instances of a TM
>
> However, the memory is divided between the slots equally (unless using
> fine-grained resource control). This is sub-optimal if some slots contain
> more memory intensive tasks than the others.
> Using fine-grained resource control is also often not an option because the
> workload might not be known in advance.
>
> The proposal is to widen the scope of sharing memory to TM, so that it can
> be shared across all RocksDB instances of that TM. That would reduce the
> overall memory consumption in exchange for resource isolation.
>
> Please see FLINK-29928 [1] for more details.
>
> Looking forward to feedback on that proposal.
>
> [1]
> https://issues.apache.org/jira/browse/FLINK-29928
> [2]
> https://issues.apache.org/jira/browse/FLINK-7289
>
> Regards,
> Roman
>