Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Stephan Ewen Mon, 19 Aug 2019 06:50:09 -0700

My main concern with option 2 (manually release memory) is that segfaults
in the JVM send off all sorts of alarms on user ends. So we need to
guarantee that this never happens.


The trickyness is in tasks that uses data structures / algorithms with
additional threads, like hash table spill/read and sorting threads. We need
to ensure that these cleanly shut down before we can exit the task.
I am not sure that we have that guaranteed already, that's why option 1.1
seemed simpler to me.

On Mon, Aug 19, 2019 at 3:42 PM Xintong Song <tonysong...@gmail.com> wrote:

> Thanks for the comments, Stephan. Summarized in this way really makes
> things easier to understand.
>
> I'm in favor of option 2, at least for the moment. I think it is not that
> difficult to keep it segfault safe for memory manager, as long as we always
> de-allocate the memory segment when it is released from the memory
> consumers. Only if the memory consumer continue using the buffer of memory
> segment after releasing it, in which case we do want the job to fail so we
> detect the memory leak early.
>
> For option 1.2, I don't think this is a good idea. Not only because the
> assumption (regular GC is enough to clean direct buffers) may not always be
> true, but also it makes harder for finding problems in cases of memory
> overuse. E.g., user configured some direct memory for the user libraries.
> If the library actually use more direct memory then configured, which
> cannot be cleaned by GC because they are still in use, may lead to overuse
> of the total container memory. In that case, if it didn't touch the JVM
> default max direct memory limit, we cannot get a direct memory OOM and it
> will become super hard to understand which part of the configuration need
> to be updated.
>
> For option 1.1, it has the similar problem as 1.2, if the exceeded direct
> memory does not reach the max direct memory limit specified by the
> dedicated parameter. I think it is slightly better than 1.2, only because
> we can tune the parameter.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Mon, Aug 19, 2019 at 2:53 PM Stephan Ewen <se...@apache.org> wrote:
>
> > About the "-XX:MaxDirectMemorySize" discussion, maybe let me summarize
> it a
> > bit differently:
> >
> > We have the following two options:
> >
> > (1) We let MemorySegments be de-allocated by the GC. That makes it
> segfault
> > safe. But then we need a way to trigger GC in case de-allocation and
> > re-allocation of a bunch of segments happens quickly, which is often the
> > case during batch scheduling or task restart.
> >   - The "-XX:MaxDirectMemorySize" (option 1.1) is one way to do this
> >   - Another way could be to have a dedicated bookkeeping in the
> > MemoryManager (option 1.2), so that this is a number independent of the
> > "-XX:MaxDirectMemorySize" parameter.
> >
> > (2) We manually allocate and de-allocate the memory for the
> MemorySegments
> > (option 2). That way we need not worry about triggering GC by some
> > threshold or bookkeeping, but it is harder to prevent segfaults. We need
> to
> > be very careful about when we release the memory segments (only in the
> > cleanup phase of the main thread).
> >
> > If we go with option 1.1, we probably need to set
> > "-XX:MaxDirectMemorySize" to "off_heap_managed_memory + direct_memory"
> and
> > have "direct_memory" as a separate reserved memory pool. Because if we
> just
> > set "-XX:MaxDirectMemorySize" to "off_heap_managed_memory +
> jvm_overhead",
> > then there will be times when that entire memory is allocated by direct
> > buffers and we have nothing left for the JVM overhead. So we either need
> a
> > way to compensate for that (again some safety margin cutoff value) or we
> > will exceed container memory.
> >
> > If we go with option 1.2, we need to be aware that it takes elaborate
> logic
> > to push recycling of direct buffers without always triggering a full GC.
> >
> >
> > My first guess is that the options will be easiest to do in the following
> > order:
> >
> >   - Option 1.1 with a dedicated direct_memory parameter, as discussed
> > above. We would need to find a way to set the direct_memory parameter by
> > default. We could start with 64 MB and see how it goes in practice. One
> > danger I see is that setting this loo low can cause a bunch of additional
> > GCs compared to before (we need to watch this carefully).
> >
> >   - Option 2. It is actually quite simple to implement, we could try how
> > segfault safe we are at the moment.
> >
> >   - Option 1.2: We would not touch the "-XX:MaxDirectMemorySize"
> parameter
> > at all and assume that all the direct memory allocations that the JVM and
> > Netty do are infrequent enough to be cleaned up fast enough through
> regular
> > GC. I am not sure if that is a valid assumption, though.
> >
> > Best,
> > Stephan
> >
> >
> >
> > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <tonysong...@gmail.com>
> > wrote:
> >
> > > Thanks for sharing your opinion Till.
> > >
> > > I'm also in favor of alternative 2. I was wondering whether we can
> avoid
> > > using Unsafe.allocate() for off-heap managed memory and network memory
> > with
> > > alternative 3. But after giving it a second thought, I think even for
> > > alternative 3 using direct memory for off-heap managed memory could
> cause
> > > problems.
> > >
> > > Hi Yang,
> > >
> > > Regarding your concern, I think what proposed in this FLIP it to have
> > both
> > > off-heap managed memory and network memory allocated through
> > > Unsafe.allocate(), which means they are practically native memory and
> not
> > > limited by JVM max direct memory. The only parts of memory limited by
> JVM
> > > max direct memory are task off-heap memory and JVM overhead, which are
> > > exactly alternative 2 suggests to set the JVM max direct memory to.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <trohrm...@apache.org>
> > > wrote:
> > >
> > > > Thanks for the clarification Xintong. I understand the two
> alternatives
> > > > now.
> > > >
> > > > I would be in favour of option 2 because it makes things explicit. If
> > we
> > > > don't limit the direct memory, I fear that we might end up in a
> similar
> > > > situation as we are currently in: The user might see that her process
> > > gets
> > > > killed by the OS and does not know why this is the case.
> Consequently,
> > > she
> > > > tries to decrease the process memory size (similar to increasing the
> > > cutoff
> > > > ratio) in order to accommodate for the extra direct memory. Even
> worse,
> > > she
> > > > tries to decrease memory budgets which are not fully used and hence
> > won't
> > > > change the overall memory consumption.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <tonysong...@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Let me explain this with a concrete example Till.
> > > > >
> > > > > Let's say we have the following scenario.
> > > > >
> > > > > Total Process Memory: 1GB
> > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed
> Memory
> > > and
> > > > > Network Memory): 800MB
> > > > >
> > > > >
> > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very large
> > > value,
> > > > > let's say 1TB.
> > > > >
> > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > Overhead
> > > > > do not exceed 200MB, then alternative 2 and alternative 3 should
> have
> > > the
> > > > > same utility. Setting larger -XX:MaxDirectMemorySize will not
> reduce
> > > the
> > > > > sizes of the other memory pools.
> > > > >
> > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > > Overhead potentially exceed 200MB, then
> > > > >
> > > > >    - Alternative 2 suffers from frequent OOM. To avoid that, the
> only
> > > > thing
> > > > >    user can do is to modify the configuration and increase JVM
> Direct
> > > > > Memory
> > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user
> > increases
> > > > JVM
> > > > >    Direct Memory to 250MB, this will reduce the total size of other
> > > > memory
> > > > >    pools to 750MB, given the total process memory remains 1GB.
> > > > >    - For alternative 3, there is no chance of direct OOM. There are
> > > > chances
> > > > >    of exceeding the total process memory limit, but given that the
> > > > process
> > > > > may
> > > > >    not use up all the reserved native memory (Off-Heap Managed
> > Memory,
> > > > > Network
> > > > >    Memory, JVM Metaspace), if the actual direct memory usage is
> > > slightly
> > > > > above
> > > > >    yet very close to 200MB, user probably do not need to change the
> > > > >    configurations.
> > > > >
> > > > > Therefore, I think from the user's perspective, a feasible
> > > configuration
> > > > > for alternative 2 may lead to lower resource utilization compared
> to
> > > > > alternative 3.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> trohrm...@apache.org
> > >
> > > > > wrote:
> > > > >
> > > > > > I guess you have to help me understand the difference between
> > > > > alternative 2
> > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > >
> > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap
> Memory
> > > and
> > > > > JVM
> > > > > > Overhead. Then there is the risk that this size is too low
> > resulting
> > > > in a
> > > > > > lot of garbage collection and potentially an OOM.
> > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something larger
> > than
> > > > > > alternative 2. This would of course reduce the sizes of the other
> > > > memory
> > > > > > types.
> > > > > >
> > > > > > How would alternative 2 now result in an under utilization of
> > memory
> > > > > > compared to alternative 3? If alternative 3 strictly sets a
> higher
> > > max
> > > > > > direct memory size and we use only little, then I would expect
> that
> > > > > > alternative 3 results in memory under utilization.
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <danrtsey...@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi xintong,till
> > > > > > >
> > > > > > >
> > > > > > > > Native and Direct Memory
> > > > > > >
> > > > > > > My point is setting a very large max direct memory size when we
> > do
> > > > not
> > > > > > > differentiate direct and native memory. If the direct
> > > > memory,including
> > > > > > user
> > > > > > > direct memory and framework direct memory,could be calculated
> > > > > > > correctly,then
> > > > > > > i am in favor of setting direct memory with fixed value.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Memory Calculation
> > > > > > >
> > > > > > > I agree with xintong. For Yarn and k8s,we need to check the
> > memory
> > > > > > > configurations in client to avoid submitting successfully and
> > > failing
> > > > > in
> > > > > > > the flink master.
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > Yang
> > > > > > >
> > > > > > > Xintong Song <tonysong...@gmail.com>于2019年8月13日 周二22:07写道：
> > > > > > >
> > > > > > > > Thanks for replying, Till.
> > > > > > > >
> > > > > > > > About MemorySegment, I think you are right that we should not
> > > > include
> > > > > > > this
> > > > > > > > issue in the scope of this FLIP. This FLIP should concentrate
> > on
> > > > how
> > > > > to
> > > > > > > > configure memory pools for TaskExecutors, with minimum
> > > involvement
> > > > on
> > > > > > how
> > > > > > > > memory consumers use it.
> > > > > > > >
> > > > > > > > About direct memory, I think alternative 3 may not having the
> > > same
> > > > > over
> > > > > > > > reservation issue that alternative 2 does, but at the cost of
> > > risk
> > > > of
> > > > > > > over
> > > > > > > > using memory at the container level, which is not good. My
> > point
> > > is
> > > > > > that
> > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy
> to
> > > > > config.
> > > > > > > For
> > > > > > > > alternative 2, users might configure them higher than what
> > > actually
> > > > > > > needed,
> > > > > > > > just to avoid getting a direct OOM. For alternative 3, users
> do
> > > not
> > > > > get
> > > > > > > > direct OOM, so they may not config the two options
> aggressively
> > > > high.
> > > > > > But
> > > > > > > > the consequences are risks of overall container memory usage
> > > > exceeds
> > > > > > the
> > > > > > > > budget.
> > > > > > > >
> > > > > > > > Thank you~
> > > > > > > >
> > > > > > > > Xintong Song
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > trohrm...@apache.org>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > >
> > > > > > > > > All in all I think it already looks quite good. Concerning
> > the
> > > > > first
> > > > > > > open
> > > > > > > > > question about allocating memory segments, I was wondering
> > > > whether
> > > > > > this
> > > > > > > > is
> > > > > > > > > strictly necessary to do in the context of this FLIP or
> > whether
> > > > > this
> > > > > > > > could
> > > > > > > > > be done as a follow up? Without knowing all details, I
> would
> > be
> > > > > > > concerned
> > > > > > > > > that we would widen the scope of this FLIP too much because
> > we
> > > > > would
> > > > > > > have
> > > > > > > > > to touch all the existing call sites of the MemoryManager
> > where
> > > > we
> > > > > > > > allocate
> > > > > > > > > memory segments (this should mainly be batch operators).
> The
> > > > > addition
> > > > > > > of
> > > > > > > > > the memory reservation call to the MemoryManager should not
> > be
> > > > > > affected
> > > > > > > > by
> > > > > > > > > this and I would hope that this is the only point of
> > > interaction
> > > > a
> > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > >
> > > > > > > > > Concerning the second open question about setting or not
> > > setting
> > > > a
> > > > > > max
> > > > > > > > > direct memory limit, I would also be interested why Yang
> Wang
> > > > > thinks
> > > > > > > > > leaving it open would be best. My concern about this would
> be
> > > > that
> > > > > we
> > > > > > > > would
> > > > > > > > > be in a similar situation as we are now with the
> > > > > RocksDBStateBackend.
> > > > > > > If
> > > > > > > > > the different memory pools are not clearly separated and
> can
> > > > spill
> > > > > > over
> > > > > > > > to
> > > > > > > > > a different pool, then it is quite hard to understand what
> > > > exactly
> > > > > > > > causes a
> > > > > > > > > process to get killed for using too much memory. This could
> > > then
> > > > > > easily
> > > > > > > > > lead to a similar situation what we have with the
> > cutoff-ratio.
> > > > So
> > > > > > why
> > > > > > > > not
> > > > > > > > > setting a sane default value for max direct memory and
> giving
> > > the
> > > > > > user
> > > > > > > an
> > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > >
> > > > > > > > > @Xintong, how would alternative 2 lead to lower memory
> > > > utilization
> > > > > > than
> > > > > > > > > alternative 3 where we set the direct memory to a higher
> > value?
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Till
> > > > > > > > >
> > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > tonysong...@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > >
> > > > > > > > > > Regarding your comments:
> > > > > > > > > >
> > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > I think setting a very large max direct memory size
> > > definitely
> > > > > has
> > > > > > > some
> > > > > > > > > > good sides. E.g., we do not worry about direct OOM, and
> we
> > > > don't
> > > > > > even
> > > > > > > > > need
> > > > > > > > > > to allocate managed / network memory with
> > Unsafe.allocate() .
> > > > > > > > > > However, there are also some down sides of doing this.
> > > > > > > > > >
> > > > > > > > > >    - One thing I can think of is that if a task executor
> > > > > container
> > > > > > is
> > > > > > > > > >    killed due to overusing memory, it could be hard for
> use
> > > to
> > > > > know
> > > > > > > > which
> > > > > > > > > > part
> > > > > > > > > >    of the memory is overused.
> > > > > > > > > >    - Another down side is that the JVM never trigger GC
> due
> > > to
> > > > > > > reaching
> > > > > > > > > max
> > > > > > > > > >    direct memory limit, because the limit is too high to
> be
> > > > > > reached.
> > > > > > > > That
> > > > > > > > > >    means we kind of relay on heap memory to trigger GC
> and
> > > > > release
> > > > > > > > direct
> > > > > > > > > >    memory. That could be a problem in cases where we have
> > > more
> > > > > > direct
> > > > > > > > > > memory
> > > > > > > > > >    usage but not enough heap activity to trigger the GC.
> > > > > > > > > >
> > > > > > > > > > Maybe you can share your reasons for preferring setting a
> > > very
> > > > > > large
> > > > > > > > > value,
> > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > >
> > > > > > > > > > *Memory Calculation*
> > > > > > > > > > If there is any conflict between multiple configuration
> > that
> > > > user
> > > > > > > > > > explicitly specified, I think we should throw an error.
> > > > > > > > > > I think doing checking on the client side is a good idea,
> > so
> > > > that
> > > > > > on
> > > > > > > > > Yarn /
> > > > > > > > > > K8s we can discover the problem before submitting the
> Flink
> > > > > > cluster,
> > > > > > > > > which
> > > > > > > > > > is always a good thing.
> > > > > > > > > > But we can not only rely on the client side checking,
> > because
> > > > for
> > > > > > > > > > standalone cluster TaskManagers on different machines may
> > > have
> > > > > > > > different
> > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > What do you think?
> > > > > > > > > >
> > > > > > > > > > Thank you~
> > > > > > > > > >
> > > > > > > > > > Xintong Song
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > danrtsey...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi xintong,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks for your detailed proposal. After all the memory
> > > > > > > configuration
> > > > > > > > > are
> > > > > > > > > > > introduced, it will be more powerful to control the
> flink
> > > > > memory
> > > > > > > > > usage. I
> > > > > > > > > > > just have few questions about it.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > >
> > > > > > > > > > > We do not differentiate user direct memory and native
> > > memory.
> > > > > > They
> > > > > > > > are
> > > > > > > > > > all
> > > > > > > > > > > included in task off-heap memory. Right? So i don’t
> think
> > > we
> > > > > > could
> > > > > > > > not
> > > > > > > > > > set
> > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer leaving
> > it a
> > > > > very
> > > > > > > > large
> > > > > > > > > > > value.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > >
> > > > > > > > > > > If the sum of and fine-grained memory(network memory,
> > > managed
> > > > > > > memory,
> > > > > > > > > > etc.)
> > > > > > > > > > > is larger than total process memory, how do we deal
> with
> > > this
> > > > > > > > > situation?
> > > > > > > > > > Do
> > > > > > > > > > > we need to check the memory configuration in client?
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song <tonysong...@gmail.com> 于2019年8月7日周三
> > > 下午10:14写道：
> > > > > > > > > > >
> > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > >
> > > > > > > > > > > > We would like to start a discussion thread on
> "FLIP-49:
> > > > > Unified
> > > > > > > > > Memory
> > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> describe
> > > how
> > > > to
> > > > > > > > improve
> > > > > > > > > > > > TaskExecutor memory configurations. The FLIP document
> > is
> > > > > mostly
> > > > > > > > based
> > > > > > > > > > on
> > > > > > > > > > > an
> > > > > > > > > > > > early design "Memory Management and Configuration
> > > > > Reloaded"[2]
> > > > > > by
> > > > > > > > > > > Stephan,
> > > > > > > > > > > > with updates from follow-up discussions both online
> and
> > > > > > offline.
> > > > > > > > > > > >
> > > > > > > > > > > > This FLIP addresses several shortcomings of current
> > > (Flink
> > > > > 1.9)
> > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > >
> > > > > > > > > > > >    - Different configuration for Streaming and Batch.
> > > > > > > > > > > >    - Complex and difficult configuration of RocksDB
> in
> > > > > > Streaming.
> > > > > > > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Key changes to solve the problems can be summarized
> as
> > > > > follows.
> > > > > > > > > > > >
> > > > > > > > > > > >    - Extend memory manager to also account for memory
> > > usage
> > > > > by
> > > > > > > > state
> > > > > > > > > > > >    backends.
> > > > > > > > > > > >    - Modify how TaskExecutor memory is partitioned
> > > > accounted
> > > > > > > > > individual
> > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > >    - Simplify memory configuration options and
> > > calculations
> > > > > > > logics.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Please find more details in the FLIP wiki document
> [1].
> > > > > > > > > > > >
> > > > > > > > > > > > (Please note that the early design doc [2] is out of
> > > sync,
> > > > > and
> > > > > > it
> > > > > > > > is
> > > > > > > > > > > > appreciated to have the discussion in this mailing
> list
> > > > > > thread.)
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you~
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > >
> > > > > > > > > > > > [2]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> >
> >
> > On Fri, Aug 16, 2019 at 2:16 PM Xintong Song <tonysong...@gmail.com>
> > wrote:
> >
> > > Thanks for sharing your opinion Till.
> > >
> > > I'm also in favor of alternative 2. I was wondering whether we can
> avoid
> > > using Unsafe.allocate() for off-heap managed memory and network memory
> > with
> > > alternative 3. But after giving it a second thought, I think even for
> > > alternative 3 using direct memory for off-heap managed memory could
> cause
> > > problems.
> > >
> > > Hi Yang,
> > >
> > > Regarding your concern, I think what proposed in this FLIP it to have
> > both
> > > off-heap managed memory and network memory allocated through
> > > Unsafe.allocate(), which means they are practically native memory and
> not
> > > limited by JVM max direct memory. The only parts of memory limited by
> JVM
> > > max direct memory are task off-heap memory and JVM overhead, which are
> > > exactly alternative 2 suggests to set the JVM max direct memory to.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Fri, Aug 16, 2019 at 1:48 PM Till Rohrmann <trohrm...@apache.org>
> > > wrote:
> > >
> > > > Thanks for the clarification Xintong. I understand the two
> alternatives
> > > > now.
> > > >
> > > > I would be in favour of option 2 because it makes things explicit. If
> > we
> > > > don't limit the direct memory, I fear that we might end up in a
> similar
> > > > situation as we are currently in: The user might see that her process
> > > gets
> > > > killed by the OS and does not know why this is the case.
> Consequently,
> > > she
> > > > tries to decrease the process memory size (similar to increasing the
> > > cutoff
> > > > ratio) in order to accommodate for the extra direct memory. Even
> worse,
> > > she
> > > > tries to decrease memory budgets which are not fully used and hence
> > won't
> > > > change the overall memory consumption.
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Fri, Aug 16, 2019 at 11:01 AM Xintong Song <tonysong...@gmail.com
> >
> > > > wrote:
> > > >
> > > > > Let me explain this with a concrete example Till.
> > > > >
> > > > > Let's say we have the following scenario.
> > > > >
> > > > > Total Process Memory: 1GB
> > > > > JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
> > > > > Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed
> Memory
> > > and
> > > > > Network Memory): 800MB
> > > > >
> > > > >
> > > > > For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
> > > > > For alternative 3, we set -XX:MaxDirectMemorySize to a very large
> > > value,
> > > > > let's say 1TB.
> > > > >
> > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > Overhead
> > > > > do not exceed 200MB, then alternative 2 and alternative 3 should
> have
> > > the
> > > > > same utility. Setting larger -XX:MaxDirectMemorySize will not
> reduce
> > > the
> > > > > sizes of the other memory pools.
> > > > >
> > > > > If the actual direct memory usage of Task Off-Heap Memory and JVM
> > > > > Overhead potentially exceed 200MB, then
> > > > >
> > > > >    - Alternative 2 suffers from frequent OOM. To avoid that, the
> only
> > > > thing
> > > > >    user can do is to modify the configuration and increase JVM
> Direct
> > > > > Memory
> > > > >    (Task Off-Heap Memory + JVM Overhead). Let's say that user
> > increases
> > > > JVM
> > > > >    Direct Memory to 250MB, this will reduce the total size of other
> > > > memory
> > > > >    pools to 750MB, given the total process memory remains 1GB.
> > > > >    - For alternative 3, there is no chance of direct OOM. There are
> > > > chances
> > > > >    of exceeding the total process memory limit, but given that the
> > > > process
> > > > > may
> > > > >    not use up all the reserved native memory (Off-Heap Managed
> > Memory,
> > > > > Network
> > > > >    Memory, JVM Metaspace), if the actual direct memory usage is
> > > slightly
> > > > > above
> > > > >    yet very close to 200MB, user probably do not need to change the
> > > > >    configurations.
> > > > >
> > > > > Therefore, I think from the user's perspective, a feasible
> > > configuration
> > > > > for alternative 2 may lead to lower resource utilization compared
> to
> > > > > alternative 3.
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <
> trohrm...@apache.org
> > >
> > > > > wrote:
> > > > >
> > > > > > I guess you have to help me understand the difference between
> > > > > alternative 2
> > > > > > and 3 wrt to memory under utilization Xintong.
> > > > > >
> > > > > > - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap
> Memory
> > > and
> > > > > JVM
> > > > > > Overhead. Then there is the risk that this size is too low
> > resulting
> > > > in a
> > > > > > lot of garbage collection and potentially an OOM.
> > > > > > - Alternative 3: set XX:MaxDirectMemorySize to something larger
> > than
> > > > > > alternative 2. This would of course reduce the sizes of the other
> > > > memory
> > > > > > types.
> > > > > >
> > > > > > How would alternative 2 now result in an under utilization of
> > memory
> > > > > > compared to alternative 3? If alternative 3 strictly sets a
> higher
> > > max
> > > > > > direct memory size and we use only little, then I would expect
> that
> > > > > > alternative 3 results in memory under utilization.
> > > > > >
> > > > > > Cheers,
> > > > > > Till
> > > > > >
> > > > > > On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <danrtsey...@gmail.com
> >
> > > > wrote:
> > > > > >
> > > > > > > Hi xintong,till
> > > > > > >
> > > > > > >
> > > > > > > > Native and Direct Memory
> > > > > > >
> > > > > > > My point is setting a very large max direct memory size when we
> > do
> > > > not
> > > > > > > differentiate direct and native memory. If the direct
> > > > memory,including
> > > > > > user
> > > > > > > direct memory and framework direct memory,could be calculated
> > > > > > > correctly,then
> > > > > > > i am in favor of setting direct memory with fixed value.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Memory Calculation
> > > > > > >
> > > > > > > I agree with xintong. For Yarn and k8s,we need to check the
> > memory
> > > > > > > configurations in client to avoid submitting successfully and
> > > failing
> > > > > in
> > > > > > > the flink master.
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > Yang
> > > > > > >
> > > > > > > Xintong Song <tonysong...@gmail.com>于2019年8月13日 周二22:07写道：
> > > > > > >
> > > > > > > > Thanks for replying, Till.
> > > > > > > >
> > > > > > > > About MemorySegment, I think you are right that we should not
> > > > include
> > > > > > > this
> > > > > > > > issue in the scope of this FLIP. This FLIP should concentrate
> > on
> > > > how
> > > > > to
> > > > > > > > configure memory pools for TaskExecutors, with minimum
> > > involvement
> > > > on
> > > > > > how
> > > > > > > > memory consumers use it.
> > > > > > > >
> > > > > > > > About direct memory, I think alternative 3 may not having the
> > > same
> > > > > over
> > > > > > > > reservation issue that alternative 2 does, but at the cost of
> > > risk
> > > > of
> > > > > > > over
> > > > > > > > using memory at the container level, which is not good. My
> > point
> > > is
> > > > > > that
> > > > > > > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy
> to
> > > > > config.
> > > > > > > For
> > > > > > > > alternative 2, users might configure them higher than what
> > > actually
> > > > > > > needed,
> > > > > > > > just to avoid getting a direct OOM. For alternative 3, users
> do
> > > not
> > > > > get
> > > > > > > > direct OOM, so they may not config the two options
> aggressively
> > > > high.
> > > > > > But
> > > > > > > > the consequences are risks of overall container memory usage
> > > > exceeds
> > > > > > the
> > > > > > > > budget.
> > > > > > > >
> > > > > > > > Thank you~
> > > > > > > >
> > > > > > > > Xintong Song
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <
> > > > trohrm...@apache.org>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks for proposing this FLIP Xintong.
> > > > > > > > >
> > > > > > > > > All in all I think it already looks quite good. Concerning
> > the
> > > > > first
> > > > > > > open
> > > > > > > > > question about allocating memory segments, I was wondering
> > > > whether
> > > > > > this
> > > > > > > > is
> > > > > > > > > strictly necessary to do in the context of this FLIP or
> > whether
> > > > > this
> > > > > > > > could
> > > > > > > > > be done as a follow up? Without knowing all details, I
> would
> > be
> > > > > > > concerned
> > > > > > > > > that we would widen the scope of this FLIP too much because
> > we
> > > > > would
> > > > > > > have
> > > > > > > > > to touch all the existing call sites of the MemoryManager
> > where
> > > > we
> > > > > > > > allocate
> > > > > > > > > memory segments (this should mainly be batch operators).
> The
> > > > > addition
> > > > > > > of
> > > > > > > > > the memory reservation call to the MemoryManager should not
> > be
> > > > > > affected
> > > > > > > > by
> > > > > > > > > this and I would hope that this is the only point of
> > > interaction
> > > > a
> > > > > > > > > streaming job would have with the MemoryManager.
> > > > > > > > >
> > > > > > > > > Concerning the second open question about setting or not
> > > setting
> > > > a
> > > > > > max
> > > > > > > > > direct memory limit, I would also be interested why Yang
> Wang
> > > > > thinks
> > > > > > > > > leaving it open would be best. My concern about this would
> be
> > > > that
> > > > > we
> > > > > > > > would
> > > > > > > > > be in a similar situation as we are now with the
> > > > > RocksDBStateBackend.
> > > > > > > If
> > > > > > > > > the different memory pools are not clearly separated and
> can
> > > > spill
> > > > > > over
> > > > > > > > to
> > > > > > > > > a different pool, then it is quite hard to understand what
> > > > exactly
> > > > > > > > causes a
> > > > > > > > > process to get killed for using too much memory. This could
> > > then
> > > > > > easily
> > > > > > > > > lead to a similar situation what we have with the
> > cutoff-ratio.
> > > > So
> > > > > > why
> > > > > > > > not
> > > > > > > > > setting a sane default value for max direct memory and
> giving
> > > the
> > > > > > user
> > > > > > > an
> > > > > > > > > option to increase it if he runs into an OOM.
> > > > > > > > >
> > > > > > > > > @Xintong, how would alternative 2 lead to lower memory
> > > > utilization
> > > > > > than
> > > > > > > > > alternative 3 where we set the direct memory to a higher
> > value?
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Till
> > > > > > > > >
> > > > > > > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <
> > > > tonysong...@gmail.com
> > > > > >
> > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Thanks for the feedback, Yang.
> > > > > > > > > >
> > > > > > > > > > Regarding your comments:
> > > > > > > > > >
> > > > > > > > > > *Native and Direct Memory*
> > > > > > > > > > I think setting a very large max direct memory size
> > > definitely
> > > > > has
> > > > > > > some
> > > > > > > > > > good sides. E.g., we do not worry about direct OOM, and
> we
> > > > don't
> > > > > > even
> > > > > > > > > need
> > > > > > > > > > to allocate managed / network memory with
> > Unsafe.allocate() .
> > > > > > > > > > However, there are also some down sides of doing this.
> > > > > > > > > >
> > > > > > > > > >    - One thing I can think of is that if a task executor
> > > > > container
> > > > > > is
> > > > > > > > > >    killed due to overusing memory, it could be hard for
> use
> > > to
> > > > > know
> > > > > > > > which
> > > > > > > > > > part
> > > > > > > > > >    of the memory is overused.
> > > > > > > > > >    - Another down side is that the JVM never trigger GC
> due
> > > to
> > > > > > > reaching
> > > > > > > > > max
> > > > > > > > > >    direct memory limit, because the limit is too high to
> be
> > > > > > reached.
> > > > > > > > That
> > > > > > > > > >    means we kind of relay on heap memory to trigger GC
> and
> > > > > release
> > > > > > > > direct
> > > > > > > > > >    memory. That could be a problem in cases where we have
> > > more
> > > > > > direct
> > > > > > > > > > memory
> > > > > > > > > >    usage but not enough heap activity to trigger the GC.
> > > > > > > > > >
> > > > > > > > > > Maybe you can share your reasons for preferring setting a
> > > very
> > > > > > large
> > > > > > > > > value,
> > > > > > > > > > if there are anything else I overlooked.
> > > > > > > > > >
> > > > > > > > > > *Memory Calculation*
> > > > > > > > > > If there is any conflict between multiple configuration
> > that
> > > > user
> > > > > > > > > > explicitly specified, I think we should throw an error.
> > > > > > > > > > I think doing checking on the client side is a good idea,
> > so
> > > > that
> > > > > > on
> > > > > > > > > Yarn /
> > > > > > > > > > K8s we can discover the problem before submitting the
> Flink
> > > > > > cluster,
> > > > > > > > > which
> > > > > > > > > > is always a good thing.
> > > > > > > > > > But we can not only rely on the client side checking,
> > because
> > > > for
> > > > > > > > > > standalone cluster TaskManagers on different machines may
> > > have
> > > > > > > > different
> > > > > > > > > > configurations and the client does see that.
> > > > > > > > > > What do you think?
> > > > > > > > > >
> > > > > > > > > > Thank you~
> > > > > > > > > >
> > > > > > > > > > Xintong Song
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <
> > > > danrtsey...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi xintong,
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Thanks for your detailed proposal. After all the memory
> > > > > > > configuration
> > > > > > > > > are
> > > > > > > > > > > introduced, it will be more powerful to control the
> flink
> > > > > memory
> > > > > > > > > usage. I
> > > > > > > > > > > just have few questions about it.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    - Native and Direct Memory
> > > > > > > > > > >
> > > > > > > > > > > We do not differentiate user direct memory and native
> > > memory.
> > > > > > They
> > > > > > > > are
> > > > > > > > > > all
> > > > > > > > > > > included in task off-heap memory. Right? So i don’t
> think
> > > we
> > > > > > could
> > > > > > > > not
> > > > > > > > > > set
> > > > > > > > > > > the -XX:MaxDirectMemorySize properly. I prefer leaving
> > it a
> > > > > very
> > > > > > > > large
> > > > > > > > > > > value.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > >    - Memory Calculation
> > > > > > > > > > >
> > > > > > > > > > > If the sum of and fine-grained memory(network memory,
> > > managed
> > > > > > > memory,
> > > > > > > > > > etc.)
> > > > > > > > > > > is larger than total process memory, how do we deal
> with
> > > this
> > > > > > > > > situation?
> > > > > > > > > > Do
> > > > > > > > > > > we need to check the memory configuration in client?
> > > > > > > > > > >
> > > > > > > > > > > Xintong Song <tonysong...@gmail.com> 于2019年8月7日周三
> > > 下午10:14写道：
> > > > > > > > > > >
> > > > > > > > > > > > Hi everyone,
> > > > > > > > > > > >
> > > > > > > > > > > > We would like to start a discussion thread on
> "FLIP-49:
> > > > > Unified
> > > > > > > > > Memory
> > > > > > > > > > > > Configuration for TaskExecutors"[1], where we
> describe
> > > how
> > > > to
> > > > > > > > improve
> > > > > > > > > > > > TaskExecutor memory configurations. The FLIP document
> > is
> > > > > mostly
> > > > > > > > based
> > > > > > > > > > on
> > > > > > > > > > > an
> > > > > > > > > > > > early design "Memory Management and Configuration
> > > > > Reloaded"[2]
> > > > > > by
> > > > > > > > > > > Stephan,
> > > > > > > > > > > > with updates from follow-up discussions both online
> and
> > > > > > offline.
> > > > > > > > > > > >
> > > > > > > > > > > > This FLIP addresses several shortcomings of current
> > > (Flink
> > > > > 1.9)
> > > > > > > > > > > > TaskExecutor memory configuration.
> > > > > > > > > > > >
> > > > > > > > > > > >    - Different configuration for Streaming and Batch.
> > > > > > > > > > > >    - Complex and difficult configuration of RocksDB
> in
> > > > > > Streaming.
> > > > > > > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Key changes to solve the problems can be summarized
> as
> > > > > follows.
> > > > > > > > > > > >
> > > > > > > > > > > >    - Extend memory manager to also account for memory
> > > usage
> > > > > by
> > > > > > > > state
> > > > > > > > > > > >    backends.
> > > > > > > > > > > >    - Modify how TaskExecutor memory is partitioned
> > > > accounted
> > > > > > > > > individual
> > > > > > > > > > > >    memory reservations and pools.
> > > > > > > > > > > >    - Simplify memory configuration options and
> > > calculations
> > > > > > > logics.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Please find more details in the FLIP wiki document
> [1].
> > > > > > > > > > > >
> > > > > > > > > > > > (Please note that the early design doc [2] is out of
> > > sync,
> > > > > and
> > > > > > it
> > > > > > > > is
> > > > > > > > > > > > appreciated to have the discussion in this mailing
> list
> > > > > > thread.)
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Looking forward to your feedbacks.
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > Thank you~
> > > > > > > > > > > >
> > > > > > > > > > > > Xintong Song
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > [1]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > > > > > > >
> > > > > > > > > > > > [2]
> > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Reply via email to