Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Xintong Song Fri, 16 Aug 2019 02:01:34 -0700

Let me explain this with a concrete example Till.

Let's say we have the following scenario.


Total Process Memory: 1GB
JVM Direct Memory (Task Off-Heap Memory + JVM Overhead): 200MB
Other Memory (JVM Heap Memory, JVM Metaspace, Off-Heap Managed Memory and
Network Memory): 800MB


For alternative 2, we set -XX:MaxDirectMemorySize to 200MB.
For alternative 3, we set -XX:MaxDirectMemorySize to a very large value,
let's say 1TB.

If the actual direct memory usage of Task Off-Heap Memory and JVM Overhead
do not exceed 200MB, then alternative 2 and alternative 3 should have the
same utility. Setting larger -XX:MaxDirectMemorySize will not reduce the
sizes of the other memory pools.

If the actual direct memory usage of Task Off-Heap Memory and JVM
Overhead potentially exceed 200MB, then

   - Alternative 2 suffers from frequent OOM. To avoid that, the only thing
   user can do is to modify the configuration and increase JVM Direct Memory
   (Task Off-Heap Memory + JVM Overhead). Let's say that user increases JVM
   Direct Memory to 250MB, this will reduce the total size of other memory
   pools to 750MB, given the total process memory remains 1GB.
   - For alternative 3, there is no chance of direct OOM. There are chances
   of exceeding the total process memory limit, but given that the process may
   not use up all the reserved native memory (Off-Heap Managed Memory, Network
   Memory, JVM Metaspace), if the actual direct memory usage is slightly above
   yet very close to 200MB, user probably do not need to change the
   configurations.

Therefore, I think from the user's perspective, a feasible configuration
for alternative 2 may lead to lower resource utilization compared to
alternative 3.

Thank you~

Xintong Song



On Fri, Aug 16, 2019 at 10:28 AM Till Rohrmann <[email protected]> wrote:

> I guess you have to help me understand the difference between alternative 2
> and 3 wrt to memory under utilization Xintong.
>
> - Alternative 2: set XX:MaxDirectMemorySize to Task Off-Heap Memory and JVM
> Overhead. Then there is the risk that this size is too low resulting in a
> lot of garbage collection and potentially an OOM.
> - Alternative 3: set XX:MaxDirectMemorySize to something larger than
> alternative 2. This would of course reduce the sizes of the other memory
> types.
>
> How would alternative 2 now result in an under utilization of memory
> compared to alternative 3? If alternative 3 strictly sets a higher max
> direct memory size and we use only little, then I would expect that
> alternative 3 results in memory under utilization.
>
> Cheers,
> Till
>
> On Tue, Aug 13, 2019 at 4:19 PM Yang Wang <[email protected]> wrote:
>
> > Hi xintong,till
> >
> >
> > > Native and Direct Memory
> >
> > My point is setting a very large max direct memory size when we do not
> > differentiate direct and native memory. If the direct memory,including
> user
> > direct memory and framework direct memory,could be calculated
> > correctly,then
> > i am in favor of setting direct memory with fixed value.
> >
> >
> >
> > > Memory Calculation
> >
> > I agree with xintong. For Yarn and k8s,we need to check the memory
> > configurations in client to avoid submitting successfully and failing in
> > the flink master.
> >
> >
> > Best,
> >
> > Yang
> >
> > Xintong Song <[email protected]>于2019年8月13日 周二22:07写道：
> >
> > > Thanks for replying, Till.
> > >
> > > About MemorySegment, I think you are right that we should not include
> > this
> > > issue in the scope of this FLIP. This FLIP should concentrate on how to
> > > configure memory pools for TaskExecutors, with minimum involvement on
> how
> > > memory consumers use it.
> > >
> > > About direct memory, I think alternative 3 may not having the same over
> > > reservation issue that alternative 2 does, but at the cost of risk of
> > over
> > > using memory at the container level, which is not good. My point is
> that
> > > both "Task Off-Heap Memory" and "JVM Overhead" are not easy to config.
> > For
> > > alternative 2, users might configure them higher than what actually
> > needed,
> > > just to avoid getting a direct OOM. For alternative 3, users do not get
> > > direct OOM, so they may not config the two options aggressively high.
> But
> > > the consequences are risks of overall container memory usage exceeds
> the
> > > budget.
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Tue, Aug 13, 2019 at 9:39 AM Till Rohrmann <[email protected]>
> > > wrote:
> > >
> > > > Thanks for proposing this FLIP Xintong.
> > > >
> > > > All in all I think it already looks quite good. Concerning the first
> > open
> > > > question about allocating memory segments, I was wondering whether
> this
> > > is
> > > > strictly necessary to do in the context of this FLIP or whether this
> > > could
> > > > be done as a follow up? Without knowing all details, I would be
> > concerned
> > > > that we would widen the scope of this FLIP too much because we would
> > have
> > > > to touch all the existing call sites of the MemoryManager where we
> > > allocate
> > > > memory segments (this should mainly be batch operators). The addition
> > of
> > > > the memory reservation call to the MemoryManager should not be
> affected
> > > by
> > > > this and I would hope that this is the only point of interaction a
> > > > streaming job would have with the MemoryManager.
> > > >
> > > > Concerning the second open question about setting or not setting a
> max
> > > > direct memory limit, I would also be interested why Yang Wang thinks
> > > > leaving it open would be best. My concern about this would be that we
> > > would
> > > > be in a similar situation as we are now with the RocksDBStateBackend.
> > If
> > > > the different memory pools are not clearly separated and can spill
> over
> > > to
> > > > a different pool, then it is quite hard to understand what exactly
> > > causes a
> > > > process to get killed for using too much memory. This could then
> easily
> > > > lead to a similar situation what we have with the cutoff-ratio. So
> why
> > > not
> > > > setting a sane default value for max direct memory and giving the
> user
> > an
> > > > option to increase it if he runs into an OOM.
> > > >
> > > > @Xintong, how would alternative 2 lead to lower memory utilization
> than
> > > > alternative 3 where we set the direct memory to a higher value?
> > > >
> > > > Cheers,
> > > > Till
> > > >
> > > > On Fri, Aug 9, 2019 at 9:12 AM Xintong Song <[email protected]>
> > > wrote:
> > > >
> > > > > Thanks for the feedback, Yang.
> > > > >
> > > > > Regarding your comments:
> > > > >
> > > > > *Native and Direct Memory*
> > > > > I think setting a very large max direct memory size definitely has
> > some
> > > > > good sides. E.g., we do not worry about direct OOM, and we don't
> even
> > > > need
> > > > > to allocate managed / network memory with Unsafe.allocate() .
> > > > > However, there are also some down sides of doing this.
> > > > >
> > > > >    - One thing I can think of is that if a task executor container
> is
> > > > >    killed due to overusing memory, it could be hard for use to know
> > > which
> > > > > part
> > > > >    of the memory is overused.
> > > > >    - Another down side is that the JVM never trigger GC due to
> > reaching
> > > > max
> > > > >    direct memory limit, because the limit is too high to be
> reached.
> > > That
> > > > >    means we kind of relay on heap memory to trigger GC and release
> > > direct
> > > > >    memory. That could be a problem in cases where we have more
> direct
> > > > > memory
> > > > >    usage but not enough heap activity to trigger the GC.
> > > > >
> > > > > Maybe you can share your reasons for preferring setting a very
> large
> > > > value,
> > > > > if there are anything else I overlooked.
> > > > >
> > > > > *Memory Calculation*
> > > > > If there is any conflict between multiple configuration that user
> > > > > explicitly specified, I think we should throw an error.
> > > > > I think doing checking on the client side is a good idea, so that
> on
> > > > Yarn /
> > > > > K8s we can discover the problem before submitting the Flink
> cluster,
> > > > which
> > > > > is always a good thing.
> > > > > But we can not only rely on the client side checking, because for
> > > > > standalone cluster TaskManagers on different machines may have
> > > different
> > > > > configurations and the client does see that.
> > > > > What do you think?
> > > > >
> > > > > Thank you~
> > > > >
> > > > > Xintong Song
> > > > >
> > > > >
> > > > >
> > > > > On Thu, Aug 8, 2019 at 5:09 PM Yang Wang <[email protected]>
> > > wrote:
> > > > >
> > > > > > Hi xintong,
> > > > > >
> > > > > >
> > > > > > Thanks for your detailed proposal. After all the memory
> > configuration
> > > > are
> > > > > > introduced, it will be more powerful to control the flink memory
> > > > usage. I
> > > > > > just have few questions about it.
> > > > > >
> > > > > >
> > > > > >
> > > > > >    - Native and Direct Memory
> > > > > >
> > > > > > We do not differentiate user direct memory and native memory.
> They
> > > are
> > > > > all
> > > > > > included in task off-heap memory. Right? So i don’t think we
> could
> > > not
> > > > > set
> > > > > > the -XX:MaxDirectMemorySize properly. I prefer leaving it a very
> > > large
> > > > > > value.
> > > > > >
> > > > > >
> > > > > >
> > > > > >    - Memory Calculation
> > > > > >
> > > > > > If the sum of and fine-grained memory(network memory, managed
> > memory,
> > > > > etc.)
> > > > > > is larger than total process memory, how do we deal with this
> > > > situation?
> > > > > Do
> > > > > > we need to check the memory configuration in client?
> > > > > >
> > > > > > Xintong Song <[email protected]> 于2019年8月7日周三 下午10:14写道：
> > > > > >
> > > > > > > Hi everyone,
> > > > > > >
> > > > > > > We would like to start a discussion thread on "FLIP-49: Unified
> > > > Memory
> > > > > > > Configuration for TaskExecutors"[1], where we describe how to
> > > improve
> > > > > > > TaskExecutor memory configurations. The FLIP document is mostly
> > > based
> > > > > on
> > > > > > an
> > > > > > > early design "Memory Management and Configuration Reloaded"[2]
> by
> > > > > > Stephan,
> > > > > > > with updates from follow-up discussions both online and
> offline.
> > > > > > >
> > > > > > > This FLIP addresses several shortcomings of current (Flink 1.9)
> > > > > > > TaskExecutor memory configuration.
> > > > > > >
> > > > > > >    - Different configuration for Streaming and Batch.
> > > > > > >    - Complex and difficult configuration of RocksDB in
> Streaming.
> > > > > > >    - Complicated, uncertain and hard to understand.
> > > > > > >
> > > > > > >
> > > > > > > Key changes to solve the problems can be summarized as follows.
> > > > > > >
> > > > > > >    - Extend memory manager to also account for memory usage by
> > > state
> > > > > > >    backends.
> > > > > > >    - Modify how TaskExecutor memory is partitioned accounted
> > > > individual
> > > > > > >    memory reservations and pools.
> > > > > > >    - Simplify memory configuration options and calculations
> > logics.
> > > > > > >
> > > > > > >
> > > > > > > Please find more details in the FLIP wiki document [1].
> > > > > > >
> > > > > > > (Please note that the early design doc [2] is out of sync, and
> it
> > > is
> > > > > > > appreciated to have the discussion in this mailing list
> thread.)
> > > > > > >
> > > > > > >
> > > > > > > Looking forward to your feedbacks.
> > > > > > >
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-49%3A+Unified+Memory+Configuration+for+TaskExecutors
> > > > > > >
> > > > > > > [2]
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1o4KvyyXsQMGUastfPin3ZWeUXWsJgoL7piqp1fFYJvA/edit?usp=sharing
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-49: Unified Memory Configuration for TaskExecutors

Reply via email to