Thanks for the insight Roman, and also for the GC tips.  There are 2
reasons why I wanted to see this memory released.  First as a way to just
confirm my understanding of Flink memory segment handling.   Second is that
I run a single standalone cluster that runs both streaming and batch jobs,
and thus cluster was being killed by OoM killer (i.e. java runtime was
killed, not jvm exception).

For the second part, I did some napkin calculations and tuned down the
number of TMs on the host.  Thus seems to help a but since before what was
happening was subsequent batch jobs were being scheduled on fresh TMs which
had not allocated memory before.  So as more TMs did work more memory was
used but never released and subsequently the OS oomkiller stepped in.

My direction now (thanks to all I learned and the input in this thread) is
to

a)  Not run Streaming and Batch jobs on the same cluster.  Their memory
models are different enough that this is not a good thing  and I certainly
don't want a streaming job to be impacted due to the running of a batch job.

b) Move the batch jobs to a Job Cluster setup running in K8s.  I have had a
lot of trouble getting this to run stability due to K8s issues, but I am
very close now I think.

Thanks again

Tim

On Mon, Oct 14, 2019, 3:08 AM Roman Grebennikov <g...@dfdx.me> wrote:

> Forced GC does not mean that JVM will even try to release the freed memory
> back to the operating system. This highly depends on the JVM and garbage
> collector used for your Flink setup, but most probably it's the jvm8 with
> the ParallelGC collector.
>
> ParallelGC is known to be not that aggressive on releasing free heap
> memory back to OS. I see here multiple different solutions:
> 1. Question yourself why do you really need to release any memory back? Is
> there a logical reason behind it? As next time you resubmit the job, the
> memory is going to be reused.
> 2. You can switch to G1GC and use JVM args like "-XX:MaxHeapFreeRatio
> -XX:MinHeapFreeRatio" to make it more aggressive on releasing memory.
> 3. You can use unofficial JVM builds from RedHat with ShenandoahGC
> backport, which is also able to do the job:
> https://builds.shipilev.net/openjdk-shenandoah-jdk8/
> 3. Flink 1.10 (hopefully) will be able to run on jvm11, so G1 on it is
> much more aggressive on releasing memory:
> https://bugs.openjdk.java.net/browse/JDK-8146436
>
> Roman Grebennikov | g...@dfdx.me
>
>
> On Sat, Oct 12, 2019, at 08:38, Timothy Victor wrote:
>
> This part about the GC not cleaning up after the job finishes makes
> sense.   However, I o served that even after I run a "jcmd <pid> GC.run" on
> the task manager process ID the memory is still not released.  This is what
> concerns me.
>
> Tim
>
>
> On Sat, Oct 12, 2019, 2:53 AM Xintong Song <tonysong...@gmail.com> wrote:
>
> Generally yes, with one slight difference.
>
> Once the job is done, the buffer is released by flink task manager
> (because pre-allocation is configured to be disabled), but the
> corresponding memory may not be released by jvm (because no GC cleans it).
> So it's not the task manager that keeps the buffer to be used for the next
> batch job. When the new batch job is running, the task executor allocates
> new buffers, which will use the memory of the previous buffer that jvm
> haven't released.
>
> Thank you~
>
> Xintong Song
>
>
>
>
> On Sat, Oct 12, 2019 at 7:28 AM Timothy Victor <vict...@gmail.com> wrote:
>
> Thanks Xintong!   In my case both of those parameters are set to false
> (default).  I think I am sort of following what's happening here.
>
> I have one TM with heap size set to 1GB.  When the cluster is started the
> TM doesn't use that 1GB (no allocations).  Once the first batch job is
> submitted I can see the memory roughly go up by 1GB.   I presume this is
> when TM allocates its 1GB on the heap, and if I read correctly this is
> essentially a large byte buffer that is tenured so that it is never GCed.
> Flink writes any pojos (serializes) to this byte buffer and this is to
> essentially circumvent GC for performance.   Once the job is done, this
> byte buffer remains on the heap, and the task manager keeps it to use for
> the next batch job.  This is why I never see the memory go down after a
> batch job is complete.
>
> Does this make sense?  Please let me know what you think.
>
> Thanks
>
> Tim
>
> On Thu, Oct 10, 2019, 11:16 PM Xintong Song <tonysong...@gmail.com> wrote:
>
> I think it depends on your configurations.
> - Are you using on-heap/off-heap managed memory? (configured by
> 'taskmanager.memory.off-heap', by default is false)
>
> - Is managed memory pre-allocated? (configured by
> 'taskmanager.memory.preallocate', by default is ffalse)
>
>
> If managed memory is pre-allocated, then the allocated memory segments
> will never be released. If it's not pre-allocated, memory segments should
> be released when the task is finished, but the actual memory will not be
> de-allocated until next GC. Since the job is finished, there may not be
> enough heap activities to trigger the GC. If on-heap memory is used, you
> may not be able to observe the decreasing of TM memory usage, because JVM
> heap size does not scale down. Only if off-heap memory is used, you might
> be able to observe the decreasing of TM memory usage after a GC, but not
> from a jmap dump because jmap dumps heap memory usage only.
>
>
> Besides, I don't think you need to worry about whether memory is released
> after one job is finished. Sometimes flink/jvm do not release memory after
> jobs/tasks finished, so that it can be reused directly by other jobs/tasks,
> for the purpose of reducing allocate/deallocated overheads and optimizing
> performance.
>
>
> Thank you~
>
> Xintong Song
>
>
>
>
> On Thu, Oct 10, 2019 at 7:55 PM Timothy Victor <vict...@gmail.com> wrote:
>
> After a batch job finishes in a flink standalone cluster, I notice that
> the memory isn't freed up.   I understand Flink uses it's own memory
> manager and just allocates a large tenured byte array that is not GC'ed.
>  But does the memory used in this byte array get released when the batch
> job is done?
>
> The scenario I am facing is that I am running a series of scheduled batch
> jobs on a standalone cluster with 1 TM and 1 Slot.  I notice that after a
> job is complete the memory used in the TM isn't freed up.  I can confirm
> this by running  jmap dump.
>
> Has anyone else run into this issue?   This is on 1.9.
>
> Thanks
>
> Tim
>
>
>

Reply via email to