Hi Tamir,

> The histogram has been taken from Task Manager using jcmd tool.

>From that histogram, I guest there is no classloader leaking.

> A simple batch job with single operation . The memory bumps to ~600MB
(after single execution). once the job is finished the memory never freed.

It could be just new code paths and hence new classes. A single execution
does not making much sense. Multiple or dozen runs and continuous memory
increasing among them and not decreasing after could be symptom of leaking.

You could use following steps to verify whether there are issues in your
task managers:
* Run job N times, the more the better.
* Wait all jobs finished or stopped.
* Trigger manually gc dozen times.
* Take class histogram and check whether there are any
“ChildFirstClassLoader”.
* If there are roughly N “ChildFirstClassLoader” in histogram, then we can
pretty sure there might be class loader leaking.
* If there is no “ChildFirstClassLoader” or few but memory still higher
than a threshold, say ~600MB or more, it could be other shape of leaking.


In all leaking case, an heap dump as @Chesnay said could be more helpful
since it can tell us which object/class/thread keep memory from freeing.


Besides this, I saw an attachment “task-manager-thrad-print.txt” in initial
mail, when and where did you capture ? Task Manager ? Is there any job
still running ?


Best,
Kezhu Wang

On March 1, 2021 at 18:38:55, Tamir Sagi (tamir.s...@niceactimize.com)
wrote:

Hey Kezhu,

The histogram has been taken from Task Manager using jcmd tool.

By means of batch job, do you means that you compile job graph from DataSet
API in client side and then submit it through RestClient ? I am not
familiar with data set api, usually, there is no `ChildFirstClassLoader`
creation in client side for job graph building. Could you depict a pseudo
for this or did you create `ChildFirstClassLoader` yourself ?

Yes, we have a batch app. we read a file from s3 using hadoop-s3-plugin,
then map that data into DataSet then just print it.
Then we have a Flink Client application which saves the batch app jar.

Attached the following files:

   1. batch-source-code.java - main function
   2. FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction
   3. flink-job-submit.txt - The code to submit the job


I've noticed 2 behaviors:

   1. Idle - Once Task manager application boots up the memory consumption
   gradually grows, starting ~360MB to ~430MB(within few minutes) I see logs
   where many classes are loaded into JVM and never get released.(Might be a
   normal behavior)
   2. Batch Job Execution - A simple batch job with single operation . The
   memory bumps to ~600MB (after single execution). once the job is finished
   the memory never freed. I executed GC several times (Manually +
   Programmatically) it did not help(although some classes were unloaded). the
   memory keeps growing while more batch jobs are executed.

Attached Task Manager Logs from yesterday after a single batch
execution.(Memory grew to 612MB and never freed)

   1. taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is the timestamp
   when the job was submitted)
   2. gc-class-historgram.txt
   3. thread-print.txt
   4. vm-class-loader-stats.txt
   5. vm-class-loaders.txt
   6. heap_info.txt


Same behavior has been observed in Flink Client application. Once the batch
job is executed the memory is increased gradually and does not get cleaned
afterwards.(We observed many ChildFirstClassLoader instances)


Thank you
Tamir.

------------------------------
*From:* Kezhu Wang <kez...@gmail.com>
*Sent:* Sunday, February 28, 2021 6:57 PM
*To:* Tamir Sagi <tamir.s...@niceactimize.com>
*Subject:* Re: Suspected classloader leak in Flink 1.11.1


*EXTERNAL EMAIL*


HI Tamir,

The histogram has no instance of `ChildFirstClassLoader`.

> we are running Flink on a session cluster (version 1.11.1) on Kubernetes,
submitting batch jobs with Flink client on Spring boot application (using
RestClusterClient).

> By analyzing the memory of the client Java application with profiling
tools, We saw that there are many instances of Flink's
ChildFirstClassLoader (perhaps as the number of jobs which were running),
and therefore many instances of the same class, each from a different
instance of the Class Loader (as shown in the attached screenshot).
Similarly, to the Flink task manager memory.

By means of batch job, do you means that you compile job graph from DataSet
API in client side and then submit it through RestClient ? I am not
familiar with data set api, usually, there is no `ChildFirstClassLoader`
creation in client side for job graph building. Could you depict a pseudo
for this or did you create `ChildFirstClassLoader` yourself ?


> In addition, we have tried calling GC manually, but it did not change
much.

It might take serval runs to collect a class loader instance.


Best,
Kezhu Wang


On February 28, 2021 at 23:27:38, Tamir Sagi (tamir.s...@niceactimize.com)
wrote:

Hey Kezhu,
Thanks for fast responding,

I've read that link few days ago.; Today I ran a simple batch job with
single operation (using hadoop s3 plugin) but the same behavior was
observed.

attached GC.class_histogram (Not filtered)


Tamir.



------------------------------
*From:* Kezhu Wang <kez...@gmail.com>
*Sent:* Sunday, February 28, 2021 4:46 PM
*To:* user@flink.apache.org <user@flink.apache.org>; Tamir Sagi <
tamir.s...@niceactimize.com>
*Subject:* Re: Suspected classloader leak in Flink 1.11.1


*EXTERNAL EMAIL*


Hi Tamir,

You could check
https://ci.apache.org/projects/flink/flink-docs-stable/ops/debugging/debugging_classloading.html#unloading-of-dynamically-loaded-classes-in-user-code
for
known class loading issues.

Besides this, I think GC.class_histogram(even filtered) could help us
listing suspected objects.


Best,
Kezhu Wang


On February 28, 2021 at 21:25:07, Tamir Sagi (tamir.s...@niceactimize.com)
wrote:


Hey all,

We are encountering memory issues on a Flink client and task managers,
which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes,
submitting batch jobs with Flink client on Spring boot application (using
RestClusterClient).

When jobs are being submitted and running, one after another, We see that
the metaspace memory(with max size of  1GB)  keeps increasing, as well as
linear increase in the heap memory (though it's a more moderate increase).
We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling
tools, We saw that there are many instances of Flink's
ChildFirstClassLoader (perhaps as the number of jobs which were running),
and therefore many instances of the same class, each from a different
instance of the Class Loader (as shown in the attached screenshot).
Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect
that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed
the investigation?


*Flink Client application (VisualVm)*







[image: Shallow Size com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (79)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (82)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (23)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (36)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (34)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (84)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (92)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (59)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (70)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (3)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (60)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (8)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (17)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (31)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (12)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (49) Objects 0% 0% 0% 0% 0% 0%
0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Retained Size 120 120 120 120 120 120 120 120
120 120 120 120 120 120 120 120 120 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
0% 0% 0% z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120
z 120 z 120 z 120 z 120 z 120 z 120 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
0% 0% 0%]

We have used different GCs but same results.


*Task Manager*


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released.

We used jcmd tool and attached 3 files


   1. Threads print
   2. VM.metaspace output
   3. VM.classloader

In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for
the above-named persons only and may be confidential and/or legally
privileged. Any opinions expressed in this communication are not
necessarily those of NICE Actimize. If this communication has come to you
in error you must take no action based on it, nor must you copy or show it
to anyone; please delete/destroy and inform the sender by e-mail
immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and
attachments are free from any virus, we advise that in keeping with good
computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for
the above-named persons only and may be confidential and/or legally
privileged. Any opinions expressed in this communication are not
necessarily those of NICE Actimize. If this communication has come to you
in error you must take no action based on it, nor must you copy or show it
to anyone; please delete/destroy and inform the sender by e-mail
immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and
attachments are free from any virus, we advise that in keeping with good
computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for
the above-named persons only and may be confidential and/or legally
privileged. Any opinions expressed in this communication are not
necessarily those of NICE Actimize. If this communication has come to you
in error you must take no action based on it, nor must you copy or show it
to anyone; please delete/destroy and inform the sender by e-mail
immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and
attachments are free from any virus, we advise that in keeping with good
computing practice the recipient should ensure they are actually virus free.

Reply via email to