Hi Tamir, > The histogram has been taken from Task Manager using jcmd tool.
>From that histogram, I guest there is no classloader leaking. > A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed. It could be just new code paths and hence new classes. A single execution does not making much sense. Multiple or dozen runs and continuous memory increasing among them and not decreasing after could be symptom of leaking. You could use following steps to verify whether there are issues in your task managers: * Run job N times, the more the better. * Wait all jobs finished or stopped. * Trigger manually gc dozen times. * Take class histogram and check whether there are any “ChildFirstClassLoader”. * If there are roughly N “ChildFirstClassLoader” in histogram, then we can pretty sure there might be class loader leaking. * If there is no “ChildFirstClassLoader” or few but memory still higher than a threshold, say ~600MB or more, it could be other shape of leaking. In all leaking case, an heap dump as @Chesnay said could be more helpful since it can tell us which object/class/thread keep memory from freeing. Besides this, I saw an attachment “task-manager-thrad-print.txt” in initial mail, when and where did you capture ? Task Manager ? Is there any job still running ? Best, Kezhu Wang On March 1, 2021 at 18:38:55, Tamir Sagi (tamir.s...@niceactimize.com) wrote: Hey Kezhu, The histogram has been taken from Task Manager using jcmd tool. By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ? Yes, we have a batch app. we read a file from s3 using hadoop-s3-plugin, then map that data into DataSet then just print it. Then we have a Flink Client application which saves the batch app jar. Attached the following files: 1. batch-source-code.java - main function 2. FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction 3. flink-job-submit.txt - The code to submit the job I've noticed 2 behaviors: 1. Idle - Once Task manager application boots up the memory consumption gradually grows, starting ~360MB to ~430MB(within few minutes) I see logs where many classes are loaded into JVM and never get released.(Might be a normal behavior) 2. Batch Job Execution - A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed. I executed GC several times (Manually + Programmatically) it did not help(although some classes were unloaded). the memory keeps growing while more batch jobs are executed. Attached Task Manager Logs from yesterday after a single batch execution.(Memory grew to 612MB and never freed) 1. taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is the timestamp when the job was submitted) 2. gc-class-historgram.txt 3. thread-print.txt 4. vm-class-loader-stats.txt 5. vm-class-loaders.txt 6. heap_info.txt Same behavior has been observed in Flink Client application. Once the batch job is executed the memory is increased gradually and does not get cleaned afterwards.(We observed many ChildFirstClassLoader instances) Thank you Tamir. ------------------------------ *From:* Kezhu Wang <kez...@gmail.com> *Sent:* Sunday, February 28, 2021 6:57 PM *To:* Tamir Sagi <tamir.s...@niceactimize.com> *Subject:* Re: Suspected classloader leak in Flink 1.11.1 *EXTERNAL EMAIL* HI Tamir, The histogram has no instance of `ChildFirstClassLoader`. > we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient). > By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory. By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ? > In addition, we have tried calling GC manually, but it did not change much. It might take serval runs to collect a class loader instance. Best, Kezhu Wang On February 28, 2021 at 23:27:38, Tamir Sagi (tamir.s...@niceactimize.com) wrote: Hey Kezhu, Thanks for fast responding, I've read that link few days ago.; Today I ran a simple batch job with single operation (using hadoop s3 plugin) but the same behavior was observed. attached GC.class_histogram (Not filtered) Tamir. ------------------------------ *From:* Kezhu Wang <kez...@gmail.com> *Sent:* Sunday, February 28, 2021 4:46 PM *To:* user@flink.apache.org <user@flink.apache.org>; Tamir Sagi < tamir.s...@niceactimize.com> *Subject:* Re: Suspected classloader leak in Flink 1.11.1 *EXTERNAL EMAIL* Hi Tamir, You could check https://ci.apache.org/projects/flink/flink-docs-stable/ops/debugging/debugging_classloading.html#unloading-of-dynamically-loaded-classes-in-user-code for known class loading issues. Besides this, I think GC.class_histogram(even filtered) could help us listing suspected objects. Best, Kezhu Wang On February 28, 2021 at 21:25:07, Tamir Sagi (tamir.s...@niceactimize.com) wrote: Hey all, We are encountering memory issues on a Flink client and task managers, which I would like to raise here. we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient). When jobs are being submitted and running, one after another, We see that the metaspace memory(with max size of 1GB) keeps increasing, as well as linear increase in the heap memory (though it's a more moderate increase). We do see GC working on the heap and releasing some resources. By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory. We would expect to see one instance of Class Loader. Therefore, We suspect that the reason for the increase is Class Loaders not being cleaned. Does anyone have some insights about this issue, or ideas how to proceed the investigation? *Flink Client application (VisualVm)* [image: Shallow Size com.fasterxmI.jackson.databind.PropertyMetadata com.fasterxmIjackson.databind.PropertyMetadata com.fasterxmI.jackson.databind.PropertyMetadata com.fasterxmIjackson.databind.PropertyMetadata com.fasterxmI.jackson.databind.PropertyMetadata com.fasterxmIjackson.databind.PropertyMetadata com.fasterxmI.jackson.databind.PropertyMetadata com.fasterxmIjackson.databind.PropertyMetadata com.fasterxmI.jackson.databind.PropertyMetadata com.fasterxmIjackson.databind.PropertyMetadata com.fasterxmI.jackson.databind.PropertyMetadata com.fasterxmIjackson.databind.PropertyMetadata com.fasterxmI.jackson.databind.PropertyMetadata com.fasterxmIjackson.databind.PropertyMetadata com.fasterxmI.jackson.databind.PropertyMetadata com.fasterxmIjackson.databind.PropertyMetadata com.fasterxmI.jackson.databind.PropertyMetadata org.apache.fIink.utiI.ChiIdFirstCIassLoader (41) org.apache.fIink.utiI.ChiIdFirstCIassLoader (79) org.apache.fIink.utiI.ChiIdFirstCIassLoader (82) org.apache.fIink.utiI.ChiIdFirstCIassLoader (23) org.apache.fIink.utiI.ChiIdFirstCIassLoader (36) org.apache.fIink.utiI.ChiIdFirstCIassLoader (34) org.apache.fIink.utiI.ChiIdFirstCIassLoader (84) org.apache.fIink.utiI.ChiIdFirstCIassLoader (92) org.apache.fIink.utiI.ChiIdFirstCIassLoader (59) org.apache.fIink.utiI.ChiIdFirstCIassLoader (70) org.apache.fIink.utiI.ChiIdFirstCIassLoader (3) org.apache.fIink.utiI.ChiIdFirstCIassLoader (60) org.apache.fIink.utiI.ChiIdFirstCIassLoader (8) org.apache.fIink.utiI.ChiIdFirstCIassLoader (17) org.apache.fIink.utiI.ChiIdFirstCIassLoader (31) org.apache.fIink.utiI.ChiIdFirstCIassLoader (12) org.apache.fIink.utiI.ChiIdFirstCIassLoader (49) Objects 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Retained Size 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 120 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%] We have used different GCs but same results. *Task Manager* Total Size 4GB metaspace 1GB Off heap 512mb Screenshot form Task manager, 612MB are occupied and not being released. We used jcmd tool and attached 3 files 1. Threads print 2. VM.metaspace output 3. VM.classloader In addition, we have tried calling GC manually, but it did not change much. Thank you Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. Monitoring: NICE Actimize may monitor incoming and outgoing e-mails. Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free. Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. Monitoring: NICE Actimize may monitor incoming and outgoing e-mails. Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free. Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately. Monitoring: NICE Actimize may monitor incoming and outgoing e-mails. Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.