Re: Classloader leak with Kafka Mbeans / JMX Reporter

Chesnay Schepler Tue, 06 Feb 2018 23:59:58 -0800

It would also be useful to know which source/sink you are using andwhich kafka version that is.


On 07.02.2018 08:58, Chesnay Schepler wrote:

Thanks for reporting this.
To narrow things down a bit, is your job using both kafka sources andsinks?
On 06.02.2018 21:30, Edward wrote:
I'm having an issue where off-heap memory is growing unchecked untilI get
OOM exceptions.
I was hoping that upgrading to 1.4 would solve these, since thechild-first
classloader is supposed to resolve issues with Avro classes cached in a
different classloader (which prevents the classloaders from beinggarbage
collected).

However, after upgrading, we are still seeing an off-heap memory leak. I
think I may have isolated the issue to the JmxReporter class used for
collecting Kafka metrics.

Here are the details of what I'm seeing:
Our cluster is running in kubernetes, using the latest flink:1.4 docker
image. We are using the default classloading order (child first).

If I resubmit my job repeatedly, the ClassLoaders from the previous job
submissions don't get cleaned up, and the non-heap memory slowlygrows until
the task manager runs out of memory.
I can see all of the un-deleted classloaders if I run "sudo -u flinkjmap-clstats <proc_id>" (the output is below). This list of deadclassloaderscontinues to grow every time I kill and resubmit a new Flink job. Inall,it lists 3200 dead class loaders. I'm only going to upload the oneswhich
show more than 2K of used memory.

finding class loader instances ..done.
computing per loader stat ..done.
please wait.. computing liveness.liveness analysis may be inaccurate ...
class_loader    classes    bytes    parent_loader    alive? type
0x00000000807302a0    7522    12213076    0x00000000804c58c0 dead
sun/misc/Launcher$AppClassLoader@0x000000010000f070
0x000000008eb00000    3699    6021535    0x00000000807302a0 dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x0000000094200190    3693    6016807    0x00000000807302a0 dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x000000009e7bc6c8    3696    6001081    0x00000000807302a0 dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x00000000a9d80008    3584    5530412    0x00000000807302a0 dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x00000000f4103650    3581    5527354    0x00000000807302a0 dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x00000000901801f8    3581    5527354    0x00000000807302a0 dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x00000000942637c0    3231    5121176    0x00000000807302a0 dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x0000000096c2ec00    3231    5119662    0x00000000807302a0 dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x000000008f600000    3225    5116241    0x00000000807302a0 dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
0x0000000092700d48    3228    5112270    0x00000000807302a0 dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
<bootstrap>    2548    4424440    null    live <internal>
0x0000000096b77190    2234    3634602    0x00000000807302a0 dead
org/apache/flink/runtime/execution/librarycache/FlinkUserCodeClassLoaders$ChildFirstClassLoader@0x00000001005cdc98
Next I took a heap dump:
sudo -u flink jmap -dump:format=b,file=/tmp/HeapDump.hprof <procpid>

Then, using Eclipse Memory Analyzer, I followed the steps from this blog
post:
http://java.jiderhamn.se/2011/12/11/classloader-leaks-i-how-to-find-classloader-leaks-with-eclipse-memory-analyser-mat/
The result of looking for strong references to classes in a deadclassloader
is this tree:

Class Name
| Shallow Heap | Retained Heap
-------------------------------------------------------------------------------------------------------------------------------------------org.apache.flink.runtime.execution.librarycache.FlinkUserCodeClassLoaders$ChildFirstClassLoader
@ 0x94200190|           88 |       616,992
'- <classloader> class
org.apache.kafka.common.metrics.JmxReporter$KafkaMbean @ 0x94250cb0
|            0 |             0
    '- <class> org.apache.kafka.common.metrics.JmxReporter$KafkaMbean @
0xbae537e8                           |           24 | 328
       '- object com.sun.jmx.mbeanserver.NamedObject @ 0xbace01e0
|           24 |            24
          '- value java.util.HashMap$Node @ 0xbace0110
|           32 |           232
             '- [247] java.util.HashMap$Node[512] @ 0xfa0d08c0
|        2,064 |       120,104
                '- table java.util.HashMap @ 0x806e9f08
|           48 |       120,152
                   '- value java.util.HashMap$Node @ 0x806e9ee8
|           32 |       120,184
                      '- [8] java.util.HashMap$Node[16] @ 0x80502da0
|           80 |       134,944
                         '- table java.util.HashMap @ 0x80502d70
|           48 |       134,992
'- domainTbcom.sun.jmx.mbeanserver.Repository @
0x80502d50                      |           32 |       135,200
-------------------------------------------------------------------------------------------------------------------------------------------
This suggests to me that JMX registry is getting created by bootstrapor app
classloader. Then we're registering MBeans with it for each running job.
Then when the job is called the classloader can't be cleaned up because
there is a reference to the object in the bootstrap classloader.

Is there anything that can be done about this in the job code?





--
Sent from:http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Classloader leak with Kafka Mbeans / JMX Reporter

Reply via email to