Another big potential candidate is the fact that JDBC libs I use in my job
are put into the Flink lib folder instead of putting them into the fat
jar..tomorrow I'll try to see if the metaspace is getting cleared correctly
after that change.
Unfortunately our jobs were written before the child-first / parent-first
classloading refactoring and at that time that was the way to go..but now
it can cause this kind of problems if using child-first policy.

On Mon, Nov 16, 2020 at 8:44 PM Flavio Pompermaier <pomperma...@okkam.it>
wrote:

> Thank you Kye for your insights...in my mind, if the job runs without
> problems one or more times the heap size, and thus the medatadata-size, is
> big enough and I should not increase it (on the same data of course).
> So I'll try to understand who is leaking what..the advice to avoid the
> dynamic class loading is just a workaround to me..there's something wrong
> going on and tomorrow I'll try to understand the root cause of the
> problem using -XX:NativeMemoryTracking=summary as you suggested.
>
> I'll keep you up to date with my findings..
>
> Best,
> Flavio
>
> On Mon, Nov 16, 2020 at 8:22 PM Kye Bae <kye....@capitalone.com> wrote:
>
>> Hello!
>>
>> The JVM metaspace is where all the classes (not class instances or
>> objects) get loaded. jmap -histo is going to show you the heap space usage
>> info not the metaspace.
>>
>> You could inspect what is happening in the metaspace by using jcmd (e.g.,
>> jcmd JPID VM.native_memory summary) after restarting the cluster with "
>> *-XX:NativeMemoryTracking=summary"*
>>
>> *As the error message suggests, you may need to increase 
>> *taskmanager.memory.jvm-metaspace.size,
>> but you need to be slightly careful when specifying the memory parameters
>> in flink-conf.yaml in Flink 1.10 (they have an issue with a confusing error
>> message).
>>
>> Anothing thing to keep in mind is that you may want to avoid using
>> dynamic classloading (
>> https://ci.apache.org/projects/flink/flink-docs-release-1.10/monitoring/debugging_classloading.html#avoiding-dynamic-classloading-for-user-code):
>> when the job continuously fails for some temporary issues, it will load the
>> same class files into the metaspace multiple times and could exceed
>> whatever the limit you set it.
>>
>> -K
>>
>> On Mon, Nov 16, 2020 at 12:39 PM Jan Lukavský <je...@seznam.cz> wrote:
>>
>>> The exclusions should not have any impact on that, because what defines
>>> which classloader will load which class is not the presence or particular
>>> class in a specific jar, but the configuration of parent-first-patterns [1].
>>>
>>> If you don't use any flink internal imports, than it still might be the
>>> case, that a class in any of the packages defined by the
>>> parent-first-pattern to hold reference to your user-code classes, which
>>> would cause the leak. I'd recommend to inspect the heap dump after several
>>> restarts of the application and look for reference to Class objects from
>>> the root set.
>>>
>>> Jan
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html#class-loading
>>> <https://urldefense.com/v3/__https://ci.apache.org/projects/flink/flink-docs-stable/ops/config.html*class-loading__;Iw!!EFVe01R3CjU!NF2uHM8m-1kZSP7E3-7ZhdKcQa2U3wTqWKDA8zSI4727gH8ASTFc4h9qBaX4_W2wJA$>
>>> On 11/16/20 5:34 PM, Flavio Pompermaier wrote:
>>>
>>> I've tried to remove all possible imports of classes not contained in
>>> the fat jar but I still face the same problem.
>>> I've also tried to reduce as much as possible the exclude in the shade
>>> section of the maven plugin (I took the one at [1]) so now I exclude only
>>> few dependencies..could it be that I should include org.slf4j:* if I use
>>> static import of it?
>>>
>>> <artifactSet>
>>>     <excludes>
>>>       <exclude>com.google.code.findbugs:jsr305</exclude>
>>>       <exclude>org.slf4j:*</exclude>
>>>       <exclude>log4j:*</exclude>
>>>     </excludes>
>>> </artifactSet>
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-master/dev/project-configuration.html#appendix-template-for-building-a-jar-with-dependencies
>>> <https://urldefense.com/v3/__https://ci.apache.org/projects/flink/flink-docs-master/dev/project-configuration.html*appendix-template-for-building-a-jar-with-dependencies__;Iw!!EFVe01R3CjU!NF2uHM8m-1kZSP7E3-7ZhdKcQa2U3wTqWKDA8zSI4727gH8ASTFc4h9qBaWGhZYoqQ$>
>>>
>>> On Mon, Nov 16, 2020 at 3:29 PM Jan Lukavský <je...@seznam.cz> wrote:
>>>
>>>> Yes, that could definitely cause this. You should probably avoid using
>>>> these flink-internal shaded classes and ship your own versions (not 
>>>> shaded).
>>>>
>>>> Best,
>>>>
>>>>  Jan
>>>> On 11/16/20 3:22 PM, Flavio Pompermaier wrote:
>>>>
>>>> Thank you Jan for your valuable feedback.
>>>> Could it be that I should not use import shaded-jackson classes in my
>>>> user code?
>>>> For example import
>>>> org.apache.flink.shaded.jackson2.com.fasterxml.jackson.databind.ObjectMapper?
>>>>
>>>> Bets,
>>>> Flavio
>>>>
>>>> On Mon, Nov 16, 2020 at 3:15 PM Jan Lukavský <je...@seznam.cz> wrote:
>>>>
>>>>> Hi Flavio,
>>>>>
>>>>> when I encountered quite similar problem that you describe, it was
>>>>> related to a static storage located in class that was loaded
>>>>> "parent-first". In my case it was it was in java.lang.ClassValue, but it
>>>>> might (and probably will be) different in your case. The problem is that 
>>>>> if
>>>>> user-code registers something in some (static) storage located in class
>>>>> loaded with parent (TaskTracker) classloader, then its associated classes
>>>>> will never be GC'd and Metaspace will grow. A good starting point would be
>>>>> not to focus on biggest consumers of heap (in general), but to look at
>>>>> where the 15k objects of type Class are referenced from. That might help
>>>>> you figure this out. I'm not sure if there is something that can be done 
>>>>> in
>>>>> general to prevent this type of leaks. That would be probably question on
>>>>> dev@ mailing list.
>>>>>
>>>>> Best,
>>>>>
>>>>>  Jan
>>>>> On 11/16/20 2:27 PM, Flavio Pompermaier wrote:
>>>>>
>>>>> Hello everybody,
>>>>> I was writing this email when a similar thread on this mailing list
>>>>> appeared..
>>>>> The difference is that the other problem seems to be related
>>>>> with Flink 1.10 on YARN and does not output anything helpful in debugging
>>>>> the cause of the problem.
>>>>>
>>>>> Indeed, in my use case I use Flink 1.11.0 and Flink on a standalone
>>>>> session cluster (the job is submitted to the cluster using the CLI 
>>>>> client).
>>>>> The problem arises when I submit the same job for about 20 times (this
>>>>> number unfortunately is not deterministic and can change a little bit). 
>>>>> The
>>>>> error reported by the Task Executor is related to the ever growing
>>>>> Metaspace..the error seems to be pretty detailed [1].
>>>>>
>>>>> I found the same issue in some previous threads on this mailing list
>>>>> and I've tried to figure it out the cause of the problem. The issue is 
>>>>> that
>>>>> looking at the objects allocated I don't really get an idea of the source
>>>>> of the problem because the type of objects that are consuming the memory
>>>>> are of general purpose (i.e. Bytes, Integers and Strings)...these are my
>>>>> "top" memory consumers if looking at the output of  jmap -histo <PID>:
>>>>>
>>>>> At run 0:
>>>>>
>>>>>  num     #instances         #bytes  class name (module)
>>>>> -------------------------------------------------------
>>>>>    1:         46238       13224056  [B (java.base@11.0.9.1)
>>>>>    2:          3736        6536672  [I (java.base@11.0.9.1)
>>>>>    3:         38081         913944  java.lang.String (
>>>>> java.base@11.0.9.1)
>>>>>    4:            26         852384
>>>>>  [Lakka.dispatch.forkjoin.ForkJoinTask;
>>>>>    5:          7146         844984  java.lang.Class (
>>>>> java.base@11.0.9.1)
>>>>>
>>>>> At run 1:
>>>>>
>>>>>    1:         77.608       25.317.496  [B (java.base@11.0.9.1)
>>>>>    2:          7.004        9.088.360  [I (java.base@11.0.9.1)
>>>>>    3:         15.814        1.887.256  java.lang.Class (
>>>>> java.base@11.0.9.1)
>>>>>    4:         67.381        1.617.144  java.lang.String (
>>>>> java.base@11.0.9.1)
>>>>>    5:          3.906        1.422.960  [Ljava.util.HashMap$Node; (
>>>>> java.base@11.0.9.1)
>>>>>
>>>>> At run 6:
>>>>>
>>>>>    1:         81.408       25.375.400  [B (java.base@11.0.9.1)
>>>>>    2:         12.479        7.249.392  [I (java.base@11.0.9.1)
>>>>>    3:         29.090        3.496.168  java.lang.Class (
>>>>> java.base@11.0.9.1)
>>>>>    4:          4.347        2.813.416  [Ljava.util.HashMap$Node; (
>>>>> java.base@11.0.9.1)
>>>>>    5:         71.584        1.718.016  java.lang.String (
>>>>> java.base@11.0.9.1)
>>>>>
>>>>> At run 8:
>>>>>
>>>>>    1:        985.979      127.193.256  [B (java.base@11.0.9.1)
>>>>>    2:         35.400       13.702.112  [I (java.base@11.0.9.1)
>>>>>    3:        260.387        6.249.288  java.lang.String (
>>>>> java.base@11.0.9.1)
>>>>>    4:        148.836        5.953.440  java.util.HashMap$KeyIterator (
>>>>> java.base@11.0.9.1)
>>>>>    5:         17.641        5.222.344  [Ljava.util.HashMap$Node; (
>>>>> java.base@11.0.9.1)
>>>>>
>>>>> Thanks in advance for any help,
>>>>> Flavio
>>>>>
>>>>> [1]
>>>>> --------------------------------------------------------------------------------------------------
>>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory
>>>>> error has occurred. This can mean two things: either the job requires a
>>>>> larger size of JVM metaspace to load classes or there is a class loading
>>>>> leak. In the first case 'taskmanager.memory.jvm-metaspace.size'
>>>>> configuration option should be increased. If the error persists (usually 
>>>>> in
>>>>> cluster after several job (re-)submissions) then there is probably a class
>>>>> loading leak in user code or some of its dependencies which has to be
>>>>> investigated and fixed. The task executor has to be shutdown...
>>>>>         at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?]
>>>>>         at java.lang.ClassLoader.defineClass(ClassLoader.java:1017)
>>>>> ~[?:?]
>>>>>         at
>>>>> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:174)
>>>>> ~[?:?]
>>>>>         at
>>>>> java.net.URLClassLoader.defineClass(URLClassLoader.java:550) ~[?:?]
>>>>>         at java.net.URLClassLoader$1.run(URLClassLoader.java:458)
>>>>> ~[?:?]
>>>>>         at java.net.URLClassLoader$1.run(URLClassLoader.java:452)
>>>>> ~[?:?]
>>>>>         at java.security.AccessController.doPrivileged(Native Method)
>>>>> ~[?:?]
>>>>>         at java.net.URLClassLoader.findClass(URLClassLoader.java:451)
>>>>> ~[?:?]
>>>>>         at
>>>>> org.apache.flink.util.ChildFirstClassLoader.loadClassWithoutExceptionHandling(ChildFirstClassLoader.java:71)
>>>>> ~[flink-dist_2.12-1.11.0.jar:1.11.0]
>>>>>         at
>>>>> org.apache.flink.util.FlinkUserCodeClassLoader.loadClass(FlinkUserCodeClassLoader.java:48)
>>>>> [flink-dist_2.12-1.11.0.jar:1.11.0]
>>>>>         at java.lang.ClassLoader.loadClass(ClassLoader.java:522) [?:?]
>>>>>
>>>>> ------------------------------
>>
>> The information contained in this e-mail is confidential and/or
>> proprietary to Capital One and/or its affiliates and may only be used
>> solely in performance of work or services for Capital One. The information
>> transmitted herewith is intended only for use by the individual or entity
>> to which it is addressed. If the reader of this message is not the intended
>> recipient, you are hereby notified that any review, retransmission,
>> dissemination, distribution, copying or other use of, or taking of any
>> action in reliance upon this information is strictly prohibited. If you
>> have received this communication in error, please contact the sender and
>> delete the material from your computer.
>>
>>
>>
>>

Reply via email to