[ 
https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17040146#comment-17040146
 ] 

Thomas Wozniakowski commented on FLINK-16142:
---------------------------------------------

[~kevin.cyj] it would be quite difficult for me to put arbitrary JARs in the 
lib folder as we're running entirely using the official docker containers and 
the app is specifically set up to run in docker (reads everything from 
environment variables, etc). I'm going to try and get the main method of the 
JAR to print all the live threads to the logs at the point where the job 
starts, hopefully that should give some insight. 

I can post the code of our custom SQS sink, but it's really only about 4 lines.

[~pnowojski] similar problem with attaching a memory profiler, the official 
Flink docker images don't expose a port for attaching to the JVM so I don't 
think that one will be possible. I'm trying to work out if there are any other 
useful metrics I could monitor on the REST endpoints from outside the container 
to help debug this.

> Memory Leak causes Metaspace OOM error on repeated job submission
> -----------------------------------------------------------------
>
>                 Key: FLINK-16142
>                 URL: https://issues.apache.org/jira/browse/FLINK-16142
>             Project: Flink
>          Issue Type: Bug
>          Components: Client / Job Submission
>    Affects Versions: 1.10.0
>            Reporter: Thomas Wozniakowski
>            Priority: Blocker
>             Fix For: 1.10.1, 1.11.0
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our 
> use-case exactly (RocksDB state backend running in a containerised cluster). 
> Unfortunately, it seems like there is a memory leak somewhere in the job 
> submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO 
> org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME 
> switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at 
> org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.<clinit>(AwsSdkMetrics.java:359)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at 
> org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at 
> org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at 
> org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at 
> org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at 
> org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
> {code}
> (The only change in the above text is the OPERATOR_NAME text where I removed 
> some of the internal specifics of our system).
> This will reliably happen on a fresh cluster after submitting and cancelling 
> our job 3 times.
> We are using the presto-s3 plugin, the CEP library and the Kinesis connector.
> Please let me know what other diagnostics would be useful.
> Tom



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to