Thanks for your responses. 1. There were no job re-starts prior to the metaspace OEM. 2. I tried increasing the CPU request and still encountered the problem. Any configuration change I make to the job manager, whether it's in the flink-conf.yaml or increasing the pod's CPU/memory request, results with this problem.
On Tue, Sep 22, 2020 at 12:04 AM Xintong Song <tonysong...@gmail.com> wrote: > Thanks for the input, Brain. > > This looks like what we are looking for. The issue is fixed in 1.10.3, > which also matches this problem occurred in 1.10.2. > > Maybe Claude can further confirm it. > > Thank you~ > > Xintong Song > > > > On Tue, Sep 22, 2020 at 10:57 AM Zhou, Brian <b.z...@dell.com> wrote: > >> Hi Xintong and Claude, >> >> >> >> In our internal tests, we also encounter these two issues and we spent >> much time debugging them. There are two points I need to confirm if we >> share the same problem. >> >> 1. Your job is using default restart strategy, which is per-second >> restart. >> 2. Your CPU resource on jobmanager might be small >> >> >> >> Here is some findings I want to share. >> >> ## Metaspace OOM >> >> Due to https://issues.apache.org/jira/browse/FLINK-15467 , when we have >> some job restarts, there will be some threads from the sourceFunction >> hanging, cause the class loader cannot close. New restarts would load new >> classes, then expand the metaspace, and finally OOM happens. >> >> >> >> ## Leader retrieving >> >> Constant restarts may be heavy for jobmanager, if JM CPU resources are >> not enough, the thread for leader retrieving may be stuck. >> >> >> >> Best Regards, >> >> Brian >> >> >> >> *From:* Xintong Song <tonysong...@gmail.com> >> *Sent:* Tuesday, September 22, 2020 10:16 >> *To:* Claude M; user >> *Subject:* Re: metaspace out-of-memory & error while retrieving the >> leader gateway >> >> >> >> ## Metaspace OOM >> >> As the error message already suggested, the metaspace OOM you encountered >> is likely caused by a class loading leak. I think you are on the right >> direction trying to look into the heap dump and find out where the leak >> comes from. IIUC, after removing the ZK folder, you are now able to run >> Flink with the heap dump options. >> >> >> >> The problem does not occur in previous versions because Flink starts to >> set the metaspace limit since the 1.10 release. The class loading leak >> might have already been there, but is never discovered. This could lead to >> unpredictable stability and performance issues. That's why Flink updated >> its memory model and explicitly set the metaspace limit in the 1.10 release. >> >> >> >> ## Leader retrieving >> >> The command looks good to me. If this problem happens only once, it could >> be irrelevant to adding the options. If that does not block you from >> getting the heap dump, we can look into it later. >> >> >> Thank you~ >> >> Xintong Song >> >> >> >> >> >> On Mon, Sep 21, 2020 at 9:37 PM Claude M <claudemur...@gmail.com> wrote: >> >> Hi Xintong, >> >> >> >> Thanks for your reply. Here is the command output w/ the java.opts: >> >> >> >> /usr/local/openjdk-8/bin/java -Xms768m -Xmx768m -XX:+UseG1GC >> -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/opt/flink/log >> -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties >> -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml >> -classpath >> /opt/flink/lib/flink-metrics-datadog-statsd-2.11-0.1.jar:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.7.5-10.0.jar:/opt/flink/lib/flink-table-blink_2.11-1.10.2.jar:/opt/flink/lib/flink-table_2.11-1.10.2.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.11-1.10.2.jar::/etc/hadoop/conf: >> org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint >> --configDir /opt/flink/conf --executionMode cluster >> >> >> >> To answer your questions: >> >> - Correct, in order for the pod to start up, I have to remove the >> flink app folder from zookeeper. I only have to delete once after >> applying >> the java.opts arguments. It doesn't make sense though that I should have >> to do this just from adding a parameter. >> - I'm using the standalone deployment. >> - I'm using job cluster mode. >> >> A higher priority issue I'm trying to solve is this metaspace out of >> memory that is occuring in task managers. This was not happening before I >> upgraded to Flink 1.10.2. Even after increasing the memory, I'm still >> encountering the problem. That is when I added the java.opts argument to >> see if I can get more information about the problem. That is when I ran >> across the second issue w/ the job manager pod not starting up. >> >> >> >> >> >> Thanks >> >> >> >> >> >> On Sun, Sep 20, 2020 at 10:23 PM Xintong Song <tonysong...@gmail.com> >> wrote: >> >> Hi Claude, >> >> >> >> IIUC, in your case the leader retrieving problem is triggered by adding >> the `java.opts`? Then could you try to find and post the complete command >> for launching the JVM process? You can try log into the pod and execute `ps >> -ef | grep <PID>`. >> >> >> >> A few more questions: >> >> - What do you mean by "resolve this"? Does the jobmanager pod get stuck >> there, and recover when you remove the folder from ZK? Do you have to do >> the removal for everytime submitting the Kubernetes? >> >> The only way I can resolve this is to delete the folder from zookeeper >> which I shouldn't have to do. >> >> - Which Flink's kubernetes deployment are you using? The standalone or >> native Kubernetes? >> >> - Which cluster mode are you using? Job cluster, session cluster, or the >> application mode? >> >> >> >> Thank you~ >> >> Xintong Song >> >> >> >> >> >> On Sat, Sep 19, 2020 at 1:22 AM Claude M <claudemur...@gmail.com> wrote: >> >> Hello, >> >> >> >> I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on the >> task managers is periodically crashing w/ the following error: >> >> >> >> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error >> has occurred. This can mean two things: either the job requires a larger >> size of JVM metaspace to load classes or there is a class loading leak. In >> the first case 'taskmanager.memory.jvm-metaspace.size' configuration option >> should be increased. If the error persists (usually in cluster after >> several job (re-)submissions) then there is probably a class loading leak >> which has to be investigated and fixed. The task executor has to be >> shutdown. >> >> >> >> I found this issue regarding it: >> >> https://issues.apache.org/jira/browse/FLINK-16406 >> >> >> >> I have tried increasing the taskmanager.memory.jvm-metaspace.size to 256M >> & 512M and still was having the problem. >> >> >> >> I then added the following to the flink.conf to try to get more >> information about the error: >> >> env.java.opts: -XX:+UseG1GC -XX:+HeapDumpOnOutOfMemoryError >> -XX:HeapDumpPath=/opt/flink/log >> >> >> >> When I deployed the change which is in a Kubernetes cluster, the >> jobmanager pod fails to start up and the following message shows >> repeatedly: >> >> >> >> 2020-09-18 17:03:46,255 WARN >> org.apache.flink.runtime.webmonitor.retriever.impl.RpcGatewayRetriever - >> Error while retrieving the leader gateway. Retrying to connect to >> akka.tcp://flink@flink-jobmanager:50010/user/dispatcher. >> >> >> >> The only way I can resolve this is to delete the folder from zookeeper >> which I shouldn't have to do. >> >> >> >> Any ideas on these issues? >> >> >> >> >> >> >> >>