metaspace out-of-memory & error while retrieving the leader gateway

2020-09-18 Thread Claude M
Hello, I upgraded from Flink 1.7.2 to 1.10.2. One of the jobs running on the task managers is periodically crashing w/ the following error: java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JV

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-20 Thread Xintong Song
Hi Claude, IIUC, in your case the leader retrieving problem is triggered by adding the `java.opts`? Then could you try to find and post the complete command for launching the JVM process? You can try log into the pod and execute `ps -ef | grep `. A few more questions: - What do you mean by "resol

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-21 Thread Xintong Song
## Metaspace OOM As the error message already suggested, the metaspace OOM you encountered is likely caused by a class loading leak. I think you are on the right direction trying to look into the heap dump and find out where the leak comes from. IIUC, after removing the ZK folder, you are now able

RE: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-21 Thread Zhou, Brian
To: Claude M; user Subject: Re: metaspace out-of-memory & error while retrieving the leader gateway ## Metaspace OOM As the error message already suggested, the metaspace OOM you encountered is likely caused by a class loading leak. I think you are on the right direction trying to look into

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-21 Thread Xintong Song
leader retrieving may be stuck. > > > > Best Regards, > > Brian > > > > *From:* Xintong Song > *Sent:* Tuesday, September 22, 2020 10:16 > *To:* Claude M; user > *Subject:* Re: metaspace out-of-memory & error while retrieving the > leader gateway > &g

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-22 Thread Claude M
starts would load new >> classes, then expand the metaspace, and finally OOM happens. >> >> >> >> ## Leader retrieving >> >> Constant restarts may be heavy for jobmanager, if JM CPU resources are >> not enough, the thread for leader retrieving may be stuck.

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-23 Thread Claude M
anging, cause the class loader cannot close. New restarts would load new >>> classes, then expand the metaspace, and finally OOM happens. >>> >>> >>> >>> ## Leader retrieving >>> >>> Constant restarts may be heavy for jobmanager, if JM CPU

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-23 Thread Claude M
https://issues.apache.org/jira/browse/FLINK-15467 , when we >>>> have some job restarts, there will be some threads from the sourceFunction >>>> hanging, cause the class loader cannot close. New restarts would load new >>>> classes, then expand the metaspace, and fi

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-23 Thread Xintong Song
gt;>>>>restart. >>>>>2. Your CPU resource on jobmanager might be small >>>>> >>>>> >>>>> >>>>> Here is some findings I want to share. >>>>> >>>>> ## Metaspace OOM >>&

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-24 Thread Claude M
>>>1. Your job is using default restart strategy, which is >>>>>>per-second restart. >>>>>>2. Your CPU resource on jobmanager might be small >>>>>> >>>>>> >>>>>> >>>>>> Her

Re: metaspace out-of-memory & error while retrieving the leader gateway

2020-09-24 Thread Xintong Song
;>> In our internal tests, we also encounter these two issues and we >>>>>>> spent much time debugging them. There are two points I need to confirm >>>>>>> if >>>>>>> we share the same problem. >>>>>>> >