Hi, I would suspect this: https://issues.apache.org/jira/browse/FLINK-12070 <https://issues.apache.org/jira/browse/FLINK-12070> To be the source of the problems.
There seems to be a hidden configuration option that avoids using memory mapped files: taskmanager.network.bounded-blocking-subpartition-type: file Could you test if helps? Piotrek > On 21 Nov 2019, at 15:22, Hailu, Andreas <andreas.ha...@gs.com> wrote: > > Hi Zhijiang, > > I looked into the container logs for the failure, and didn’t see any specific > OutOfMemory errors before it was killed. I ran the application using the same > config this morning on 1.6.4, and it went through successfully. I took a > snapshot of the memory usage from the dashboard and can send it to you if you > like for reference. > > What stands out to me as suspicious is that on 1.9.1, the application is > using nearly 6GB of Mapped memory before it dies, while 1.6.4 uses 0 > throughout its runtime and succeeds. The JVM heap memory itself never exceeds > its capacity, peaking at 6.65GB, so it sounds like the problem lies somewhere > in the changes around mapped memory. > > // ah > > <>From: Zhijiang <wangzhijiang...@aliyun.com> > Sent: Wednesday, November 20, 2019 11:32 PM > To: Hailu, Andreas [Engineering] <andreas.ha...@ny.email.gs.com>; > user@flink.apache.org > Subject: Re: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1? > > Hi Andreas, > > You are running a batch job, so there should be no native memory used by > rocked state backend. Then I guess it is either heap memory or direct memory > over used. The heap managed memory is mainly used by batch operators and > direct memory is used by network shuffle. Can you further check whether there > are any logs to indicate HeapOutOfMemory or DirectOutOfMemory before killed? > If the used memory exceeds the JVM configuration, it should throw that error. > Then we can further narrow down the scope. I can not remember the changes of > memory issues for managed memory or network stack, especially it really spans > several releases. > > Best, > Zhijiang > > ------------------------------------------------------------------ > From:Hailu, Andreas <andreas.ha...@gs.com <mailto:andreas.ha...@gs.com>> > Send Time:2019 Nov. 21 (Thu.) 01:03 > To:user@flink.apache.org <user@flink.apache.org > <mailto:user@flink.apache.org>> > Subject:RE: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1? > > Going through the release notes today - we tried fiddling with the > taskmanager.memory.fraction option, going as low as 0.1 with unfortunately no > success. It still leads to the container running beyond physical memory > limits. > > // ah > > From: Hailu, Andreas [Engineering] > Sent: Tuesday, November 19, 2019 6:01 PM > To: 'user@flink.apache.org' <user@flink.apache.org > <mailto:user@flink.apache.org>> > Subject: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1? > > Hi, > > We’re in the middle of testing the upgrade of our data processing flows from > Flink 1.6.4 to 1.9.1. We’re seeing that flows which were running just fine on > 1.6.4 now fail on 1.9.1 with the same application resources and input data > size. It seems that there have been some changes around how the data is > sorted prior to being fed to the CoGroup operator - this is the error that we > encounter: > > Caused by: org.apache.flink.runtime.client.JobExecutionException: Job > execution failed. > at > org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) > at > org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259) > ... 15 more > Caused by: java.lang.Exception: The data preparation for task 'CoGroup > (Dataset | Merge | NONE)' , caused an error: Error obtaining the sorted > input: Thread 'SortMerger Reading Thread' terminated due to an exception: > Lost connection to task manager 'd73996-213.dc.gs.com/10.47.226.218:46003'. > This indicates that the remote task manager was lost. > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:480) > at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369) > at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705) > at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) > ... 1 more > Caused by: java.lang.RuntimeException: Error obtaining the sorted input: > Thread 'SortMerger Reading Thread' terminated due to an exception: Lost > connection to task manager 'd73996-213.dc.gs.com/10.47.226.218:46003'. This > indicates that the remote task manager was lost. > at > org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:650) > at org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1109) > at > org.apache.flink.runtime.operators.CoGroupDriver.prepare(CoGroupDriver.java:102) > at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:474) > > I drilled further down into the YARN app logs, and I found that the container > was running out of physical memory: > > 2019-11-19 12:49:23,068 INFO org.apache.flink.yarn.YarnResourceManager > - Closing TaskExecutor connection > container_e42_1574076744505_9444_01_000004 because: Container > [pid=42774,containerID=container_e42_1574076744505_9444_01_000004] is running > beyond physical memory limits. Current usage: 12.0 GB of 12 GB physical > memory used; 13.9 GB of 25.2 GB virtual memory used. Killing container. > > This is what leads my suspicions as this resourcing configuration worked just > fine on 1.6.4 > > I’m working on getting heap dumps of these applications to try and get a > better understanding of what’s causing the blowup in physical memory required > myself, but it would be helpful if anyone knew what relevant changes have > been made between these versions or where else I could look? There are some > features in 1.9 that we’d like to use in our flows so getting this sorted > out, no pun intended, is inhibiting us from doing so. > > Best, > Andreas > > > Your Personal Data: We may collect and process information about you that may > be subject to data protection laws. For more information about how we use and > disclose your personal data, how we protect your information, our legal basis > to use your information, your rights and who you can contact, please refer > to: www.gs.com/privacy-notices <http://www.gs.com/privacy-notices> > > > > Your Personal Data: We may collect and process information about you that may > be subject to data protection laws. For more information about how we use and > disclose your personal data, how we protect your information, our legal basis > to use your information, your rights and who you can contact, please refer > to: www.gs.com/privacy-notices <http://www.gs.com/privacy-notices>