Re: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

Piotr Nowojski Thu, 21 Nov 2019 07:15:12 -0800

Hi,

I would suspect this:
https://issues.apache.org/jira/browse/FLINK-12070 
<https://issues.apache.org/jira/browse/FLINK-12070>
To be the source of the problems.


There seems to be a hidden configuration option that avoids using memory mapped 
files:

taskmanager.network.bounded-blocking-subpartition-type: file

Could you test if helps?

Piotrek

> On 21 Nov 2019, at 15:22, Hailu, Andreas <andreas.ha...@gs.com> wrote:
> 
> Hi Zhijiang,
>  
> I looked into the container logs for the failure, and didn’t see any specific 
> OutOfMemory errors before it was killed. I ran the application using the same 
> config this morning on 1.6.4, and it went through successfully. I took a 
> snapshot of the memory usage from the dashboard and can send it to you if you 
> like for reference.
>  
> What stands out to me as suspicious is that on 1.9.1, the application is 
> using nearly 6GB of Mapped memory before it dies, while 1.6.4 uses 0 
> throughout its runtime and succeeds. The JVM heap memory itself never exceeds 
> its capacity, peaking at 6.65GB, so it sounds like the problem lies somewhere 
> in the changes around mapped memory.
>  
> // ah
>  
>  <>From: Zhijiang <wangzhijiang...@aliyun.com> 
> Sent: Wednesday, November 20, 2019 11:32 PM
> To: Hailu, Andreas [Engineering] <andreas.ha...@ny.email.gs.com>; 
> user@flink.apache.org
> Subject: Re: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?
>  
> Hi Andreas,
>  
> You are running a batch job, so there should be no native memory used by 
> rocked state backend. Then I guess it is either heap memory or direct memory 
> over used. The heap managed memory is mainly used by batch operators and 
> direct memory is used by network shuffle. Can you further check whether there 
> are any logs to indicate HeapOutOfMemory or DirectOutOfMemory before killed? 
> If the used memory exceeds the JVM configuration, it should throw that error. 
> Then we can further narrow down the scope. I can not remember the changes of 
> memory issues for managed memory or network stack, especially it really spans 
> several releases.
>  
> Best,
> Zhijiang
>  
> ------------------------------------------------------------------
> From:Hailu, Andreas <andreas.ha...@gs.com <mailto:andreas.ha...@gs.com>>
> Send Time:2019 Nov. 21 (Thu.) 01:03
> To:user@flink.apache.org <user@flink.apache.org 
> <mailto:user@flink.apache.org>>
> Subject:RE: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?
>  
> Going through the release notes today - we tried fiddling with the 
> taskmanager.memory.fraction option, going as low as 0.1 with unfortunately no 
> success. It still leads to the container running beyond physical memory 
> limits.
>  
> // ah
>  
> From: Hailu, Andreas [Engineering] 
> Sent: Tuesday, November 19, 2019 6:01 PM
> To: 'user@flink.apache.org' <user@flink.apache.org 
> <mailto:user@flink.apache.org>>
> Subject: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?
>  
> Hi,
>  
> We’re in the middle of testing the upgrade of our data processing flows from 
> Flink 1.6.4 to 1.9.1. We’re seeing that flows which were running just fine on 
> 1.6.4 now fail on 1.9.1 with the same application resources and input data 
> size. It seems that there have been some changes around how the data is 
> sorted prior to being fed to the CoGroup operator - this is the error that we 
> encounter:
>  
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
> at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
> at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259)
> ... 15 more
> Caused by: java.lang.Exception: The data preparation for task 'CoGroup 
> (Dataset | Merge | NONE)' , caused an error: Error obtaining the sorted 
> input: Thread 'SortMerger Reading Thread' terminated due to an exception: 
> Lost connection to task manager 'd73996-213.dc.gs.com/10.47.226.218:46003'. 
> This indicates that the remote task manager was lost.
> at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:480)
> at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
> ... 1 more
> Caused by: java.lang.RuntimeException: Error obtaining the sorted input: 
> Thread 'SortMerger Reading Thread' terminated due to an exception: Lost 
> connection to task manager 'd73996-213.dc.gs.com/10.47.226.218:46003'. This 
> indicates that the remote task manager was lost.
> at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:650)
> at org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1109)
> at 
> org.apache.flink.runtime.operators.CoGroupDriver.prepare(CoGroupDriver.java:102)
> at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:474)
>  
> I drilled further down into the YARN app logs, and I found that the container 
> was running out of physical memory:
>  
> 2019-11-19 12:49:23,068 INFO  org.apache.flink.yarn.YarnResourceManager       
>               - Closing TaskExecutor connection 
> container_e42_1574076744505_9444_01_000004 because: Container 
> [pid=42774,containerID=container_e42_1574076744505_9444_01_000004] is running 
> beyond physical memory limits. Current usage: 12.0 GB of 12 GB physical 
> memory used; 13.9 GB of 25.2 GB virtual memory used. Killing container.
>  
> This is what leads my suspicions as this resourcing configuration worked just 
> fine on 1.6.4
>  
> I’m working on getting heap dumps of these applications to try and get a 
> better understanding of what’s causing the blowup in physical memory required 
> myself, but it would be helpful if anyone knew what relevant changes have 
> been made between these versions or where else I could look? There are some 
> features in 1.9 that we’d like to use in our flows so getting this sorted 
> out, no pun intended, is inhibiting us from doing so.
>  
> Best,
> Andreas
>  
> 
> Your Personal Data: We may collect and process information about you that may 
> be subject to data protection laws. For more information about how we use and 
> disclose your personal data, how we protect your information, our legal basis 
> to use your information, your rights and who you can contact, please refer 
> to: www.gs.com/privacy-notices <http://www.gs.com/privacy-notices>
>  
> 
> 
> Your Personal Data: We may collect and process information about you that may 
> be subject to data protection laws. For more information about how we use and 
> disclose your personal data, how we protect your information, our legal basis 
> to use your information, your rights and who you can contact, please refer 
> to: www.gs.com/privacy-notices <http://www.gs.com/privacy-notices>

Re: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1?

Reply via email to