[jira] [Comment Edited] (FLINK-14123) Change taskmanager.memory.fraction default value to 0.6

2019-10-15 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952028#comment-16952028
 ] 

Xintong Song edited comment on FLINK-14123 at 10/16/19 1:59 AM:


I don't know neither. I think you can try ask in the dev mailing list.


was (Author: xintongsong):
I don't know either. I think you can try ask in the dev mailing list.

> Change taskmanager.memory.fraction default value to 0.6
> ---
>
> Key: FLINK-14123
> URL: https://issues.apache.org/jira/browse/FLINK-14123
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Affects Versions: 1.9.0
>Reporter: liupengcheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we are testing flink batch task, such as terasort, however, it 
> started only awhile then it failed due to OOM. 
>  
> {code:java}
> org.apache.flink.client.program.ProgramInvocationException: Job failed. 
> (JobID: a807e1d635bd4471ceea4282477f8850)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:262)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)
>   at 
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
>   at 
> org.apache.flink.api.scala.ExecutionEnvironment.execute(ExecutionEnvironment.scala:539)
>   at 
> com.github.ehiggs.spark.terasort.FlinkTeraSort$.main(FlinkTeraSort.scala:89)
>   at 
> com.github.ehiggs.spark.terasort.FlinkTeraSort.main(FlinkTeraSort.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:604)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:466)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746)
>   at 
> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1007)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1080)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1080)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259)
>   ... 23 more
> Caused by: java.lang.RuntimeException: Error obtaining the sorted input: 
> Thread 'SortMerger Reading Thread' terminated due to an exception: GC 
> overhead limit exceeded
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:650)
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1109)
>   at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:82)
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:504)
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated 
> due to an exception: GC overhead limit exceeded
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:831)
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at 
> 

[jira] [Comment Edited] (FLINK-14123) Change taskmanager.memory.fraction default value to 0.6

2019-10-15 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16952028#comment-16952028
 ] 

Xintong Song edited comment on FLINK-14123 at 10/16/19 1:59 AM:


I don't know either. I think you can try ask in the dev mailing list.


was (Author: xintongsong):
I don't no either. I think you can try ask in the dev mailing list.

> Change taskmanager.memory.fraction default value to 0.6
> ---
>
> Key: FLINK-14123
> URL: https://issues.apache.org/jira/browse/FLINK-14123
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Affects Versions: 1.9.0
>Reporter: liupengcheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we are testing flink batch task, such as terasort, however, it 
> started only awhile then it failed due to OOM. 
>  
> {code:java}
> org.apache.flink.client.program.ProgramInvocationException: Job failed. 
> (JobID: a807e1d635bd4471ceea4282477f8850)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:262)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)
>   at 
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
>   at 
> org.apache.flink.api.scala.ExecutionEnvironment.execute(ExecutionEnvironment.scala:539)
>   at 
> com.github.ehiggs.spark.terasort.FlinkTeraSort$.main(FlinkTeraSort.scala:89)
>   at 
> com.github.ehiggs.spark.terasort.FlinkTeraSort.main(FlinkTeraSort.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:604)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:466)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746)
>   at 
> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273)
>   at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205)
>   at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1007)
>   at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1080)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1886)
>   at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>   at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1080)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>   at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259)
>   ... 23 more
> Caused by: java.lang.RuntimeException: Error obtaining the sorted input: 
> Thread 'SortMerger Reading Thread' terminated due to an exception: GC 
> overhead limit exceeded
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:650)
>   at 
> org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1109)
>   at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:82)
>   at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:504)
>   at 
> org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369)
>   at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
>   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
>   at java.lang.Thread.run(Thread.java:748)
> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread' terminated 
> due to an exception: GC overhead limit exceeded
>   at 
> org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:831)
> Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
>   at 
> 

[jira] [Comment Edited] (FLINK-14123) Change taskmanager.memory.fraction default value to 0.6

2019-10-01 Thread Xintong Song (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-14123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16942428#comment-16942428
 ] 

Xintong Song edited comment on FLINK-14123 at 10/2/19 1:59 AM:
---

[~StephanEwen], thanks for pulling me in.

Actually, there is a discussion regarding whether we should be backwards 
compatible on 'taskmanager.memory.fraction', in this 
[PR|[https://github.com/apache/flink/pull/9760]]. The problem is that, the 
definition of managed memory fraction becomes different.
 * In current codebase, 'taskmanager.memory.fraction' = managed memory / (total 
flink memory - network memory).
 * According to FLIP-49, the new config option 
'taskmanager.memory.managed.fraction' = managed memory / total flink memory.

If we support backwards compatibility on key 'taskmanager.memory.fraction', 
then it could be confusing for the user that the same configured fraction 
actually leads to different managed memory size. Maybe not supporting the 
legacy fraction could be a good way to draw attention of users on the 
definition change.

If we decide not supporting the legacy fraction, then there would be no 
necessary on changing the default value. If we decide to be backwards 
compatible on this, then it makes more sense to decrease the default value 
because of the definition changes. Actually, the default value of the new 
fraction according to FLIP-49 is 0.5.


was (Author: xintongsong):
[~StephanEwen],

I'm not sure about this.

 

Actually, there is a discussion regarding whether we should be backwards 
compatible on 'taskmanager.memory.fraction', in this 
[PR|[https://github.com/apache/flink/pull/9760]]. The problem is that, the 
definition of managed memory fraction becomes different.
 * In current codebase, 'taskmanager.memory.fraction' = managed memory / (total 
flink memory - network memory).
 * According to FLIP-49, the new config option 
'taskmanager.memory.managed.fraction' = managed memory / total flink memory.

If we support backwards compatibility on key 'taskmanager.memory.fraction', 
then it could be confusing for the user that the same configured fraction 
actually leads to different managed memory size. Maybe not supporting the 
legacy fraction could be a good way to draw attention of users on the 
definition change.

 

If we decide not supporting the legacy fraction, then there would be no 
necessary on changing the default value. If we decide to be backwards 
compatible on this, then it makes more sense to decrease the default value 
because of the definition changes. Actually, the default value of the new 
fraction according to FLIP-49 is 0.5.

> Change taskmanager.memory.fraction default value to 0.6
> ---
>
> Key: FLINK-14123
> URL: https://issues.apache.org/jira/browse/FLINK-14123
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Configuration
>Affects Versions: 1.9.0
>Reporter: liupengcheng
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Currently, we are testing flink batch task, such as terasort, however, it 
> started only awhile then it failed due to OOM. 
>  
> {code:java}
> org.apache.flink.client.program.ProgramInvocationException: Job failed. 
> (JobID: a807e1d635bd4471ceea4282477f8850)
>   at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:262)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:326)
>   at 
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:62)
>   at 
> org.apache.flink.api.scala.ExecutionEnvironment.execute(ExecutionEnvironment.scala:539)
>   at 
> com.github.ehiggs.spark.terasort.FlinkTeraSort$.main(FlinkTeraSort.scala:89)
>   at 
> com.github.ehiggs.spark.terasort.FlinkTeraSort.main(FlinkTeraSort.scala)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)
>   at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:604)
>   at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:466)
>   at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274)
>   at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746)
>   at 
> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273)
>   at