[jira] [Comment Edited] (FLINK-13489) Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout

Till Rohrmann (JIRA) Mon, 05 Aug 2019 02:51:17 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-13489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16899343#comment-16899343
 ]


Till Rohrmann edited comment on FLINK-13489 at 8/5/19 9:50 AM:
---------------------------------------------------------------

[~StephanEwen] I run the test for many times, but only encountered the akka 
timeout problem only once, and nerve encountered the heartbeat timeout problem. 
But unfortunately, I did not get the JM/TM log of that failure. Latter, I 
modified the test script to print gc and JM/TM log out and run the test for 
many times, but the timeout problem did not occur. I noticed the gc time is a 
little long, many 2, 3, 4 seconds (these are for successfully finished job). I 
guess the previous failure may result by GC. Another problem is that the Travis 
test platform seems not stable, the test time varies.

As for containerized.heap-cutoff-min, it is because it was used for memory 
calculation. If the default value (600) is used, the standalone cluster can not 
start up. I agree with you that this config option should not be considered by 
standalone mode, but it seems reusing the same code (I think it also should be 
fixed). The flowing is the exception stack:

{code}
 2019-08-01 18:42:29,289 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start cluster 
entrypoint StandaloneSessionClusterEntrypoint.
 org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to 
initialize the cluster entrypoint StandaloneSessionClusterEntrypoint.
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:182)
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:501)
 at 
org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint.main(StandaloneSessionClusterEntrypoint.java:65)
 Caused by: org.apache.flink.util.FlinkException: Could not create the 
DispatcherResourceManagerComponent.
 at 
org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:259)
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:210)
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:164)
 at 
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:163)
 ... 2 more
 Caused by: java.lang.IllegalArgumentException: The configuration value 
'containerized.heap-cutoff-min'='600' is larger than the total container memory 
512
 at 
org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters.calculateCutoffMB(ContaineredTaskManagerParameters.java:133)
 at 
org.apache.flink.runtime.util.ResourceManagerUtil.getResourceManagerConfiguration(ResourceManagerUtil.java:34)
 at 
org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:171)
 ... 6 more
{code}


was (Author: kevin.cyj):
[~StephanEwen] I run the test for many times, but only encountered the akka 
timeout problem only once, and nerve encountered the heartbeat timeout problem. 
But unfortunately, I did not get the JM/TM log of that failure. Latter, I 
modified the test script to print gc and JM/TM log out and run the test for 
many times, but the timeout problem did not occur. I noticed the gc time is a 
little long, many 2, 3, 4 seconds (these are for successfully finished job). I 
guess the previous failure may result by GC. Another problem is that the Travis 
test platform seems not stable, the test time varies.

As for containerized.heap-cutoff-min, it is because it was used for memory 
calculation. If the default value (600) is used, the standalone cluster can not 
start up. I agree with you that this config option should not be considered by 
standalone mode, but it seems reusing the same code (I think it also should be 
fixed). The flowing is the exception stack:


 2019-08-01 18:42:29,289 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Could not start cluster 
entrypoint StandaloneSessionClusterEntrypoint.
 org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to 
initialize the cluster entrypoint StandaloneSessionClusterEntrypoint.
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:182)
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:501)
 at 
org.apache.flink.runtime.entrypoint.StandaloneSessionClusterEntrypoint.main(StandaloneSessionClusterEntrypoint.java:65)
 Caused by: org.apache.flink.util.FlinkException: Could not create the 
DispatcherResourceManagerComponent.
 at 
org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:259)
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:210)
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:164)
 at 
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
 at 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:163)
 ... 2 more
 Caused by: java.lang.IllegalArgumentException: The configuration value 
'containerized.heap-cutoff-min'='600' is larger than the total container memory 
512
 at 
org.apache.flink.runtime.clusterframework.ContaineredTaskManagerParameters.calculateCutoffMB(ContaineredTaskManagerParameters.java:133)
 at 
org.apache.flink.runtime.util.ResourceManagerUtil.getResourceManagerConfiguration(ResourceManagerUtil.java:34)
 at 
org.apache.flink.runtime.entrypoint.component.AbstractDispatcherResourceManagerComponentFactory.create(AbstractDispatcherResourceManagerComponentFactory.java:171)
 ... 6 more

> Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout
> --------------------------------------------------------------------------
>
>                 Key: FLINK-13489
>                 URL: https://issues.apache.org/jira/browse/FLINK-13489
>             Project: Flink
>          Issue Type: Bug
>          Components: Test Infrastructure
>            Reporter: Tzu-Li (Gordon) Tai
>            Assignee: Yingjie Cao
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.9.0
>
>
> https://api.travis-ci.org/v3/job/564925128/log.txt
> {code}
> ------------------------------------------------------------
>  The program finished with the following exception:
> org.apache.flink.client.program.ProgramInvocationException: Job failed. 
> (JobID: 1b4f1807cc749628cfc1bdf04647527a)
>       at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:250)
>       at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:338)
>       at 
> org.apache.flink.streaming.api.environment.StreamContextEnvironment.execute(StreamContextEnvironment.java:60)
>       at 
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1507)
>       at 
> org.apache.flink.deployment.HeavyDeploymentStressTestProgram.main(HeavyDeploymentStressTestProgram.java:70)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>       at java.lang.reflect.Method.invoke(Method.java:498)
>       at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:576)
>       at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:438)
>       at 
> org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:274)
>       at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:746)
>       at 
> org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:273)
>       at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:205)
>       at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1010)
>       at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1083)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
>       at 
> org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
>       at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1083)
> Caused by: org.apache.flink.runtime.client.JobExecutionException: Job 
> execution failed.
>       at 
> org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146)
>       at 
> org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:247)
>       ... 21 more
> Caused by: java.util.concurrent.TimeoutException: Heartbeat of TaskManager 
> with id ea456d6a590eca7598c19c4d35e56db9 timed out.
>       at 
> org.apache.flink.runtime.jobmaster.JobMaster$TaskManagerHeartbeatListener.notifyHeartbeatTimeout(JobMaster.java:1149)
>       at 
> org.apache.flink.runtime.heartbeat.HeartbeatManagerImpl$HeartbeatMonitor.run(HeartbeatManagerImpl.java:318)
>       at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>       at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:397)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:190)
>       at 
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:74)
>       at 
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:152)
>       at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
>       at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
>       at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
>       at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>       at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>       at akka.actor.Actor$class.aroundReceive(Actor.scala:517)
>       at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:225)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:592)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:561)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:258)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:225)
>       at akka.dispatch.Mailbox.exec(Mailbox.scala:235)
>       at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Comment Edited] (FLINK-13489) Heavy deployment end-to-end test fails on Travis with TM heartbeat timeout

Reply via email to