Ufuk Celebi created FLINK-2969:
----------------------------------
Summary: FlinkYarnSessionCli with recovery enabled fails when
killing TaskManager
Key: FLINK-2969
URL: https://issues.apache.org/jira/browse/FLINK-2969
Project: Flink
Issue Type: Bug
Components: Distributed Runtime, YARN Client
Affects Versions: 0.10
Reporter: Ufuk Celebi
I'm running a YARN session with 2 physical nodes and 5 containers
(ApplicationMaster and 4 TaskManagers). There is no Flink program submitted to
the cluster.
Running a sequence of failure operations (killing the ApplicationMaster and
TaskManager containers), I sometimes get the following Exception after killing
a TaskManager:
{code}
15:31:20,721 WARN org.apache.flink.client.FlinkYarnSessionCli
- Exception while running the interactive command line interface
java.lang.RuntimeException: Unable to get Cluster status from Application Client
at
org.apache.flink.yarn.FlinkYarnCluster.getClusterStatus(FlinkYarnCluster.java:307)
at
org.apache.flink.client.FlinkYarnSessionCli.runInteractiveCli(FlinkYarnSessionCli.java:296)
at
org.apache.flink.client.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:455)
at
org.apache.flink.client.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:351)
Caused by: akka.pattern.AskTimeoutException:
Recipient[Actor[akka://flink/user/applicationClient#-607831833]] had already
been terminated.
at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:132)
at akka.pattern.AskableActorRef$.$qmark$extension(AskSupport.scala:144)
at akka.pattern.AskSupport$class.ask(AskSupport.scala:75)
at akka.pattern.package$.ask(package.scala:43)
at akka.pattern.Patterns$.ask(Patterns.scala:47)
at akka.pattern.Patterns.ask(Patterns.scala)
at
org.apache.flink.yarn.FlinkYarnCluster.getClusterStatus(FlinkYarnCluster.java:302)
... 3 more
{code}
I would like to investigate this for the 0.10.1/1.0 release and not block the
current RC.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)