[ 
https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16423787#comment-16423787
 ] 

dhiraj prajapati commented on FLINK-9120:
-----------------------------------------

Hi [~till.rohrmann]/[~sihuazhou], find the steps to reproduce the issue below

What I did was: 
1) Run cluster with JM on machine A, one TM on machine B and one TM on 
machine C 
2) Submit a job to the cluster. Works fine till now. 
3) Forcefully kill the TM on machine C. The web UI shows job failing and 
then restarting and finally the job is up on its own. TM on machine B handles 
everything. This is perfect. 
4) Now I start the TM on machine C and wait for sufficient time. At this point 
both TMs are up.
5) Now kill the TM on machine B. At this point the job fails. Shouldn't the 
job be handled by the running TM on machine C? 

> Task Manager Fault Tolerance issue
> ----------------------------------
>
>                 Key: FLINK-9120
>                 URL: https://issues.apache.org/jira/browse/FLINK-9120
>             Project: Flink
>          Issue Type: Bug
>          Components: Cluster Management, Configuration, Core
>    Affects Versions: 1.4.2
>            Reporter: dhiraj prajapati
>            Priority: Critical
>         Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log
>
>
> HI, 
> I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
> to 2 on each node. I submitted a job to this cluster and it runs fine. To 
> test fault tolerance, I killed one task manager. I was expecting the job to 
> run fine because one of the 2 task managers was still up and running. 
> However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to