[jira] [Reopened] (FLINK-9120) Task Manager Fault Tolerance issue
[ https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhiraj prajapati reopened FLINK-9120: - As suggested by [~sihuazhou], reopening the issue till [~till.rohrmann] comments. Please note, that it did not work with RestartStrategies.fixedDelayRestart(3, 5000) but worked with RestartStrategies.fixedDelayRestart(20, 5000) > Task Manager Fault Tolerance issue > -- > > Key: FLINK-9120 > URL: https://issues.apache.org/jira/browse/FLINK-9120 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Configuration, Core >Affects Versions: 1.4.2 >Reporter: dhiraj prajapati >Priority: Critical > Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > > > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs taskmanager.numberOfTaskSlots and parallelism.default were set > to 2 on each node. I submitted a job to this cluster and it runs fine. To > test fault tolerance, I killed one task manager. I was expecting the job to > run fine because one of the 2 task managers was still up and running. > However, the job failed. Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Closed] (FLINK-9120) Task Manager Fault Tolerance issue
[ https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhiraj prajapati closed FLINK-9120. --- Resolution: Invalid > Task Manager Fault Tolerance issue > -- > > Key: FLINK-9120 > URL: https://issues.apache.org/jira/browse/FLINK-9120 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Configuration, Core >Affects Versions: 1.4.2 >Reporter: dhiraj prajapati >Priority: Critical > Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > > > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs taskmanager.numberOfTaskSlots and parallelism.default were set > to 2 on each node. I submitted a job to this cluster and it runs fine. To > test fault tolerance, I killed one task manager. I was expecting the job to > run fine because one of the 2 task managers was still up and running. > However, the job failed. Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue
[ https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425256#comment-16425256 ] dhiraj prajapati commented on FLINK-9120: - HI [~till.rohrmann]/[~sihuazhou], it works perfectly fine after increasing the no. of retries. Thanks a lot for your time and effort. I will close this issue. > Task Manager Fault Tolerance issue > -- > > Key: FLINK-9120 > URL: https://issues.apache.org/jira/browse/FLINK-9120 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Configuration, Core >Affects Versions: 1.4.2 >Reporter: dhiraj prajapati >Priority: Critical > Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > > > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs taskmanager.numberOfTaskSlots and parallelism.default were set > to 2 on each node. I submitted a job to this cluster and it runs fine. To > test fault tolerance, I killed one task manager. I was expecting the job to > run fine because one of the 2 task managers was still up and running. > However, the job failed. Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue
[ https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423787#comment-16423787 ] dhiraj prajapati commented on FLINK-9120: - Hi [~till.rohrmann]/[~sihuazhou], find the steps to reproduce the issue below What I did was: 1) Run cluster with JM on machine A, one TM on machine B and one TM on machine C 2) Submit a job to the cluster. Works fine till now. 3) Forcefully kill the TM on machine C. The web UI shows job failing and then restarting and finally the job is up on its own. TM on machine B handles everything. This is perfect. 4) Now I start the TM on machine C and wait for sufficient time. At this point both TMs are up. 5) Now kill the TM on machine B. At this point the job fails. Shouldn't the job be handled by the running TM on machine C? > Task Manager Fault Tolerance issue > -- > > Key: FLINK-9120 > URL: https://issues.apache.org/jira/browse/FLINK-9120 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Configuration, Core >Affects Versions: 1.4.2 >Reporter: dhiraj prajapati >Priority: Critical > Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > > > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs taskmanager.numberOfTaskSlots and parallelism.default were set > to 2 on each node. I submitted a job to this cluster and it runs fine. To > test fault tolerance, I killed one task manager. I was expecting the job to > run fine because one of the 2 task managers was still up and running. > However, the job failed. Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue
[ https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423753#comment-16423753 ] dhiraj prajapati commented on FLINK-9120: - Hi [~till.rohrmann], with MemoryStateBackend, the state should be accessible to all TMs as long as the JM is up and running, right? Even then with MemoryStateBackend, the TM fault tolerance behaviour is not consistent. Sometimes it works and some times it doesn't. Hi [~sihuazhou], can you please elaborate on " TM doesn't unregister from JM properly in standalone model" ? If one of the TMs gets terminated due to machne crash or any other reason, it will obviously not be able to unregister from JM. But the other TM should pick up the job and the job shouldn't fail right? > Task Manager Fault Tolerance issue > -- > > Key: FLINK-9120 > URL: https://issues.apache.org/jira/browse/FLINK-9120 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Configuration, Core >Affects Versions: 1.4.2 >Reporter: dhiraj prajapati >Priority: Critical > Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > > > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs taskmanager.numberOfTaskSlots and parallelism.default were set > to 2 on each node. I submitted a job to this cluster and it runs fine. To > test fault tolerance, I killed one task manager. I was expecting the job to > run fine because one of the 2 task managers was still up and running. > However, the job failed. Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (FLINK-9120) Task Manager Fault Tolerance issue
[ https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423732#comment-16423732 ] dhiraj prajapati edited comment on FLINK-9120 at 4/3/18 9:29 AM: - Hi [~sihuazhou], I tried restarting the cluster again and submitted the job again with MemoryStateBackend and it seems to be working fine with a single TM now. Strange that it does not work always. Thanks [~sihuazhou] for pointing out the issue with TM instance specific file path location of state backend. was (Author: dhirajpraj): Hi [~sihuazhou], I tried restarting the cluster again and submitted the job again with MemoryStateBackend and it seems to be working fine with a single TM now. Strange that it does not work always. > Task Manager Fault Tolerance issue > -- > > Key: FLINK-9120 > URL: https://issues.apache.org/jira/browse/FLINK-9120 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Configuration, Core >Affects Versions: 1.4.2 >Reporter: dhiraj prajapati >Priority: Critical > Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > > > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs taskmanager.numberOfTaskSlots and parallelism.default were set > to 2 on each node. I submitted a job to this cluster and it runs fine. To > test fault tolerance, I killed one task manager. I was expecting the job to > run fine because one of the 2 task managers was still up and running. > However, the job failed. Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue
[ https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423732#comment-16423732 ] dhiraj prajapati commented on FLINK-9120: - Hi [~sihuazhou], I tried restarting the cluster again and submitted the job again with MemoryStateBackend and it seems to be working fine with a single TM now. Strange that it does not work always. > Task Manager Fault Tolerance issue > -- > > Key: FLINK-9120 > URL: https://issues.apache.org/jira/browse/FLINK-9120 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Configuration, Core >Affects Versions: 1.4.2 >Reporter: dhiraj prajapati >Priority: Critical > Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > > > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs taskmanager.numberOfTaskSlots and parallelism.default were set > to 2 on each node. I submitted a job to this cluster and it runs fine. To > test fault tolerance, I killed one task manager. I was expecting the job to > run fine because one of the 2 task managers was still up and running. > However, the job failed. Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue
[ https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423638#comment-16423638 ] dhiraj prajapati commented on FLINK-9120: - Hi [~sihuazhou], As mentioned in the problem statement, I have purposely killed the TM to test fault tolerance. Without terminating TM, the job runs fine. > Task Manager Fault Tolerance issue > -- > > Key: FLINK-9120 > URL: https://issues.apache.org/jira/browse/FLINK-9120 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Configuration, Core >Affects Versions: 1.4.2 >Reporter: dhiraj prajapati >Priority: Critical > Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > > > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs taskmanager.numberOfTaskSlots and parallelism.default were set > to 2 on each node. I submitted a job to this cluster and it runs fine. To > test fault tolerance, I killed one task manager. I was expecting the job to > run fine because one of the 2 task managers was still up and running. > However, the job failed. Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue
[ https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423609#comment-16423609 ] dhiraj prajapati commented on FLINK-9120: - HI [~sihuazhou], the task managers were on different machines. However, even if I use MemoryStateBackend, the job fails when one of the task managers is terminated. I am attaching the new logs with MemoryStateBackend. [^flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log] [^flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log] [^flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log] ^^[^flink-dhiraj.prajapati-client-ip-10-14-25-115.log]^^ > Task Manager Fault Tolerance issue > -- > > Key: FLINK-9120 > URL: https://issues.apache.org/jira/browse/FLINK-9120 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Configuration, Core >Affects Versions: 1.4.2 >Reporter: dhiraj prajapati >Priority: Critical > Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > > > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs taskmanager.numberOfTaskSlots and parallelism.default were set > to 2 on each node. I submitted a job to this cluster and it runs fine. To > test fault tolerance, I killed one task manager. I was expecting the job to > run fine because one of the 2 task managers was still up and running. > However, the job failed. Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-9120) Task Manager Fault Tolerance issue
[ https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhiraj prajapati updated FLINK-9120: Attachment: flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > Task Manager Fault Tolerance issue > -- > > Key: FLINK-9120 > URL: https://issues.apache.org/jira/browse/FLINK-9120 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Configuration, Core >Affects Versions: 1.4.2 >Reporter: dhiraj prajapati >Priority: Critical > Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > > > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs taskmanager.numberOfTaskSlots and parallelism.default were set > to 2 on each node. I submitted a job to this cluster and it runs fine. To > test fault tolerance, I killed one task manager. I was expecting the job to > run fine because one of the 2 task managers was still up and running. > However, the job failed. Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-9120) Task Manager Fault Tolerance issue
[ https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhiraj prajapati updated FLINK-9120: Attachment: flink-dhiraj.prajapati-client-ip-10-14-25-115.log > Task Manager Fault Tolerance issue > -- > > Key: FLINK-9120 > URL: https://issues.apache.org/jira/browse/FLINK-9120 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Configuration, Core >Affects Versions: 1.4.2 >Reporter: dhiraj prajapati >Priority: Critical > Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > > > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs taskmanager.numberOfTaskSlots and parallelism.default were set > to 2 on each node. I submitted a job to this cluster and it runs fine. To > test fault tolerance, I killed one task manager. I was expecting the job to > run fine because one of the 2 task managers was still up and running. > However, the job failed. Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-9120) Task Manager Fault Tolerance issue
[ https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhiraj prajapati updated FLINK-9120: Attachment: flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log > Task Manager Fault Tolerance issue > -- > > Key: FLINK-9120 > URL: https://issues.apache.org/jira/browse/FLINK-9120 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Configuration, Core >Affects Versions: 1.4.2 >Reporter: dhiraj prajapati >Priority: Critical > Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > > > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs taskmanager.numberOfTaskSlots and parallelism.default were set > to 2 on each node. I submitted a job to this cluster and it runs fine. To > test fault tolerance, I killed one task manager. I was expecting the job to > run fine because one of the 2 task managers was still up and running. > However, the job failed. Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue
[ https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423508#comment-16423508 ] dhiraj prajapati commented on FLINK-9120: - Please find the logs attached [^flink-dhiraj.prajapati-client-ip-10-14-25-115.log] [^flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log] [^flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log] > Task Manager Fault Tolerance issue > -- > > Key: FLINK-9120 > URL: https://issues.apache.org/jira/browse/FLINK-9120 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Configuration, Core >Affects Versions: 1.4.2 >Reporter: dhiraj prajapati >Priority: Critical > Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > > > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs taskmanager.numberOfTaskSlots and parallelism.default were set > to 2 on each node. I submitted a job to this cluster and it runs fine. To > test fault tolerance, I killed one task manager. I was expecting the job to > run fine because one of the 2 task managers was still up and running. > However, the job failed. Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-9120) Task Manager Fault Tolerance issue
[ https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhiraj prajapati updated FLINK-9120: Attachment: flink-dhiraj.prajapati-client-ip-10-14-25-115.log flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > Task Manager Fault Tolerance issue > -- > > Key: FLINK-9120 > URL: https://issues.apache.org/jira/browse/FLINK-9120 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Configuration, Core >Affects Versions: 1.4.2 >Reporter: dhiraj prajapati >Priority: Critical > Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, > flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, > flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log > > > HI, > I have set up a flink 1.4 cluster with 1 job manager and two task managers. > The configs taskmanager.numberOfTaskSlots and parallelism.default were set > to 2 on each node. I submitted a job to this cluster and it runs fine. To > test fault tolerance, I killed one task manager. I was expecting the job to > run fine because one of the 2 task managers was still up and running. > However, the job failed. Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-9120) Task Manager Fault Tolerance issue
dhiraj prajapati created FLINK-9120: --- Summary: Task Manager Fault Tolerance issue Key: FLINK-9120 URL: https://issues.apache.org/jira/browse/FLINK-9120 Project: Flink Issue Type: Bug Components: Cluster Management, Configuration, Core Affects Versions: 1.4.2 Reporter: dhiraj prajapati HI, I have set up a flink 1.4 cluster with 1 job manager and two task managers. The configs taskmanager.numberOfTaskSlots and parallelism.default were set to 2 on each node. I submitted a job to this cluster and it runs fine. To test fault tolerance, I killed one task manager. I was expecting the job to run fine because one of the 2 task managers was still up and running. However, the job failed. Am I missing something? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (FLINK-9054) IllegalStateException: Buffer pool is destroyed
[ https://issues.apache.org/jira/browse/FLINK-9054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] dhiraj prajapati updated FLINK-9054: Fix Version/s: (was: 1.5.0) > IllegalStateException: Buffer pool is destroyed > --- > > Key: FLINK-9054 > URL: https://issues.apache.org/jira/browse/FLINK-9054 > Project: Flink > Issue Type: Bug > Components: Cluster Management, Configuration, Core >Affects Versions: 1.4.2 >Reporter: dhiraj prajapati >Priority: Blocker > Attachments: flink-conf.yaml > > > Hi, > I have a flink cluster running on 2 machines, say A and B. > Job manager is running on A. There are 2 TaksManagers, one on each node. > So effectively, A has a job manager and a task manager, while B has a task > manager. > When I submit a job to the cluster, I see below exception and the job fails: > 2018-03-22 17:16:52,205 WARN > org.apache.flink.streaming.api.operators.AbstractStreamOperator - Error while > emitting latency marker. > org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: > Could not forward element to next operator > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitLatencyMarker(OperatorChain.java:489) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:824) > at > org.apache.flink.streaming.api.operators.StreamSource$LatencyMarksEmitter$1.onProcessingTime(StreamSource.java:150) > at > org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$RepeatedTriggerTask.run(SystemProcessingTimeService.java:294) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > Caused by: > org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: > Could not forward element to next operator > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitLatencyMarker(OperatorChain.java:489) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:824) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.reportOrForwardLatencyMarker(AbstractStreamOperator.java:679) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.processLatencyMarker(AbstractStreamOperator.java:662) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitLatencyMarker(OperatorChain.java:486) > ... 10 more > Caused by: java.lang.RuntimeException: Buffer pool is destroyed. > at > org.apache.flink.streaming.runtime.io.RecordWriterOutput.emitLatencyMarker(RecordWriterOutput.java:141) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.emitLatencyMarker(OperatorChain.java:604) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:824) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.reportOrForwardLatencyMarker(AbstractStreamOperator.java:679) > at > org.apache.flink.streaming.api.operators.AbstractStreamOperator.processLatencyMarker(AbstractStreamOperator.java:662) > at > org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitLatencyMarker(OperatorChain.java:486) > ... 14 more > Caused by: java.lang.IllegalStateException: Buffer pool is destroyed. > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBuffer(LocalBufferPool.java:203) > at > org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBlocking(LocalBufferPool.java:191) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:132) > at > org.apache.flink.runtime.io.network.api.writer.RecordWriter.randomEmit(RecordWriter.java:107) > at > org.apache.flink.streaming.runtime.io.StreamRecordWriter.randomEmit(StreamRecordWriter.java:102) > at > org.apache.flink.streaming.runtime.io.RecordWriterOutput.emitLatencyMarker(RecordWriterOutput.java:138) > ... 19 more > > The exception does not come when I run only one JobManager (only on machine > B). > > I am attaching flink-conf.yaml -- This message was sent by Atlassian JIRA
[jira] [Created] (FLINK-9054) IllegalStateException: Buffer pool is destroyed
dhiraj prajapati created FLINK-9054: --- Summary: IllegalStateException: Buffer pool is destroyed Key: FLINK-9054 URL: https://issues.apache.org/jira/browse/FLINK-9054 Project: Flink Issue Type: Bug Components: Cluster Management, Configuration, Core Affects Versions: 1.4.2 Reporter: dhiraj prajapati Attachments: flink-conf.yaml Hi, I have a flink cluster running on 2 machines, say A and B. Job manager is running on A. There are 2 TaksManagers, one on each node. So effectively, A has a job manager and a task manager, while B has a task manager. When I submit a job to the cluster, I see below exception and the job fails: 2018-03-22 17:16:52,205 WARN org.apache.flink.streaming.api.operators.AbstractStreamOperator - Error while emitting latency marker. org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: Could not forward element to next operator at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitLatencyMarker(OperatorChain.java:489) at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:824) at org.apache.flink.streaming.api.operators.StreamSource$LatencyMarksEmitter$1.onProcessingTime(StreamSource.java:150) at org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$RepeatedTriggerTask.run(SystemProcessingTimeService.java:294) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: Could not forward element to next operator at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitLatencyMarker(OperatorChain.java:489) at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:824) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.reportOrForwardLatencyMarker(AbstractStreamOperator.java:679) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.processLatencyMarker(AbstractStreamOperator.java:662) at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitLatencyMarker(OperatorChain.java:486) ... 10 more Caused by: java.lang.RuntimeException: Buffer pool is destroyed. at org.apache.flink.streaming.runtime.io.RecordWriterOutput.emitLatencyMarker(RecordWriterOutput.java:141) at org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.emitLatencyMarker(OperatorChain.java:604) at org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:824) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.reportOrForwardLatencyMarker(AbstractStreamOperator.java:679) at org.apache.flink.streaming.api.operators.AbstractStreamOperator.processLatencyMarker(AbstractStreamOperator.java:662) at org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitLatencyMarker(OperatorChain.java:486) ... 14 more Caused by: java.lang.IllegalStateException: Buffer pool is destroyed. at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBuffer(LocalBufferPool.java:203) at org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBlocking(LocalBufferPool.java:191) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:132) at org.apache.flink.runtime.io.network.api.writer.RecordWriter.randomEmit(RecordWriter.java:107) at org.apache.flink.streaming.runtime.io.StreamRecordWriter.randomEmit(StreamRecordWriter.java:102) at org.apache.flink.streaming.runtime.io.RecordWriterOutput.emitLatencyMarker(RecordWriterOutput.java:138) ... 19 more The exception does not come when I run only one JobManager (only on machine B). I am attaching flink-conf.yaml -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (FLINK-8947) Timeout handler issue
dhiraj prajapati created FLINK-8947: --- Summary: Timeout handler issue Key: FLINK-8947 URL: https://issues.apache.org/jira/browse/FLINK-8947 Project: Flink Issue Type: Bug Components: CEP Affects Versions: 1.4.2 Reporter: dhiraj prajapati The issue is same as FLINK-5753 I am using Event time and have used watermark interval. Still, I have observed the timeout executes only after the next event if: first event appears, second event not appear in the stream and *no new events appear in a stream*, timeout handler is not executed. Expected result: timeout handler should be executed in case if there are no new events in a stream My code snippet: DataStream dataStream = env.socketTextStream("localhost", 1212); dataStream.getExecutionConfig().setAutoWatermarkInterval(100L); dataStream = dataStream.assignTimestampsAndWatermarks(new BoundedOutOfOrdernessTimestampExtractor( Time.seconds(0)) { private static final long serialVersionUID = 4969170359023055566L; @Override public long extractTimestamp(JSONObject event) { return System.currentTimeMillis(); } }); -- This message was sent by Atlassian JIRA (v7.6.3#76005)