[jira] [Reopened] (FLINK-9120) Task Manager Fault Tolerance issue

2018-04-04 Thread dhiraj prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhiraj prajapati reopened FLINK-9120:
-

As suggested by [~sihuazhou], reopening the issue till [~till.rohrmann] 
comments.

Please note, that it did not work with RestartStrategies.fixedDelayRestart(3, 
5000) but worked with RestartStrategies.fixedDelayRestart(20, 5000)

> Task Manager Fault Tolerance issue
> --
>
> Key: FLINK-9120
> URL: https://issues.apache.org/jira/browse/FLINK-9120
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Configuration, Core
>Affects Versions: 1.4.2
>Reporter: dhiraj prajapati
>Priority: Critical
> Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log
>
>
> HI, 
> I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
> to 2 on each node. I submitted a job to this cluster and it runs fine. To 
> test fault tolerance, I killed one task manager. I was expecting the job to 
> run fine because one of the 2 task managers was still up and running. 
> However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (FLINK-9120) Task Manager Fault Tolerance issue

2018-04-04 Thread dhiraj prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhiraj prajapati closed FLINK-9120.
---
Resolution: Invalid

> Task Manager Fault Tolerance issue
> --
>
> Key: FLINK-9120
> URL: https://issues.apache.org/jira/browse/FLINK-9120
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Configuration, Core
>Affects Versions: 1.4.2
>Reporter: dhiraj prajapati
>Priority: Critical
> Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log
>
>
> HI, 
> I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
> to 2 on each node. I submitted a job to this cluster and it runs fine. To 
> test fault tolerance, I killed one task manager. I was expecting the job to 
> run fine because one of the 2 task managers was still up and running. 
> However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue

2018-04-04 Thread dhiraj prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16425256#comment-16425256
 ] 

dhiraj prajapati commented on FLINK-9120:
-

HI [~till.rohrmann]/[~sihuazhou], it works perfectly fine after increasing the 
no. of retries. Thanks a lot for your time and effort. I will close this issue.

> Task Manager Fault Tolerance issue
> --
>
> Key: FLINK-9120
> URL: https://issues.apache.org/jira/browse/FLINK-9120
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Configuration, Core
>Affects Versions: 1.4.2
>Reporter: dhiraj prajapati
>Priority: Critical
> Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log
>
>
> HI, 
> I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
> to 2 on each node. I submitted a job to this cluster and it runs fine. To 
> test fault tolerance, I killed one task manager. I was expecting the job to 
> run fine because one of the 2 task managers was still up and running. 
> However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue

2018-04-03 Thread dhiraj prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423787#comment-16423787
 ] 

dhiraj prajapati commented on FLINK-9120:
-

Hi [~till.rohrmann]/[~sihuazhou], find the steps to reproduce the issue below

What I did was: 
1) Run cluster with JM on machine A, one TM on machine B and one TM on 
machine C 
2) Submit a job to the cluster. Works fine till now. 
3) Forcefully kill the TM on machine C. The web UI shows job failing and 
then restarting and finally the job is up on its own. TM on machine B handles 
everything. This is perfect. 
4) Now I start the TM on machine C and wait for sufficient time. At this point 
both TMs are up.
5) Now kill the TM on machine B. At this point the job fails. Shouldn't the 
job be handled by the running TM on machine C? 

> Task Manager Fault Tolerance issue
> --
>
> Key: FLINK-9120
> URL: https://issues.apache.org/jira/browse/FLINK-9120
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Configuration, Core
>Affects Versions: 1.4.2
>Reporter: dhiraj prajapati
>Priority: Critical
> Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log
>
>
> HI, 
> I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
> to 2 on each node. I submitted a job to this cluster and it runs fine. To 
> test fault tolerance, I killed one task manager. I was expecting the job to 
> run fine because one of the 2 task managers was still up and running. 
> However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue

2018-04-03 Thread dhiraj prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423753#comment-16423753
 ] 

dhiraj prajapati commented on FLINK-9120:
-

Hi [~till.rohrmann], with MemoryStateBackend, the state should be accessible to 
all TMs as long as the JM is up and running, right? Even then with 
MemoryStateBackend, the TM fault tolerance behaviour is not consistent. 
Sometimes it works and some times it doesn't.

 

Hi [~sihuazhou], can you please elaborate on " TM doesn't unregister from JM 
properly in standalone model" ? If one of the TMs gets terminated due to machne 
crash or any other reason, it will obviously not be able to unregister from JM. 
But the other TM should pick up the job and the job shouldn't fail right?

> Task Manager Fault Tolerance issue
> --
>
> Key: FLINK-9120
> URL: https://issues.apache.org/jira/browse/FLINK-9120
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Configuration, Core
>Affects Versions: 1.4.2
>Reporter: dhiraj prajapati
>Priority: Critical
> Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log
>
>
> HI, 
> I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
> to 2 on each node. I submitted a job to this cluster and it runs fine. To 
> test fault tolerance, I killed one task manager. I was expecting the job to 
> run fine because one of the 2 task managers was still up and running. 
> However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (FLINK-9120) Task Manager Fault Tolerance issue

2018-04-03 Thread dhiraj prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423732#comment-16423732
 ] 

dhiraj prajapati edited comment on FLINK-9120 at 4/3/18 9:29 AM:
-

Hi [~sihuazhou], I tried restarting the cluster again and submitted the job 
again with MemoryStateBackend and it seems to be working fine with a single TM 
now. Strange that it does not work always.

Thanks [~sihuazhou] for pointing out the issue with TM instance specific file 
path location of state backend.


was (Author: dhirajpraj):
Hi [~sihuazhou], I tried restarting the cluster again and submitted the job 
again with MemoryStateBackend and it seems to be working fine with a single TM 
now. Strange that it does not work always.

> Task Manager Fault Tolerance issue
> --
>
> Key: FLINK-9120
> URL: https://issues.apache.org/jira/browse/FLINK-9120
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Configuration, Core
>Affects Versions: 1.4.2
>Reporter: dhiraj prajapati
>Priority: Critical
> Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log
>
>
> HI, 
> I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
> to 2 on each node. I submitted a job to this cluster and it runs fine. To 
> test fault tolerance, I killed one task manager. I was expecting the job to 
> run fine because one of the 2 task managers was still up and running. 
> However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue

2018-04-03 Thread dhiraj prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423732#comment-16423732
 ] 

dhiraj prajapati commented on FLINK-9120:
-

Hi [~sihuazhou], I tried restarting the cluster again and submitted the job 
again with MemoryStateBackend and it seems to be working fine with a single TM 
now. Strange that it does not work always.

> Task Manager Fault Tolerance issue
> --
>
> Key: FLINK-9120
> URL: https://issues.apache.org/jira/browse/FLINK-9120
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Configuration, Core
>Affects Versions: 1.4.2
>Reporter: dhiraj prajapati
>Priority: Critical
> Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log
>
>
> HI, 
> I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
> to 2 on each node. I submitted a job to this cluster and it runs fine. To 
> test fault tolerance, I killed one task manager. I was expecting the job to 
> run fine because one of the 2 task managers was still up and running. 
> However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue

2018-04-03 Thread dhiraj prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423638#comment-16423638
 ] 

dhiraj prajapati commented on FLINK-9120:
-

Hi [~sihuazhou], As mentioned in the problem statement, I have purposely killed 
the TM to test fault tolerance. Without terminating TM, the job runs fine.

> Task Manager Fault Tolerance issue
> --
>
> Key: FLINK-9120
> URL: https://issues.apache.org/jira/browse/FLINK-9120
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Configuration, Core
>Affects Versions: 1.4.2
>Reporter: dhiraj prajapati
>Priority: Critical
> Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log
>
>
> HI, 
> I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
> to 2 on each node. I submitted a job to this cluster and it runs fine. To 
> test fault tolerance, I killed one task manager. I was expecting the job to 
> run fine because one of the 2 task managers was still up and running. 
> However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue

2018-04-03 Thread dhiraj prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423609#comment-16423609
 ] 

dhiraj prajapati commented on FLINK-9120:
-

HI [~sihuazhou], the task managers were on different machines. However, even if 
I use MemoryStateBackend, the job fails when one of the task managers is 
terminated. I am attaching the new logs with MemoryStateBackend.

[^flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log]

[^flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log] 
[^flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log]

^^[^flink-dhiraj.prajapati-client-ip-10-14-25-115.log]^^

> Task Manager Fault Tolerance issue
> --
>
> Key: FLINK-9120
> URL: https://issues.apache.org/jira/browse/FLINK-9120
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Configuration, Core
>Affects Versions: 1.4.2
>Reporter: dhiraj prajapati
>Priority: Critical
> Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log
>
>
> HI, 
> I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
> to 2 on each node. I submitted a job to this cluster and it runs fine. To 
> test fault tolerance, I killed one task manager. I was expecting the job to 
> run fine because one of the 2 task managers was still up and running. 
> However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9120) Task Manager Fault Tolerance issue

2018-04-03 Thread dhiraj prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhiraj prajapati updated FLINK-9120:

Attachment: flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log

> Task Manager Fault Tolerance issue
> --
>
> Key: FLINK-9120
> URL: https://issues.apache.org/jira/browse/FLINK-9120
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Configuration, Core
>Affects Versions: 1.4.2
>Reporter: dhiraj prajapati
>Priority: Critical
> Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log
>
>
> HI, 
> I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
> to 2 on each node. I submitted a job to this cluster and it runs fine. To 
> test fault tolerance, I killed one task manager. I was expecting the job to 
> run fine because one of the 2 task managers was still up and running. 
> However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9120) Task Manager Fault Tolerance issue

2018-04-03 Thread dhiraj prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhiraj prajapati updated FLINK-9120:

Attachment: flink-dhiraj.prajapati-client-ip-10-14-25-115.log

> Task Manager Fault Tolerance issue
> --
>
> Key: FLINK-9120
> URL: https://issues.apache.org/jira/browse/FLINK-9120
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Configuration, Core
>Affects Versions: 1.4.2
>Reporter: dhiraj prajapati
>Priority: Critical
> Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log
>
>
> HI, 
> I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
> to 2 on each node. I submitted a job to this cluster and it runs fine. To 
> test fault tolerance, I killed one task manager. I was expecting the job to 
> run fine because one of the 2 task managers was still up and running. 
> However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9120) Task Manager Fault Tolerance issue

2018-04-03 Thread dhiraj prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhiraj prajapati updated FLINK-9120:

Attachment: flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log

> Task Manager Fault Tolerance issue
> --
>
> Key: FLINK-9120
> URL: https://issues.apache.org/jira/browse/FLINK-9120
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Configuration, Core
>Affects Versions: 1.4.2
>Reporter: dhiraj prajapati
>Priority: Critical
> Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log
>
>
> HI, 
> I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
> to 2 on each node. I submitted a job to this cluster and it runs fine. To 
> test fault tolerance, I killed one task manager. I was expecting the job to 
> run fine because one of the 2 task managers was still up and running. 
> However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-9120) Task Manager Fault Tolerance issue

2018-04-02 Thread dhiraj prajapati (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16423508#comment-16423508
 ] 

dhiraj prajapati commented on FLINK-9120:
-

Please find the logs attached

[^flink-dhiraj.prajapati-client-ip-10-14-25-115.log]

[^flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log]

[^flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log]

> Task Manager Fault Tolerance issue
> --
>
> Key: FLINK-9120
> URL: https://issues.apache.org/jira/browse/FLINK-9120
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Configuration, Core
>Affects Versions: 1.4.2
>Reporter: dhiraj prajapati
>Priority: Critical
> Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log
>
>
> HI, 
> I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
> to 2 on each node. I submitted a job to this cluster and it runs fine. To 
> test fault tolerance, I killed one task manager. I was expecting the job to 
> run fine because one of the 2 task managers was still up and running. 
> However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9120) Task Manager Fault Tolerance issue

2018-04-02 Thread dhiraj prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhiraj prajapati updated FLINK-9120:

Attachment: flink-dhiraj.prajapati-client-ip-10-14-25-115.log
flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log
flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log

> Task Manager Fault Tolerance issue
> --
>
> Key: FLINK-9120
> URL: https://issues.apache.org/jira/browse/FLINK-9120
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Configuration, Core
>Affects Versions: 1.4.2
>Reporter: dhiraj prajapati
>Priority: Critical
> Attachments: flink-dhiraj.prajapati-client-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-jobmanager-5-ip-10-14-25-115.log, 
> flink-dhiraj.prajapati-taskmanager-5-ip-10-14-25-116.log
>
>
> HI, 
> I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
> The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
> to 2 on each node. I submitted a job to this cluster and it runs fine. To 
> test fault tolerance, I killed one task manager. I was expecting the job to 
> run fine because one of the 2 task managers was still up and running. 
> However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-9120) Task Manager Fault Tolerance issue

2018-04-02 Thread dhiraj prajapati (JIRA)
dhiraj prajapati created FLINK-9120:
---

 Summary: Task Manager Fault Tolerance issue
 Key: FLINK-9120
 URL: https://issues.apache.org/jira/browse/FLINK-9120
 Project: Flink
  Issue Type: Bug
  Components: Cluster Management, Configuration, Core
Affects Versions: 1.4.2
Reporter: dhiraj prajapati


HI, 
I have set up a flink 1.4 cluster with 1 job manager and two task managers. 
The configs taskmanager.numberOfTaskSlots and parallelism.default were set 
to 2 on each node. I submitted a job to this cluster and it runs fine. To 
test fault tolerance, I killed one task manager. I was expecting the job to 
run fine because one of the 2 task managers was still up and running. 
However, the job failed. Am I missing something? 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (FLINK-9054) IllegalStateException: Buffer pool is destroyed

2018-03-28 Thread dhiraj prajapati (JIRA)

 [ 
https://issues.apache.org/jira/browse/FLINK-9054?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

dhiraj prajapati updated FLINK-9054:

Fix Version/s: (was: 1.5.0)

> IllegalStateException: Buffer pool is destroyed
> ---
>
> Key: FLINK-9054
> URL: https://issues.apache.org/jira/browse/FLINK-9054
> Project: Flink
>  Issue Type: Bug
>  Components: Cluster Management, Configuration, Core
>Affects Versions: 1.4.2
>Reporter: dhiraj prajapati
>Priority: Blocker
> Attachments: flink-conf.yaml
>
>
> Hi,
> I have a flink cluster running on 2 machines, say A and B.
> Job manager is running on A. There are 2 TaksManagers, one on each node.
> So effectively, A has a job manager and a task manager, while B has a task 
> manager.
> When I submit a job to the cluster, I see below exception and the job fails:
> 2018-03-22 17:16:52,205 WARN 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator - Error while 
> emitting latency marker.
> org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: 
> Could not forward element to next operator
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitLatencyMarker(OperatorChain.java:489)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:824)
>  at 
> org.apache.flink.streaming.api.operators.StreamSource$LatencyMarksEmitter$1.onProcessingTime(StreamSource.java:150)
>  at 
> org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$RepeatedTriggerTask.run(SystemProcessingTimeService.java:294)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>  at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>  at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>  at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>  at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: 
> Could not forward element to next operator
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitLatencyMarker(OperatorChain.java:489)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:824)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.reportOrForwardLatencyMarker(AbstractStreamOperator.java:679)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.processLatencyMarker(AbstractStreamOperator.java:662)
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitLatencyMarker(OperatorChain.java:486)
>  ... 10 more
> Caused by: java.lang.RuntimeException: Buffer pool is destroyed.
>  at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.emitLatencyMarker(RecordWriterOutput.java:141)
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.emitLatencyMarker(OperatorChain.java:604)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:824)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.reportOrForwardLatencyMarker(AbstractStreamOperator.java:679)
>  at 
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.processLatencyMarker(AbstractStreamOperator.java:662)
>  at 
> org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitLatencyMarker(OperatorChain.java:486)
>  ... 14 more
> Caused by: java.lang.IllegalStateException: Buffer pool is destroyed.
>  at 
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBuffer(LocalBufferPool.java:203)
>  at 
> org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBlocking(LocalBufferPool.java:191)
>  at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:132)
>  at 
> org.apache.flink.runtime.io.network.api.writer.RecordWriter.randomEmit(RecordWriter.java:107)
>  at 
> org.apache.flink.streaming.runtime.io.StreamRecordWriter.randomEmit(StreamRecordWriter.java:102)
>  at 
> org.apache.flink.streaming.runtime.io.RecordWriterOutput.emitLatencyMarker(RecordWriterOutput.java:138)
>  ... 19 more
>  
> The exception does not come when I run only one JobManager (only on machine 
> B).
>  
> I am attaching flink-conf.yaml



--
This message was sent by Atlassian JIRA

[jira] [Created] (FLINK-9054) IllegalStateException: Buffer pool is destroyed

2018-03-22 Thread dhiraj prajapati (JIRA)
dhiraj prajapati created FLINK-9054:
---

 Summary: IllegalStateException: Buffer pool is destroyed
 Key: FLINK-9054
 URL: https://issues.apache.org/jira/browse/FLINK-9054
 Project: Flink
  Issue Type: Bug
  Components: Cluster Management, Configuration, Core
Affects Versions: 1.4.2
Reporter: dhiraj prajapati
 Attachments: flink-conf.yaml

Hi,

I have a flink cluster running on 2 machines, say A and B.

Job manager is running on A. There are 2 TaksManagers, one on each node.

So effectively, A has a job manager and a task manager, while B has a task 
manager.

When I submit a job to the cluster, I see below exception and the job fails:

2018-03-22 17:16:52,205 WARN 
org.apache.flink.streaming.api.operators.AbstractStreamOperator - Error while 
emitting latency marker.
org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: 
Could not forward element to next operator
 at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitLatencyMarker(OperatorChain.java:489)
 at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:824)
 at 
org.apache.flink.streaming.api.operators.StreamSource$LatencyMarksEmitter$1.onProcessingTime(StreamSource.java:150)
 at 
org.apache.flink.streaming.runtime.tasks.SystemProcessingTimeService$RepeatedTriggerTask.run(SystemProcessingTimeService.java:294)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
 at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
 at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
 at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
 at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
 at java.lang.Thread.run(Thread.java:745)
Caused by: 
org.apache.flink.streaming.runtime.tasks.ExceptionInChainedOperatorException: 
Could not forward element to next operator
 at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitLatencyMarker(OperatorChain.java:489)
 at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:824)
 at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.reportOrForwardLatencyMarker(AbstractStreamOperator.java:679)
 at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.processLatencyMarker(AbstractStreamOperator.java:662)
 at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitLatencyMarker(OperatorChain.java:486)
 ... 10 more
Caused by: java.lang.RuntimeException: Buffer pool is destroyed.
 at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.emitLatencyMarker(RecordWriterOutput.java:141)
 at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$BroadcastingOutputCollector.emitLatencyMarker(OperatorChain.java:604)
 at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator$CountingOutput.emitLatencyMarker(AbstractStreamOperator.java:824)
 at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.reportOrForwardLatencyMarker(AbstractStreamOperator.java:679)
 at 
org.apache.flink.streaming.api.operators.AbstractStreamOperator.processLatencyMarker(AbstractStreamOperator.java:662)
 at 
org.apache.flink.streaming.runtime.tasks.OperatorChain$ChainingOutput.emitLatencyMarker(OperatorChain.java:486)
 ... 14 more
Caused by: java.lang.IllegalStateException: Buffer pool is destroyed.
 at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBuffer(LocalBufferPool.java:203)
 at 
org.apache.flink.runtime.io.network.buffer.LocalBufferPool.requestBufferBlocking(LocalBufferPool.java:191)
 at 
org.apache.flink.runtime.io.network.api.writer.RecordWriter.sendToTarget(RecordWriter.java:132)
 at 
org.apache.flink.runtime.io.network.api.writer.RecordWriter.randomEmit(RecordWriter.java:107)
 at 
org.apache.flink.streaming.runtime.io.StreamRecordWriter.randomEmit(StreamRecordWriter.java:102)
 at 
org.apache.flink.streaming.runtime.io.RecordWriterOutput.emitLatencyMarker(RecordWriterOutput.java:138)
 ... 19 more

 

The exception does not come when I run only one JobManager (only on machine B).

 

I am attaching flink-conf.yaml



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (FLINK-8947) Timeout handler issue

2018-03-15 Thread dhiraj prajapati (JIRA)
dhiraj prajapati created FLINK-8947:
---

 Summary: Timeout handler issue
 Key: FLINK-8947
 URL: https://issues.apache.org/jira/browse/FLINK-8947
 Project: Flink
  Issue Type: Bug
  Components: CEP
Affects Versions: 1.4.2
Reporter: dhiraj prajapati


The issue is same as FLINK-5753

I am using Event time and have used watermark interval. Still, I have observed 
the timeout executes only after the next event

if: first event appears, second event not appear in the stream 
and *no new events appear in a stream*, timeout handler is not executed.

Expected result: timeout handler should be executed in case if there are no new 
events in a stream

 

My code snippet:

DataStream dataStream = env.socketTextStream("localhost", 1212);

dataStream.getExecutionConfig().setAutoWatermarkInterval(100L);

dataStream = dataStream.assignTimestampsAndWatermarks(new 
BoundedOutOfOrdernessTimestampExtractor(
 Time.seconds(0)) {

private static final long serialVersionUID = 4969170359023055566L;

@Override
 public long extractTimestamp(JSONObject event) {
 return System.currentTimeMillis();
 }
 });



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)