[jira] [Commented] (FLINK-8035) Unable to submit job when HA is enabled

2019-03-29 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804866#comment-16804866
 ] 

Till Rohrmann commented on FLINK-8035:
--

Ping [~longtimer].

> Unable to submit job when HA is enabled
> ---
>
> Key: FLINK-8035
> URL: https://issues.apache.org/jira/browse/FLINK-8035
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.4.0
> Environment: Mac OS X
>Reporter: Robert Metzger
>Priority: Critical
>
> Steps to reproduce:
> - Get Flink 1.4 (f5a0b4bdfb)
> - Get ZK (3.3.6 in this case)
> - Put the following flink-conf.yaml:
> {code}
> high-availability: zookeeper
> high-availability.storageDir: file:///tmp/flink-ha
> high-availability.zookeeper.quorum: localhost:2181
> high-availability.zookeeper.path.cluster-id: /my-namespace
> {code}
> - Start Flink, submit a job (any streaming example will do)
> The job submission will time out. On the JobManager, it seems that the job 
> submission gets stuck when trying to submit something to Zookeeper.
> In the JM UI, the job will sit there in status "CREATED"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8035) Unable to submit job when HA is enabled

2018-09-28 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16632090#comment-16632090
 ] 

Till Rohrmann commented on FLINK-8035:
--

The {{jobmanager.log}} would be a good start.

> Unable to submit job when HA is enabled
> ---
>
> Key: FLINK-8035
> URL: https://issues.apache.org/jira/browse/FLINK-8035
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.4.0
> Environment: Mac OS X
>Reporter: Robert Metzger
>Priority: Critical
>
> Steps to reproduce:
> - Get Flink 1.4 (f5a0b4bdfb)
> - Get ZK (3.3.6 in this case)
> - Put the following flink-conf.yaml:
> {code}
> high-availability: zookeeper
> high-availability.storageDir: file:///tmp/flink-ha
> high-availability.zookeeper.quorum: localhost:2181
> high-availability.zookeeper.path.cluster-id: /my-namespace
> {code}
> - Start Flink, submit a job (any streaming example will do)
> The job submission will time out. On the JobManager, it seems that the job 
> submission gets stuck when trying to submit something to Zookeeper.
> In the JM UI, the job will sit there in status "CREATED"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8035) Unable to submit job when HA is enabled

2018-09-20 Thread Jason Kania (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622792#comment-16622792
 ] 

Jason Kania commented on FLINK-8035:


I would need to know the specific logs you would like to see and the log 
configuration. The problem that I saw was that there were no logs indicating 
any issue even with trace enabled. I would simply get a message at the point of 
timeout without any preceding logs indicating that any messages had been set.

> Unable to submit job when HA is enabled
> ---
>
> Key: FLINK-8035
> URL: https://issues.apache.org/jira/browse/FLINK-8035
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.4.0
> Environment: Mac OS X
>Reporter: Robert Metzger
>Priority: Critical
>
> Steps to reproduce:
> - Get Flink 1.4 (f5a0b4bdfb)
> - Get ZK (3.3.6 in this case)
> - Put the following flink-conf.yaml:
> {code}
> high-availability: zookeeper
> high-availability.storageDir: file:///tmp/flink-ha
> high-availability.zookeeper.quorum: localhost:2181
> high-availability.zookeeper.path.cluster-id: /my-namespace
> {code}
> - Start Flink, submit a job (any streaming example will do)
> The job submission will time out. On the JobManager, it seems that the job 
> submission gets stuck when trying to submit something to Zookeeper.
> In the JM UI, the job will sit there in status "CREATED"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8035) Unable to submit job when HA is enabled

2018-09-20 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622751#comment-16622751
 ] 

Till Rohrmann commented on FLINK-8035:
--

In order to better understand your problem, getting access to the debug logs of 
all components would be very helpful [~longtimer].

> Unable to submit job when HA is enabled
> ---
>
> Key: FLINK-8035
> URL: https://issues.apache.org/jira/browse/FLINK-8035
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.4.0
> Environment: Mac OS X
>Reporter: Robert Metzger
>Priority: Critical
>
> Steps to reproduce:
> - Get Flink 1.4 (f5a0b4bdfb)
> - Get ZK (3.3.6 in this case)
> - Put the following flink-conf.yaml:
> {code}
> high-availability: zookeeper
> high-availability.storageDir: file:///tmp/flink-ha
> high-availability.zookeeper.quorum: localhost:2181
> high-availability.zookeeper.path.cluster-id: /my-namespace
> {code}
> - Start Flink, submit a job (any streaming example will do)
> The job submission will time out. On the JobManager, it seems that the job 
> submission gets stuck when trying to submit something to Zookeeper.
> In the JM UI, the job will sit there in status "CREATED"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8035) Unable to submit job when HA is enabled

2018-09-14 Thread Jason Kania (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16615275#comment-16615275
 ] 

Jason Kania commented on FLINK-8035:


I encountered this issue in 1.5.3 and subsequently had to roll back to 1.4.2. I 
had cleaned out the zookeeper data but the issue remained. I was unable to 
trace to a known cause. I suspect a swallowed error in zookeeper communication 
because the code to perform the actual low level send for the job list was not 
executed or at least the breakpoint was never hit.

> Unable to submit job when HA is enabled
> ---
>
> Key: FLINK-8035
> URL: https://issues.apache.org/jira/browse/FLINK-8035
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.4.0
> Environment: Mac OS X
>Reporter: Robert Metzger
>Priority: Critical
>
> Steps to reproduce:
> - Get Flink 1.4 (f5a0b4bdfb)
> - Get ZK (3.3.6 in this case)
> - Put the following flink-conf.yaml:
> {code}
> high-availability: zookeeper
> high-availability.storageDir: file:///tmp/flink-ha
> high-availability.zookeeper.quorum: localhost:2181
> high-availability.zookeeper.path.cluster-id: /my-namespace
> {code}
> - Start Flink, submit a job (any streaming example will do)
> The job submission will time out. On the JobManager, it seems that the job 
> submission gets stuck when trying to submit something to Zookeeper.
> In the JM UI, the job will sit there in status "CREATED"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8035) Unable to submit job when HA is enabled

2018-09-14 Thread Till Rohrmann (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16615160#comment-16615160
 ] 

Till Rohrmann commented on FLINK-8035:
--

Which version of Flink are you using [~longtimer]?

> Unable to submit job when HA is enabled
> ---
>
> Key: FLINK-8035
> URL: https://issues.apache.org/jira/browse/FLINK-8035
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.4.0
> Environment: Mac OS X
>Reporter: Robert Metzger
>Priority: Critical
>
> Steps to reproduce:
> - Get Flink 1.4 (f5a0b4bdfb)
> - Get ZK (3.3.6 in this case)
> - Put the following flink-conf.yaml:
> {code}
> high-availability: zookeeper
> high-availability.storageDir: file:///tmp/flink-ha
> high-availability.zookeeper.quorum: localhost:2181
> high-availability.zookeeper.path.cluster-id: /my-namespace
> {code}
> - Start Flink, submit a job (any streaming example will do)
> The job submission will time out. On the JobManager, it seems that the job 
> submission gets stuck when trying to submit something to Zookeeper.
> In the JM UI, the job will sit there in status "CREATED"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8035) Unable to submit job when HA is enabled

2018-09-05 Thread Jason Kania (JIRA)


[ 
https://issues.apache.org/jira/browse/FLINK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605052#comment-16605052
 ] 

Jason Kania commented on FLINK-8035:


This issue affects more than just submission. Many of the flink command line 
calls also timeout because of this issue. In my case, I am unable to upgrade 
the zookeeper because of other components so I have had to abandon HA mode.

> Unable to submit job when HA is enabled
> ---
>
> Key: FLINK-8035
> URL: https://issues.apache.org/jira/browse/FLINK-8035
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.4.0
> Environment: Mac OS X
>Reporter: Robert Metzger
>Priority: Critical
>
> Steps to reproduce:
> - Get Flink 1.4 (f5a0b4bdfb)
> - Get ZK (3.3.6 in this case)
> - Put the following flink-conf.yaml:
> {code}
> high-availability: zookeeper
> high-availability.storageDir: file:///tmp/flink-ha
> high-availability.zookeeper.quorum: localhost:2181
> high-availability.zookeeper.path.cluster-id: /my-namespace
> {code}
> - Start Flink, submit a job (any streaming example will do)
> The job submission will time out. On the JobManager, it seems that the job 
> submission gets stuck when trying to submit something to Zookeeper.
> In the JM UI, the job will sit there in status "CREATED"



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (FLINK-8035) Unable to submit job when HA is enabled

2017-11-08 Thread Robert Metzger (JIRA)

[ 
https://issues.apache.org/jira/browse/FLINK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244604#comment-16244604
 ] 

Robert Metzger commented on FLINK-8035:
---

The problem actually also occurred with Flink 1.3.2, but there, the error 
reporting is better:
{code}
017-11-08 20:29:47,375 WARN  org.apache.zookeeper.ClientCnxn
   - Session 0x15f9c132b170016 for server 
localhost/0:0:0:0:0:0:0:1:2181, unexpected error, closing socket connection and 
attempting reconnect
java.io.IOException: Xid out of order. Got Xid 56 with err 0 expected Xid 55 
for a packet with details: clientPath:null serverPath:null finished:false 
header:: 55,14  replyHeader:: 0,0,-4  request:: 
org.apache.zookeeper.MultiTransactionRecord@7677f7ec response:: 
org.apache.zookeeper.MultiResponse@0
at 
org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:798)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:94)
at 
org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366)
at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081)
2017-11-08 20:29:47,480 INFO  
org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager
  - State change: SUSPENDED
{code}

Upgrading the ZK server to 3.4.9 resolved the problem for 1.3.2.
I still think the error handling in 1.4.0 needs to improve (job switching to 
failed? + an exception being logged). It would also be good to find out why ZK 
3.3.6 didn't work

> Unable to submit job when HA is enabled
> ---
>
> Key: FLINK-8035
> URL: https://issues.apache.org/jira/browse/FLINK-8035
> Project: Flink
>  Issue Type: Bug
>  Components: JobManager
>Affects Versions: 1.4.0
> Environment: Mac OS X
>Reporter: Robert Metzger
>Priority: Critical
>
> Steps to reproduce:
> - Get Flink 1.4 (f5a0b4bdfb)
> - Get ZK (3.3.6 in this case)
> - Put the following flink-conf.yaml:
> {code}
> high-availability: zookeeper
> high-availability.storageDir: file:///tmp/flink-ha
> high-availability.zookeeper.quorum: localhost:2181
> high-availability.zookeeper.path.cluster-id: /my-namespace
> {code}
> - Start Flink, submit a job (any streaming example will do)
> The job submission will time out. On the JobManager, it seems that the job 
> submission gets stuck when trying to submit something to Zookeeper.
> In the JM UI, the job will sit there in status "CREATED"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)