[jira] [Commented] (FLINK-8035) Unable to submit job when HA is enabled
[ https://issues.apache.org/jira/browse/FLINK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804866#comment-16804866 ] Till Rohrmann commented on FLINK-8035: -- Ping [~longtimer]. > Unable to submit job when HA is enabled > --- > > Key: FLINK-8035 > URL: https://issues.apache.org/jira/browse/FLINK-8035 > Project: Flink > Issue Type: Bug > Components: Runtime / Coordination >Affects Versions: 1.4.0 > Environment: Mac OS X >Reporter: Robert Metzger >Priority: Critical > > Steps to reproduce: > - Get Flink 1.4 (f5a0b4bdfb) > - Get ZK (3.3.6 in this case) > - Put the following flink-conf.yaml: > {code} > high-availability: zookeeper > high-availability.storageDir: file:///tmp/flink-ha > high-availability.zookeeper.quorum: localhost:2181 > high-availability.zookeeper.path.cluster-id: /my-namespace > {code} > - Start Flink, submit a job (any streaming example will do) > The job submission will time out. On the JobManager, it seems that the job > submission gets stuck when trying to submit something to Zookeeper. > In the JM UI, the job will sit there in status "CREATED" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8035) Unable to submit job when HA is enabled
[ https://issues.apache.org/jira/browse/FLINK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16632090#comment-16632090 ] Till Rohrmann commented on FLINK-8035: -- The {{jobmanager.log}} would be a good start. > Unable to submit job when HA is enabled > --- > > Key: FLINK-8035 > URL: https://issues.apache.org/jira/browse/FLINK-8035 > Project: Flink > Issue Type: Bug > Components: JobManager >Affects Versions: 1.4.0 > Environment: Mac OS X >Reporter: Robert Metzger >Priority: Critical > > Steps to reproduce: > - Get Flink 1.4 (f5a0b4bdfb) > - Get ZK (3.3.6 in this case) > - Put the following flink-conf.yaml: > {code} > high-availability: zookeeper > high-availability.storageDir: file:///tmp/flink-ha > high-availability.zookeeper.quorum: localhost:2181 > high-availability.zookeeper.path.cluster-id: /my-namespace > {code} > - Start Flink, submit a job (any streaming example will do) > The job submission will time out. On the JobManager, it seems that the job > submission gets stuck when trying to submit something to Zookeeper. > In the JM UI, the job will sit there in status "CREATED" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8035) Unable to submit job when HA is enabled
[ https://issues.apache.org/jira/browse/FLINK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622792#comment-16622792 ] Jason Kania commented on FLINK-8035: I would need to know the specific logs you would like to see and the log configuration. The problem that I saw was that there were no logs indicating any issue even with trace enabled. I would simply get a message at the point of timeout without any preceding logs indicating that any messages had been set. > Unable to submit job when HA is enabled > --- > > Key: FLINK-8035 > URL: https://issues.apache.org/jira/browse/FLINK-8035 > Project: Flink > Issue Type: Bug > Components: JobManager >Affects Versions: 1.4.0 > Environment: Mac OS X >Reporter: Robert Metzger >Priority: Critical > > Steps to reproduce: > - Get Flink 1.4 (f5a0b4bdfb) > - Get ZK (3.3.6 in this case) > - Put the following flink-conf.yaml: > {code} > high-availability: zookeeper > high-availability.storageDir: file:///tmp/flink-ha > high-availability.zookeeper.quorum: localhost:2181 > high-availability.zookeeper.path.cluster-id: /my-namespace > {code} > - Start Flink, submit a job (any streaming example will do) > The job submission will time out. On the JobManager, it seems that the job > submission gets stuck when trying to submit something to Zookeeper. > In the JM UI, the job will sit there in status "CREATED" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8035) Unable to submit job when HA is enabled
[ https://issues.apache.org/jira/browse/FLINK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16622751#comment-16622751 ] Till Rohrmann commented on FLINK-8035: -- In order to better understand your problem, getting access to the debug logs of all components would be very helpful [~longtimer]. > Unable to submit job when HA is enabled > --- > > Key: FLINK-8035 > URL: https://issues.apache.org/jira/browse/FLINK-8035 > Project: Flink > Issue Type: Bug > Components: JobManager >Affects Versions: 1.4.0 > Environment: Mac OS X >Reporter: Robert Metzger >Priority: Critical > > Steps to reproduce: > - Get Flink 1.4 (f5a0b4bdfb) > - Get ZK (3.3.6 in this case) > - Put the following flink-conf.yaml: > {code} > high-availability: zookeeper > high-availability.storageDir: file:///tmp/flink-ha > high-availability.zookeeper.quorum: localhost:2181 > high-availability.zookeeper.path.cluster-id: /my-namespace > {code} > - Start Flink, submit a job (any streaming example will do) > The job submission will time out. On the JobManager, it seems that the job > submission gets stuck when trying to submit something to Zookeeper. > In the JM UI, the job will sit there in status "CREATED" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8035) Unable to submit job when HA is enabled
[ https://issues.apache.org/jira/browse/FLINK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16615275#comment-16615275 ] Jason Kania commented on FLINK-8035: I encountered this issue in 1.5.3 and subsequently had to roll back to 1.4.2. I had cleaned out the zookeeper data but the issue remained. I was unable to trace to a known cause. I suspect a swallowed error in zookeeper communication because the code to perform the actual low level send for the job list was not executed or at least the breakpoint was never hit. > Unable to submit job when HA is enabled > --- > > Key: FLINK-8035 > URL: https://issues.apache.org/jira/browse/FLINK-8035 > Project: Flink > Issue Type: Bug > Components: JobManager >Affects Versions: 1.4.0 > Environment: Mac OS X >Reporter: Robert Metzger >Priority: Critical > > Steps to reproduce: > - Get Flink 1.4 (f5a0b4bdfb) > - Get ZK (3.3.6 in this case) > - Put the following flink-conf.yaml: > {code} > high-availability: zookeeper > high-availability.storageDir: file:///tmp/flink-ha > high-availability.zookeeper.quorum: localhost:2181 > high-availability.zookeeper.path.cluster-id: /my-namespace > {code} > - Start Flink, submit a job (any streaming example will do) > The job submission will time out. On the JobManager, it seems that the job > submission gets stuck when trying to submit something to Zookeeper. > In the JM UI, the job will sit there in status "CREATED" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8035) Unable to submit job when HA is enabled
[ https://issues.apache.org/jira/browse/FLINK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16615160#comment-16615160 ] Till Rohrmann commented on FLINK-8035: -- Which version of Flink are you using [~longtimer]? > Unable to submit job when HA is enabled > --- > > Key: FLINK-8035 > URL: https://issues.apache.org/jira/browse/FLINK-8035 > Project: Flink > Issue Type: Bug > Components: JobManager >Affects Versions: 1.4.0 > Environment: Mac OS X >Reporter: Robert Metzger >Priority: Critical > > Steps to reproduce: > - Get Flink 1.4 (f5a0b4bdfb) > - Get ZK (3.3.6 in this case) > - Put the following flink-conf.yaml: > {code} > high-availability: zookeeper > high-availability.storageDir: file:///tmp/flink-ha > high-availability.zookeeper.quorum: localhost:2181 > high-availability.zookeeper.path.cluster-id: /my-namespace > {code} > - Start Flink, submit a job (any streaming example will do) > The job submission will time out. On the JobManager, it seems that the job > submission gets stuck when trying to submit something to Zookeeper. > In the JM UI, the job will sit there in status "CREATED" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8035) Unable to submit job when HA is enabled
[ https://issues.apache.org/jira/browse/FLINK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16605052#comment-16605052 ] Jason Kania commented on FLINK-8035: This issue affects more than just submission. Many of the flink command line calls also timeout because of this issue. In my case, I am unable to upgrade the zookeeper because of other components so I have had to abandon HA mode. > Unable to submit job when HA is enabled > --- > > Key: FLINK-8035 > URL: https://issues.apache.org/jira/browse/FLINK-8035 > Project: Flink > Issue Type: Bug > Components: JobManager >Affects Versions: 1.4.0 > Environment: Mac OS X >Reporter: Robert Metzger >Priority: Critical > > Steps to reproduce: > - Get Flink 1.4 (f5a0b4bdfb) > - Get ZK (3.3.6 in this case) > - Put the following flink-conf.yaml: > {code} > high-availability: zookeeper > high-availability.storageDir: file:///tmp/flink-ha > high-availability.zookeeper.quorum: localhost:2181 > high-availability.zookeeper.path.cluster-id: /my-namespace > {code} > - Start Flink, submit a job (any streaming example will do) > The job submission will time out. On the JobManager, it seems that the job > submission gets stuck when trying to submit something to Zookeeper. > In the JM UI, the job will sit there in status "CREATED" -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (FLINK-8035) Unable to submit job when HA is enabled
[ https://issues.apache.org/jira/browse/FLINK-8035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16244604#comment-16244604 ] Robert Metzger commented on FLINK-8035: --- The problem actually also occurred with Flink 1.3.2, but there, the error reporting is better: {code} 017-11-08 20:29:47,375 WARN org.apache.zookeeper.ClientCnxn - Session 0x15f9c132b170016 for server localhost/0:0:0:0:0:0:0:1:2181, unexpected error, closing socket connection and attempting reconnect java.io.IOException: Xid out of order. Got Xid 56 with err 0 expected Xid 55 for a packet with details: clientPath:null serverPath:null finished:false header:: 55,14 replyHeader:: 0,0,-4 request:: org.apache.zookeeper.MultiTransactionRecord@7677f7ec response:: org.apache.zookeeper.MultiResponse@0 at org.apache.zookeeper.ClientCnxn$SendThread.readResponse(ClientCnxn.java:798) at org.apache.zookeeper.ClientCnxnSocketNIO.doIO(ClientCnxnSocketNIO.java:94) at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:366) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1081) 2017-11-08 20:29:47,480 INFO org.apache.flink.shaded.org.apache.curator.framework.state.ConnectionStateManager - State change: SUSPENDED {code} Upgrading the ZK server to 3.4.9 resolved the problem for 1.3.2. I still think the error handling in 1.4.0 needs to improve (job switching to failed? + an exception being logged). It would also be good to find out why ZK 3.3.6 didn't work > Unable to submit job when HA is enabled > --- > > Key: FLINK-8035 > URL: https://issues.apache.org/jira/browse/FLINK-8035 > Project: Flink > Issue Type: Bug > Components: JobManager >Affects Versions: 1.4.0 > Environment: Mac OS X >Reporter: Robert Metzger >Priority: Critical > > Steps to reproduce: > - Get Flink 1.4 (f5a0b4bdfb) > - Get ZK (3.3.6 in this case) > - Put the following flink-conf.yaml: > {code} > high-availability: zookeeper > high-availability.storageDir: file:///tmp/flink-ha > high-availability.zookeeper.quorum: localhost:2181 > high-availability.zookeeper.path.cluster-id: /my-namespace > {code} > - Start Flink, submit a job (any streaming example will do) > The job submission will time out. On the JobManager, it seems that the job > submission gets stuck when trying to submit something to Zookeeper. > In the JM UI, the job will sit there in status "CREATED" -- This message was sent by Atlassian JIRA (v6.4.14#64029)