[ 
https://issues.apache.org/jira/browse/FLINK-13878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gary Yao updated FLINK-13878:
-----------------------------
    Description: 
We are running Flink 1.8.1 in HA mode with Zookeeper 3.4.11.

When we list/submit/cancel from the CLI on a job manager, the operation fails 
with a TimeoutException. Here is an example:

{noformat}
$ ./bin/flink list
Waiting for response... 
------------------------------------------------------------
The program finished with the following 
exception:}}{{org.apache.flink.util.FlinkException: Failed to retrieve job list.
at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:448)
at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:430)
at 
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:985)
at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:427)
at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1053)
at org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)
at 
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)
Caused by: java.util.concurrent.TimeoutException
at 
org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:943)
at 
org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:361)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
{noformat}
 

When the same command is run with strace, we see that the CLI only manages to 
connect out to zookeeper and hangs on that connection:
{noformat}
$ strace -f -e connect ./bin/flink list
...
[pid 29445] connect(27, \{sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("gateway ip")}, 16) = 0
...
[pid 29445] connect(27,\{sa_family=AF_INET, sin_port=htons(2181), 
sin_addr=inet_addr("zookeeper ip")}, 16) = -1 EINPROGRESS (Operation now in 
progress)
strace: Process 29448 attached
Waiting for response...
...
Exception
{noformat}

 

The CLI is able to successfully connect to zookeeper and even appears to submit 
commands. This is verified by logs from zookeeper:

{noformat}
INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - 
Accepted socket connection from /JOB_MANAGER_IP:52074
DEBUG [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@880] - Session 
establishment request from client /JOB_MANAGER_IP:52074 client's lastZxid is 0x0
INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@938] - Client 
attempting to establish new session at /JOB_MANAGER_IP:52074
DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
sessionid:0x30037a90dce0001 type:createSession cxid:0x0 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:n/a
DEBUG [QuorumPeer[myid=3]/0.0.0.0:2181:CommitProcessor@164] - Committing 
request:: sessionid:0x30037a90dce0001 type:createSession cxid:0x0 
zxid:0x10200000109 txntype:-10 reqpath:n/a
DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
sessionid:0x30037a90dce0001 type:createSession cxid:0x0 zxid:0x10200000109 
txntype:-10 reqpath:n/a
DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
sessionid:0x30037a90dce0001 type:createSession cxid:0x0 zxid:0x10200000109 
txntype:-10 reqpath:n/a
INFO [CommitProcessor:3:ZooKeeperServer@683] - Established session 
0x30037a90dce0001 with negotiated timeout 40000 for client /JOB_MANAGER_IP:52074
DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x1 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink
DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x1 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink
DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
sessionid:0x30037a90dce0001 type:exists cxid:0x1 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink
DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x2 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME
DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x2 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME
DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
sessionid:0x30037a90dce0001 type:exists cxid:0x2 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME
DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x3 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink
DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x3 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink
DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
sessionid:0x30037a90dce0001 type:exists cxid:0x3 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink
DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x4 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME
DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x4 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME
DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
sessionid:0x30037a90dce0001 type:exists cxid:0x4 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME
DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x5 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME/leader
DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x5 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME/leader
DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
sessionid:0x30037a90dce0001 type:exists cxid:0x5 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME/leader
DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x6 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME/leader/rest_server_lock
DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x7 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink
DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x6 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME/leader/rest_server_lock
DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
sessionid:0x30037a90dce0001 type:exists cxid:0x6 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME/leader/rest_server_lock
{noformat}
 

Based on those logs I can speculate the the CLI has difficulty obtaining the 
rest server lock, but I don't really know.

In case it matters, we recently upgraded to Flink 1.8.

Here are a few instances of similar issues we have found online which don't 
seem to have conclusive endings:

[https://stackoverflow.com/questions/53716795/flink-1-6-timeoutexception-while-submitting-the-job]

[https://mail-archives.apache.org/mod_mbox/flink-user/201812.mbox/%3ca80ea545-cd42-a5a3-4321-d4ebfc79e...@apache.org%3E]

[https://mail-archives.apache.org/mod_mbox/flink-user/201809.mbox/%3ccac2r296tesd00qra9drd+atfngtrj0gcrxnpn-bvvcuvd4-...@mail.gmail.com%3e]

 

  was:
We are running Flink 1.8.1 in HA mode with Zookeeper 3.4.11.

{{When we list/submit/cancel from the CLI on a job manager, the operation fails 
with a TimeoutException. Here is an example:}}


{{ $ ./bin/flink list}}
{{ Waiting for 
response...}}{{------------------------------------------------------------}}
{{ The program finished with the following 
exception:}}{{org.apache.flink.util.FlinkException: Failed to retrieve job 
list.}}
{{ at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:448)}}
{{ at 
org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:430)}}
{{ at 
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:985)}}
{{ at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:427)}}
{{ at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1053)}}
{{ at 
org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)}}
{{ at 
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)}}
{{ at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)}}
{{ Caused by: java.util.concurrent.TimeoutException}}
{{ at 
org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:943)}}
{{ at 
org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)}}
{{ at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:361)}}
{{ at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)}}
{{ at java.util.concurrent.FutureTask.run(FutureTask.java:266)}}
{{ at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)}}
{{ at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)}}
{{ at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)}}
{{ at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)}}
{{ at java.lang.Thread.run(Thread.java:748)}}

 

{{When the same command is run with strace, we see that the CLI only manages to 
connect out to zookeeper and hangs on that connection:}}
{{ $ strace -f -e connect ./bin/flink list}}
{{ ...}}
{{ [pid 29445] connect(27, \{sa_family=AF_INET, sin_port=htons(53), 
sin_addr=inet_addr("gateway ip")}, 16) = 0}}
{{ ...}}
{{ [pid 29445] connect(27,\{sa_family=AF_INET, sin_port=htons(2181), 
sin_addr=inet_addr("zookeeper ip")}, 16) = -1 EINPROGRESS (Operation now in 
progress)}}
{{ strace: Process 29448 attached}}
{{ Waiting for response...}}
{{ ...}}
{{ Exception}}

 

{{The CLI is able to successfully connect to zookeeper and even appears to 
submit commands. This is verified by logs from zookeeper:}}
{{INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - 
Accepted socket connection from /JOB_MANAGER_IP:52074}}
{{DEBUG [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@880] - 
Session establishment request from client /JOB_MANAGER_IP:52074 client's 
lastZxid is 0x0}}
{{INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@938] - Client 
attempting to establish new session at /JOB_MANAGER_IP:52074}}
{{DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
sessionid:0x30037a90dce0001 type:createSession cxid:0x0 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:n/a}}
{{DEBUG [QuorumPeer[myid=3]/0.0.0.0:2181:CommitProcessor@164] - Committing 
request:: sessionid:0x30037a90dce0001 type:createSession cxid:0x0 
zxid:0x10200000109 txntype:-10 reqpath:n/a}}
{{DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
sessionid:0x30037a90dce0001 type:createSession cxid:0x0 zxid:0x10200000109 
txntype:-10 reqpath:n/a}}
{{DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
sessionid:0x30037a90dce0001 type:createSession cxid:0x0 zxid:0x10200000109 
txntype:-10 reqpath:n/a}}
{{INFO [CommitProcessor:3:ZooKeeperServer@683] - Established session 
0x30037a90dce0001 with negotiated timeout 40000 for client 
/JOB_MANAGER_IP:52074}}
{{DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x1 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink}}
{{DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x1 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink}}
{{DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
sessionid:0x30037a90dce0001 type:exists cxid:0x1 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink}}
{{DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x2 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME}}
{{DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x2 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME}}
{{DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
sessionid:0x30037a90dce0001 type:exists cxid:0x2 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME}}
{{DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x3 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink}}
{{DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x3 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink}}
{{DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
sessionid:0x30037a90dce0001 type:exists cxid:0x3 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink}}
{{DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x4 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME}}
{{DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x4 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME}}
{{DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
sessionid:0x30037a90dce0001 type:exists cxid:0x4 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME}}
{{DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x5 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME/leader}}
{{DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x5 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME/leader}}
{{DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
sessionid:0x30037a90dce0001 type:exists cxid:0x5 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME/leader}}
{{DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x6 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME/leader/rest_server_lock}}
{{DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x7 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink}}
{{DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
sessionid:0x30037a90dce0001 type:exists cxid:0x6 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME/leader/rest_server_lock}}
{{DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
sessionid:0x30037a90dce0001 type:exists cxid:0x6 zxid:0xfffffffffffffffe 
txntype:unknown reqpath:/flink/CLUSTER_NAME/leader/rest_server_lock}}

 

Based on those logs I can speculate the the CLI has difficulty obtaining the 
rest server lock, but I don't really know.

In case it matters, we recently upgraded to Flink 1.8.

Here are a few instances of similar issues we have found online which don't 
seem to have conclusive endings:

[https://stackoverflow.com/questions/53716795/flink-1-6-timeoutexception-while-submitting-the-job]

[https://mail-archives.apache.org/mod_mbox/flink-user/201812.mbox/%3ca80ea545-cd42-a5a3-4321-d4ebfc79e...@apache.org%3E]

[https://mail-archives.apache.org/mod_mbox/flink-user/201809.mbox/%3ccac2r296tesd00qra9drd+atfngtrj0gcrxnpn-bvvcuvd4-...@mail.gmail.com%3e]

 

 

 


> Flink CLI fails (TimeoutException) for unknown reason
> -----------------------------------------------------
>
>                 Key: FLINK-13878
>                 URL: https://issues.apache.org/jira/browse/FLINK-13878
>             Project: Flink
>          Issue Type: Bug
>          Components: Command Line Client, Runtime / Coordination
>            Reporter: Mitch Wasson
>            Priority: Major
>
> We are running Flink 1.8.1 in HA mode with Zookeeper 3.4.11.
> When we list/submit/cancel from the CLI on a job manager, the operation fails 
> with a TimeoutException. Here is an example:
> {noformat}
> $ ./bin/flink list
> Waiting for response... 
> ------------------------------------------------------------
> The program finished with the following 
> exception:}}{{org.apache.flink.util.FlinkException: Failed to retrieve job 
> list.
> at org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:448)
> at org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:430)
> at 
> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:985)
> at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:427)
> at 
> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1053)
> at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1126)
> at 
> org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1126)
> Caused by: java.util.concurrent.TimeoutException
> at 
> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:943)
> at 
> org.apache.flink.runtime.concurrent.DirectExecutorService.execute(DirectExecutorService.java:211)
> at 
> org.apache.flink.runtime.concurrent.FutureUtils.lambda$orTimeout$11(FutureUtils.java:361)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
> at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> {noformat}
>  
> When the same command is run with strace, we see that the CLI only manages to 
> connect out to zookeeper and hangs on that connection:
> {noformat}
> $ strace -f -e connect ./bin/flink list
> ...
> [pid 29445] connect(27, \{sa_family=AF_INET, sin_port=htons(53), 
> sin_addr=inet_addr("gateway ip")}, 16) = 0
> ...
> [pid 29445] connect(27,\{sa_family=AF_INET, sin_port=htons(2181), 
> sin_addr=inet_addr("zookeeper ip")}, 16) = -1 EINPROGRESS (Operation now in 
> progress)
> strace: Process 29448 attached
> Waiting for response...
> ...
> Exception
> {noformat}
>  
> The CLI is able to successfully connect to zookeeper and even appears to 
> submit commands. This is verified by logs from zookeeper:
> {noformat}
> INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] - 
> Accepted socket connection from /JOB_MANAGER_IP:52074
> DEBUG [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@880] - 
> Session establishment request from client /JOB_MANAGER_IP:52074 client's 
> lastZxid is 0x0
> INFO [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@938] - Client 
> attempting to establish new session at /JOB_MANAGER_IP:52074
> DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
> sessionid:0x30037a90dce0001 type:createSession cxid:0x0 
> zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a
> DEBUG [QuorumPeer[myid=3]/0.0.0.0:2181:CommitProcessor@164] - Committing 
> request:: sessionid:0x30037a90dce0001 type:createSession cxid:0x0 
> zxid:0x10200000109 txntype:-10 reqpath:n/a
> DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
> sessionid:0x30037a90dce0001 type:createSession cxid:0x0 zxid:0x10200000109 
> txntype:-10 reqpath:n/a
> DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
> sessionid:0x30037a90dce0001 type:createSession cxid:0x0 zxid:0x10200000109 
> txntype:-10 reqpath:n/a
> INFO [CommitProcessor:3:ZooKeeperServer@683] - Established session 
> 0x30037a90dce0001 with negotiated timeout 40000 for client 
> /JOB_MANAGER_IP:52074
> DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
> sessionid:0x30037a90dce0001 type:exists cxid:0x1 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink
> DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
> sessionid:0x30037a90dce0001 type:exists cxid:0x1 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink
> DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
> sessionid:0x30037a90dce0001 type:exists cxid:0x1 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink
> DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
> sessionid:0x30037a90dce0001 type:exists cxid:0x2 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink/CLUSTER_NAME
> DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
> sessionid:0x30037a90dce0001 type:exists cxid:0x2 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink/CLUSTER_NAME
> DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
> sessionid:0x30037a90dce0001 type:exists cxid:0x2 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink/CLUSTER_NAME
> DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
> sessionid:0x30037a90dce0001 type:exists cxid:0x3 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink
> DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
> sessionid:0x30037a90dce0001 type:exists cxid:0x3 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink
> DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
> sessionid:0x30037a90dce0001 type:exists cxid:0x3 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink
> DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
> sessionid:0x30037a90dce0001 type:exists cxid:0x4 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink/CLUSTER_NAME
> DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
> sessionid:0x30037a90dce0001 type:exists cxid:0x4 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink/CLUSTER_NAME
> DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
> sessionid:0x30037a90dce0001 type:exists cxid:0x4 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink/CLUSTER_NAME
> DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
> sessionid:0x30037a90dce0001 type:exists cxid:0x5 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink/CLUSTER_NAME/leader
> DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
> sessionid:0x30037a90dce0001 type:exists cxid:0x5 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink/CLUSTER_NAME/leader
> DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
> sessionid:0x30037a90dce0001 type:exists cxid:0x5 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink/CLUSTER_NAME/leader
> DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
> sessionid:0x30037a90dce0001 type:exists cxid:0x6 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink/CLUSTER_NAME/leader/rest_server_lock
> DEBUG [FollowerRequestProcessor:3:CommitProcessor@174] - Processing request:: 
> sessionid:0x30037a90dce0001 type:exists cxid:0x7 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink
> DEBUG [CommitProcessor:3:FinalRequestProcessor@89] - Processing request:: 
> sessionid:0x30037a90dce0001 type:exists cxid:0x6 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink/CLUSTER_NAME/leader/rest_server_lock
> DEBUG [CommitProcessor:3:FinalRequestProcessor@161] - 
> sessionid:0x30037a90dce0001 type:exists cxid:0x6 zxid:0xfffffffffffffffe 
> txntype:unknown reqpath:/flink/CLUSTER_NAME/leader/rest_server_lock
> {noformat}
>  
> Based on those logs I can speculate the the CLI has difficulty obtaining the 
> rest server lock, but I don't really know.
> In case it matters, we recently upgraded to Flink 1.8.
> Here are a few instances of similar issues we have found online which don't 
> seem to have conclusive endings:
> [https://stackoverflow.com/questions/53716795/flink-1-6-timeoutexception-while-submitting-the-job]
> [https://mail-archives.apache.org/mod_mbox/flink-user/201812.mbox/%3ca80ea545-cd42-a5a3-4321-d4ebfc79e...@apache.org%3E]
> [https://mail-archives.apache.org/mod_mbox/flink-user/201809.mbox/%3ccac2r296tesd00qra9drd+atfngtrj0gcrxnpn-bvvcuvd4-...@mail.gmail.com%3e]
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to