On 06-Sep-2018, at 10:45 AM, Gary Yao <g...@data-artisans.com
<mailto:g...@data-artisans.com>> wrote:
Hi Jason,
From the stacktrace it seems that you are using the 1.4.0 client to
list jobs
on a 1.5.x cluster. This will not work. You have to use the 1.5.x client.
Best,
Gary
On Wed, Sep 5, 2018 at 5:35 PM, Jason Kania <jason.ka...@ymail.com
<mailto:jason.ka...@ymail.com>> wrote:
Hello,
Thanks for the response. I had already tried setting the log
level to debug in log4j-cli.properties, logback-console.xml, and
log4j-console.properties but no additional relevant information
comes out. On the server, all that comes out are zookeeper ping
responses:
2018-09-05 15:16:56,786 DEBUG
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn
- Got ping response for sessionid: 0x3659b60bcb50076 after 1ms
The client log indicates only the following (but we are not using
hadoop):
2018-09-05 15:19:53,339 WARN
org.apache.flink.client.cli.CliFrontend - Could
not load CLI class org.apache.flink.yarn.cli.FlinkYarnSessionCli.
java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at
org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(CliFrontend.java:1208)
at
org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(CliFrontend.java:1164)
at
org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1090)
Caused by: java.lang.ClassNotFoundException:
org.apache.hadoop.conf.Configuration
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at
sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 5 more
and
2018-09-05 15:19:53,881 ERROR
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState
- Authentication failed
despite the zookeeper being configured as 'open' and latest logs
showing data being read from zookeeper.
2018-09-05 15:19:54,274 DEBUG
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn
- Reading reply sessionid:0x265a12437df0074, packet::
clientPath:null serverPath:null finished:false header:: 1,3
replyHeader:: 1,47244656277,0 request:: '/flink,F response::
s{47244656196,47244656196,1536110417531,1536110417531,0,1,0,0,0,1,47244656197}
Much like the basic log output, the detailed trace shows no
additional information, just a gap after waiting for the response:
2018-09-05 15:19:54,313 INFO
org.apache.flink.client.cli.CliFrontend - Waiting
for response...
2018-09-05 15:20:07,635 DEBUG
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn
- Got ping response for sessionid: 0x265a12437df0074 after 1ms
2018-09-05 15:20:20,976 DEBUG
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn
- Got ping response for sessionid: 0x265a12437df0074 after 1ms
2018-09-05 15:20:24,311 INFO
org.apache.flink.runtime.rest.RestClient - Shutting
down rest endpoint.
2018-09-05 15:20:24,317 INFO
org.apache.flink.runtime.rest.RestClient - Rest
endpoint shutdown complete.
2018-09-05 15:20:24,318 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
- Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
2018-09-05 15:20:24,320 INFO
org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService
- Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
2018-09-05 15:20:24,320 DEBUG
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
- Closing
2018-09-05 15:20:24,321 INFO
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
- backgroundOperationsLoop exiting
2018-09-05 15:20:24,322 DEBUG
org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient
- Closing
2018-09-05 15:20:24,322 DEBUG
org.apache.flink.shaded.curator.org.apache.curator.ConnectionState
- Closing
2018-09-05 15:20:24,323 DEBUG
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper
- Closing session: 0x265a12437df0074
2018-09-05 15:20:24,323 DEBUG
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn
- Closing client for session: 0x265a12437df0074
2018-09-05 15:20:24,329 DEBUG
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn
- Reading reply sessionid:0x265a12437df0074, packet::
clientPath:null serverPath:null finished:false header:: 11,-11
replyHeader:: 11,47244656278,0 request:: null response:: null
2018-09-05 15:20:24,329 DEBUG
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn
- Disconnecting client for session: 0x265a12437df0074
2018-09-05 15:20:24,330 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper
- Session: 0x265a12437df0074 closed
2018-09-05 15:20:24,330 INFO
org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn
- EventThread shut down for session: 0x265a12437df0074
2018-09-05 15:20:24,330 ERROR
org.apache.flink.client.cli.CliFrontend - Error
while running the command.
On Wednesday, September 5, 2018, 3:41:29 a.m. EDT, Chesnay
Schepler <ches...@apache.org <mailto:ches...@apache.org>> wrote:
Please enable DEBUG logging for the client and TRACE logging for
the cluster.
For the client, look for log messages starting with "Sending
request of", this will contain the host and port that requests
are sent to by the client. Verify that these are correct and
match the host/port that you use when accessing the web UI.
For the server, look for log messages starting with "Received
request", so we can figure out whether the request at least arrives.
On 05.09.2018 00:53, Jason Kania wrote:
I have upgraded from Flink 1.4.0 to Flink 1.5.3 with a three
node cluster configured with HA. Now I am encountering an issue
where the flink command line operations timeout. The exception
generated is very poor because it only indicates a timeout and
not the reason or what it was trying to do:
>./flink list -f
Waiting for response...
------------------------------------------------------------
The program finished with the following exception:
org.apache.flink.util.FlinkException: Failed to retrieve job list.
at
org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:433)
at
org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:416)
at
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)
at
org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:413)
at
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1028)
at
org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
at
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
at
org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
Caused by:
org.apache.flink.runtime.concurrent.FutureUtils$RetryException:
Could not complete the operation. Exception is not retryable.
at
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
at
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
at
org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793)
at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.CompletionException:
java.util.concurrent.TimeoutException
at
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
at
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
at
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
at
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
... 10 more
Caused by: java.util.concurrent.TimeoutException
The web interface shows the 2 job managers and 3 task managers
that are talking with one another.
I have looked at the zookeeper data and it is all present.
I have tried running the command on multiple nodes and they all
give the same error.
I looked for a verbose or debug option for the commands but
found nothing.
Suggestions on this?
Thanks,
Jason