Re: flink list and flink run commands timeout

Chesnay Schepler Mon, 03 Dec 2018 03:52:30 -0800

Based on the stacktrace the client is not running in legacy mode; pleasecheck the client flink-conf.yaml.


On 03.12.2018 12:10, Aneesha Kaushal wrote:

Hello,

I am facing the same Timeout exception, at flink run and flink listcommands when I am trying to deploy jobs in Flink 1.6 in “legacy" mode.

We are planning to run in legacy mode because after upgrading fromFlink 1.3 to Flink 1.6, flink job was not getting distributed acrosstask managers.


In “new" mode jobs are working fine.

Any suggestions?

org.apache.flink.client.program.ProgramInvocationException: Could not submit 
job (JobID: d6686e184897e8799d71008488ccf80e)
        at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:249)
        at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:486)
        at 
org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
        at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:432)
        at 
org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:804)
        at 
org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:280)
        at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
        at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1044)
        at 
org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
        at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
submit JobGraph.
        at 
org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:379)
        at 
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
        at 
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at 
org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.CompletionException: 
org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not 
complete the operation. Exception is not retryable.
        at 
java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
        at 
java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
        at 
java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
        at 
java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
        ... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: 
Could not complete the operation. Exception is not retryable.
        ... 13 more
Caused by: java.util.concurrent.CompletionException: 
java.util.concurrent.TimeoutException
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
        at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
        ... 10 more
Caused by: java.util.concurrent.TimeoutException

Thanks,
Aneesha Kaushal

On 06-Sep-2018, at 10:45 AM, Gary Yao <g...@data-artisans.com<mailto:g...@data-artisans.com>> wrote:


Hi Jason,

From the stacktrace it seems that you are using the 1.4.0 client tolist jobs

on a 1.5.x cluster. This will not work. You have to use the 1.5.x client.

Best,
Gary

On Wed, Sep 5, 2018 at 5:35 PM, Jason Kania <jason.ka...@ymail.com<mailto:jason.ka...@ymail.com>> wrote:


    Hello,

    Thanks for the response. I had already tried setting the log
    level to debug in log4j-cli.properties, logback-console.xml, and
    log4j-console.properties but no additional relevant information
    comes out. On the server, all that comes out are zookeeper ping
    responses:

    2018-09-05 15:16:56,786 DEBUG
    org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn
    - Got ping response for sessionid: 0x3659b60bcb50076 after 1ms

    The client log indicates only the following (but we are not using
    hadoop):

    2018-09-05 15:19:53,339 WARN
    org.apache.flink.client.cli.CliFrontend                - Could
    not load CLI class org.apache.flink.yarn.cli.FlinkYarnSessionCli.
    java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
            at java.lang.Class.forName0(Native Method)
            at java.lang.Class.forName(Class.java:264)
            at
    
org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(CliFrontend.java:1208)
            at
    
org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(CliFrontend.java:1164)
            at
    org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1090)
    Caused by: java.lang.ClassNotFoundException:
    org.apache.hadoop.conf.Configuration
            at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
            at
    sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
            at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
            ... 5 more


    and

    2018-09-05 15:19:53,881 ERROR

org.apache.flink.shaded.curator.org.apache.curator.ConnectionState- Authentication failed



    despite the zookeeper being configured as 'open' and latest logs
    showing data being read from zookeeper.

    2018-09-05 15:19:54,274 DEBUG

org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn- Reading reply sessionid:0x265a12437df0074, packet::clientPath:null serverPath:null finished:false header:: 1,3replyHeader:: 1,47244656277,0 request:: '/flink,F response::

    
s{47244656196,47244656196,1536110417531,1536110417531,0,1,0,0,0,1,47244656197}


    Much like the basic log output, the detailed trace shows no
    additional information, just a gap after waiting for the response:

    2018-09-05 15:19:54,313 INFO
    org.apache.flink.client.cli.CliFrontend                - Waiting
    for response...
    2018-09-05 15:20:07,635 DEBUG
    org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn
    - Got ping response for sessionid: 0x265a12437df0074 after 1ms
    2018-09-05 15:20:20,976 DEBUG
    org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn
    - Got ping response for sessionid: 0x265a12437df0074 after 1ms
    2018-09-05 15:20:24,311 INFO
    org.apache.flink.runtime.rest.RestClient             - Shutting
    down rest endpoint.
    2018-09-05 15:20:24,317 INFO
    org.apache.flink.runtime.rest.RestClient             - Rest
    endpoint shutdown complete.
    2018-09-05 15:20:24,318 INFO

org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService- Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.

    2018-09-05 15:20:24,320 INFO

org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService- Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.

    2018-09-05 15:20:24,320 DEBUG
    
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
    - Closing
    2018-09-05 15:20:24,321 INFO
    
org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
    - backgroundOperationsLoop exiting
    2018-09-05 15:20:24,322 DEBUG
    org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient
    - Closing
    2018-09-05 15:20:24,322 DEBUG

org.apache.flink.shaded.curator.org.apache.curator.ConnectionState- Closing

    2018-09-05 15:20:24,323 DEBUG
    org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper
    - Closing session: 0x265a12437df0074
    2018-09-05 15:20:24,323 DEBUG
    org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn
    - Closing client for session: 0x265a12437df0074
    2018-09-05 15:20:24,329 DEBUG
    org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn
    - Reading reply sessionid:0x265a12437df0074, packet::

clientPath:null serverPath:null finished:false header:: 11,-11replyHeader:: 11,47244656278,0 request:: null response:: null

    2018-09-05 15:20:24,329 DEBUG
    org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn
    - Disconnecting client for session: 0x265a12437df0074
    2018-09-05 15:20:24,330 INFO
    org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper
    - Session: 0x265a12437df0074 closed
    2018-09-05 15:20:24,330 INFO
    org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn
    - EventThread shut down for session: 0x265a12437df0074
    2018-09-05 15:20:24,330 ERROR
    org.apache.flink.client.cli.CliFrontend                - Error
    while running the command.




    On Wednesday, September 5, 2018, 3:41:29 a.m. EDT, Chesnay
    Schepler <ches...@apache.org <mailto:ches...@apache.org>> wrote:


    Please enable DEBUG logging for the client and TRACE logging for
    the cluster.

    For the client, look for log messages starting with "Sending
    request of", this will contain the host and port that requests
    are sent to by the client. Verify that these are correct and
    match the host/port that you use when accessing the web UI.

    For the server, look for log messages starting with "Received
    request", so we can figure out whether the request at least arrives.

    On 05.09.2018 00:53, Jason Kania wrote:

    I have upgraded from Flink 1.4.0 to Flink 1.5.3 with a three
    node cluster configured with HA. Now I am encountering an issue
    where the flink command line operations timeout. The exception
    generated is very poor because it only indicates a timeout and
    not the reason or what it was trying to do:

    >./flink list -f
    Waiting for response...

    ------------------------------------------------------------
     The program finished with the following exception:
    org.apache.flink.util.FlinkException: Failed to retrieve job list.
      at
    org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:433)
      at
    org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:416)
      at
    
org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)
      at
    org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:413)
      at
    
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1028)
      at
    org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
      at
    
org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
      at
    org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
    Caused by:
    org.apache.flink.runtime.concurrent.FutureUtils$RetryException:
    Could not complete the operation. Exception is not retryable.
      at
    
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
      at
    
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
      at
    
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
      at
    
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
      at
    
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
      at
    
org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793)
      at
    java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at
    
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
      at
    
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
      at
    
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
      at
    
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
      at java.lang.Thread.run(Thread.java:748)
    Caused by: java.util.concurrent.CompletionException:
    java.util.concurrent.TimeoutException
      at
    
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
      at
    
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
      at
    java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
      at
    
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
      ... 10 more
    Caused by: java.util.concurrent.TimeoutException

    The web interface shows the 2 job managers and 3 task managers
    that are talking with one another.

    I have looked at the zookeeper data and it is all present.

    I have tried running the command on multiple nodes and they all
    give the same error.

    I looked for a verbose or debug option for the commands but
    found nothing.

    Suggestions on this?

    Thanks,

    Jason

Re: flink list and flink run commands timeout

Reply via email to