Re: flink list and flink run commands timeout

Aneesha Kaushal Mon, 03 Dec 2018 03:11:08 -0800

Hello, 

I am facing the same Timeout exception, at flink run and flink list commands 
when I am trying to deploy jobs in Flink 1.6 in “legacy" mode.


We are planning to run in legacy mode because after upgrading from Flink 1.3 to 
Flink 1.6, flink job was not getting distributed across task managers.

In “new" mode jobs are working fine. 

Any suggestions? 

org.apache.flink.client.program.ProgramInvocationException: Could not submit 
job (JobID: d6686e184897e8799d71008488ccf80e)
        at 
org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:249)
        at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:486)
        at 
org.apache.flink.client.program.DetachedEnvironment.finalizeExecute(DetachedEnvironment.java:77)
        at 
org.apache.flink.client.program.ClusterClient.run(ClusterClient.java:432)
        at 
org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:804)
        at 
org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:280)
        at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215)
        at 
org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1044)
        at 
org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
        at 
org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
        at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120)
Caused by: org.apache.flink.runtime.client.JobSubmissionException: Failed to 
submit JobGraph.
        at 
org.apache.flink.client.program.rest.RestClusterClient.lambda$submitJob$8(RestClusterClient.java:379)
        at 
java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:870)
        at 
java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:852)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at 
org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
        at 
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
        at 
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
        at 
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
        at 
java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
        at 
org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793)
        at 
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
        at 
java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.concurrent.CompletionException: 
org.apache.flink.runtime.concurrent.FutureUtils$RetryException: Could not 
complete the operation. Exception is not retryable.
        at 
java.util.concurrent.CompletableFuture.encodeRelay(CompletableFuture.java:326)
        at 
java.util.concurrent.CompletableFuture.completeRelay(CompletableFuture.java:338)
        at 
java.util.concurrent.CompletableFuture.uniRelay(CompletableFuture.java:911)
        at 
java.util.concurrent.CompletableFuture$UniRelay.tryFire(CompletableFuture.java:899)
        ... 15 more
Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: 
Could not complete the operation. Exception is not retryable.
        ... 13 more
Caused by: java.util.concurrent.CompletionException: 
java.util.concurrent.TimeoutException
        at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
        at 
java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
        at 
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
        at 
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
        ... 10 more
Caused by: java.util.concurrent.TimeoutException

Thanks,
Aneesha Kaushal



> On 06-Sep-2018, at 10:45 AM, Gary Yao <g...@data-artisans.com> wrote:
> 
> Hi Jason,
> 
> From the stacktrace it seems that you are using the 1.4.0 client to list jobs
> on a 1.5.x cluster. This will not work. You have to use the 1.5.x client.
> 
> Best,
> Gary
> 
> On Wed, Sep 5, 2018 at 5:35 PM, Jason Kania <jason.ka...@ymail.com 
> <mailto:jason.ka...@ymail.com>> wrote:
> Hello,
> 
> Thanks for the response. I had already tried setting the log level to debug 
> in log4j-cli.properties, logback-console.xml, and log4j-console.properties 
> but no additional relevant information comes out. On the server, all that 
> comes out are zookeeper ping responses:
> 
> 2018-09-05 15:16:56,786 DEBUG 
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Got ping 
> response for sessionid: 0x3659b60bcb50076 after 1ms
> 
> The client log indicates only the following (but we are not using hadoop):
> 
> 2018-09-05 15:19:53,339 WARN  org.apache.flink.client.cli.CliFrontend         
>               - Could not load CLI class 
> org.apache.flink.yarn.cli.FlinkYarnSessionCli.
> java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration
>         at java.lang.Class.forName0(Native Method)
>         at java.lang.Class.forName(Class.java:264)
>         at 
> org.apache.flink.client.cli.CliFrontend.loadCustomCommandLine(CliFrontend.java:1208)
>         at 
> org.apache.flink.client.cli.CliFrontend.loadCustomCommandLines(CliFrontend.java:1164)
>         at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1090)
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.hadoop.conf.Configuration
>         at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
>         at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
>         at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
>         ... 5 more
> 
> 
> and 
> 
> 2018-09-05 15:19:53,881 ERROR 
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - 
> Authentication failed
> 
> 
> despite the zookeeper being configured as 'open' and latest logs showing data 
> being read from zookeeper.
> 
> 2018-09-05 15:19:54,274 DEBUG 
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Reading 
> reply sessionid:0x265a12437df0074, packet:: clientPath:null serverPath:null 
> finished:false header:: 1,3  replyHeader:: 1,47244656277,0  request:: 
> '/flink,F  response:: 
> s{47244656196,47244656196,1536110417531,1536110417531,0,1,0,0,0,1,47244656197}
> 
> 
> Much like the basic log output, the detailed trace shows no additional 
> information, just a gap after waiting for the response:
> 
> 2018-09-05 15:19:54,313 INFO  org.apache.flink.client.cli.CliFrontend         
>               - Waiting for response...
> 2018-09-05 15:20:07,635 DEBUG 
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Got ping 
> response for sessionid: 0x265a12437df0074 after 1ms
> 2018-09-05 15:20:20,976 DEBUG 
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Got ping 
> response for sessionid: 0x265a12437df0074 after 1ms
> 2018-09-05 15:20:24,311 INFO  org.apache.flink.runtime.rest.RestClient        
>               - Shutting down rest endpoint.
> 2018-09-05 15:20:24,317 INFO  org.apache.flink.runtime.rest.RestClient        
>               - Rest endpoint shutdown complete.
> 2018-09-05 15:20:24,318 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService /leader/rest_server_lock.
> 2018-09-05 15:20:24,320 INFO  
> org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService  - 
> Stopping ZooKeeperLeaderRetrievalService /leader/dispatcher_lock.
> 2018-09-05 15:20:24,320 DEBUG 
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
>   - Closing
> 2018-09-05 15:20:24,321 INFO  
> org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl
>   - backgroundOperationsLoop exiting
> 2018-09-05 15:20:24,322 DEBUG 
> org.apache.flink.shaded.curator.org.apache.curator.CuratorZookeeperClient  - 
> Closing
> 2018-09-05 15:20:24,322 DEBUG 
> org.apache.flink.shaded.curator.org.apache.curator.ConnectionState  - Closing
> 2018-09-05 15:20:24,323 DEBUG 
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Closing 
> session: 0x265a12437df0074
> 2018-09-05 15:20:24,323 DEBUG 
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Closing 
> client for session: 0x265a12437df0074
> 2018-09-05 15:20:24,329 DEBUG 
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - Reading 
> reply sessionid:0x265a12437df0074, packet:: clientPath:null serverPath:null 
> finished:false header:: 11,-11  replyHeader:: 11,47244656278,0  request:: 
> null response:: null
> 2018-09-05 15:20:24,329 DEBUG 
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - 
> Disconnecting client for session: 0x265a12437df0074
> 2018-09-05 15:20:24,330 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper  - Session: 
> 0x265a12437df0074 closed
> 2018-09-05 15:20:24,330 INFO  
> org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn  - 
> EventThread shut down for session: 0x265a12437df0074
> 2018-09-05 15:20:24,330 ERROR org.apache.flink.client.cli.CliFrontend         
>               - Error while running the command.
> 
> 
> 
> 
> On Wednesday, September 5, 2018, 3:41:29 a.m. EDT, Chesnay Schepler 
> <ches...@apache.org <mailto:ches...@apache.org>> wrote:
> 
> 
> Please enable DEBUG logging for the client and TRACE logging for the cluster.
> 
> For the client, look for log messages starting with "Sending request of", 
> this will contain the host and port that requests are sent to by the client. 
> Verify that these are correct and match the host/port that you use when 
> accessing the web UI.
> 
> For the server, look for log messages starting with "Received request", so we 
> can figure out whether the request at least arrives.
> 
> On 05.09.2018 00:53, Jason Kania wrote:
>> I have upgraded from Flink 1.4.0 to Flink 1.5.3 with a three node cluster 
>> configured with HA. Now I am encountering an issue where the flink command 
>> line operations timeout. The exception generated is very poor because it 
>> only indicates a timeout and not the reason or what it was trying to do:
>> 
>> >./flink list -f
>> Waiting for response...
>> 
>> ------------------------------------------------------------
>>  The program finished with the following exception:
>> org.apache.flink.util.FlinkException: Failed to retrieve job list.
>>         at 
>> org.apache.flink.client.cli.CliFrontend.listJobs(CliFrontend.java:433)
>>         at 
>> org.apache.flink.client.cli.CliFrontend.lambda$list$0(CliFrontend.java:416)
>>         at 
>> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:960)
>>         at org.apache.flink.client.cli.CliFrontend.list(CliFrontend.java:413)
>>         at 
>> org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1028)
>>         at 
>> org.apache.flink.client.cli.CliFrontend.lambda$main$9(CliFrontend.java:1101)
>>         at 
>> org.apache.flink.runtime.security.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
>>         at 
>> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1101)
>> Caused by: org.apache.flink.runtime.concurrent.FutureUtils$RetryException: 
>> Could not complete the operation. Exception is not retryable.
>>         at 
>> org.apache.flink.runtime.concurrent.FutureUtils.lambda$retryOperationWithDelay$5(FutureUtils.java:213)
>>         at 
>> java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
>>         at 
>> java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
>>         at 
>> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
>>         at 
>> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1977)
>>         at 
>> org.apache.flink.runtime.concurrent.FutureUtils$Timeout.run(FutureUtils.java:793)
>>         at 
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>         at 
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>>         at 
>> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>>         at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>         at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>         at java.lang.Thread.run(Thread.java:748)
>> Caused by: java.util.concurrent.CompletionException: 
>> java.util.concurrent.TimeoutException
>>         at 
>> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>>         at 
>> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>>         at 
>> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:593)
>>         at 
>> java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
>>         ... 10 more
>> Caused by: java.util.concurrent.TimeoutException
>> 
>> The web interface shows the 2 job managers and 3 task managers that are 
>> talking with one another.
>> 
>> I have looked at the zookeeper data and it is all present.
>> 
>> I have tried running the command on multiple nodes and they all give the 
>> same error.
>> 
>> I looked for a verbose or debug option for the commands but found nothing.
>> 
>> Suggestions on this?
>> 
>> Thanks,
>> 
>> Jason
> 
> 
>

Re: flink list and flink run commands timeout

Reply via email to