Hi Harshith, In the jobmanager.sh script, the 2nd argument is assigned to the HOST variable [1]. How are you invoking jobmanager.sh? Prior to 1.5, the script expected an execution mode (local or cluster) but this is no longer the case [2].
Best, Gary [1] https://github.com/apache/flink/blob/c6878aca6c5aeee46581b4d6744b31049db9de95/flink-dist/src/main/flink-bin/bin/jobmanager.sh [2] https://github.com/apache/flink/commit/d61664ca64bcb82c4e8ddf03a2ed38fe8edafa98 On Fri, Mar 15, 2019 at 3:36 AM Kumar Bolar, Harshith <hk...@arity.com> wrote: > Hi Gary, > > > > An update. I noticed the line “–host cluster” in the program arguments > section of the job manager logs. So, I commented the following section in > jobmanager.sh, the task manager is now able to connect to job manager > without issues. > > > > *if [ ! -z $HOST ]; then* > > * args+=("--host")* > > * args+=("${HOST}")* > > *fi* > > > > > > Task manager logs after commenting those lines: > > > > > * 2019-03-14 22:31:02,863 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting > RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at > akka://flink/user/taskmanager_0 .* > > *2019-03-14 22:31:02,875 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock.* > > *2019-03-14 22:31:02,876 INFO > org.apache.flink.runtime.taskexecutor.JobLeaderService - Start job > leader service.* > > *2019-03-14 22:31:02,877 INFO > org.apache.flink.runtime.filecache.FileCache - User file > cache uses directory > /tmp/flink-dist-cache-12d5905f-d694-46f6-9359-3a636188b008* > > *2019-03-14 22:31:02,884 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting > to ResourceManager > akka.tcp://fl...@flink0-1.flink1.us-east-1.high.ue1.non.aws.cloud.arity.com:28945/user/resourcemanager(8583b335fd08a30a89585b7af07e4213) > <http://fl...@flink0-1.flink1.us-east-1.high.ue1.non.aws.cloud.arity.com:28945/user/resourcemanager(8583b335fd08a30a89585b7af07e4213)>.* > > *2019-03-14 22:31:03,109 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Resolved > ResourceManager address, beginning registration* > > *2019-03-14 22:31:03,110 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - > Registration at ResourceManager attempt 1 (timeout=100ms)* > > *2019-03-14 22:31:03,228 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - > Registration at ResourceManager attempt 2 (timeout=200ms)* > > *2019-03-14 22:31:03,266 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Successful > registration at resource manager > akka.tcp://fl...@flink0-1.flink1.us-east-1.abc.com:28945/user/resourcemanager > <http://fl...@flink0-1.flink1.us-east-1.abc.com:28945/user/resourcemanager> > under registration id 170ee6a00f80ee02ead0e88710093d77.* > > > > > > Thanks, > > Harshith > > > > *From: *Harshith Kumar Bolar <hk...@arity.com> > *Date: *Friday, 15 March 2019 at 7:38 AM > *To: *Gary Yao <g...@ververica.com> > *Cc: *user <user@flink.apache.org> > *Subject: *Re: [External] Re: Re: Flink 1.7.2: Task Manager not able to > connect to Job Manager > > > > Hi Gary, > > > > Here are the full job manager and task manager logs. In the job manager > logs, I see it says “*starting StandaloneSessionClusterEntrypoint”,* whereas > in Flink 1.4.2, it used to say “*starting JobManager”*. Is this correct? > > > > Job manager logs: https://paste.ubuntu.com/p/DCVzsQdpHq/ > (https://paste(.)ubuntu(.)com/p/DCVzsQdpHq > /) > > Task Manager logs: https://paste.ubuntu.com/p/wbvYFZxdT8/ ( > https://paste(.)ubuntu(.)com/p/wbvYFZxdT8/) > > > > Thanks, > > Harshith > > > > *From: *Gary Yao <g...@ververica.com> > *Date: *Thursday, 14 March 2019 at 10:11 PM > *To: *Harshith Kumar Bolar <hk...@arity.com> > *Cc: *user <user@flink.apache.org> > *Subject: *[External] Re: Re: Flink 1.7.2: Task Manager not able to > connect to Job Manager > > > > Hi Harshith, > > The truncated log is not enough. Can you share the complete logs? If that's > not possible, I'd like to see the beginning of the log files where the > cluster > configuration is logged. > > The TaskManager tries to connect to the leader that is advertised in > ZooKeeper. In your case the "cluster" hostname is advertised which hints a > problem in your Flink configuration. > > Best, > Gary > > > > On Thu, Mar 14, 2019 at 4:54 PM Kumar Bolar, Harshith <hk...@arity.com> > wrote: > > Hi Gary, > > > > I’ve attached the relevant portions of the JM and TM logs. > > > > *Job Manager Logs:* > > 2019-03-14 11:38:28,257 INFO > org.apache.flink.shaded.curator.org.apache.curator.framework.state.ConnectionStateManager > - State change: CONNECTED > 2019-03-14 11:38:28,309 INFO > org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined > location of main cluster component log file: > /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.log > 2019-03-14 11:38:28,309 INFO > org.apache.flink.runtime.webmonitor.WebMonitorUtils - Determined > location of main cluster component stdout file: > /opt/flink-1.7.2/log/flink-root-standalonesession-4-flink0-1.flink1.us-east-1.out > 2019-03-14 11:38:28,527 INFO > org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Rest > endpoint listening at cluster:8080 > 2019-03-14 11:38:28,527 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Starting ZooKeeperLeaderElectionService > ZooKeeperLeaderElectionService{leaderPath='/leader/rest_server_lock'}. > 2019-03-14 11:38:28,574 INFO > org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Web > frontend listening at http://cluster:8080 > <https://urldefense.proofpoint.com/v2/url?u=http-3A__cluster-3A8080&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=jmNlhpOrwRIDbMAqwetQxCtYFQfw1xtgw6S6ji1QqDE&e=> > . > 2019-03-14 11:38:28,613 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting > RPC endpoint for > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager at > akka://flink/user/resourcemanager . > 2019-03-14 11:38:28,674 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting > RPC endpoint for org.apache.flink.runtime.dispatcher.StandaloneDispatcher > at akka://flink/user/dispatcher . > 2019-03-14 11:38:28,691 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Starting ZooKeeperLeaderElectionService > ZooKeeperLeaderElectionService{leaderPath='/leader/resource_manager_lock'}. > 2019-03-14 11:38:28,694 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. > 2019-03-14 11:38:28,698 INFO > org.apache.flink.runtime.leaderelection.ZooKeeperLeaderElectionService - > Starting ZooKeeperLeaderElectionService > ZooKeeperLeaderElectionService{leaderPath='/leader/dispatcher_lock'}. > 2019-03-14 11:38:28,700 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Starting ZooKeeperLeaderRetrievalService /leader/dispatcher_lock. > 2019-03-14 11:38:28,818 WARN > akka.remote.ReliableDeliverySupervisor - Association > with remote system [akka.tcp://flink@cluster:22671] has failed, address > is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink@cluster:22671]] Caused by: [cluster] > 2019-03-14 11:39:09,010 INFO > org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - > http://cluster:8080 > <https://urldefense.proofpoint.com/v2/url?u=http-3A__cluster-3A8080&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=jmNlhpOrwRIDbMAqwetQxCtYFQfw1xtgw6S6ji1QqDE&e=> > was granted leadership with > leaderSessionID=bbe408fc-ef93-4328-abeb-85323db7aef7 > 2019-03-14 11:39:09,010 INFO > org.apache.flink.runtime.resourcemanager.StandaloneResourceManager - > ResourceManager akka.tcp://flink@cluster:31794/user/resourcemanager was > granted leadership with fencing token ae4c0d30d0d65a0c41565360667e48fb > 2019-03-14 11:39:09,011 INFO > org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager - > Starting the SlotManager. > 2019-03-14 11:39:09,012 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Dispatcher > akka.tcp://flink@cluster:31794/user/dispatcher was granted leadership > with fencing token c852ada2-5fd4-4ff8-80ab-c2cdd85a75d9 > 2019-03-14 11:39:09,017 INFO > org.apache.flink.runtime.dispatcher.StandaloneDispatcher - Recovering > all persisted jobs. > > *Task Manager Logs:* > > 2019-03-14 11:42:35,790 INFO > org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager > uses directory /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f for spill > files. > 2019-03-14 11:42:35,820 INFO > org.apache.flink.runtime.taskexecutor.TaskManagerConfiguration - Messages > have a max timeout of 10000 ms > 2019-03-14 11:42:35,839 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcService - Starting > RPC endpoint for org.apache.flink.runtime.taskexecutor.TaskExecutor at > akka://flink/user/taskmanager_0 . > 2019-03-14 11:42:35,853 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Starting ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. > 2019-03-14 11:42:35,854 INFO > org.apache.flink.runtime.taskexecutor.JobLeaderService - Start job > leader service. > 2019-03-14 11:42:35,855 INFO > org.apache.flink.runtime.filecache.FileCache - User file > cache uses directory > /tmp/flink-dist-cache-a7f67948-ab57-4cd9-b2a6-0361b53ecd26 > 2019-03-14 11:42:35,871 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Connecting > to ResourceManager akka.tcp://flink@cluster > :31794/user/resourcemanager(ae4c0d30d0d65a0c41565360667e48fb). > 2019-03-14 11:42:35,963 WARN > akka.remote.ReliableDeliverySupervisor - Association > with remote system [akka.tcp://flink@cluster:31794] has failed, address > is now gated for [50] ms. Reason: [Association failed with > [akka.tcp://flink@cluster:31794]] Caused by: [cluster: Name or service > not known] > 2019-03-14 11:42:35,964 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Could not > resolve ResourceManager address > akka.tcp://flink@cluster:31794/user/resourcemanager, > retrying in 10000 ms: Could not connect to rpc endpoint under address > akka.tcp://flink@cluster:31794/user/resourcemanager.. > 2019-03-14 11:47:35,895 ERROR > org.apache.flink.runtime.taskexecutor.TaskExecutor - Fatal error > occurred in TaskExecutor akka.tcp:// > fl...@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0 > <https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=> > . > org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: > Could not register at the ResourceManager within the specified maximum > registration duration 300000 ms. This indicates a problem with this > instance. Terminating now. > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout( > TaskExecutor.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=> > :1037) > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3( > TaskExecutor.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=> > :1023) > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync( > AkkaRpcActor.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=> > :332) > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage( > AkkaRpcActor.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=> > :158) > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive( > AkkaRpcActor.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=> > :142) > at > akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > at akka.actor.ActorCell.invoke(ActorCell.scala:495) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > at akka.dispatch.Mailbox.run(Mailbox.scala:224) > at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinTask.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=bv8jB1enKafGeoNgdOTLg2sbTtbMfgFehYs0GRLszts&e=> > :260) > at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask( > ForkJoinPool.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=> > :1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=> > :1979) > at scala.concurrent.forkjoin.ForkJoinWorkerThread.run( > ForkJoinWorkerThread.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinWorkerThread.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=d_bm2VR2tTF2xi468xPlqDIiV2Bnq07S6kPGj6gOLN4&e=> > :107) > 2019-03-14 11:47:35,897 ERROR > org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Fatal error > occurred while executing the TaskManager. Shutting it down... > org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: > Could not register at the ResourceManager within the specified maximum > registration duration 300000 ms. This indicates a problem with this > instance. Terminating now. > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout( > TaskExecutor.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=> > :1037) > at > org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$3( > TaskExecutor.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__TaskExecutor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=wdm3q_iJnu8L9xmD8hreg638d7pxSet6twA4ggwlDIY&e=> > :1023) > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync( > AkkaRpcActor.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=> > :332) > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage( > AkkaRpcActor.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=> > :158) > at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.onReceive( > AkkaRpcActor.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__AkkaRpcActor.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=uQw7PD53jnoGsG_qcfATfHUWMAPCjhjKqyYBjvYy7iY&e=> > :142) > at > akka.actor.UntypedActor$$anonfun$receive$1.applyOrElse(UntypedActor.scala:165) > at akka.actor.Actor$class.aroundReceive(Actor.scala:502) > at akka.actor.UntypedActor.aroundReceive(UntypedActor.scala:95) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:526) > at akka.actor.ActorCell.invoke(ActorCell.scala:495) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:257) > at akka.dispatch.Mailbox.run(Mailbox.scala:224) > at akka.dispatch.Mailbox.exec(Mailbox.scala:234) > at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinTask.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=bv8jB1enKafGeoNgdOTLg2sbTtbMfgFehYs0GRLszts&e=> > :260) > at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask( > ForkJoinPool.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=> > :1339) > at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinPool.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=8tFyqgZpCdRLwcHpdKe3mYfJ2F8ZgSQzMvW59LoO9S4&e=> > :1979) > at scala.concurrent.forkjoin.ForkJoinWorkerThread.run( > ForkJoinWorkerThread.java > <https://urldefense.proofpoint.com/v2/url?u=http-3A__ForkJoinWorkerThread.java&d=DwQFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=d_bm2VR2tTF2xi468xPlqDIiV2Bnq07S6kPGj6gOLN4&e=> > :107) > 2019-03-14 11:47:35,904 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Stopping > TaskExecutor akka.tcp:// > fl...@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0 > <https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=> > . > 2019-03-14 11:47:35,904 INFO > org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - > Stopping ZooKeeperLeaderRetrievalService /leader/resource_manager_lock. > 2019-03-14 11:47:35,904 INFO > org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager - > Shutting down TaskExecutorLocalStateStoresManager. > 2019-03-14 11:47:35,908 INFO > org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager > removed spill file directory > /tmp/flink-io-a7bc246d-bae4-489f-9c9c-f6a25d3c4b8f > 2019-03-14 11:47:35,908 INFO > org.apache.flink.runtime.io.network.NetworkEnvironment - Shutting > down the network environment and its components. > 2019-03-14 11:47:35,914 INFO > org.apache.flink.runtime.io.network.netty.NettyClient - Successful > shutdown (took 5 ms). > 2019-03-14 11:47:35,917 INFO > org.apache.flink.runtime.io.network.netty.NettyServer - Successful > shutdown (took 2 ms). > 2019-03-14 11:47:35,925 INFO > org.apache.flink.runtime.taskexecutor.JobLeaderService - Stop job > leader service. > 2019-03-14 11:47:35,931 INFO > org.apache.flink.runtime.taskexecutor.TaskExecutor - Stopped > TaskExecutor akka.tcp:// > fl...@flink1-1.flink1.us-east-1.com:24623/user/taskmanager_0 > <https://urldefense.proofpoint.com/v2/url?u=http-3A__flink-40flink1-2D1.flink1.us-2Deast-2D1.com-3A24623_user_taskmanager-5F0&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=8UFr4YnRs-5evbGW--p28mCAv00uGlqHKnYoYchCXb8&s=GiUzkLjbXMJFr7rhd_zh-C1BpqSfOF-A7KItP0jILFE&e=> > . > 2019-03-14 11:47:35,931 INFO > org.apache.flink.runtime.blob.PermanentBlobCache - Shutting > down BLOB cache > 2019-03-14 11:47:35,933 INFO > org.apache.flink.runtime.blob.TransientBlobCache - Shutting > down BLOB cache > 2019-03-14 11:47:35,943 INFO > org.apache.flink.shaded.curator.org.apache.curator.framework.imps.CuratorFrameworkImpl > - backgroundOperationsLoop exiting > 2019-03-14 11:47:35,950 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ZooKeeper - > Session: 0x26977a24c4e0018 closed > 2019-03-14 11:47:35,950 INFO > org.apache.flink.shaded.zookeeper.org.apache.zookeeper.ClientCnxn - > EventThread shut down for session: 0x26977a24c4e0018 > 2019-03-14 11:47:35,950 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopping > Akka RPC service. > 2019-03-14 11:47:35,952 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting > down remote daemon. > 2019-03-14 11:47:35,952 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote > daemon shut down; proceeding with flushing remote transports. > 2019-03-14 11:47:35,959 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting > down remote daemon. > 2019-03-14 11:47:35,966 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote > daemon shut down; proceeding with flushing remote transports. > 2019-03-14 11:47:35,983 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting > shut down. > 2019-03-14 11:47:35,984 INFO > akka.remote.RemoteActorRefProvider$RemotingTerminator - Remoting > shut down. > 2019-03-14 11:47:35,992 INFO > org.apache.flink.runtime.rpc.akka.AkkaRpcService - Stopped > Akka RPC service. > > > > > > *From: *Gary Yao <g...@ververica.com> > *Date: *Thursday, 14 March 2019 at 9:06 PM > *To: *Harshith Kumar Bolar <hk...@arity.com> > *Cc: *user <user@flink.apache.org> > *Subject: *[External] Re: Flink 1.7.2: Task Manager not able to connect > to Job Manager > > > > Hi Harshith, > > > > Can you share JM and TM logs? > > > > Best, > > Gary > > > > On Thu, Mar 14, 2019 at 3:42 PM Kumar Bolar, Harshith <hk...@arity.com> > wrote: > > Hi all, > > > > I'm trying to upgrade our Flink cluster from 1.4.2 to 1.7.2 > > > > When I bring up the cluster, the task managers refuse to connect to the > job managers with the following error. > > > > 2019-03-14 10:34:41,551 WARN > akka.remote.ReliableDeliverySupervisor > > - Association with remote system [akka.tcp://flink@cluster:22671] > has failed, address is now gated for [50] ms. Reason: [Association failed > with [akka.tcp://flink@cluster:22671]] Caused by: [cluster: Name or > service not known] > > > > Now, this works correctly if I add the following line into > the /etc/hosts file. > > > > x.x.x.x job-manager-address.com > <https://urldefense.proofpoint.com/v2/url?u=http-3A__job-2Dmanager-2Daddress.com&d=DwMFaQ&c=gtIjdLs6LnStUpy9cTOW9w&r=61bFb6zUNKZxlAQDRo_jKA&m=04EWFpDL8G7AOCUH79K-QVwPa3NSJj7u4Qanpbrx0tg&s=KDu-Fxq2rWtLq1EmNp0DOuK0yWC6GyHwvhpbyQ8hRQg&e=> > cluster > > > > Why is Flink 1.7.2 connecting to JM using cluster in the address? Flink > 1.4.2 used to have the job manager's address instead of the word cluster. > > > > Thanks, > > Harshith > > > >