Hi, jstack 下 JM 的栈? 从目前这个现象上看,像是 JobMaster 初始化的时候卡住了
Best, Lijie ynz...@163.com <ynz...@163.com> 于2022年7月13日周三 09:56写道: > 是的,192.168.10.227:35961是TM地址; > 反复初始化是指,在flink web ui的overview界面,Running Job > List中对应JOb的status一直是INITIALIZING; > 没有TM日志,我暂时还没弄明白为什么退出,flink web ui的TM界面,全程是没有任何信息的; > 以下是日志列表,我没找到啥有用信息 > directory.info : Total file length is 7201 bytes. > jobmanager.err : Total file length is 588 bytes. > jobmanager.log : Total file length is 82894 bytes. > jobmanager.out : Total file length is 0 bytes. > launch_container.sh : Total file length is 21758 bytes. > prelaunch.err : Total file length is 0 bytes. > prelaunch.out : Total file length is 100 bytes. > > > > best, > ynz...@163.com > > From: Weihua Hu > Date: 2022-07-12 23:18 > To: user-zh > Subject: Re: Re: flink-hudi-hive > 单从这个日志看不到一直 Failover ,相关任务反复初始化是指哪个任务呢? > 看到了一些 akka 的链接异常,有可能是对应的 TM 异常退出了,可以再确认下 192.168.10.227:35961 这个是不是 > TaskManager 地址,以及为什么退出 > > Best, > Weihua > > > On Tue, Jul 12, 2022 at 9:37 AM ynz...@163.com <ynz...@163.com> wrote: > > > 这是job managers所有日志: > > 2022-07-12 09:33:02,280 INFO > > org.apache.flink.configuration.GlobalConfiguration [] - Loading > > configuration property: execution.shutdown-on-attached-exit, false > > 2022-07-12 09:33:02,280 INFO > > org.apache.flink.configuration.GlobalConfiguration [] - Loading > > configuration property: pipeline.jars, > > file:/home/dataxc/opt/flink-1.14.4/opt/flink-python_2.11-1.14.4.jar > > 2022-07-12 09:33:02,280 INFO > > org.apache.flink.configuration.GlobalConfiguration [] - Loading > > configuration property: execution.checkpointing.min-pause, 8min > > 2022-07-12 09:33:02,280 INFO > > org.apache.flink.configuration.GlobalConfiguration [] - Loading > > configuration property: restart-strategy, failure-rate > > 2022-07-12 09:33:02,280 INFO > > org.apache.flink.configuration.GlobalConfiguration [] - Loading > > configuration property: jobmanager.memory.jvm-metaspace.size, 128m > > 2022-07-12 09:33:02,280 INFO > > org.apache.flink.configuration.GlobalConfiguration [] - Loading > > configuration property: state.checkpoints.dir, hdfs:///flink/checkpoints > > 2022-07-12 09:33:02,382 WARN akka.remote.transport.netty.NettyTransport > > [] - Remote connection to [null] failed with > > java.net.ConnectException: Connection refused: n103/192.168.10.227:35961 > > 2022-07-12 09:33:02,383 WARN akka.remote.ReliableDeliverySupervisor > > [] - Association with remote system > [akka.tcp://flink@n103:35961] > > has failed, address is now gated for [50] ms. Reason: [Association failed > > with [akka.tcp://flink@n103:35961]] Caused by: > > [java.net.ConnectException: Connection refused: n103/ > 192.168.10.227:35961] > > 2022-07-12 09:33:02,399 INFO > > org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - > Starting > > RPC endpoint for > > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager at > > akka://flink/user/rpc/resourcemanager_1 . > > 2022-07-12 09:33:02,405 INFO > > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] > - > > Starting the resource manager. > > 2022-07-12 09:33:02,479 INFO > > org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider [] - > > Failing over to rm2 > > 2022-07-12 09:33:02,509 INFO > > org.apache.flink.yarn.YarnResourceManagerDriver [] - > Recovered > > 0 containers from previous attempts ([]). > > 2022-07-12 09:33:02,509 INFO > > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] > - > > Recovered 0 workers from previous attempt. > > 2022-07-12 09:33:02,514 WARN akka.remote.transport.netty.NettyTransport > > [] - Remote connection to [null] failed with > > java.net.ConnectException: Connection refused: n103/192.168.10.227:35961 > > 2022-07-12 09:33:02,515 WARN akka.remote.ReliableDeliverySupervisor > > [] - Association with remote system > [akka.tcp://flink@n103:35961] > > has failed, address is now gated for [50] ms. Reason: [Association failed > > with [akka.tcp://flink@n103:35961]] Caused by: > > [java.net.ConnectException: Connection refused: n103/ > 192.168.10.227:35961] > > 2022-07-12 09:33:02,528 INFO org.apache.hadoop.conf.Configuration > > [] - resource-types.xml not found > > 2022-07-12 09:33:02,528 INFO > > org.apache.hadoop.yarn.util.resource.ResourceUtils [] - Unable > to > > find 'resource-types.xml'. > > 2022-07-12 09:33:02,538 INFO > > org.apache.flink.runtime.externalresource.ExternalResourceUtils [] - > > Enabled external resources: [] > > 2022-07-12 09:33:02,541 INFO > > org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl [] - Upper > > bound of the thread pool size is 500 > > 2022-07-12 09:33:02,584 WARN akka.remote.transport.netty.NettyTransport > > [] - Remote connection to [null] failed with > > java.net.ConnectException: Connection refused: n103/192.168.10.227:35961 > > 2022-07-12 09:33:02,585 WARN akka.remote.ReliableDeliverySupervisor > > [] - Association with remote system > [akka.tcp://flink@n103:35961] > > has failed, address is now gated for [50] ms. Reason: [Association failed > > with [akka.tcp://flink@n103:35961]] Caused by: > > [java.net.ConnectException: Connection refused: n103/ > 192.168.10.227:35961] > > > > > > > > best, > > ynz...@163.com > > > > From: Weihua Hu > > Date: 2022-07-11 19:46 > > To: user-zh > > Subject: Re: flink-hudi-hive > > Hi, > > 任务反复初始化是指一直在 Failover 吗?在 JobManager.log 里可以看到作业 Failover 原因,搜索关键字; "to > > FAILED" > > > > Best, > > Weihua > > > > > > On Mon, Jul 11, 2022 at 2:46 PM ynz...@163.com <ynz...@163.com> wrote: > > > > > Hi, > > > 我正在使用flink将数据写入hudi并同步至hive,将任务提交到yarn后,我从flink web > > > ui看到:相关任务反复初始化,task managers无任何信息。日志中也无明确错误提示 ; > > > 当我删除代码中sync_hive相关配置,并且不改变其他配置,数据能正常写入hudi ; > > > 我使用的hudi-0.11.1,flink-1.14.4,hadoop-3.3.1,hive-3.1.3 ; > > > > > > > > > > > > best, > > > ynz...@163.com > > > > > >