Hi,

jstack 下 JM 的栈? 从目前这个现象上看,像是 JobMaster 初始化的时候卡住了

Best,
Lijie

ynz...@163.com <ynz...@163.com> 于2022年7月13日周三 09:56写道:

> 是的,192.168.10.227:35961是TM地址;
> 反复初始化是指,在flink web ui的overview界面,Running Job
> List中对应JOb的status一直是INITIALIZING;
> 没有TM日志,我暂时还没弄明白为什么退出,flink web ui的TM界面,全程是没有任何信息的;
> 以下是日志列表,我没找到啥有用信息
> directory.info : Total file length is 7201 bytes.
> jobmanager.err : Total file length is 588 bytes.
> jobmanager.log : Total file length is 82894 bytes.
> jobmanager.out : Total file length is 0 bytes.
> launch_container.sh : Total file length is 21758 bytes.
> prelaunch.err : Total file length is 0 bytes.
> prelaunch.out : Total file length is 100 bytes.
>
>
>
> best,
> ynz...@163.com
>
> From: Weihua Hu
> Date: 2022-07-12 23:18
> To: user-zh
> Subject: Re: Re: flink-hudi-hive
> 单从这个日志看不到一直 Failover ,相关任务反复初始化是指哪个任务呢?
> 看到了一些 akka 的链接异常,有可能是对应的 TM 异常退出了,可以再确认下 192.168.10.227:35961 这个是不是
> TaskManager 地址,以及为什么退出
>
> Best,
> Weihua
>
>
> On Tue, Jul 12, 2022 at 9:37 AM ynz...@163.com <ynz...@163.com> wrote:
>
> > 这是job managers所有日志:
> > 2022-07-12 09:33:02,280 INFO
> > org.apache.flink.configuration.GlobalConfiguration           [] - Loading
> > configuration property: execution.shutdown-on-attached-exit, false
> > 2022-07-12 09:33:02,280 INFO
> > org.apache.flink.configuration.GlobalConfiguration           [] - Loading
> > configuration property: pipeline.jars,
> > file:/home/dataxc/opt/flink-1.14.4/opt/flink-python_2.11-1.14.4.jar
> > 2022-07-12 09:33:02,280 INFO
> > org.apache.flink.configuration.GlobalConfiguration           [] - Loading
> > configuration property: execution.checkpointing.min-pause, 8min
> > 2022-07-12 09:33:02,280 INFO
> > org.apache.flink.configuration.GlobalConfiguration           [] - Loading
> > configuration property: restart-strategy, failure-rate
> > 2022-07-12 09:33:02,280 INFO
> > org.apache.flink.configuration.GlobalConfiguration           [] - Loading
> > configuration property: jobmanager.memory.jvm-metaspace.size, 128m
> > 2022-07-12 09:33:02,280 INFO
> > org.apache.flink.configuration.GlobalConfiguration           [] - Loading
> > configuration property: state.checkpoints.dir, hdfs:///flink/checkpoints
> > 2022-07-12 09:33:02,382 WARN  akka.remote.transport.netty.NettyTransport
> >                  [] - Remote connection to [null] failed with
> > java.net.ConnectException: Connection refused: n103/192.168.10.227:35961
> > 2022-07-12 09:33:02,383 WARN  akka.remote.ReliableDeliverySupervisor
> >                  [] - Association with remote system
> [akka.tcp://flink@n103:35961]
> > has failed, address is now gated for [50] ms. Reason: [Association failed
> > with [akka.tcp://flink@n103:35961]] Caused by:
> > [java.net.ConnectException: Connection refused: n103/
> 192.168.10.227:35961]
> > 2022-07-12 09:33:02,399 INFO
> > org.apache.flink.runtime.rpc.akka.AkkaRpcService             [] -
> Starting
> > RPC endpoint for
> > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager at
> > akka://flink/user/rpc/resourcemanager_1 .
> > 2022-07-12 09:33:02,405 INFO
> > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager []
> -
> > Starting the resource manager.
> > 2022-07-12 09:33:02,479 INFO
> > org.apache.hadoop.yarn.client.ConfiguredRMFailoverProxyProvider [] -
> > Failing over to rm2
> > 2022-07-12 09:33:02,509 INFO
> > org.apache.flink.yarn.YarnResourceManagerDriver              [] -
> Recovered
> > 0 containers from previous attempts ([]).
> > 2022-07-12 09:33:02,509 INFO
> > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager []
> -
> > Recovered 0 workers from previous attempt.
> > 2022-07-12 09:33:02,514 WARN  akka.remote.transport.netty.NettyTransport
> >                  [] - Remote connection to [null] failed with
> > java.net.ConnectException: Connection refused: n103/192.168.10.227:35961
> > 2022-07-12 09:33:02,515 WARN  akka.remote.ReliableDeliverySupervisor
> >                  [] - Association with remote system
> [akka.tcp://flink@n103:35961]
> > has failed, address is now gated for [50] ms. Reason: [Association failed
> > with [akka.tcp://flink@n103:35961]] Caused by:
> > [java.net.ConnectException: Connection refused: n103/
> 192.168.10.227:35961]
> > 2022-07-12 09:33:02,528 INFO  org.apache.hadoop.conf.Configuration
> >                  [] - resource-types.xml not found
> > 2022-07-12 09:33:02,528 INFO
> > org.apache.hadoop.yarn.util.resource.ResourceUtils           [] - Unable
> to
> > find 'resource-types.xml'.
> > 2022-07-12 09:33:02,538 INFO
> > org.apache.flink.runtime.externalresource.ExternalResourceUtils [] -
> > Enabled external resources: []
> > 2022-07-12 09:33:02,541 INFO
> > org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl [] - Upper
> > bound of the thread pool size is 500
> > 2022-07-12 09:33:02,584 WARN  akka.remote.transport.netty.NettyTransport
> >                  [] - Remote connection to [null] failed with
> > java.net.ConnectException: Connection refused: n103/192.168.10.227:35961
> > 2022-07-12 09:33:02,585 WARN  akka.remote.ReliableDeliverySupervisor
> >                  [] - Association with remote system
> [akka.tcp://flink@n103:35961]
> > has failed, address is now gated for [50] ms. Reason: [Association failed
> > with [akka.tcp://flink@n103:35961]] Caused by:
> > [java.net.ConnectException: Connection refused: n103/
> 192.168.10.227:35961]
> >
> >
> >
> > best,
> > ynz...@163.com
> >
> > From: Weihua Hu
> > Date: 2022-07-11 19:46
> > To: user-zh
> > Subject: Re: flink-hudi-hive
> > Hi,
> > 任务反复初始化是指一直在 Failover 吗?在 JobManager.log 里可以看到作业 Failover 原因,搜索关键字; "to
> > FAILED"
> >
> > Best,
> > Weihua
> >
> >
> > On Mon, Jul 11, 2022 at 2:46 PM ynz...@163.com <ynz...@163.com> wrote:
> >
> > > Hi,
> > >     我正在使用flink将数据写入hudi并同步至hive,将任务提交到yarn后,我从flink web
> > > ui看到:相关任务反复初始化,task managers无任何信息。日志中也无明确错误提示 ;
> > >     当我删除代码中sync_hive相关配置,并且不改变其他配置,数据能正常写入hudi ;
> > >     我使用的hudi-0.11.1,flink-1.14.4,hadoop-3.3.1,hive-3.1.3 ;
> > >
> > >
> > >
> > > best,
> > > ynz...@163.com
> > >
> >
>

回复