Re: Understanding RocksDBStateBackend in Flink on Yarn on AWS EMR

2024-03-26 Thread Yang Wang
torage solution like >> HDFS or S3. >> >> Generally, EMR based Hadoop NN runs on 8020 port. You may find the NN IP >> details from EMR service. >> >> Hope this helps. >> >> -A >> >> >> On Thu, Mar 21, 2024 at 10:54 PM Sachin Mittal

Re: Understanding RocksDBStateBackend in Flink on Yarn on AWS EMR

2024-03-22 Thread Sachin Mittal
Hadoop NN runs on 8020 port. You may find the NN IP > details from EMR service. > > Hope this helps. > > -A > > > On Thu, Mar 21, 2024 at 10:54 PM Sachin Mittal wrote: > >> Hi, >> We are using AWS EMR where we can submit our flink jobs to a long running &g

Re: Understanding RocksDBStateBackend in Flink on Yarn on AWS EMR

2024-03-22 Thread Asimansu Bera
. -A On Thu, Mar 21, 2024 at 10:54 PM Sachin Mittal wrote: > Hi, > We are using AWS EMR where we can submit our flink jobs to a long running > flink cluster on Yarn. > > We wanted to configure RocksDBStateBackend as our state backend to store > our checkpoints. > > So we h

Understanding RocksDBStateBackend in Flink on Yarn on AWS EMR

2024-03-21 Thread Sachin Mittal
Hi, We are using AWS EMR where we can submit our flink jobs to a long running flink cluster on Yarn. We wanted to configure RocksDBStateBackend as our state backend to store our checkpoints. So we have configured following properties in our flink-conf.yaml - state.backend.type: rocksdb

Re: Uneven TM Distribution of Flink on YARN

2023-09-06 Thread Lu Niu
anghao >>> *发送时间:* Tuesday, August 29, 2023 12:14:53 PM >>> *收件人:* Lu Niu ; Weihua Hu >>> *抄送:* Kenan Kılıçtepe ; user < >>> user@flink.apache.org> >>> *主题:* Re: Uneven TM Distribution of Flink on YARN >>> >>> CCing @Weihua Hu , wh

Re: Uneven TM Distribution of Flink on YARN

2023-09-06 Thread Lu Niu
: *Lu Niu > *Date: *Thursday, September 7, 2023 at 12:17 AM > *To: *Geng Biao > *Cc: *Chen Zhanghao , Weihua Hu < > huweihua@gmail.com>, Kenan Kılıçtepe , user < > user@flink.apache.org> > *Subject: *Re: Uneven TM Distribution of Flink on YARN > > Hi, Thanks fo

Re: Uneven TM Distribution of Flink on YARN

2023-09-06 Thread Biao Geng
duling strategy, the final distribution of apps after some time is different. Best, Biao Geng From: Lu Niu Date: Thursday, September 7, 2023 at 12:17 AM To: Geng Biao Cc: Chen Zhanghao , Weihua Hu , Kenan Kılıçtepe , user Subject: Re: Uneven TM Distribution of Flink on YARN Hi, Thanks for all you

Re: Uneven TM Distribution of Flink on YARN

2023-09-06 Thread Lu Niu
M >> *收件人:* Lu Niu ; Weihua Hu >> *抄送:* Kenan Kılıçtepe ; user > > >> *主题:* Re: Uneven TM Distribution of Flink on YARN >> >> CCing @Weihua Hu , who is an expert on this. Do >> you have any ideas on the phenomenon here? >> >> Best, >>

Re: Uneven TM Distribution of Flink on YARN

2023-08-30 Thread Lu Niu
lt;https://aka.ms/o0ukef> > -- > *发件人:* Chen Zhanghao > *发送时间:* Tuesday, August 29, 2023 12:14:53 PM > *收件人:* Lu Niu ; Weihua Hu > *抄送:* Kenan Kılıçtepe ; user > *主题:* Re: Uneven TM Distribution of Flink on YARN > > CCing @Weihua Hu , who is an expert on this. Do &g

Re: Uneven TM Distribution of Flink on YARN

2023-08-29 Thread Geng Biao
er 主题: Re: Uneven TM Distribution of Flink on YARN CCing @Weihua Hu<mailto:huweihua@gmail.com> , who is an expert on this. Do you have any ideas on the phenomenon here? Best, Zhanghao Chen From: Lu Niu Sent: Tuesday, August 29, 2023 12:11:35 PM To: C

Re: Uneven TM Distribution of Flink on YARN

2023-08-28 Thread Chen Zhanghao
ct: Re: Uneven TM Distribution of Flink on YARN Thanks for your reply. The interesting fact is that we also managed spark on yarn. However. Only the flink cluster is having the issue. I am wondering whether there is a difference in the implementation on flink side. Best Lu On Mon, Aug 28, 2023 at 8

Re: Uneven TM Distribution of Flink on YARN

2023-08-28 Thread Lu Niu
ts Standalone mode Flink > clusters, and does not take effect on a Flink cluster on YARN. > > Best, > Zhanghao Chen > -- > *发件人:* Lu Niu > *发送时间:* 2023年8月29日 4:30 > *收件人:* Kenan Kılıçtepe > *抄送:* user > *主题:* Re: Uneven TM Distribution of Flink

回复: Uneven TM Distribution of Flink on YARN

2023-08-28 Thread Chen Zhanghao
mode Flink clusters, and does not take effect on a Flink cluster on YARN. Best, Zhanghao Chen 发件人: Lu Niu 发送时间: 2023年8月29日 4:30 收件人: Kenan Kılıçtepe 抄送: user 主题: Re: Uneven TM Distribution of Flink on YARN Thanks for the reply. We've already set cluster.evenly

Re: Uneven TM Distribution of Flink on YARN

2023-08-28 Thread Lu Niu
Thanks for the reply. We've already set cluster.evenly-spread-out-slots = true Best Lu On Mon, Aug 28, 2023 at 1:23 PM Kenan Kılıçtepe wrote: > Have you checked config param cluster.evenly-spread-out-slots ? > > > On Mon, Aug 28, 2023 at 10:31 PM Lu Niu wrote: > >> Hi, Flink users >> >> We

Re: Uneven TM Distribution of Flink on YARN

2023-08-28 Thread Kenan Kılıçtepe
Have you checked config param cluster.evenly-spread-out-slots ? On Mon, Aug 28, 2023 at 10:31 PM Lu Niu wrote: > Hi, Flink users > > We have recently observed that the allocation of Flink TaskManagers in our > YARN cluster is not evenly distributed. We would like to hear your thoughts > on

Uneven TM Distribution of Flink on YARN

2023-08-28 Thread Lu Niu
Hi, Flink users We have recently observed that the allocation of Flink TaskManagers in our YARN cluster is not evenly distributed. We would like to hear your thoughts on this matter. 1. Our setup includes Flink version 1.15.1 and Hadoop 2.10.0. 2. The uneven distribution is that out of a

flink1.17.1版本 flink on yarn 提交无法获取配置文件

2023-08-01 Thread guanyq
/opt/flink/flink-1.17.1/bin/flink run-application -t yarn-application -yjm 1024m -ytm 1024m ./xx-1.0.jar ./config.properties以上提交命令制定的配置文件,为什么在容器内找配置文件?file /home/yarn/nm/usercache/root/appcache/application_1690773368385_0092/container_e183_1690773368385_0092_01_01/./config.properties does

Re: flink on yarn rocksdb内存超用

2023-06-07 Thread Hangxiang Yu
Hi, 目前对RocksDB使用的内存是没有严格限制住的,可以参考这个 ticket: https://issues.apache.org/jira/browse/FLINK-15532 如果要定位到内存使用情况,可以先看一些粗的Metrics: https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#rocksdb-native-metrics 如果要再细致定位到单 instance 内部 RocksDB 的详细内存使用情况,可能需要用 malloc

Re: Web UI don't show up In Flink on Yarn (Flink 1.17)

2023-05-25 Thread Weihua Hu
t; and this info came from yarn resoucemanager > > 获取 Outlook for iOS <https://aka.ms/o0ukef> > -- > *发件人:* tan yao > *发送时间:* Thursday, May 25, 2023 8:14:45 PM > *收件人:* Weihua Hu > *抄送:* user > *主题:* Re: Web UI don't show up In Flin

Re: Web UI don't show up In Flink on Yarn (Flink 1.17)

2023-05-25 Thread tan yao
and this info came from yarn resoucemanager 获取 Outlook for iOS<https://aka.ms/o0ukef> 发件人: tan yao 发送时间: Thursday, May 25, 2023 8:14:45 PM 收件人: Weihua Hu 抄送: user 主题: Re: Web UI don't show up In Flink on Yarn (Flink 1.17) yes i have tried ip direct

Re: Web UI don't show up In Flink on Yarn (Flink 1.17)

2023-05-25 Thread Weihua Hu
Hi, Are there any reported exceptions? Did you try using curl to query the rest API, such as "curl http://{ip:port}/overview; Best, Weihua On Thu, May 25, 2023 at 8:49 AM tan yao wrote: > Hi all, > I find a strange thing with flink 1.17 deployed on yarn (CDH 6.x), flink > web ui can not show

Web UI don't show up In Flink on Yarn (Flink 1.17)

2023-05-24 Thread tan yao
Hi all,   I find a strange thing with flink 1.17 deployed on yarn (CDH 6.x), flink web ui can not show up from yarn web link "ApplicationMaster",even typed jobmanager ip directly in browser . when i run wordcount application in flink 1.17 examples, and click yarn web "ApplicationMaster" link

Re:Re: Re: Re: Re: flink on yarn 异常停电问题咨询

2023-03-13 Thread guanyq
我昨天模拟下断电的情况 10个ha文件的日期是错开的5秒一个 chk-xxx也不是都损坏了,有的是可以启动的,这个我也试了 现在情况是 yarn集群停电重启首先会循环尝试从10个ha的文件中启动应用,ha文件记录的chk的相关原数据 1.如果ha文件都损坏了,即使chk没有损坏,flink应用也是拉不起来的 现在想的是让hdfs上存在至少1组个可用的的ha文件及其对应的chk 现在是5秒一个chk,保存了10个,也会出现损坏无法启动的问题 5秒*10 = 50秒,也想知道多长时间的存档才能保证存在一组没有损坏ha和chk呢。 在 2023-03-14

Re: Re: Re: Re: flink on yarn 异常停电问题咨询

2023-03-13 Thread Guojun Li
Hi 确认一下这些 ha 文件的 last modification time 是一致的还是错开的? 另外,指定 chk- 恢复尝试了没有?可以恢复吗? Best, Guojun On Fri, Mar 10, 2023 at 11:56 AM guanyq wrote: > flink ha路径为 /tmp/flink/ha/ > flink chk路径为 /tmp/flink/checkpoint > > > 我现在不确定是这个ha的文件损坏了,还是所有chk都损坏,但是这个需要模拟验证一下。 > > > > > 会尝试从10个chk恢复,日志有打印 >

Re:Re: Re: flink on yarn关于yarn尝试重启flink job问题咨询

2023-03-13 Thread guanyq
理解了,非常感谢。 在 2023-03-13 16:57:18,"Weihua Hu" 写道: >图片看不到,可以找一个图床上传图片,在邮件列表中贴一下链接。 > >YARN 拉起 AM 还受 "yarn.application-attempt-failures-validity-interval"[1] >控制,在这个时间内达到指定次数才会退出。 > >[1]

Re: Re: flink on yarn关于yarn尝试重启flink job问题咨询

2023-03-13 Thread Weihua Hu
图片看不到,可以找一个图床上传图片,在邮件列表中贴一下链接。 YARN 拉起 AM 还受 "yarn.application-attempt-failures-validity-interval"[1] 控制,在这个时间内达到指定次数才会退出。 [1] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#yarn-application-attempt-failures-validity-interval Best, Weihua On Mon, Mar 13, 2023 at

Re:Re: flink on yarn关于yarn尝试重启flink job问题咨询

2023-03-13 Thread guanyq
图片在附件 但是实际却是超过了10次。。 在 2023-03-13 15:39:39,"Weihua Hu" 写道: >Hi, > >图片看不到了 > >按照这个配置,YARN 应该只会拉起 10 次 JobManager。 > >Best, >Weihua > > >On Mon, Mar 13, 2023 at 3:32 PM guanyq wrote: > >> flink1.10版本,flink配置如下 >> yarn.application-attempts = 10 (yarn尝试启动flink job的次数为10) >>

Re: flink on yarn关于yarn尝试重启flink job问题咨询

2023-03-13 Thread Weihua Hu
Hi, 图片看不到了 按照这个配置,YARN 应该只会拉起 10 次 JobManager。 Best, Weihua On Mon, Mar 13, 2023 at 3:32 PM guanyq wrote: > flink1.10版本,flink配置如下 > yarn.application-attempts = 10 (yarn尝试启动flink job的次数为10) > 正常我理解yarn会尝试10次启动flink job,如果起不来应该就会失败,但是在yarn应用页面看到了尝试11次,如下图 >

flink on yarn关于yarn尝试重启flink job问题咨询

2023-03-13 Thread guanyq
flink1.10版本,flink配置如下 yarn.application-attempts = 10 (yarn尝试启动flink job的次数为10) 正常我理解yarn会尝试10次启动flink job,如果起不来应该就会失败,但是在yarn应用页面看到了尝试11次,如下图 请问appattempt_1678102326043_0006_000409每个序号不是代表一次尝试么

Re:Re: Re: Re: flink on yarn 异常停电问题咨询

2023-03-09 Thread guanyq
flink ha路径为 /tmp/flink/ha/ flink chk路径为 /tmp/flink/checkpoint 我现在不确定是这个ha的文件损坏了,还是所有chk都损坏,但是这个需要模拟验证一下。 会尝试从10个chk恢复,日志有打印 2023-03-0718:37:43,703INFOorg.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Recovering checkpoints from ZooKeeper.

Re: Re: Re: flink on yarn 异常停电问题咨询

2023-03-09 Thread Weihua Hu
Hi 一般来说只是 YARN 集群异常停电不会影响已经完成的历史 Checkpoint(最后一次 Checkpoint 可能会写 hdfs 异常) 有更详细的 JobManager 日志吗?可以先确认下 Flink 在恢复时检索到了多少个 completedCheckpoint 以及最终尝试从哪一次 cp 恢复的。 也可以尝试按照 Yanfei 所说指定历史的 cp 作为 savepoint 恢复 Best, Weihua On Fri, Mar 10, 2023 at 10:38 AM guanyq wrote: > 没有开启增量chk >

Re:Re: Re: flink on yarn 异常停电问题咨询

2023-03-09 Thread guanyq
没有开启增量chk 文件损坏是看了启动日志,启动日志尝试从10个chk启动,但是都因为以下块损坏启动失败了 错误日志为: java.io.IOException: Got error, status message opReadBlock BP-1003103929-192.168.200.11-1668473836936:blk_1301252639_227512278 received exception org.apache.hadoop.hdfs.server.datanode.CorruptMetaHeaderException: The meta file length

Re: Re: flink on yarn 异常停电问题咨询

2023-03-09 Thread Yanfei Lei
Hi 可以尝试去flink配置的checkpoint dir下面去找一找历史chk-x文件夹,如果能找到历史的chk-x,可以尝试手工指定 chk重启[1]。 > flink任务是10个checkpoint,每个checkpoint间隔5秒,如果突然停电,为什么所有的checkpoint都损坏了。 请问作业开启增量checkpoint了吗?在开启了增量的情况下,如果是比较早的一个checkpoint的文件损坏了,会影响后续基于它进行增量的checkpoint。 >

Re:Re: flink on yarn 异常停电问题咨询

2023-03-09 Thread guanyq
目前也想着用savepoint处理异常停电的问题 但是我这面还有个疑问: flink任务是10个checkpoint,每个checkpoint间隔5秒,如果突然停电,为什么所有的checkpoint都损坏了。 就很奇怪是不是10个checkpoint都没落盘导致的。 想问下: checkpoint落盘的机制,这个应该和hdfs写入有关系,flink任务checkpoint成功,但是hdfs却没有落盘。 在 2023-03-10 08:47:11,"Shammon FY" 写道: >Hi > >我觉得Flink

Re: flink on yarn 异常停电问题咨询

2023-03-09 Thread Shammon FY
Hi 我觉得Flink 作业恢复失败时,作业本身很难确定失败是checkpoint的文件块损坏之类的原因。如果你的作业做过savepoint,可以尝试从指定的savepoint恢复作业 Best, Shammon On Thu, Mar 9, 2023 at 10:06 PM guanyq wrote: > 前提 > 1.flink配置了高可用 > 2.flink配置checkpoint数为10 > 3.yarn集群配置了任务恢复 > 疑问 > yarn集群停电重启后,恢复flink任务时,如果最近的checkpoint由于停电导致块损坏,是否会尝试从其他checkpoint启动

flink on yarn 异常停电问题咨询

2023-03-09 Thread guanyq
前提 1.flink配置了高可用 2.flink配置checkpoint数为10 3.yarn集群配置了任务恢复 疑问 yarn集群停电重启后,恢复flink任务时,如果最近的checkpoint由于停电导致块损坏,是否会尝试从其他checkpoint启动

Re: Flink on yarn 运行一段时间出现 TaskManager with id is no longer reachable

2023-02-16 Thread Shammon FY
Hi 上面TM心跳出现unreachable,一般是TM退出了,可以看下退出原因 下面Checkpoint超时,可以看下是否出现反压等问题,也可以看checkpoint执行时间,考虑增加checkpoint超时时间 Best, Shammon On Thu, Feb 16, 2023 at 10:34 AM lxk wrote: > 你好,可以dump下内存分析 > > > > > > > > > > > > > > > > > > 在 2023-02-16 10:05:19,"Fei Han" 写道: > >@all > >大家好!我的Flink

Flink on yarn 运行一段时间出现 TaskManager with id is no longer reachable

2023-02-15 Thread Fei Han
@all 大家好!我的Flink 版本是1.14.5。CDC版本是2.2.1。在on yarn 运行一段时间后会出现如下报错: org.apache.flink.runtime.jobmaster.JobMasterException: TaskManager with id container_e506_1673750933366_49579_01_02(hdp-server-010.yigongpin.com:8041) is no longer reachable. at

Re: Deploy Flink on YARN or Kubernetes.

2022-12-20 Thread Biao Geng
observations of myself and hope it can provide more information for the discussion: 1. stability: Flink on YARN module and Hadoop ecosystem have developed for a longer period of time than Flink on K8S and K8S ecosystem. The codebase of Flink on YARN module is more stable and it could be easier to get relevant

Re: Deploy Flink on YARN or Kubernetes.

2022-12-20 Thread Márton Balassi
Hi Ruibin, Given that you are starting fresh I would recommend going with Kubernetes and specifically checking out the Flink Kubernetes Operator. [1] I have worked with Yarn for years before I transitioned to Kubernetes a year ago and I am pleased that we made the jump. To address you point on a

Deploy Flink on YARN or Kubernetes.

2022-12-18 Thread Ruibin Xing
Hi all, We are currently setting up a new Flink cluster and are trying to decide on the best deployment method. As far as we know, Flink supports two resource providers: YARN and Kubernetes. We are having difficulty evaluating the pros and cons of each provider, particularly in terms of

Re: flink on yarn 作业挂掉反复重启

2022-07-25 Thread Weihua Hu
可以检查下是不是 JobManager 内存不足被 OOM kill 了,如果有更多的日志也可以贴出来 Best, Weihua On Mon, Jul 18, 2022 at 8:41 PM SmileSmile wrote: > hi,all > 遇到这种场景,flink on yarn,并行度3000的场景下,作业包含了多个agg操作,作业recover from checkpoint > 或者savepoint必现无法恢复的情况,作业反复重启 > jm报错org.apache.flink.runtime.entrypoint.Clust

Re: flink on yarn job always restart

2022-07-18 Thread SmileSmile
for a component like JM that doesn't run business logic (job parallelism is 3000, with multiple agg operations and sinks) Replied Message | From | Geng Biao | | Date | 07/18/2022 23:31 | | To | SmileSmile | | Cc | user | | Subject | Re: flink on yarn job always restart | The log shows

Re: flink on yarn job always restart

2022-07-18 Thread Geng Biao
. Cluster entrypoint is the driver to launch the flink cluster on YARN, not JM or TM process. The zk HA is for JM(i.e. starting a new JM when previous JM fails) and TM is managed by JM which, IIUC, does not directly interact with zk. It is possible that JM will be restarted repeated (check details

Re: flink on yarn job always restart

2022-07-18 Thread SmileSmile
it receive SIGNAL 15 2. is it because of some configuration? (e.g. deploy timeout causing kill?) Replied Message | From | Geng Biao | | Date | 07/18/2022 22:36 | | To | SmileSmile、user | | Cc | | | Subject | Re: flink on yarn job always restart | Hi, One possible direction is to check your

Re: flink on yarn job always restart

2022-07-18 Thread Geng Biao
not the root cause. Best, Biao Geng From: SmileSmile Date: Monday, July 18, 2022 at 8:46 PM To: user Subject: flink on yarn job always restart hi all we meet a situation, parallelism 3000,the job contains multiple agg operation,the job recover from checkpoint or savepoint must be unrecoverable

Re: flink on yarn job always restart

2022-07-18 Thread SmileSmile
/2022 21:19 | | To | SmileSmile、user | | Cc | | | Subject | Re: flink on yarn job always restart | Hi, could you provide the whole JM log? Best, Zhanghao Chen From: SmileSmile Sent: Monday, July 18, 2022 20:46 To: user Subject: flink on yarn job always restart hi all we meet a situation

Re: flink on yarn job always restart

2022-07-18 Thread Zhanghao Chen
Hi, could you provide the whole JM log? Best, Zhanghao Chen From: SmileSmile Sent: Monday, July 18, 2022 20:46 To: user Subject: flink on yarn job always restart hi all we meet a situation, parallelism 3000,the job contains multiple agg operation,the job

flink on yarn job always restart

2022-07-18 Thread SmileSmile
hi all we meet a situation, parallelism 3000,the job contains multiple agg operation,the job recover from checkpoint or savepoint must be unrecoverable, the job restarts repeatedly jm error logorg.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - RECEIVED S IGNAL 15: SIGTERM.

flink on yarn 作业挂掉反复重启

2022-07-18 Thread SmileSmile
hi,all 遇到这种场景,flink on yarn,并行度3000的场景下,作业包含了多个agg操作,作业recover from checkpoint 或者savepoint必现无法恢复的情况,作业反复重启 jm报错org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - RECEIVED S IGNAL 15: SIGTERM. Shutting down as requested. 请问有什么好的排查思路吗

Re: Flink on yarn ,并行度>1的情况下,怎么获取springboot的bean?

2022-04-22 Thread tison
@duwenwen 我比较好奇你的算子里做了什么,因为如果你就是要全局初始化唯一一次,那就用一个 parallelism=1 的算子去做就好了。 parallelism=n 你还要确保 once 的话应该得依赖外部系统来做到仅初始化一次。 Best, tison. Paul Lam 于2022年4月22日周五 18:16写道: > 听起来是在 Flink 里启动 springboot? 很有意思的架构,有一点点类似 statefun 了。可以说说这么做的背景吗? > > 另外请附带上 flink 的部署模式和版本信息,这样大家才好判断问题在哪里。 > > Best, >

Re: Flink on yarn ,并行度>1的情况下,怎么获取springboot的bean?

2022-04-22 Thread Paul Lam
听起来是在 Flink 里启动 springboot? 很有意思的架构,有一点点类似 statefun 了。可以说说这么做的背景吗? 另外请附带上 flink 的部署模式和版本信息,这样大家才好判断问题在哪里。 Best, Paul Lam > 2022年4月22日 16:30,duwenwen 写道: > > 您好: >首先很感谢您能在百忙之中看到我的邮件。我是一个写代码的新手,在使用flink框架过程中我遇到了一些问题,希望能得到您的解答。 >

Flink on yarn ,并行度>1的情况下,怎么获取springboot的bean?

2022-04-22 Thread duwenwen
您好: 首先很感谢您能在百忙之中看到我的邮件。我是一个写代码的新手,在使用flink框架过程中我遇到了一些问题,希望能得到您的解答。 由于需求要求,我需要将springboot和flink结合起来使用,我在open方法中获取springboot的上下文来获取bean。当设置parallelism为1时,可以发布到集群正常运行,但是当parallelism>1时,springboot的环境被多次初始化,运行就开始报错,,请问当parallelism>1

Re: flink on yarn任务停止发生异常

2022-03-08 Thread Jiangang Liu
异常提示的很明确了,做savepoint的过程中有的task不在running状态,你可以看下你的作业是否发生了failover。 QiZhu Chan 于2022年3月8日周二 17:37写道: > Hi, > > 各位社区大佬们,帮忙看一下如下报错是什么原因造成的?正常情况下客户端日志应该返回一个savepoint路径,但却出现如下异常日志,同时作业已被停止并且查看hdfs有发现当前job产生的savepoint文件 > > > > >

回复:flink on yarn HDFS_DELEGATION_TOKEN清除后,任务am attempt时失败

2022-02-10 Thread xieyi
/hadoop-yarn-site/src/site/markdown/YarnApplicationSecurity.md#securing-long-lived-yarn-services 想知道flink on yarn是如何解决hadoop Delegation token 过期的呢?看官网似乎说得不够清楚 目前在生产环境遇到了如下故障: flink 1.12 on yarn,yarn的nodemanager是容器化部署的,nodemanager偶尔会挂掉重启。当flink 任务运行超过7天后,若某个flink任务的JM(am)所在的nodemanager重启,am会进行

flink on yarn HDFS_DELEGATION_TOKEN清除后,任务am attempt时失败

2022-02-10 Thread xieyi
老师们好: 请教一个问题, 由于hadoop Delegation token 会在超过Max Lifetime(默认7天)后过期清除,对于长期运行任务,yarn提到有三种策略解决这个问题:https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/YarnApplicationSecurity.md#securing-long-lived-yarn-services 想知道flink on yarn

flink on yarn HDFS_DELEGATION_TOKEN清除后,任务am attempt时失败

2022-02-10 Thread xieyi
老师们好: 请教一个问题, 由于hadoop Delegation token 会在超过Max Lifetime(默认7天)后过期清除,对于长期运行任务,yarn提到有三种策略解决这个问题:https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-site/src/site/markdown/YarnApplicationSecurity.md#securing-long-lived-yarn-services 想知道flink on yarn

Re:Re: 关于flink on yarn 跨多hdfs集群访问的问题

2021-12-06 Thread casel.chen
如果是两套oss或s3 bucket(每个bucket对应一组accessKey/secret)要怎么配置呢?例如写数据到bucketA,但checkpoint在bucketB 在 2021-12-06 18:59:46,"Yang Wang" 写道: >我觉得你可以尝试一下ship本地的hadoop conf,然后设置HADOOP_CONF_DIR环境变量的方式 > >-yt /path/of/my-hadoop-conf >-yD

Re: 关于flink on yarn 跨多hdfs集群访问的问题

2021-12-06 Thread Yang Wang
我觉得你可以尝试一下ship本地的hadoop conf,然后设置HADOOP_CONF_DIR环境变量的方式 -yt /path/of/my-hadoop-conf -yD containerized.master.env.HADOOP_CONF_DIR='$PWD/my-hadoop-conf' -yD containerized.taskmanager.env.HADOOP_CONF_DIR='$PWD/my-hadoop-conf' Best, Yang chenqizhu 于2021年11月30日周二 上午10:00写道: > all,您好: > >

关于flink on yarn 跨多hdfs集群访问的问题

2021-11-29 Thread chenqizhu
all,您好: flink 1.13 版本支持了在flink-conf.yaml通过flink.hadoop.* 的方式 配置hadoop属性。有个需求是将checkpoint写到装有ssd的hdfs(称之为集群B)以加速checkpoint写入速度,但这个hdfs集群不是flink客户端本地的默认hdfs(默认hdfs称为集群A),于是想通过在flink-conf.yaml里配置A、B两个集群的nameservices,类似与hdfs联邦模式,访问到两个hdfs集群,具体配置如下: flink.hadoop.dfs.nameservices:

关于flink on yarn 跨多hdfs集群访问的问题

2021-11-29 Thread chenqizhu
all,您好: flink 1.13 版本支持了在flink-conf.yaml通过flink.hadoop.* 的方式 配置hadoop属性。有个需求是将checkpoint写到装有ssd的hdfs(称之为集群B)以加速checkpoint写入速度,但这个hdfs集群不是flink客户端本地的默认hdfs(默认hdfs称为集群A),于是想通过在flink-conf.yaml里配置A、B两个集群的nameservices,类似与hdfs联邦模式,访问到两个hdfs集群,具体配置如下: flink.hadoop.dfs.nameservices:

??????flink on yarn ??pre_job????????,????session????????????

2021-11-04 Thread JasonLee
hi ?? jar ??Flink ?? Best JasonLee ??2021??11??4?? 18:41<2572805...@qq.com.INVALID> ?? yarn??: org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint YarnJobClusterEntrypoint. at

Re: flink on yarn 的pre_job提交失败,但是session模式可以成功

2021-11-04 Thread 刘建刚
通过你上面的信息是看不出来的,里头的链接你可以看下详细日志 http://ark1.analysys.xyz:8088/cluster/app/application_1635998548270_0028 陈卓宇 <2572805...@qq.com.invalid> 于2021年11月4日周四 下午6:29写道: > yarn的错误日志: > org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to > initialize the cluster entrypoint

flink on yarn ??pre_job????????,????session????????????

2021-11-04 Thread ??????
yarn??: org.apache.flink.runtime.entrypoint.ClusterEntrypointException: Failed to initialize the cluster entrypoint YarnJobClusterEntrypoint. at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:200) at

flink 1.13.1 ????yarn-application????????????????mysql??????????????hive??????????????16G+??Taskmangaer????

2021-11-04 Thread Asahi Lee
hi! ??flink sqlmysql??hive??yarn-application??16G??

回复:Flink on yarn的日志监控和checkpoint的监控生产是如何处理的?

2021-08-31 Thread JasonLee
Hi 可以参考这两篇文章: https://mp.weixin.qq.com/s/2S4M8p-rBRinIRxmZrZq5Q https://mp.weixin.qq.com/s/44SXmCAUOqSWhQrNiZftoQ Best JasonLee 在2021年08月31日 13:23,guanyq 写道: flink on yarn 在集群中启动很多的task,生产应用中是如何监控task的日志,和checkpoint的呢? 求大佬指导。

Flink on yarn的日志监控和checkpoint的监控生产是如何处理的?

2021-08-30 Thread guanyq
flink on yarn 在集群中启动很多的task,生产应用中是如何监控task的日志,和checkpoint的呢? 求大佬指导。

RE: Upgrading from Flink on YARN 1.9 to 1.11

2021-08-20 Thread Hailu, Andreas [Engineering]
Hi David, I was able to get this working using your suggestion: 1)Deploy a Flink YARN Session Cluster, noting the host + port of the session’s Job Manager. 2)Submit a Flink job using the session’s details, i.e submitting Flink job with ‘-m host:port’ option. Thanks for clearing

Re: Upgrading from Flink on YARN 1.9 to 1.11

2021-08-17 Thread David Morávek
re quite brief – would you be able to > have a look at see if you can see if there’s something we’re doing that’s > clearly wrong? > > > > Something I did notice is that with the upgrade, our submissions are now > using the introduction of this ContextEnvironment#executeAsync

Re: Flink On Yarn HA 部署模式下Flink程序无法启动

2021-08-17 Thread 周瑞
您好,我的版本是1.13.1 --Original-- From: "Yang Wang"https://issues.apache.org/jira/browse/FLINK-19212 Best, Yang 周瑞

Re: Flink On Yarn HA 部署模式下Flink程序无法启动

2021-08-17 Thread Yang Wang
看报错应该是个已知问题[1]并且已经在1.11.2中修复 [1]. https://issues.apache.org/jira/browse/FLINK-19212 Best, Yang 周瑞 于2021年8月17日周二 上午11:04写道: > 您好:Flink程序部署在Yran上以Appliation Mode 模式启动的,在没有采用HA > 模式的时候可以正常启动,配置了HA之后,启动异常,麻烦帮忙看下是什么原因导致的. > > > HA 配置如下: > high-availability: zookeeper high-availability.storageDir:

Flink On Yarn HA 部署模式下Flink程序无法启动

2021-08-16 Thread 周瑞
您好:Flink程序部署在Yran上以Appliation Mode 模式启动的,在没有采用HA 模式的时候可以正常启动,配置了HA之后,启动异常,麻烦帮忙看下是什么原因导致的. HA 配置如下: high-availability: zookeeper high-availability.storageDir: hdfs://mycluster/flink/ha high-availability.zookeeper.quorum: zk-1:2181,zk-2:2181,zk-3:2181 high-availability.zookeeper.path.root:

RE: Upgrading from Flink on YARN 1.9 to 1.11

2021-08-16 Thread Hailu, Andreas [Engineering]
ávek Sent: Monday, August 16, 2021 6:28 AM To: Hailu, Andreas [Engineering] Cc: user@flink.apache.org Subject: Re: Upgrading from Flink on YARN 1.9 to 1.11 Hi Andreas, Per-job and session deployment modes should not be affected by this FLIP. Application mode is just a new deployment mode (where

Re: Upgrading from Flink on YARN 1.9 to 1.11

2021-08-16 Thread David Morávek
Hi Andreas, Per-job and session deployment modes should not be affected by this FLIP. Application mode is just a new deployment mode (where job driver runs embedded within JM), that co-exists with these two. >From information you've provided, I'd say your actual problem is this exception: ```

Upgrading from Flink on YARN 1.9 to 1.11

2021-08-13 Thread Hailu, Andreas [Engineering]
Hello folks! We're looking to upgrade from 1.9 to 1.11. Our Flink applications run on YARN and each have their own clusters, with each application having multiple jobs submitted. Our current submission command looks like this: $ run -m yarn-cluster --class com.class.name.Here -p 2 -yqu

flink on yarn报错

2021-07-30 Thread wangjingen
有没有大佬帮忙看看这个问题 The RMClient's and YarnResourceManagers internal state about the number of pending container requests for resource has diverged .Number client's pending container requests 1 !=Number RM's pending container requests 0;

flink on yarn??????????log4j????

2021-07-22 Thread comsir
hi all flink??log4jlog4j ?? ??

Flink on yarn-cluster模式提交任务报错

2021-06-08 Thread maker_d...@foxmail.com
我在CDH集群上使用Flink on yarn-cluster模式提交任务,报错不能部署,找不到jar包。 这个jar包是我没有用到的,但是在flink的lib中是存在的,并且我已经将lib的目录添加到环境变量中: export HADOOP_CLASSPATH=/opt/cloudera/parcels/FLINK/lib/flink/lib The program finished with the following exception: org.apache.flink.client.program.ProgramInvocationException: The main

回复:flink on yarn日志清理

2021-06-07 Thread 王刚
你可以在客户端的log4j.properties或者logback.xml文件上配置下相关清理策略 你先确认下使用的哪个slf4j的实现类 原始邮件 发件人: zjfpla...@hotmail.com 收件人: user-zh 发送时间: 2021年6月7日(周一) 12:17 主题: flink on yarn日志清理 大家好, 请问下如下问题: flink on yarn模式,日志清理机制有没有的? 是不是也是按照log4j/logback/log4j2等的清理机制来的?还是yarn上配置的。 是实时流作业,非离线一次性作业,一直跑着的 zjfpla

flink on yarn日志清理

2021-06-06 Thread zjfpla...@hotmail.com
大家好, 请问下如下问题: flink on yarn模式,日志清理机制有没有的? 是不是也是按照log4j/logback/log4j2等的清理机制来的?还是yarn上配置的。 是实时流作业,非离线一次性作业,一直跑着的 zjfpla...@hotmail.com

Re: Re: flink on yarn 模式下,yarn集群的resource-manager切换导致flink应用程序重启,并且未从最后一次checkpoint恢复

2021-05-31 Thread Yang Wang
HA在ZK里面记录了最后一次成功的checkpoint counter和地址,没有启用HA的话,就是从指定的savepoint恢复的。 Best, Yang 刘建刚 于2021年5月28日周五 下午6:51写道: > 那应该是master failover后把快照信息丢失了,ha应该能解决这个问题。 > > 董建 <62...@163.com> 于2021年5月28日周五 下午6:24写道: > > > 稳定复现 > > checkpoint 正常生成,在web ui和hdfs目录里边都可以确认。 > > 我们jobmanager没有做ha,不知道是否是这个原因导致的? > >

Re: Re: flink on yarn 模式下,yarn集群的resource-manager切换导致flink应用程序重启,并且未从最后一次checkpoint恢复

2021-05-28 Thread 刘建刚
那应该是master failover后把快照信息丢失了,ha应该能解决这个问题。 董建 <62...@163.com> 于2021年5月28日周五 下午6:24写道: > 稳定复现 > checkpoint 正常生成,在web ui和hdfs目录里边都可以确认。 > 我们jobmanager没有做ha,不知道是否是这个原因导致的? > 日志里边能看到是从指定的-s恢复的,没有指定-s的时候,重启的时候也并没有使用最新的checkpoint文件。 > 目前这个问题困扰了我很久,也没有一个好的思路,下一步先把ha搞起来再试试。 > >>

Re:Re: flink on yarn 模式下,yarn集群的resource-manager切换导致flink应用程序重启,并且未从最后一次checkpoint恢复

2021-05-28 Thread 董建
稳定复现 checkpoint 正常生成,在web ui和hdfs目录里边都可以确认。 我们jobmanager没有做ha,不知道是否是这个原因导致的? 日志里边能看到是从指定的-s恢复的,没有指定-s的时候,重启的时候也并没有使用最新的checkpoint文件。 目前这个问题困扰了我很久,也没有一个好的思路,下一步先把ha搞起来再试试。 >> org.apache.flink.configuration.GlobalConfiguration [] - Loading >> configuration property:

Re: flink on yarn 模式下,yarn集群的resource-manager切换导致flink应用程序重启,并且未从最后一次checkpoint恢复

2021-05-28 Thread 刘建刚
这种情况是不符合预期的。请问通过以下步骤可以稳定复现吗? 1、从savepoint恢复; 2、作业开始定期做savepoint; 3、作业failover。 如果是的话,可能需要排查下checkpoint 文件是否存在,zookeeper上是否更新。 如果还是有问题,需要通过日志来排查了。 董建 <62...@163.com> 于2021年5月28日周五 下午5:37写道: > 我遇到的问题现象是这样的 > > > > > 1、flink版本flink-1.12.2,启动命令如下,指定-s是因为job有做过cancel,这里重启。 > > > > > flink run -d -s >

flink on yarn 模式下,yarn集群的resource-manager切换导致flink应用程序重启,并且未从最后一次checkpoint恢复

2021-05-28 Thread 董建
我遇到的问题现象是这样的 1、flink版本flink-1.12.2,启动命令如下,指定-s是因为job有做过cancel,这里重启。 flink run -d -s hdfs:///user/flink/checkpoints/default/f9b85edbc6ca779b6e60414f3e3964f2/chk-100 -t yarn-per-job -m yarn-cluser -D yarn.application.name= /tmp/flink-1.0-SNAPSHOT.jar -c com.test.myStream --profile

flink on yarn更新文件后重启失败

2021-04-30 Thread zjfpla...@hotmail.com
flink任务停止后,将相关配置文件进行更新(keytab),然后报错: Resource hdfs://nameservice1/user/hbase/.flink/${appid}/hbase.keytab changed on src filesystem(excepted ,was ) zjfpla...@hotmail.com

flink on yarn kerberos认证问题

2021-04-30 Thread zjfpla...@hotmail.com
大家好, 问题点: 1.cdh中kerberos已经被cm托管的情况下,cm中修改kerberos配置,/var/kerberos/krb5kdc/kdc.conf和/etc/krb5.conf都不变,好像是存在其他位置,这个有没有人清楚? 2.flink 1.8 on cdh5.14 yarn运行时,一天后报GSS initiate failed{caused by GSSException:No valid credentials

flink在yarn集群上启动的问题

2021-04-21 Thread tanggen...@163.com
您好,我在向yarn 集群提交flink任务时遇到了一些问题,希望能帮忙回答一下 我布署了一个三个节点hadoop集群,两个工作节点为4c24G,yarn-site中配置了8个vcore,可用内存为20G,总共是16vcore 40G的资源,现在我向yarn提交了两个任务,分别分配了3vcore,6G内存,共消耗6vcore,12G内存,从hadoop的web ui上也能反映这一点,如下图: 但是当我提交第三个任务时,却无法提交成功,没有明显的报错日志,可是整个集群的资源明显是充足的,所以不知道问题是出现在哪里,还请多多指教 附1(控制台输出): The program

flink on yarn 启动报错

2021-04-18 Thread Bruce Zhang
flink on yarn per-job 模式提交报错,命令是 bin/flink run -m yarn-cluster -d -yjm 1024 -ytm 4096 /home/XX.jar yarn 资源足够,提交别的程序也可以,只有这个程序提交就报错,但是命令修改为bin/flink run -m yarn-cluster -yjm 1024 -ytm 4096 /home/testjar/XX.jar 就能成功,即去掉-d 这个命令参数,但是是session模式,并且还会影响别的程序执行 报错信息: 2021-04-19 10:08:13,116

Re: flink on yarn 多TaskManager 拒绝连接问题

2021-04-12 Thread haihua
hi请问楼主这个问题解决了 ,有什么思路可以分享一下吗? -- Sent from: http://apache-flink.147419.n8.nabble.com/

Re:Re: flink on yarn session模式与yarn通信失败的问题 (job模式可以成功)

2021-03-22 Thread 刘乘九
多谢大佬呀~尝试了一下没有解决。这两个参数有配置上,启动的时候也显示的与配置中一致。看上面的注释说好像仅Standalone 模式下有效,而且奇怪的是pre-job可以很顺利 session却连不上。对啦我的版本是1.11.2,大佬有空再帮忙看一眼呀 在 2021-03-23 09:28:20,"wxpcc" 写道: >第一个问题可以尝试在flink.conf 中配上jobmanager.rpc.address 和jobmanager.rpc.port >第二个问题不是很清楚 > > > >-- >Sent from:

Re: flink on yarn session模式与yarn通信失败的问题 (job模式可以成功)

2021-03-22 Thread wxpcc
第一个问题可以尝试在flink.conf 中配上jobmanager.rpc.address 和jobmanager.rpc.port 第二个问题不是很清楚 -- Sent from: http://apache-flink.147419.n8.nabble.com/

flink on yarn session模式与yarn通信失败的问题 (job模式可以成功)

2021-03-22 Thread 刘乘九
大佬们请教一下: 之前一直使用job模式来提交任务,可以顺利提交计算任务。最近有需求比较适合session模式来提交,按照论坛里的教程进行提交的时候,一直报错连接不上resource manage。观察启动log发现两种任务连接的resource manage不同,一个是正确的端口,一个一直请求本机端口。 session 模式启动log: job 模式启动log: 想请教一下: 1.如何配置session 模式下的 resource manage 端口? 2.job

flink on yarn session模式与yarn通信失败 (job模式可以成功)的问题

2021-03-22 Thread 刘乘九
大佬们请教一下: 之前一直使用job模式来提交任务,可以顺利提交计算任务。最近有需求比较适合session模式来提交,按照论坛里的教程进行提交的时候,一直报错连接不上resource manage。观察启动log发现两种任务连接的resource manage不同,一个是正确的端口,一个一直请求本机端口。 session 模式启动log: job 模式启动log: 想请教一下: 1.如何配置session 模式下的 resource manage 端口? 2.job

Flink on yarn per-job HA

2021-03-19 Thread Ink????
?? ??flink1.12 flink on yarn per-job HAHA??

Re: Flink On Yarn Per Job 作业提交失败问题

2021-02-24 Thread Robin Zhang
Hi,凌战 看看hadoop环境变量是否正确设置,可以参考文档 https://ci.apache.org/projects/flink/flink-docs-release-1.12/deployment/resource-providers/yarn.html#preparation Best, Robin 凌战 wrote > hi,社区 > 在接口端设置用户为 hdfs 用户,在调度执行作业后,发现在/user/hdfs/.flink/application-id 目录下 存在相关包,如 > -rw-r--r-- 3 hdfs supergroup

Flink On Yarn Per Job 作业提交失败问题

2021-02-23 Thread 凌战
hi,社区 在接口端设置用户为 hdfs 用户,在调度执行作业后,发现在/user/hdfs/.flink/application-id 目录下 存在相关包,如 -rw-r--r-- 3 hdfs supergroup 9402 2021-02-24 11:02 /user/hdfs/.flink/application_1610671284452_0257/WordCount.jar -rw-r--r-- 3 hdfs supergroup 1602 2021-02-24 11:09

Re: 本地api提交jar包到Flink on Yarn集群,报错 Error: Could not find or load main class org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint

2021-02-23 Thread Smile
你好, org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint 这个类应该是在 flink-yarn 这个 module 里面,打 lib 包的时候作为依赖被打进 flink-dist 里面。 为什么你同时添加了 flink-dist_2.11-1.10.1.jar 和 flink-yarn_2.11-1.11.1.jar 这两个 jar 呀,不会冲突吗? Smile -- Sent from: http://apache-flink.147419.n8.nabble.com/

Re:回复:本地api提交jar包到Flink on Yarn集群,报错 Error: Could not find or load main class org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint

2021-02-23 Thread Smile@LETTers
你好,org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint 这个类应该是在 flink-yarn 这个 module 里面,打 lib 包的时候作为依赖被打进 flink-dist 里面。为什么你同时添加了 flink-dist_2.11-1.10.1.jar 和 flink-yarn_2.11-1.11.1.jar 这两个 jar 呀,不会冲突吗?Smile 在 2021-02-23 19:27:43,"凌战" 写道: >上面添加的jar包没有显示,补充一下:目前除了用户jar包,添加的依赖

回复:本地api提交jar包到Flink on Yarn集群,报错 Error: Could not find or load main class org.apache.flink.yarn.entrypoint.YarnJobClusterEntrypoint

2021-02-23 Thread 凌战
上面添加的jar包没有显示,补充一下:目前除了用户jar包,添加的依赖jar包就是 flink-dist_2.11-1.10.1.jar flink-queryable-state-runtime_2.11-1.10.1.jar flink-shaded-hadoop-2-uber-2.7.5-10.0.jar flink-table-blink_2.11-1.10.1.jar flink-table_2.11-1.10.1.jar flink-yarn_2.11-1.11.1.jar | | 凌战 | | m18340872...@163.com | 签名由网易邮箱大师定制

  1   2   3   4   5   6   >