Re: 【flink native k8s】HA配置 taskmanager pod一直重启
找不到TM的日志。因为TM还没有启动起来,pod就挂了 我看下是否是这个原因,目前确实没有增加-Dkubernetes.taskmanager.service-account这个参数 -Dkubernetes.taskmanager.service-account这个参数是在./bin/kubernetes-session.sh启动session集群的时候加的吗 在 2022/8/31 下午4:10,“Yang Wang” 写入: 我猜测你是因为没有给TM设置service account,导致TM没有权限从K8s ConfigMap拿到leader,从而注册到RM、JM -Dkubernetes.taskmanager.service-account=wuzhiheng \ Best, Yang Xuyang 于2022年8月30日周二 23:22写道: > Hi, 能贴一下TM的日志吗,看Warn的日志貌似是TM一直起不来 > 在 2022-08-30 03:45:43,"Wu,Zhiheng" 写道: > >【问题描述】 > >启用HA配置之后,taskmanager pod一直处于创建-停止-创建的过程,无法启动任务 > > > >1. 任务配置和启动过程 > > > >a) 修改conf/flink.yaml配置文件,增加HA配置 > >kubernetes.cluster-id: realtime-monitor > >high-availability: > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory > >high-availability.storageDir: > file:///opt/flink/checkpoint/recovery/monitor// > 这是一个NFS路径,以pvc挂载到pod > > > >b) 先通过以下命令创建一个无状态部署,建立一个session集群 > > > >./bin/kubernetes-session.sh \ > > > >-Dkubernetes.secrets=cdn-res-bd-keystore:/opt/flink/kafka/res/keystore/bd,cdn-res-bd-truststore:/opt/flink/kafka/res/truststore/bd,cdn-res-bj-keystore://opt/flink/kafka/res/keystore/bj,cdn-res-bj-truststore:/opt/flink/kafka/res/truststore/bj > \ > > > >-Dkubernetes.pod-template-file=./conf/pod-template.yaml \ > > > >-Dkubernetes.cluster-id=realtime-monitor \ > > > >-Dkubernetes.jobmanager.service-account=wuzhiheng \ > > > >-Dkubernetes.namespace=monitor \ > > > >-Dtaskmanager.numberOfTaskSlots=6 \ > > > >-Dtaskmanager.memory.process.size=8192m \ > > > >-Djobmanager.memory.process.size=2048m > > > >c) 最后通过web ui提交一个jar包任务,jobmanager 出现如下日志 > > > >2022-08-29 23:49:04,150 INFO > org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Pod > realtime-monitor-taskmanager-1-13 is created. > > > >2022-08-29 23:49:04,152 INFO > org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Pod > realtime-monitor-taskmanager-1-12 is created. > > > >2022-08-29 23:49:04,161 INFO > org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Received > new TaskManager pod: realtime-monitor-taskmanager-1-12 > > > >2022-08-29 23:49:04,162 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Requested worker realtime-monitor-taskmanager-1-12 with resource spec > WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), > taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), > managedMemSize=0 bytes, numSlots=6}. > > > >2022-08-29 23:49:04,162 INFO > org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Received > new TaskManager pod: realtime-monitor-taskmanager-1-13 > > > >2022-08-29 23:49:04,162 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Requested worker realtime-monitor-taskmanager-1-13 with resource spec > WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), > taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), > managedMemSize=0 bytes, numSlots=6}. > > > >2022-08-29 23:49:07,176 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Reaching max start worker failure rate: 12 events detected in the recent > interval, reaching the threshold 10.00. > > > >2022-08-29 23:49:07,176 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Will not retry creating worker in 3000 ms. > > > >2022-08-29 23:49:07,176 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Worker realtime-monitor-taskmanager-1-12 with resource spec > WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), > taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), > managedMemSize=0 bytes, numSlots=6} was requested in current attempt and > has not registered. Current pending count after removing: 1. > > > >2022-08-29 23:49:07,176 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Worker realtime-monitor-taskmanager-1-12 is terminated. Diagnostics: Pod > terminated, container termination statuses: > [flink-main-container(exitCode=1, reason=Error, message=null)], pod status: > Failed(reason=null, message=null) > > > >2022-08-29 23:49:07,176 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Requesting new worker with resource spec WorkerResourceSpec {cpuCores=6.0, > taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes, > networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, > numSlots=6}, current pending count: 2. > > >
Re: 【flink native k8s】HA配置 taskmanager pod一直重启
我猜测你是因为没有给TM设置service account,导致TM没有权限从K8s ConfigMap拿到leader,从而注册到RM、JM -Dkubernetes.taskmanager.service-account=wuzhiheng \ Best, Yang Xuyang 于2022年8月30日周二 23:22写道: > Hi, 能贴一下TM的日志吗,看Warn的日志貌似是TM一直起不来 > 在 2022-08-30 03:45:43,"Wu,Zhiheng" 写道: > >【问题描述】 > >启用HA配置之后,taskmanager pod一直处于创建-停止-创建的过程,无法启动任务 > > > >1. 任务配置和启动过程 > > > >a) 修改conf/flink.yaml配置文件,增加HA配置 > >kubernetes.cluster-id: realtime-monitor > >high-availability: > org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory > >high-availability.storageDir: > file:///opt/flink/checkpoint/recovery/monitor// > 这是一个NFS路径,以pvc挂载到pod > > > >b) 先通过以下命令创建一个无状态部署,建立一个session集群 > > > >./bin/kubernetes-session.sh \ > > > >-Dkubernetes.secrets=cdn-res-bd-keystore:/opt/flink/kafka/res/keystore/bd,cdn-res-bd-truststore:/opt/flink/kafka/res/truststore/bd,cdn-res-bj-keystore://opt/flink/kafka/res/keystore/bj,cdn-res-bj-truststore:/opt/flink/kafka/res/truststore/bj > \ > > > >-Dkubernetes.pod-template-file=./conf/pod-template.yaml \ > > > >-Dkubernetes.cluster-id=realtime-monitor \ > > > >-Dkubernetes.jobmanager.service-account=wuzhiheng \ > > > >-Dkubernetes.namespace=monitor \ > > > >-Dtaskmanager.numberOfTaskSlots=6 \ > > > >-Dtaskmanager.memory.process.size=8192m \ > > > >-Djobmanager.memory.process.size=2048m > > > >c) 最后通过web ui提交一个jar包任务,jobmanager 出现如下日志 > > > >2022-08-29 23:49:04,150 INFO > org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Pod > realtime-monitor-taskmanager-1-13 is created. > > > >2022-08-29 23:49:04,152 INFO > org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Pod > realtime-monitor-taskmanager-1-12 is created. > > > >2022-08-29 23:49:04,161 INFO > org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Received > new TaskManager pod: realtime-monitor-taskmanager-1-12 > > > >2022-08-29 23:49:04,162 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Requested worker realtime-monitor-taskmanager-1-12 with resource spec > WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), > taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), > managedMemSize=0 bytes, numSlots=6}. > > > >2022-08-29 23:49:04,162 INFO > org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Received > new TaskManager pod: realtime-monitor-taskmanager-1-13 > > > >2022-08-29 23:49:04,162 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Requested worker realtime-monitor-taskmanager-1-13 with resource spec > WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), > taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), > managedMemSize=0 bytes, numSlots=6}. > > > >2022-08-29 23:49:07,176 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Reaching max start worker failure rate: 12 events detected in the recent > interval, reaching the threshold 10.00. > > > >2022-08-29 23:49:07,176 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Will not retry creating worker in 3000 ms. > > > >2022-08-29 23:49:07,176 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Worker realtime-monitor-taskmanager-1-12 with resource spec > WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), > taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), > managedMemSize=0 bytes, numSlots=6} was requested in current attempt and > has not registered. Current pending count after removing: 1. > > > >2022-08-29 23:49:07,176 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Worker realtime-monitor-taskmanager-1-12 is terminated. Diagnostics: Pod > terminated, container termination statuses: > [flink-main-container(exitCode=1, reason=Error, message=null)], pod status: > Failed(reason=null, message=null) > > > >2022-08-29 23:49:07,176 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Requesting new worker with resource spec WorkerResourceSpec {cpuCores=6.0, > taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes, > networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, > numSlots=6}, current pending count: 2. > > > >2022-08-29 23:49:07,514 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Reaching max start worker failure rate: 13 events detected in the recent > interval, reaching the threshold 10.00. > > > >2022-08-29 23:49:07,514 INFO > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Worker realtime-monitor-taskmanager-1-13 with resource spec > WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), > taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), > managedMemSize=0 bytes, numSlots=6} was requested in current attempt and > has not registered.
Re:【flink native k8s】HA配置 taskmanager pod一直重启
Hi, 能贴一下TM的日志吗,看Warn的日志貌似是TM一直起不来 在 2022-08-30 03:45:43,"Wu,Zhiheng" 写道: >【问题描述】 >启用HA配置之后,taskmanager pod一直处于创建-停止-创建的过程,无法启动任务 > >1. 任务配置和启动过程 > >a) 修改conf/flink.yaml配置文件,增加HA配置 >kubernetes.cluster-id: realtime-monitor >high-availability: >org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory >high-availability.storageDir: file:///opt/flink/checkpoint/recovery/monitor >// 这是一个NFS路径,以pvc挂载到pod > >b) 先通过以下命令创建一个无状态部署,建立一个session集群 > >./bin/kubernetes-session.sh \ > >-Dkubernetes.secrets=cdn-res-bd-keystore:/opt/flink/kafka/res/keystore/bd,cdn-res-bd-truststore:/opt/flink/kafka/res/truststore/bd,cdn-res-bj-keystore://opt/flink/kafka/res/keystore/bj,cdn-res-bj-truststore:/opt/flink/kafka/res/truststore/bj > \ > >-Dkubernetes.pod-template-file=./conf/pod-template.yaml \ > >-Dkubernetes.cluster-id=realtime-monitor \ > >-Dkubernetes.jobmanager.service-account=wuzhiheng \ > >-Dkubernetes.namespace=monitor \ > >-Dtaskmanager.numberOfTaskSlots=6 \ > >-Dtaskmanager.memory.process.size=8192m \ > >-Djobmanager.memory.process.size=2048m > >c) 最后通过web ui提交一个jar包任务,jobmanager 出现如下日志 > >2022-08-29 23:49:04,150 INFO >org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Pod >realtime-monitor-taskmanager-1-13 is created. > >2022-08-29 23:49:04,152 INFO >org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Pod >realtime-monitor-taskmanager-1-12 is created. > >2022-08-29 23:49:04,161 INFO >org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Received new >TaskManager pod: realtime-monitor-taskmanager-1-12 > >2022-08-29 23:49:04,162 INFO >org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - >Requested worker realtime-monitor-taskmanager-1-12 with resource spec >WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), >taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), >managedMemSize=0 bytes, numSlots=6}. > >2022-08-29 23:49:04,162 INFO >org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Received new >TaskManager pod: realtime-monitor-taskmanager-1-13 > >2022-08-29 23:49:04,162 INFO >org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - >Requested worker realtime-monitor-taskmanager-1-13 with resource spec >WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), >taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), >managedMemSize=0 bytes, numSlots=6}. > >2022-08-29 23:49:07,176 WARN >org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - >Reaching max start worker failure rate: 12 events detected in the recent >interval, reaching the threshold 10.00. > >2022-08-29 23:49:07,176 INFO >org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - >Will not retry creating worker in 3000 ms. > >2022-08-29 23:49:07,176 INFO >org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - >Worker realtime-monitor-taskmanager-1-12 with resource spec WorkerResourceSpec >{cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 >bytes, networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, >numSlots=6} was requested in current attempt and has not registered. Current >pending count after removing: 1. > >2022-08-29 23:49:07,176 INFO >org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - >Worker realtime-monitor-taskmanager-1-12 is terminated. Diagnostics: Pod >terminated, container termination statuses: [flink-main-container(exitCode=1, >reason=Error, message=null)], pod status: Failed(reason=null, message=null) > >2022-08-29 23:49:07,176 INFO >org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - >Requesting new worker with resource spec WorkerResourceSpec {cpuCores=6.0, >taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes, >networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, >numSlots=6}, current pending count: 2. > >2022-08-29 23:49:07,514 WARN >org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - >Reaching max start worker failure rate: 13 events detected in the recent >interval, reaching the threshold 10.00. > >2022-08-29 23:49:07,514 INFO >org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - >Worker realtime-monitor-taskmanager-1-13 with resource spec WorkerResourceSpec >{cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 >bytes, networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, >numSlots=6} was requested in current attempt and has not registered. Current >pending count after removing: 1. > >2022-08-29 23:49:07,514 INFO >org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - >Worker realtime-monitor-taskmanager-1-13 is terminated. Diagnostics: Pod >terminated, container termination statuses:
【flink native k8s】HA配置 taskmanager pod一直重启
【问题描述】 启用HA配置之后,taskmanager pod一直处于创建-停止-创建的过程,无法启动任务 1. 任务配置和启动过程 a) 修改conf/flink.yaml配置文件,增加HA配置 kubernetes.cluster-id: realtime-monitor high-availability: org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory high-availability.storageDir: file:///opt/flink/checkpoint/recovery/monitor // 这是一个NFS路径,以pvc挂载到pod b) 先通过以下命令创建一个无状态部署,建立一个session集群 ./bin/kubernetes-session.sh \ -Dkubernetes.secrets=cdn-res-bd-keystore:/opt/flink/kafka/res/keystore/bd,cdn-res-bd-truststore:/opt/flink/kafka/res/truststore/bd,cdn-res-bj-keystore://opt/flink/kafka/res/keystore/bj,cdn-res-bj-truststore:/opt/flink/kafka/res/truststore/bj \ -Dkubernetes.pod-template-file=./conf/pod-template.yaml \ -Dkubernetes.cluster-id=realtime-monitor \ -Dkubernetes.jobmanager.service-account=wuzhiheng \ -Dkubernetes.namespace=monitor \ -Dtaskmanager.numberOfTaskSlots=6 \ -Dtaskmanager.memory.process.size=8192m \ -Djobmanager.memory.process.size=2048m c) 最后通过web ui提交一个jar包任务,jobmanager 出现如下日志 2022-08-29 23:49:04,150 INFO org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Pod realtime-monitor-taskmanager-1-13 is created. 2022-08-29 23:49:04,152 INFO org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Pod realtime-monitor-taskmanager-1-12 is created. 2022-08-29 23:49:04,161 INFO org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Received new TaskManager pod: realtime-monitor-taskmanager-1-12 2022-08-29 23:49:04,162 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requested worker realtime-monitor-taskmanager-1-12 with resource spec WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, numSlots=6}. 2022-08-29 23:49:04,162 INFO org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Received new TaskManager pod: realtime-monitor-taskmanager-1-13 2022-08-29 23:49:04,162 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requested worker realtime-monitor-taskmanager-1-13 with resource spec WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, numSlots=6}. 2022-08-29 23:49:07,176 WARN org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Reaching max start worker failure rate: 12 events detected in the recent interval, reaching the threshold 10.00. 2022-08-29 23:49:07,176 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Will not retry creating worker in 3000 ms. 2022-08-29 23:49:07,176 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker realtime-monitor-taskmanager-1-12 with resource spec WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, numSlots=6} was requested in current attempt and has not registered. Current pending count after removing: 1. 2022-08-29 23:49:07,176 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker realtime-monitor-taskmanager-1-12 is terminated. Diagnostics: Pod terminated, container termination statuses: [flink-main-container(exitCode=1, reason=Error, message=null)], pod status: Failed(reason=null, message=null) 2022-08-29 23:49:07,176 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Requesting new worker with resource spec WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, numSlots=6}, current pending count: 2. 2022-08-29 23:49:07,514 WARN org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Reaching max start worker failure rate: 13 events detected in the recent interval, reaching the threshold 10.00. 2022-08-29 23:49:07,514 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker realtime-monitor-taskmanager-1-13 with resource spec WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, numSlots=6} was requested in current attempt and has not registered. Current pending count after removing: 1. 2022-08-29 23:49:07,514 INFO org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Worker realtime-monitor-taskmanager-1-13 is terminated. Diagnostics: Pod terminated, container termination statuses: [flink-main-container(exitCode=1, reason=Error, message=null)], pod status: Failed(reason=null, message=null) 2022-08-29 23:49:07,515 INFO