Re: 【flink native k8s】HA配置 taskmanager pod一直重启

2022-08-31 文章 Wu,Zhiheng
找不到TM的日志。因为TM还没有启动起来,pod就挂了
我看下是否是这个原因,目前确实没有增加-Dkubernetes.taskmanager.service-account这个参数
-Dkubernetes.taskmanager.service-account这个参数是在./bin/kubernetes-session.sh启动session集群的时候加的吗

在 2022/8/31 下午4:10,“Yang Wang” 写入:

我猜测你是因为没有给TM设置service account,导致TM没有权限从K8s ConfigMap拿到leader,从而注册到RM、JM

-Dkubernetes.taskmanager.service-account=wuzhiheng \


Best,
Yang

Xuyang  于2022年8月30日周二 23:22写道:

> Hi, 能贴一下TM的日志吗,看Warn的日志貌似是TM一直起不来
> 在 2022-08-30 03:45:43,"Wu,Zhiheng"  写道:
> >【问题描述】
> >启用HA配置之后,taskmanager pod一直处于创建-停止-创建的过程,无法启动任务
> >
> >1. 任务配置和启动过程
> >
> >a)  修改conf/flink.yaml配置文件,增加HA配置
> >kubernetes.cluster-id: realtime-monitor
> >high-availability:
> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> >high-availability.storageDir:
> file:///opt/flink/checkpoint/recovery/monitor//
> 这是一个NFS路径,以pvc挂载到pod
> >
> >b)  先通过以下命令创建一个无状态部署,建立一个session集群
> >
> >./bin/kubernetes-session.sh \
> >
> 
>-Dkubernetes.secrets=cdn-res-bd-keystore:/opt/flink/kafka/res/keystore/bd,cdn-res-bd-truststore:/opt/flink/kafka/res/truststore/bd,cdn-res-bj-keystore://opt/flink/kafka/res/keystore/bj,cdn-res-bj-truststore:/opt/flink/kafka/res/truststore/bj
> \
> >
> >-Dkubernetes.pod-template-file=./conf/pod-template.yaml \
> >
> >-Dkubernetes.cluster-id=realtime-monitor \
> >
> >-Dkubernetes.jobmanager.service-account=wuzhiheng \
> >
> >-Dkubernetes.namespace=monitor \
> >
> >-Dtaskmanager.numberOfTaskSlots=6 \
> >
> >-Dtaskmanager.memory.process.size=8192m \
> >
> >-Djobmanager.memory.process.size=2048m
> >
> >c)  最后通过web ui提交一个jar包任务,jobmanager 出现如下日志
> >
> >2022-08-29 23:49:04,150 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod
> realtime-monitor-taskmanager-1-13 is created.
> >
> >2022-08-29 23:49:04,152 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod
> realtime-monitor-taskmanager-1-12 is created.
> >
> >2022-08-29 23:49:04,161 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received
> new TaskManager pod: realtime-monitor-taskmanager-1-12
> >
> >2022-08-29 23:49:04,162 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requested worker realtime-monitor-taskmanager-1-12 with resource spec
> WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
> taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
> managedMemSize=0 bytes, numSlots=6}.
> >
> >2022-08-29 23:49:04,162 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received
> new TaskManager pod: realtime-monitor-taskmanager-1-13
> >
> >2022-08-29 23:49:04,162 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requested worker realtime-monitor-taskmanager-1-13 with resource spec
> WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
> taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
> managedMemSize=0 bytes, numSlots=6}.
> >
> >2022-08-29 23:49:07,176 WARN
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Reaching max start worker failure rate: 12 events detected in the recent
> interval, reaching the threshold 10.00.
> >
> >2022-08-29 23:49:07,176 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Will not retry creating worker in 3000 ms.
> >
> >2022-08-29 23:49:07,176 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Worker realtime-monitor-taskmanager-1-12 with resource spec
> WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
> taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
> managedMemSize=0 bytes, numSlots=6} was requested in current attempt and
> has not registered. Current pending count after removing: 1.
> >
> >2022-08-29 23:49:07,176 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Worker realtime-monitor-taskmanager-1-12 is terminated. Diagnostics: Pod
> terminated, container termination statuses:
> [flink-main-container(exitCode=1, reason=Error, message=null)], pod 
status:
> Failed(reason=null, message=null)
> >
> >2022-08-29 23:49:07,176 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requesting new worker with resource spec WorkerResourceSpec {cpuCores=6.0,
> taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes,
> networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes,
> numSlots=6}, current pending count: 2.
> >
> >2022-

Re: 【flink native k8s】HA配置 taskmanager pod一直重启

2022-08-31 文章 Yang Wang
我猜测你是因为没有给TM设置service account,导致TM没有权限从K8s ConfigMap拿到leader,从而注册到RM、JM

-Dkubernetes.taskmanager.service-account=wuzhiheng \


Best,
Yang

Xuyang  于2022年8月30日周二 23:22写道:

> Hi, 能贴一下TM的日志吗,看Warn的日志貌似是TM一直起不来
> 在 2022-08-30 03:45:43,"Wu,Zhiheng"  写道:
> >【问题描述】
> >启用HA配置之后,taskmanager pod一直处于创建-停止-创建的过程,无法启动任务
> >
> >1. 任务配置和启动过程
> >
> >a)  修改conf/flink.yaml配置文件,增加HA配置
> >kubernetes.cluster-id: realtime-monitor
> >high-availability:
> org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
> >high-availability.storageDir:
> file:///opt/flink/checkpoint/recovery/monitor//
> 这是一个NFS路径,以pvc挂载到pod
> >
> >b)  先通过以下命令创建一个无状态部署,建立一个session集群
> >
> >./bin/kubernetes-session.sh \
> >
> >-Dkubernetes.secrets=cdn-res-bd-keystore:/opt/flink/kafka/res/keystore/bd,cdn-res-bd-truststore:/opt/flink/kafka/res/truststore/bd,cdn-res-bj-keystore://opt/flink/kafka/res/keystore/bj,cdn-res-bj-truststore:/opt/flink/kafka/res/truststore/bj
> \
> >
> >-Dkubernetes.pod-template-file=./conf/pod-template.yaml \
> >
> >-Dkubernetes.cluster-id=realtime-monitor \
> >
> >-Dkubernetes.jobmanager.service-account=wuzhiheng \
> >
> >-Dkubernetes.namespace=monitor \
> >
> >-Dtaskmanager.numberOfTaskSlots=6 \
> >
> >-Dtaskmanager.memory.process.size=8192m \
> >
> >-Djobmanager.memory.process.size=2048m
> >
> >c)  最后通过web ui提交一个jar包任务,jobmanager 出现如下日志
> >
> >2022-08-29 23:49:04,150 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod
> realtime-monitor-taskmanager-1-13 is created.
> >
> >2022-08-29 23:49:04,152 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod
> realtime-monitor-taskmanager-1-12 is created.
> >
> >2022-08-29 23:49:04,161 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received
> new TaskManager pod: realtime-monitor-taskmanager-1-12
> >
> >2022-08-29 23:49:04,162 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requested worker realtime-monitor-taskmanager-1-12 with resource spec
> WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
> taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
> managedMemSize=0 bytes, numSlots=6}.
> >
> >2022-08-29 23:49:04,162 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received
> new TaskManager pod: realtime-monitor-taskmanager-1-13
> >
> >2022-08-29 23:49:04,162 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requested worker realtime-monitor-taskmanager-1-13 with resource spec
> WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
> taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
> managedMemSize=0 bytes, numSlots=6}.
> >
> >2022-08-29 23:49:07,176 WARN
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Reaching max start worker failure rate: 12 events detected in the recent
> interval, reaching the threshold 10.00.
> >
> >2022-08-29 23:49:07,176 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Will not retry creating worker in 3000 ms.
> >
> >2022-08-29 23:49:07,176 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Worker realtime-monitor-taskmanager-1-12 with resource spec
> WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
> taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
> managedMemSize=0 bytes, numSlots=6} was requested in current attempt and
> has not registered. Current pending count after removing: 1.
> >
> >2022-08-29 23:49:07,176 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Worker realtime-monitor-taskmanager-1-12 is terminated. Diagnostics: Pod
> terminated, container termination statuses:
> [flink-main-container(exitCode=1, reason=Error, message=null)], pod status:
> Failed(reason=null, message=null)
> >
> >2022-08-29 23:49:07,176 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requesting new worker with resource spec WorkerResourceSpec {cpuCores=6.0,
> taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes,
> networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes,
> numSlots=6}, current pending count: 2.
> >
> >2022-08-29 23:49:07,514 WARN
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Reaching max start worker failure rate: 13 events detected in the recent
> interval, reaching the threshold 10.00.
> >
> >2022-08-29 23:49:07,514 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Worker realtime-monitor-taskmanager-1-13 with resource spec
> WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes),
> taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes),
> managedMemSize=0 bytes, numSlots=6} was requested in current attempt and
> has not registered. C

Re:【flink native k8s】HA配置 taskmanager pod一直重启

2022-08-30 文章 Xuyang
Hi, 能贴一下TM的日志吗,看Warn的日志貌似是TM一直起不来
在 2022-08-30 03:45:43,"Wu,Zhiheng"  写道:
>【问题描述】
>启用HA配置之后,taskmanager pod一直处于创建-停止-创建的过程,无法启动任务
>
>1. 任务配置和启动过程
>
>a)  修改conf/flink.yaml配置文件,增加HA配置
>kubernetes.cluster-id: realtime-monitor
>high-availability: 
>org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
>high-availability.storageDir: file:///opt/flink/checkpoint/recovery/monitor
>// 这是一个NFS路径,以pvc挂载到pod
>
>b)  先通过以下命令创建一个无状态部署,建立一个session集群
>
>./bin/kubernetes-session.sh \
>
>-Dkubernetes.secrets=cdn-res-bd-keystore:/opt/flink/kafka/res/keystore/bd,cdn-res-bd-truststore:/opt/flink/kafka/res/truststore/bd,cdn-res-bj-keystore://opt/flink/kafka/res/keystore/bj,cdn-res-bj-truststore:/opt/flink/kafka/res/truststore/bj
> \
>
>-Dkubernetes.pod-template-file=./conf/pod-template.yaml \
>
>-Dkubernetes.cluster-id=realtime-monitor \
>
>-Dkubernetes.jobmanager.service-account=wuzhiheng \
>
>-Dkubernetes.namespace=monitor \
>
>-Dtaskmanager.numberOfTaskSlots=6 \
>
>-Dtaskmanager.memory.process.size=8192m \
>
>-Djobmanager.memory.process.size=2048m
>
>c)  最后通过web ui提交一个jar包任务,jobmanager 出现如下日志
>
>2022-08-29 23:49:04,150 INFO  
>org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod 
>realtime-monitor-taskmanager-1-13 is created.
>
>2022-08-29 23:49:04,152 INFO  
>org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod 
>realtime-monitor-taskmanager-1-12 is created.
>
>2022-08-29 23:49:04,161 INFO  
>org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received new 
>TaskManager pod: realtime-monitor-taskmanager-1-12
>
>2022-08-29 23:49:04,162 INFO  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Requested worker realtime-monitor-taskmanager-1-12 with resource spec 
>WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), 
>taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), 
>managedMemSize=0 bytes, numSlots=6}.
>
>2022-08-29 23:49:04,162 INFO  
>org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received new 
>TaskManager pod: realtime-monitor-taskmanager-1-13
>
>2022-08-29 23:49:04,162 INFO  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Requested worker realtime-monitor-taskmanager-1-13 with resource spec 
>WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), 
>taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), 
>managedMemSize=0 bytes, numSlots=6}.
>
>2022-08-29 23:49:07,176 WARN  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Reaching max start worker failure rate: 12 events detected in the recent 
>interval, reaching the threshold 10.00.
>
>2022-08-29 23:49:07,176 INFO  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Will not retry creating worker in 3000 ms.
>
>2022-08-29 23:49:07,176 INFO  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Worker realtime-monitor-taskmanager-1-12 with resource spec WorkerResourceSpec 
>{cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 
>bytes, networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, 
>numSlots=6} was requested in current attempt and has not registered. Current 
>pending count after removing: 1.
>
>2022-08-29 23:49:07,176 INFO  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Worker realtime-monitor-taskmanager-1-12 is terminated. Diagnostics: Pod 
>terminated, container termination statuses: [flink-main-container(exitCode=1, 
>reason=Error, message=null)], pod status: Failed(reason=null, message=null)
>
>2022-08-29 23:49:07,176 INFO  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Requesting new worker with resource spec WorkerResourceSpec {cpuCores=6.0, 
>taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes, 
>networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, 
>numSlots=6}, current pending count: 2.
>
>2022-08-29 23:49:07,514 WARN  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Reaching max start worker failure rate: 13 events detected in the recent 
>interval, reaching the threshold 10.00.
>
>2022-08-29 23:49:07,514 INFO  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Worker realtime-monitor-taskmanager-1-13 with resource spec WorkerResourceSpec 
>{cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 
>bytes, networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, 
>numSlots=6} was requested in current attempt and has not registered. Current 
>pending count after removing: 1.
>
>2022-08-29 23:49:07,514 INFO  
>org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
>Worker realtime-monitor-taskmanager-1-13 is terminated. Diagnostics: Pod 
>terminated, container termination statuses: [flink-mai

【flink native k8s】HA配置 taskmanager pod一直重启

2022-08-29 文章 Wu,Zhiheng
【问题描述】
启用HA配置之后,taskmanager pod一直处于创建-停止-创建的过程,无法启动任务

1. 任务配置和启动过程

a)  修改conf/flink.yaml配置文件,增加HA配置
kubernetes.cluster-id: realtime-monitor
high-availability: 
org.apache.flink.kubernetes.highavailability.KubernetesHaServicesFactory
high-availability.storageDir: file:///opt/flink/checkpoint/recovery/monitor 
   // 这是一个NFS路径,以pvc挂载到pod

b)  先通过以下命令创建一个无状态部署,建立一个session集群

./bin/kubernetes-session.sh \

-Dkubernetes.secrets=cdn-res-bd-keystore:/opt/flink/kafka/res/keystore/bd,cdn-res-bd-truststore:/opt/flink/kafka/res/truststore/bd,cdn-res-bj-keystore://opt/flink/kafka/res/keystore/bj,cdn-res-bj-truststore:/opt/flink/kafka/res/truststore/bj
 \

-Dkubernetes.pod-template-file=./conf/pod-template.yaml \

-Dkubernetes.cluster-id=realtime-monitor \

-Dkubernetes.jobmanager.service-account=wuzhiheng \

-Dkubernetes.namespace=monitor \

-Dtaskmanager.numberOfTaskSlots=6 \

-Dtaskmanager.memory.process.size=8192m \

-Djobmanager.memory.process.size=2048m

c)  最后通过web ui提交一个jar包任务,jobmanager 出现如下日志

2022-08-29 23:49:04,150 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod 
realtime-monitor-taskmanager-1-13 is created.

2022-08-29 23:49:04,152 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Pod 
realtime-monitor-taskmanager-1-12 is created.

2022-08-29 23:49:04,161 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received new 
TaskManager pod: realtime-monitor-taskmanager-1-12

2022-08-29 23:49:04,162 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requested worker realtime-monitor-taskmanager-1-12 with resource spec 
WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), 
taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), 
managedMemSize=0 bytes, numSlots=6}.

2022-08-29 23:49:04,162 INFO  
org.apache.flink.kubernetes.KubernetesResourceManagerDriver  [] - Received new 
TaskManager pod: realtime-monitor-taskmanager-1-13

2022-08-29 23:49:04,162 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requested worker realtime-monitor-taskmanager-1-13 with resource spec 
WorkerResourceSpec {cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), 
taskOffHeapSize=0 bytes, networkMemSize=711.680mb (746250577 bytes), 
managedMemSize=0 bytes, numSlots=6}.

2022-08-29 23:49:07,176 WARN  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Reaching max start worker failure rate: 12 events detected in the recent 
interval, reaching the threshold 10.00.

2022-08-29 23:49:07,176 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - Will 
not retry creating worker in 3000 ms.

2022-08-29 23:49:07,176 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Worker realtime-monitor-taskmanager-1-12 with resource spec WorkerResourceSpec 
{cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 
bytes, networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, 
numSlots=6} was requested in current attempt and has not registered. Current 
pending count after removing: 1.

2022-08-29 23:49:07,176 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Worker realtime-monitor-taskmanager-1-12 is terminated. Diagnostics: Pod 
terminated, container termination statuses: [flink-main-container(exitCode=1, 
reason=Error, message=null)], pod status: Failed(reason=null, message=null)

2022-08-29 23:49:07,176 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Requesting new worker with resource spec WorkerResourceSpec {cpuCores=6.0, 
taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 bytes, 
networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, 
numSlots=6}, current pending count: 2.

2022-08-29 23:49:07,514 WARN  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Reaching max start worker failure rate: 13 events detected in the recent 
interval, reaching the threshold 10.00.

2022-08-29 23:49:07,514 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Worker realtime-monitor-taskmanager-1-13 with resource spec WorkerResourceSpec 
{cpuCores=6.0, taskHeapSize=6.005gb (6447819631 bytes), taskOffHeapSize=0 
bytes, networkMemSize=711.680mb (746250577 bytes), managedMemSize=0 bytes, 
numSlots=6} was requested in current attempt and has not registered. Current 
pending count after removing: 1.

2022-08-29 23:49:07,514 INFO  
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Worker realtime-monitor-taskmanager-1-13 is terminated. Diagnostics: Pod 
terminated, container termination statuses: [flink-main-container(exitCode=1, 
reason=Error, message=null)], pod status: Failed(reason=null, message=null)

2022-08-29 23:49:07,515 INFO  
org.apache.flink.runtime.resourcemanager.active.Activ

Re: flink native k8s 按照文档提交任务找不到对应的集群

2022-07-18 文章 Yang Wang
你的理解是没有问题的

之所以将FlinkSessionJob拆成单独的CR来管理,主要是因为这样也更符合K8s的语义,在Session集群内每个Job也可以作为K8s资源来管理,Job状态变化就能及时更新到Status里面


Best,
Yang

yidan zhao  于2022年7月14日周四 23:01写道:

> 再咨询下关于 flink-k8s-operator 的问题。
> 我看了看问的文档,提供了2个CRD,分别为 FlinkDeployment 和 FlinkSessionJob。不知道如下理解对不对:
> (1)对于 application-mode 方式提交运行的任务,则用 FlinkDeployment,并配置好 job 部分。 会自动创建
> flink 集群,并根据 job 配置运行job。
>  这种方式不需要考虑集群创建、任务提交的步骤,本身就是一体。
> (2)对于 session 集群的创建,也是用 FlinkDeployment ,只是不需要指定 job 配置即可。
> (3)配合通过(2)方式创建的 session 集群,则可以配合 FlinkSessionJob 提交任务。
>
> Yang Wang  于2022年7月12日周二 17:10写道:
> >
> > 如果你K8s集群内的机器配置的DNS Server也是coredns,那就可以正常解析clusterIP对应的service的
> >
> > 最初ClusterIP的设计也是让任务管理的Pod来使用,例如flink-kubernetes-operator[1]
> >
> > [1]. https://github.com/apache/flink-kubernetes-operator
> >
> > Best,
> > Yang
> >
> > yidan zhao  于2022年7月12日周二 13:17写道:
> >
> > > 我用 flink run -m 方式指定 clusterIp 是可以提交任务的。
> > > 那么使用 --target kubernetes-session
> > > -Dkubernetes.cluster-id=my-first-flink-cluster 的方式,为什么不能智能点拿到对应
> > > cluster 的 svc 的 clusterIp 去提交呢。
> > >
> > > yidan zhao  于2022年7月12日周二 12:50写道:
> > > >
> > > > 如果是在 k8s-master-node 上,可不可以直接用 ClusterIp 呢?
> > > >
> > > >
> > > > 其次,NodePort我大概理解,一直不是很懂 LoadBalancer 方式是什么原理。
> > > >
> > > > yidan zhao  于2022年7月12日周二 12:48写道:
> > > > >
> > > > > 我理解的 k8s 集群内是组成 k8s 的机器,是必须在 pod 内?我在k8s的node上也不可以是吧。
> > > > >
> > > > > Yang Wang  于2022年7月12日周二 12:07写道:
> > > > > >
> > > > > > 日志里面已经说明的比较清楚了,如果用的是ClusterIP的方式,那你的Flink
> > > > > > client必须在k8s集群内才能正常提交。例如:起一个Pod,然后再pod里面执行flink run
> > > > > > 否则你就需要NodePort或者LoadBalancer的方式了
> > > > > >
> > > > > > 2022-07-12 10:23:23,021 WARN
> > > > > > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > > > > > Please note that Flink client operations(e.g. cancel, list, stop,
> > > > > > savepoint, etc.) won't work from outside the Kubernetes cluster
> since
> > > > > > 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Yang
> > > > > >
> > > > > > yidan zhao  于2022年7月12日周二 10:40写道:
> > > > > >
> > > > > > > 如下步骤参考的文档
> > > > > > >
> > >
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes
> > > > > > >
> > > > > > > 版本:1.15
> > > > > > >
> > > > > > > (1)创建集群:
> > > > > > >
> > > > > > >
> > >
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes
> > > > > > > (2)提交任务:
> > > > > > > ./bin/flink run \
> > > > > > > --target kubernetes-session \
> > > > > > > -Dkubernetes.cluster-id=my-first-flink-cluster \
> > > > > > > ./examples/streaming/TopSpeedWindowing.jar
> > > > > > >
> > > > > > > svc是ClusterIp类型
> > > > > > >
> > > > > > > 第二步提交任务环节,显示如下:
> > > > > > > Executing example with default input data.
> > > > > > > Use --input to specify file input.
> > > > > > > Printing result to stdout. Use --output to specify output path.
> > > > > > > 2022-07-12 10:23:23,021 WARN
> > > > > > > org.apache.flink.kubernetes.KubernetesClusterDescriptor
> [] -
> > > > > > > Please note that Flink client operations(e.g. cancel, list,
> stop,
> > > > > > > savepoint, etc.) won't work from outside the Kubernetes cluster
> > > since
> > > > > > > 'kubernetes.rest-service.exposed.type' has been set to
> ClusterIP.
> > > > > > > 2022-07-12 10:23:23,027 INFO
> > > > > > > org.apache.flink.kubernetes.KubernetesClusterDescriptor
> [] -
> > > > > > > Retrieve flink cluster my-first-flink-cluster successfully,
> > > JobManager
> > > > > > > Web Interface: http://my-first-flink-cluster-rest.test:8081
> > > > > > > 2022-07-12 10:23:23,044 WARN
> > > > > > > org.apache.flink.kubernetes.KubernetesClusterDescriptor
> [] -
> > > > > > > Please note that Flink client operations(e.g. cancel, list,
> stop,
> > > > > > > savepoint, etc.) won't work from outside the Kubernetes cluster
> > > since
> > > > > > > 'kubernetes.rest-service.exposed.type' has been set to
> ClusterIP.
> > > > > > >
> > > > > > > 
> > > > > > >  The program finished with the following exception:
> > > > > > > org.apache.flink.client.program.ProgramInvocationException: The
> > > main
> > > > > > > method caused an error: Failed to execute job
> > > > > > > 'CarTopSpeedWindowingExample'.
> > > > > > > ...
> > > > > > > Caused by: org.apache.flink.util.FlinkException: Failed to
> execute
> > > job
> > > > > > > 'CarTopSpeedWindowingExample'.
> > > > > > > ...
> > > > > > > Caused by:
> org.apache.flink.runtime.client.JobSubmissionException:
> > > > > > > Failed to submit JobGraph.
> > > > > > > ...
> > > > > > > Caused by:
> > > org.apache.flink.util.concurrent.FutureUtils$RetryException:
> > > > > > > Could not complete the operation. Number of retries has been
> > > > > > > exhausted.
> > > > > > > ...
> > > > > > > Caused by: java.util.concurrent.CompletionExceptio

Re: flink native k8s 按照文档提交任务找不到对应的集群

2022-07-14 文章 yidan zhao
再咨询下关于 flink-k8s-operator 的问题。
我看了看问的文档,提供了2个CRD,分别为 FlinkDeployment 和 FlinkSessionJob。不知道如下理解对不对:
(1)对于 application-mode 方式提交运行的任务,则用 FlinkDeployment,并配置好 job 部分。 会自动创建
flink 集群,并根据 job 配置运行job。
 这种方式不需要考虑集群创建、任务提交的步骤,本身就是一体。
(2)对于 session 集群的创建,也是用 FlinkDeployment ,只是不需要指定 job 配置即可。
(3)配合通过(2)方式创建的 session 集群,则可以配合 FlinkSessionJob 提交任务。

Yang Wang  于2022年7月12日周二 17:10写道:
>
> 如果你K8s集群内的机器配置的DNS Server也是coredns,那就可以正常解析clusterIP对应的service的
>
> 最初ClusterIP的设计也是让任务管理的Pod来使用,例如flink-kubernetes-operator[1]
>
> [1]. https://github.com/apache/flink-kubernetes-operator
>
> Best,
> Yang
>
> yidan zhao  于2022年7月12日周二 13:17写道:
>
> > 我用 flink run -m 方式指定 clusterIp 是可以提交任务的。
> > 那么使用 --target kubernetes-session
> > -Dkubernetes.cluster-id=my-first-flink-cluster 的方式,为什么不能智能点拿到对应
> > cluster 的 svc 的 clusterIp 去提交呢。
> >
> > yidan zhao  于2022年7月12日周二 12:50写道:
> > >
> > > 如果是在 k8s-master-node 上,可不可以直接用 ClusterIp 呢?
> > >
> > >
> > > 其次,NodePort我大概理解,一直不是很懂 LoadBalancer 方式是什么原理。
> > >
> > > yidan zhao  于2022年7月12日周二 12:48写道:
> > > >
> > > > 我理解的 k8s 集群内是组成 k8s 的机器,是必须在 pod 内?我在k8s的node上也不可以是吧。
> > > >
> > > > Yang Wang  于2022年7月12日周二 12:07写道:
> > > > >
> > > > > 日志里面已经说明的比较清楚了,如果用的是ClusterIP的方式,那你的Flink
> > > > > client必须在k8s集群内才能正常提交。例如:起一个Pod,然后再pod里面执行flink run
> > > > > 否则你就需要NodePort或者LoadBalancer的方式了
> > > > >
> > > > > 2022-07-12 10:23:23,021 WARN
> > > > > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > > > > Please note that Flink client operations(e.g. cancel, list, stop,
> > > > > savepoint, etc.) won't work from outside the Kubernetes cluster since
> > > > > 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
> > > > >
> > > > >
> > > > > Best,
> > > > > Yang
> > > > >
> > > > > yidan zhao  于2022年7月12日周二 10:40写道:
> > > > >
> > > > > > 如下步骤参考的文档
> > > > > >
> > https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes
> > > > > >
> > > > > > 版本:1.15
> > > > > >
> > > > > > (1)创建集群:
> > > > > >
> > > > > >
> > https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes
> > > > > > (2)提交任务:
> > > > > > ./bin/flink run \
> > > > > > --target kubernetes-session \
> > > > > > -Dkubernetes.cluster-id=my-first-flink-cluster \
> > > > > > ./examples/streaming/TopSpeedWindowing.jar
> > > > > >
> > > > > > svc是ClusterIp类型
> > > > > >
> > > > > > 第二步提交任务环节,显示如下:
> > > > > > Executing example with default input data.
> > > > > > Use --input to specify file input.
> > > > > > Printing result to stdout. Use --output to specify output path.
> > > > > > 2022-07-12 10:23:23,021 WARN
> > > > > > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > > > > > Please note that Flink client operations(e.g. cancel, list, stop,
> > > > > > savepoint, etc.) won't work from outside the Kubernetes cluster
> > since
> > > > > > 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
> > > > > > 2022-07-12 10:23:23,027 INFO
> > > > > > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > > > > > Retrieve flink cluster my-first-flink-cluster successfully,
> > JobManager
> > > > > > Web Interface: http://my-first-flink-cluster-rest.test:8081
> > > > > > 2022-07-12 10:23:23,044 WARN
> > > > > > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > > > > > Please note that Flink client operations(e.g. cancel, list, stop,
> > > > > > savepoint, etc.) won't work from outside the Kubernetes cluster
> > since
> > > > > > 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
> > > > > >
> > > > > > 
> > > > > >  The program finished with the following exception:
> > > > > > org.apache.flink.client.program.ProgramInvocationException: The
> > main
> > > > > > method caused an error: Failed to execute job
> > > > > > 'CarTopSpeedWindowingExample'.
> > > > > > ...
> > > > > > Caused by: org.apache.flink.util.FlinkException: Failed to execute
> > job
> > > > > > 'CarTopSpeedWindowingExample'.
> > > > > > ...
> > > > > > Caused by: org.apache.flink.runtime.client.JobSubmissionException:
> > > > > > Failed to submit JobGraph.
> > > > > > ...
> > > > > > Caused by:
> > org.apache.flink.util.concurrent.FutureUtils$RetryException:
> > > > > > Could not complete the operation. Number of retries has been
> > > > > > exhausted.
> > > > > > ...
> > > > > > Caused by: java.util.concurrent.CompletionException:
> > > > > > java.net.UnknownHostException: my-first-flink-cluster-rest.test:
> > Name
> > > > > > or service not known
> > > > > > ...
> > > > > > Caused by: java.net.UnknownHostException:
> > > > > > my-first-flink-cluster-rest.test: Name or service not known
> > > > > >
> > > > > >
> > > > > > 如上,根据 --target kubernetes-session
> > > > > > -Dkubernetes.cluster-id=my-first-flink-cluster 找到的提交入口为

Re: flink native k8s 按照文档提交任务找不到对应的集群

2022-07-12 文章 Yang Wang
如果你K8s集群内的机器配置的DNS Server也是coredns,那就可以正常解析clusterIP对应的service的

最初ClusterIP的设计也是让任务管理的Pod来使用,例如flink-kubernetes-operator[1]

[1]. https://github.com/apache/flink-kubernetes-operator

Best,
Yang

yidan zhao  于2022年7月12日周二 13:17写道:

> 我用 flink run -m 方式指定 clusterIp 是可以提交任务的。
> 那么使用 --target kubernetes-session
> -Dkubernetes.cluster-id=my-first-flink-cluster 的方式,为什么不能智能点拿到对应
> cluster 的 svc 的 clusterIp 去提交呢。
>
> yidan zhao  于2022年7月12日周二 12:50写道:
> >
> > 如果是在 k8s-master-node 上,可不可以直接用 ClusterIp 呢?
> >
> >
> > 其次,NodePort我大概理解,一直不是很懂 LoadBalancer 方式是什么原理。
> >
> > yidan zhao  于2022年7月12日周二 12:48写道:
> > >
> > > 我理解的 k8s 集群内是组成 k8s 的机器,是必须在 pod 内?我在k8s的node上也不可以是吧。
> > >
> > > Yang Wang  于2022年7月12日周二 12:07写道:
> > > >
> > > > 日志里面已经说明的比较清楚了,如果用的是ClusterIP的方式,那你的Flink
> > > > client必须在k8s集群内才能正常提交。例如:起一个Pod,然后再pod里面执行flink run
> > > > 否则你就需要NodePort或者LoadBalancer的方式了
> > > >
> > > > 2022-07-12 10:23:23,021 WARN
> > > > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > > > Please note that Flink client operations(e.g. cancel, list, stop,
> > > > savepoint, etc.) won't work from outside the Kubernetes cluster since
> > > > 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
> > > >
> > > >
> > > > Best,
> > > > Yang
> > > >
> > > > yidan zhao  于2022年7月12日周二 10:40写道:
> > > >
> > > > > 如下步骤参考的文档
> > > > >
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes
> > > > >
> > > > > 版本:1.15
> > > > >
> > > > > (1)创建集群:
> > > > >
> > > > >
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes
> > > > > (2)提交任务:
> > > > > ./bin/flink run \
> > > > > --target kubernetes-session \
> > > > > -Dkubernetes.cluster-id=my-first-flink-cluster \
> > > > > ./examples/streaming/TopSpeedWindowing.jar
> > > > >
> > > > > svc是ClusterIp类型
> > > > >
> > > > > 第二步提交任务环节,显示如下:
> > > > > Executing example with default input data.
> > > > > Use --input to specify file input.
> > > > > Printing result to stdout. Use --output to specify output path.
> > > > > 2022-07-12 10:23:23,021 WARN
> > > > > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > > > > Please note that Flink client operations(e.g. cancel, list, stop,
> > > > > savepoint, etc.) won't work from outside the Kubernetes cluster
> since
> > > > > 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
> > > > > 2022-07-12 10:23:23,027 INFO
> > > > > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > > > > Retrieve flink cluster my-first-flink-cluster successfully,
> JobManager
> > > > > Web Interface: http://my-first-flink-cluster-rest.test:8081
> > > > > 2022-07-12 10:23:23,044 WARN
> > > > > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > > > > Please note that Flink client operations(e.g. cancel, list, stop,
> > > > > savepoint, etc.) won't work from outside the Kubernetes cluster
> since
> > > > > 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
> > > > >
> > > > > 
> > > > >  The program finished with the following exception:
> > > > > org.apache.flink.client.program.ProgramInvocationException: The
> main
> > > > > method caused an error: Failed to execute job
> > > > > 'CarTopSpeedWindowingExample'.
> > > > > ...
> > > > > Caused by: org.apache.flink.util.FlinkException: Failed to execute
> job
> > > > > 'CarTopSpeedWindowingExample'.
> > > > > ...
> > > > > Caused by: org.apache.flink.runtime.client.JobSubmissionException:
> > > > > Failed to submit JobGraph.
> > > > > ...
> > > > > Caused by:
> org.apache.flink.util.concurrent.FutureUtils$RetryException:
> > > > > Could not complete the operation. Number of retries has been
> > > > > exhausted.
> > > > > ...
> > > > > Caused by: java.util.concurrent.CompletionException:
> > > > > java.net.UnknownHostException: my-first-flink-cluster-rest.test:
> Name
> > > > > or service not known
> > > > > ...
> > > > > Caused by: java.net.UnknownHostException:
> > > > > my-first-flink-cluster-rest.test: Name or service not known
> > > > >
> > > > >
> > > > > 如上,根据 --target kubernetes-session
> > > > > -Dkubernetes.cluster-id=my-first-flink-cluster 找到的提交入口为
> > > > >
> my-first-flink-cluster-rest.test。这个应该是根据k8s生成的dns,test是flink的namespace。
> > > > >
> > > > > 我本地也的确并无法解析 my-first-flink-cluster-rest.test 这个。
> > > > >
>


Re: flink native k8s 按照文档提交任务找不到对应的集群

2022-07-11 文章 yidan zhao
我用 flink run -m 方式指定 clusterIp 是可以提交任务的。
那么使用 --target kubernetes-session
-Dkubernetes.cluster-id=my-first-flink-cluster 的方式,为什么不能智能点拿到对应
cluster 的 svc 的 clusterIp 去提交呢。

yidan zhao  于2022年7月12日周二 12:50写道:
>
> 如果是在 k8s-master-node 上,可不可以直接用 ClusterIp 呢?
>
>
> 其次,NodePort我大概理解,一直不是很懂 LoadBalancer 方式是什么原理。
>
> yidan zhao  于2022年7月12日周二 12:48写道:
> >
> > 我理解的 k8s 集群内是组成 k8s 的机器,是必须在 pod 内?我在k8s的node上也不可以是吧。
> >
> > Yang Wang  于2022年7月12日周二 12:07写道:
> > >
> > > 日志里面已经说明的比较清楚了,如果用的是ClusterIP的方式,那你的Flink
> > > client必须在k8s集群内才能正常提交。例如:起一个Pod,然后再pod里面执行flink run
> > > 否则你就需要NodePort或者LoadBalancer的方式了
> > >
> > > 2022-07-12 10:23:23,021 WARN
> > > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > > Please note that Flink client operations(e.g. cancel, list, stop,
> > > savepoint, etc.) won't work from outside the Kubernetes cluster since
> > > 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
> > >
> > >
> > > Best,
> > > Yang
> > >
> > > yidan zhao  于2022年7月12日周二 10:40写道:
> > >
> > > > 如下步骤参考的文档
> > > > https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes
> > > >
> > > > 版本:1.15
> > > >
> > > > (1)创建集群:
> > > >
> > > > https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes
> > > > (2)提交任务:
> > > > ./bin/flink run \
> > > > --target kubernetes-session \
> > > > -Dkubernetes.cluster-id=my-first-flink-cluster \
> > > > ./examples/streaming/TopSpeedWindowing.jar
> > > >
> > > > svc是ClusterIp类型
> > > >
> > > > 第二步提交任务环节,显示如下:
> > > > Executing example with default input data.
> > > > Use --input to specify file input.
> > > > Printing result to stdout. Use --output to specify output path.
> > > > 2022-07-12 10:23:23,021 WARN
> > > > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > > > Please note that Flink client operations(e.g. cancel, list, stop,
> > > > savepoint, etc.) won't work from outside the Kubernetes cluster since
> > > > 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
> > > > 2022-07-12 10:23:23,027 INFO
> > > > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > > > Retrieve flink cluster my-first-flink-cluster successfully, JobManager
> > > > Web Interface: http://my-first-flink-cluster-rest.test:8081
> > > > 2022-07-12 10:23:23,044 WARN
> > > > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > > > Please note that Flink client operations(e.g. cancel, list, stop,
> > > > savepoint, etc.) won't work from outside the Kubernetes cluster since
> > > > 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
> > > >
> > > > 
> > > >  The program finished with the following exception:
> > > > org.apache.flink.client.program.ProgramInvocationException: The main
> > > > method caused an error: Failed to execute job
> > > > 'CarTopSpeedWindowingExample'.
> > > > ...
> > > > Caused by: org.apache.flink.util.FlinkException: Failed to execute job
> > > > 'CarTopSpeedWindowingExample'.
> > > > ...
> > > > Caused by: org.apache.flink.runtime.client.JobSubmissionException:
> > > > Failed to submit JobGraph.
> > > > ...
> > > > Caused by: org.apache.flink.util.concurrent.FutureUtils$RetryException:
> > > > Could not complete the operation. Number of retries has been
> > > > exhausted.
> > > > ...
> > > > Caused by: java.util.concurrent.CompletionException:
> > > > java.net.UnknownHostException: my-first-flink-cluster-rest.test: Name
> > > > or service not known
> > > > ...
> > > > Caused by: java.net.UnknownHostException:
> > > > my-first-flink-cluster-rest.test: Name or service not known
> > > >
> > > >
> > > > 如上,根据 --target kubernetes-session
> > > > -Dkubernetes.cluster-id=my-first-flink-cluster 找到的提交入口为
> > > > my-first-flink-cluster-rest.test。这个应该是根据k8s生成的dns,test是flink的namespace。
> > > >
> > > > 我本地也的确并无法解析 my-first-flink-cluster-rest.test 这个。
> > > >


Re: flink native k8s 按照文档提交任务找不到对应的集群

2022-07-11 文章 yidan zhao
如果是在 k8s-master-node 上,可不可以直接用 ClusterIp 呢?


其次,NodePort我大概理解,一直不是很懂 LoadBalancer 方式是什么原理。

yidan zhao  于2022年7月12日周二 12:48写道:
>
> 我理解的 k8s 集群内是组成 k8s 的机器,是必须在 pod 内?我在k8s的node上也不可以是吧。
>
> Yang Wang  于2022年7月12日周二 12:07写道:
> >
> > 日志里面已经说明的比较清楚了,如果用的是ClusterIP的方式,那你的Flink
> > client必须在k8s集群内才能正常提交。例如:起一个Pod,然后再pod里面执行flink run
> > 否则你就需要NodePort或者LoadBalancer的方式了
> >
> > 2022-07-12 10:23:23,021 WARN
> > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > Please note that Flink client operations(e.g. cancel, list, stop,
> > savepoint, etc.) won't work from outside the Kubernetes cluster since
> > 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
> >
> >
> > Best,
> > Yang
> >
> > yidan zhao  于2022年7月12日周二 10:40写道:
> >
> > > 如下步骤参考的文档
> > > https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes
> > >
> > > 版本:1.15
> > >
> > > (1)创建集群:
> > >
> > > https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes
> > > (2)提交任务:
> > > ./bin/flink run \
> > > --target kubernetes-session \
> > > -Dkubernetes.cluster-id=my-first-flink-cluster \
> > > ./examples/streaming/TopSpeedWindowing.jar
> > >
> > > svc是ClusterIp类型
> > >
> > > 第二步提交任务环节,显示如下:
> > > Executing example with default input data.
> > > Use --input to specify file input.
> > > Printing result to stdout. Use --output to specify output path.
> > > 2022-07-12 10:23:23,021 WARN
> > > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > > Please note that Flink client operations(e.g. cancel, list, stop,
> > > savepoint, etc.) won't work from outside the Kubernetes cluster since
> > > 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
> > > 2022-07-12 10:23:23,027 INFO
> > > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > > Retrieve flink cluster my-first-flink-cluster successfully, JobManager
> > > Web Interface: http://my-first-flink-cluster-rest.test:8081
> > > 2022-07-12 10:23:23,044 WARN
> > > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > > Please note that Flink client operations(e.g. cancel, list, stop,
> > > savepoint, etc.) won't work from outside the Kubernetes cluster since
> > > 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
> > >
> > > 
> > >  The program finished with the following exception:
> > > org.apache.flink.client.program.ProgramInvocationException: The main
> > > method caused an error: Failed to execute job
> > > 'CarTopSpeedWindowingExample'.
> > > ...
> > > Caused by: org.apache.flink.util.FlinkException: Failed to execute job
> > > 'CarTopSpeedWindowingExample'.
> > > ...
> > > Caused by: org.apache.flink.runtime.client.JobSubmissionException:
> > > Failed to submit JobGraph.
> > > ...
> > > Caused by: org.apache.flink.util.concurrent.FutureUtils$RetryException:
> > > Could not complete the operation. Number of retries has been
> > > exhausted.
> > > ...
> > > Caused by: java.util.concurrent.CompletionException:
> > > java.net.UnknownHostException: my-first-flink-cluster-rest.test: Name
> > > or service not known
> > > ...
> > > Caused by: java.net.UnknownHostException:
> > > my-first-flink-cluster-rest.test: Name or service not known
> > >
> > >
> > > 如上,根据 --target kubernetes-session
> > > -Dkubernetes.cluster-id=my-first-flink-cluster 找到的提交入口为
> > > my-first-flink-cluster-rest.test。这个应该是根据k8s生成的dns,test是flink的namespace。
> > >
> > > 我本地也的确并无法解析 my-first-flink-cluster-rest.test 这个。
> > >


Re: flink native k8s 按照文档提交任务找不到对应的集群

2022-07-11 文章 yidan zhao
我理解的 k8s 集群内是组成 k8s 的机器,是必须在 pod 内?我在k8s的node上也不可以是吧。

Yang Wang  于2022年7月12日周二 12:07写道:
>
> 日志里面已经说明的比较清楚了,如果用的是ClusterIP的方式,那你的Flink
> client必须在k8s集群内才能正常提交。例如:起一个Pod,然后再pod里面执行flink run
> 否则你就需要NodePort或者LoadBalancer的方式了
>
> 2022-07-12 10:23:23,021 WARN
> org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> Please note that Flink client operations(e.g. cancel, list, stop,
> savepoint, etc.) won't work from outside the Kubernetes cluster since
> 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
>
>
> Best,
> Yang
>
> yidan zhao  于2022年7月12日周二 10:40写道:
>
> > 如下步骤参考的文档
> > https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes
> >
> > 版本:1.15
> >
> > (1)创建集群:
> >
> > https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes
> > (2)提交任务:
> > ./bin/flink run \
> > --target kubernetes-session \
> > -Dkubernetes.cluster-id=my-first-flink-cluster \
> > ./examples/streaming/TopSpeedWindowing.jar
> >
> > svc是ClusterIp类型
> >
> > 第二步提交任务环节,显示如下:
> > Executing example with default input data.
> > Use --input to specify file input.
> > Printing result to stdout. Use --output to specify output path.
> > 2022-07-12 10:23:23,021 WARN
> > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > Please note that Flink client operations(e.g. cancel, list, stop,
> > savepoint, etc.) won't work from outside the Kubernetes cluster since
> > 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
> > 2022-07-12 10:23:23,027 INFO
> > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > Retrieve flink cluster my-first-flink-cluster successfully, JobManager
> > Web Interface: http://my-first-flink-cluster-rest.test:8081
> > 2022-07-12 10:23:23,044 WARN
> > org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> > Please note that Flink client operations(e.g. cancel, list, stop,
> > savepoint, etc.) won't work from outside the Kubernetes cluster since
> > 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
> >
> > 
> >  The program finished with the following exception:
> > org.apache.flink.client.program.ProgramInvocationException: The main
> > method caused an error: Failed to execute job
> > 'CarTopSpeedWindowingExample'.
> > ...
> > Caused by: org.apache.flink.util.FlinkException: Failed to execute job
> > 'CarTopSpeedWindowingExample'.
> > ...
> > Caused by: org.apache.flink.runtime.client.JobSubmissionException:
> > Failed to submit JobGraph.
> > ...
> > Caused by: org.apache.flink.util.concurrent.FutureUtils$RetryException:
> > Could not complete the operation. Number of retries has been
> > exhausted.
> > ...
> > Caused by: java.util.concurrent.CompletionException:
> > java.net.UnknownHostException: my-first-flink-cluster-rest.test: Name
> > or service not known
> > ...
> > Caused by: java.net.UnknownHostException:
> > my-first-flink-cluster-rest.test: Name or service not known
> >
> >
> > 如上,根据 --target kubernetes-session
> > -Dkubernetes.cluster-id=my-first-flink-cluster 找到的提交入口为
> > my-first-flink-cluster-rest.test。这个应该是根据k8s生成的dns,test是flink的namespace。
> >
> > 我本地也的确并无法解析 my-first-flink-cluster-rest.test 这个。
> >


Re: flink native k8s 按照文档提交任务找不到对应的集群

2022-07-11 文章 Yang Wang
日志里面已经说明的比较清楚了,如果用的是ClusterIP的方式,那你的Flink
client必须在k8s集群内才能正常提交。例如:起一个Pod,然后再pod里面执行flink run
否则你就需要NodePort或者LoadBalancer的方式了

2022-07-12 10:23:23,021 WARN
org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
Please note that Flink client operations(e.g. cancel, list, stop,
savepoint, etc.) won't work from outside the Kubernetes cluster since
'kubernetes.rest-service.exposed.type' has been set to ClusterIP.


Best,
Yang

yidan zhao  于2022年7月12日周二 10:40写道:

> 如下步骤参考的文档
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes
>
> 版本:1.15
>
> (1)创建集群:
>
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes
> (2)提交任务:
> ./bin/flink run \
> --target kubernetes-session \
> -Dkubernetes.cluster-id=my-first-flink-cluster \
> ./examples/streaming/TopSpeedWindowing.jar
>
> svc是ClusterIp类型
>
> 第二步提交任务环节,显示如下:
> Executing example with default input data.
> Use --input to specify file input.
> Printing result to stdout. Use --output to specify output path.
> 2022-07-12 10:23:23,021 WARN
> org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> Please note that Flink client operations(e.g. cancel, list, stop,
> savepoint, etc.) won't work from outside the Kubernetes cluster since
> 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
> 2022-07-12 10:23:23,027 INFO
> org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> Retrieve flink cluster my-first-flink-cluster successfully, JobManager
> Web Interface: http://my-first-flink-cluster-rest.test:8081
> 2022-07-12 10:23:23,044 WARN
> org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
> Please note that Flink client operations(e.g. cancel, list, stop,
> savepoint, etc.) won't work from outside the Kubernetes cluster since
> 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
>
> 
>  The program finished with the following exception:
> org.apache.flink.client.program.ProgramInvocationException: The main
> method caused an error: Failed to execute job
> 'CarTopSpeedWindowingExample'.
> ...
> Caused by: org.apache.flink.util.FlinkException: Failed to execute job
> 'CarTopSpeedWindowingExample'.
> ...
> Caused by: org.apache.flink.runtime.client.JobSubmissionException:
> Failed to submit JobGraph.
> ...
> Caused by: org.apache.flink.util.concurrent.FutureUtils$RetryException:
> Could not complete the operation. Number of retries has been
> exhausted.
> ...
> Caused by: java.util.concurrent.CompletionException:
> java.net.UnknownHostException: my-first-flink-cluster-rest.test: Name
> or service not known
> ...
> Caused by: java.net.UnknownHostException:
> my-first-flink-cluster-rest.test: Name or service not known
>
>
> 如上,根据 --target kubernetes-session
> -Dkubernetes.cluster-id=my-first-flink-cluster 找到的提交入口为
> my-first-flink-cluster-rest.test。这个应该是根据k8s生成的dns,test是flink的namespace。
>
> 我本地也的确并无法解析 my-first-flink-cluster-rest.test 这个。
>


flink native k8s 按照文档提交任务找不到对应的集群

2022-07-11 文章 yidan zhao
如下步骤参考的文档 
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes

版本:1.15

(1)创建集群:
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#starting-a-flink-session-on-kubernetes
(2)提交任务:
./bin/flink run \
--target kubernetes-session \
-Dkubernetes.cluster-id=my-first-flink-cluster \
./examples/streaming/TopSpeedWindowing.jar

svc是ClusterIp类型

第二步提交任务环节,显示如下:
Executing example with default input data.
Use --input to specify file input.
Printing result to stdout. Use --output to specify output path.
2022-07-12 10:23:23,021 WARN
org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
Please note that Flink client operations(e.g. cancel, list, stop,
savepoint, etc.) won't work from outside the Kubernetes cluster since
'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
2022-07-12 10:23:23,027 INFO
org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
Retrieve flink cluster my-first-flink-cluster successfully, JobManager
Web Interface: http://my-first-flink-cluster-rest.test:8081
2022-07-12 10:23:23,044 WARN
org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] -
Please note that Flink client operations(e.g. cancel, list, stop,
savepoint, etc.) won't work from outside the Kubernetes cluster since
'kubernetes.rest-service.exposed.type' has been set to ClusterIP.


 The program finished with the following exception:
org.apache.flink.client.program.ProgramInvocationException: The main
method caused an error: Failed to execute job
'CarTopSpeedWindowingExample'.
...
Caused by: org.apache.flink.util.FlinkException: Failed to execute job
'CarTopSpeedWindowingExample'.
...
Caused by: org.apache.flink.runtime.client.JobSubmissionException:
Failed to submit JobGraph.
...
Caused by: org.apache.flink.util.concurrent.FutureUtils$RetryException:
Could not complete the operation. Number of retries has been
exhausted.
...
Caused by: java.util.concurrent.CompletionException:
java.net.UnknownHostException: my-first-flink-cluster-rest.test: Name
or service not known
...
Caused by: java.net.UnknownHostException:
my-first-flink-cluster-rest.test: Name or service not known


如上,根据 --target kubernetes-session
-Dkubernetes.cluster-id=my-first-flink-cluster 找到的提交入口为
my-first-flink-cluster-rest.test。这个应该是根据k8s生成的dns,test是flink的namespace。

我本地也的确并无法解析 my-first-flink-cluster-rest.test 这个。


flink native k8s ????????

2021-04-22 文章 ??
flink 1.12.2 Native K8s,


 
./bin/kubernetes-session.sh \
  -Dkubernetes.namespace=flink-session-cluster \
  -Dkubernetes.jobmanager.service-account=flink \
  -Dkubernetes.cluster-id=session001 \
  -Dtaskmanager.memory.process.size=1024m \
  -Dkubernetes.taskmanager.cpu=1 \
  -Dtaskmanager.numberOfTaskSlots=4 \
  -Dresourcemanager.taskmanager-timeout=360





??svc??cluster-Ip
 ./bin/flink run -d \
  -e kubernetes-session \
  -Dkubernetes.namespace=flink-session-cluster \
  -Dkubernetes.cluster-id=session001 \
  examples/streaming/WindowJoin.jar





??k8s??

Re: flink native k8s 有计划支持hostAlias配置码?

2021-01-18 文章 Yang Wang
目前对于一些不是经常使用的功能,社区打算使用pod template来统一支持
我理解应该是可以满足你的需求的
这样更加灵活,也会有更好的扩展性,具体你可以看一下这个JIRA[1]

已经有了一个draft的PR,会很快在完成后提交正式PR,然后review
你也可以先试用一下,有问题及时反馈

[1]. https://issues.apache.org/jira/browse/FLINK-15656

Best,
Yang

高函  于2021年1月18日周一 上午11:13写道:

>
>
> 请问社区有计划支持native k8s模式下配置hostAlais码?
> 如果自己扩展的话,需要在模块中添加对应的hostAlais的配置项,并打包自定义的docker 镜像码?
> 谢谢~
>


flink native k8s 有计划支持hostAlias配置码?

2021-01-17 文章 高函


请问社区有计划支持native k8s模式下配置hostAlais码?
如果自己扩展的话,需要在模块中添加对应的hostAlais的配置项,并打包自定义的docker 镜像码?
谢谢~