from:"Bohinski, Kevin"

Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

2020-10-28 Thread Bohinski, Kevin

Hi Yang,

Thanks again for all the help!

We are still seeing this with 1.11.2 and ZK.
Looks like others are seeing this as well and they found a solution 
https://translate.google.com/translate?hl=en&sl=zh-CN&u=https://cloud.tencent.com/developer/article/1731416&prev=search

Should this solution be added to 1.12?

Best
kevin

On 2020/08/14 02:48:50, Yang Wang mailto:d...@gmail.com>> wrote:
> Hi kevin,>
>
> Thanks for sharing more information. You are right. Actually, "too old>
> resource version" is caused by a bug>
> of fabric8 kubernetes-client[1]. It has been fix in v4.6.1. And we have>
> bumped the kubernetes-client version>
> to v4.9.2 in Flink release-1.11. Also it has been backported to release>
> 1.10 and will be included in the next>
> minor release version(1.10.2).>
>
> BTW, if you really want all your jobs recovered when jobmanager crashed,>
> you still need to configure the Zookeeper high availability.>
>
> [1]. https://github.com/fabric8io/kubernetes-client/pull/1800>
>
>
> Best,>
> Yang>
>
> Bohinski, Kevin mailto:ke...@comcast.com>> 于2020年8月14日周五 
> 上午6:32写道：>
>
> > Might be useful>
> >>
> > https://stackoverflow.com/a/61437982>
> >>
> >>
> >>
> > Best,>
> >>
> > kevin>
> >>
> >>
> >>
> >>
> >>
> > *From: *"Bohinski, Kevin" mailto:ke...@comcast.com>>>
> > *Date: *Thursday, August 13, 2020 at 6:13 PM>
> > *To: *Yang Wang mailto:da...@gmail.com>>>
> > *Cc: *"user@flink.apache.org<mailto:user@flink.apache.org>" 
> > mailto:us...@flink.apache.org>>>
> > *Subject: *Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job>
> > never recovers>
> >>
> >>
> >>
> > Hi>
> >>
> >>
> >>
> > Got the logs on crash, hopefully they help.>
> >>
> >>
> >>
> > 2020-08-13 22:00:40,336 ERROR>
> > org.apache.flink.kubernetes.KubernetesResourceManager[] - Fatal>
> > error occurred in ResourceManager.>
> >>
> > io.fabric8.kubernetes.client.KubernetesClientException: too old resource>
> > version: 8617182 (8633230)>
> >>
> > at>
> > io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]>
> >>
> > at>
> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]>
> >>
> > at>
> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]>
> >>
> > at>
> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]>
> >>
> > at>
> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]>
> >>
> > at>
> > org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]>
> >>
> > at>
> > org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]>
> >>
> > at>
> > org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)>
> > [flink-dist_2.11-1.11.1.jar:1.11.1]>
> >>
> > at>
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)>
> > [?:1.8.0_262]>
> >>
> > at>
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)>
> > [?:1.8.0_262]>
> >>
> > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]>
> >>
> > 2020-08-13 22:00:40,337 ERROR>
> > org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Fatal>
> > error occurred in the cluster entrypoint.>
> >>
> > io.fabric8.kubernetes.client.KubernetesClientException: too old resource>
> > version: 8617182

Native K8S HA Session Cluster Issue 1.12.1

2021-02-11 Thread Bohinski, Kevin

Hi All,

On long lived session clusters we are seeing a k8s error `Error while watching 
the ConfigMap`.
Good news is it looks like `too old resource version` issue is fixed :).

Logs are attached below. Any tips?

best
Kevin


2021-02-11 07:55:15,249 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Completed 
checkpoint 4 for job 58ec7a029cd31ad057e25479a9979cb4 (202852094 bytes in 49274 
ms).
2021-02-11 08:00:15,732 INFO  
org.apache.flink.runtime.checkpoint.CheckpointCoordinator[] - Triggering 
checkpoint 5 (type=CHECKPOINT) @ 1613030415249 for job 
58ec7a029cd31ad057e25479a9979cb4.
2021-02-11 08:00:25,446 ERROR 
org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - 
Fatal error occurred in ResourceManager.
org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error while 
watching the ConfigMap 
JOB_NAME-6a3361c3fdeb4dd9ba80d8e667a8093e-jobmanager-leader
at 
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) 
[flink-dist_2.12-1.12.1.jar:1.12.1]
at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_282]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_282]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
2021-02-11 08:00:25,456 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Fatal error 
occurred in the cluster entrypoint.
org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Error while 
watching the ConfigMap 
JOB_NAME-6a3361c3fdeb4dd9ba80d8e667a8093e-jobmanager-leader
at 
org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver$ConfigMapCallbackHandlerImpl.handleFatalError(KubernetesLeaderRetrievalDriver.java:120)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
org.apache.flink.kubernetes.kubeclient.resources.AbstractKubernetesWatcher.onClose(AbstractKubernetesWatcher.java:48)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
io.fabric8.kubernetes.client.utils.WatcherToggle.onClose(WatcherToggle.java:56) 
[flink-dist_2.12-1.12.1.jar:1.12.1]
at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.closeEvent(WatchConnectionManager.java:367)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager.access$700(WatchConnectionManager.java:50)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
 [flink-dist_2.12-1.12.1.jar:1.12.1]
at 
org.apache

1.12.2 docker image

2021-03-03 Thread Bohinski, Kevin

Hi,

Are there plans to provide a docker image for 1.12.2?

Best
kevin

Re: [EXTERNAL] Re: 1.12.2 docker image

2021-03-04 Thread Bohinski, Kevin

Hi,

I see the images on docker hub.

I tried to launch a k8s session with the following options, and I am getting 
the following logs:
```

  -Dkubernetes.container.image=flink:1.12-scala_2.12-java8 \

  
-Dcontainerized.master.env.ENABLE_BUILT_IN_PLUGINS=flink-s3-fs-hadoop-1.12.2.jar
 \
  
-Dcontainerized.taskmanager.env.ENABLE_BUILT_IN_PLUGINS=flink-s3-fs-hadoop-1.12.2.jar
 \
```

```
Enabling required built-in plugins
Linking flink-s3-fs-hadoop-1.12.2.jar to plugin directory
Plugin flink-s3-fs-hadoop-1.12.2.jar does not exist. Exiting.
```

Any suggestions?

Best
kevin

From: Chesnay Schepler 
Date: Wednesday, March 3, 2021 at 5:41 PM
To: "Bohinski, Kevin" , user 
Subject: [EXTERNAL] Re: 1.12.2 docker image

they should be released in a day or two.

On 3/3/2021 11:18 PM, Bohinski, Kevin wrote:
Hi,

Are there plans to provide a docker image for 1.12.2?

Best
kevin

Re: [EXTERNAL] Re: 1.12.2 docker image

2021-03-04 Thread Bohinski, Kevin

Hi,

Actually that seemed to use a cached 1.12.1 image.

I’m seeing the following:
```
➜ docker pull flink:1.12.2
1.12.2: Pulling from library/flink
no matching manifest for linux/amd64 in the manifest list entries
```

Best
kevin

From: "Bohinski, Kevin" 
Date: Thursday, March 4, 2021 at 4:17 PM
To: Chesnay Schepler , user 
Subject: Re: [EXTERNAL] Re: 1.12.2 docker image

Hi,

I see the images on docker hub.

I tried to launch a k8s session with the following options, and I am getting 
the following logs:
```

  -Dkubernetes.container.image=flink:1.12-scala_2.12-java8 \

  
-Dcontainerized.master.env.ENABLE_BUILT_IN_PLUGINS=flink-s3-fs-hadoop-1.12.2.jar
 \
  
-Dcontainerized.taskmanager.env.ENABLE_BUILT_IN_PLUGINS=flink-s3-fs-hadoop-1.12.2.jar
 \
```

```
Enabling required built-in plugins
Linking flink-s3-fs-hadoop-1.12.2.jar to plugin directory
Plugin flink-s3-fs-hadoop-1.12.2.jar does not exist. Exiting.
```

Any suggestions?

Best
kevin

From: Chesnay Schepler 
Date: Wednesday, March 3, 2021 at 5:41 PM
To: "Bohinski, Kevin" , user 
Subject: [EXTERNAL] Re: 1.12.2 docker image

they should be released in a day or two.

On 3/3/2021 11:18 PM, Bohinski, Kevin wrote:
Hi,

Are there plans to provide a docker image for 1.12.2?

Best
kevin

PyFlink DataStream Example Kafka/Kinesis?

2021-03-24 Thread Bohinski, Kevin

Hi,

Is there an example kafka/kinesis source or sink for the PyFlink DataStream API?

Best,
kevin

Re: PyFlink DataStream Example Kafka/Kinesis?

2021-03-24 Thread Bohinski, Kevin

Nevermind, found this for anyone else looking: 
https://github.com/apache/flink/blob/master/flink-end-to-end-tests/flink-python-test/python/datastream/data_stream_job.py

From: "Bohinski, Kevin" 
Date: Wednesday, March 24, 2021 at 4:38 PM
To: user 
Subject: PyFlink DataStream Example Kafka/Kinesis?

Hi,

Is there an example kafka/kinesis source or sink for the PyFlink DataStream API?

Best,
kevin

Re: PyFlink DataStream Example Kafka/Kinesis?

2021-03-24 Thread Bohinski, Kevin

Is there a kinesis example?

From: "Bohinski, Kevin" 
Date: Wednesday, March 24, 2021 at 4:40 PM
To: "Bohinski, Kevin" 
Subject: Re: PyFlink DataStream Example Kafka/Kinesis?

Nevermind, found this for anyone else looking: 
https://github.com/apache/flink/blob/master/flink-end-to-end-tests/flink-python-test/python/datastream/data_stream_job.py

From: "Bohinski, Kevin" 
Date: Wednesday, March 24, 2021 at 4:38 PM
To: user 
Subject: PyFlink DataStream Example Kafka/Kinesis?

Hi,

Is there an example kafka/kinesis source or sink for the PyFlink DataStream API?

Best,
kevin

Re: [EXTERNAL] Re: PyFlink DataStream Example Kafka/Kinesis?

2021-03-24 Thread Bohinski, Kevin

Hi Shuiqiang,

Thanks for letting me know. Feel free to send any beginner level contributions 
for this effort my way 😊 .

Best,
kevin

From: Shuiqiang Chen 
Date: Wednesday, March 24, 2021 at 10:31 PM
To: "Bohinski, Kevin" 
Cc: user 
Subject: [EXTERNAL] Re: PyFlink DataStream Example Kafka/Kinesis?

Hi Kevin,

Kinesis connector is not supported yet in Python DataStream API. We will add it 
in the future.

Best,
Shuiqiang

Bohinski, Kevin mailto:kevin_bohin...@comcast.com>> 
于2021年3月25日周四 上午5:03写道：
Is there a kinesis example?

From: "Bohinski, Kevin" 
mailto:kevin_bohin...@comcast.com>>
Date: Wednesday, March 24, 2021 at 4:40 PM
To: "Bohinski, Kevin" 
mailto:kevin_bohin...@comcast.com>>
Subject: Re: PyFlink DataStream Example Kafka/Kinesis?

Nevermind, found this for anyone else looking: 
https://github.com/apache/flink/blob/master/flink-end-to-end-tests/flink-python-test/python/datastream/data_stream_job.py<https://urldefense.com/v3/__https:/github.com/apache/flink/blob/master/flink-end-to-end-tests/flink-python-test/python/datastream/data_stream_job.py__;!!CQl3mcHX2A!QFaB5DO1ZU_Sx8v59FCwTcqH2lAH3CrM8-jFD1xIuUN-LvDep2fAnOlpFTwBV7CSMLggLw$>

From: "Bohinski, Kevin" 
mailto:kevin_bohin...@comcast.com>>
Date: Wednesday, March 24, 2021 at 4:38 PM
To: user mailto:user@flink.apache.org>>
Subject: PyFlink DataStream Example Kafka/Kinesis?

Hi,

Is there an example kafka/kinesis source or sink for the PyFlink DataStream API?

Best,
kevin

Re: Native K8S not creating TMs

2020-06-08 Thread Bohinski, Kevin

Hi Yang

Thanks again for your help so far.
I tried your suggestion, still with no luck.

Attached are the logs, please let me know if there are more I should send.

Best
kevin

On 2020/06/08 03:02:40, Yang Wang mailto:d...@gmail.com>> wrote:
> Hi Kevin,>
>
> It may because the characters length limitation of K8s(no more than 63)[1].>
> So the pod>
> name could not be too long. I notice that you are using the client>
> automatic generated>
> cluster-id. It may cause problem and could you set a meaningful cluster-id>
> for your Flink>
> session? For example,>
>
> kubernetes-session.sh ... -Dkubernetes.cluster-id=my-flink-k8s-session>
>
> This behavior has been improved in Flink 1.11 to check the length in client>
> side before submission.>
>
> If it still could not work, could you share your full command and>
> jobmanager logs? It will help a lot>
> to find the root cause.>
>
>
> [1].>
> https://stackoverflow.com/questions/50412837/kubernetes-label-name-63-character-limit>
>
>
> Best,>
> Yang>
>
> kb mailto:ke...@comcast.com>> 于2020年6月6日周六 上午1:00写道：>
>
> > Thanks Yang for the suggestion, I have tried it and I'm still getting the>
> > same exception. Is it possible its due to the null pod name? Operation:>
> > [create]  for kind: [Pod]  with name: [null]  in namespace: [default]>
> > failed.>
> >>
> > Best,>
> > kevin>
> >>
> >>
> >>
> > -->
> > Sent from:>
> > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
> >>
>

Best,
kevin

[REDACTED flink]$ kubectl create serviceaccount svc-flink
serviceaccount/svc-flink created
[REDACTED flink]$ kubectl create clusterrolebinding svc-flink-role-binding 
--clusterrole=cluster-admin --serviceaccount=default:svc-flink
clusterrolebinding.rbac.authorization.k8s.io/svc-flink-role-binding created
[REDACTED flink]$ ./flink-1.10.1/bin/kubernetes-session.sh 
-Dkubernetes.jobmanager.service-account=svc-flink 
-Dcontainerized.master.env.HTTP2_DISABLE=true 
-Dkubernetes.container.image=REDACTED/flink:1.10.1-scala_2.11-s3-3 
-Dkubernetes.cluster-id=ledger-flink-session
2020-06-08 14:50:57,215 INFO  
org.apache.flink.configuration.GlobalConfiguration- Loading 
configuration property: jobmanager.rpc.address, localhost
2020-06-08 14:50:57,216 INFO  
org.apache.flink.configuration.GlobalConfiguration- Loading 
configuration property: jobmanager.rpc.port, 6123
2020-06-08 14:50:57,217 INFO  
org.apache.flink.configuration.GlobalConfiguration- Loading 
configuration property: jobmanager.heap.size, 1024m
2020-06-08 14:50:57,217 INFO  
org.apache.flink.configuration.GlobalConfiguration- Loading 
configuration property: taskmanager.memory.process.size, 1728m
2020-06-08 14:50:57,217 INFO  
org.apache.flink.configuration.GlobalConfiguration- Loading 
configuration property: taskmanager.numberOfTaskSlots, 1
2020-06-08 14:50:57,217 INFO  
org.apache.flink.configuration.GlobalConfiguration- Loading 
configuration property: parallelism.default, 1
2020-06-08 14:50:57,218 INFO  
org.apache.flink.configuration.GlobalConfiguration- Loading 
configuration property: jobmanager.execution.failover-strategy, region
2020-06-08 14:50:58,542 INFO  
org.apache.flink.runtime.clusterframework.TaskExecutorProcessUtils  - The 
derived from fraction jvm overhead memory (172.800mb (181193935 bytes)) is less 
than its min value 192.000mb (201326592 bytes), min value will be used instead
2020-06-08 14:50:58,550 INFO  org.apache.flink.kubernetes.utils.KubernetesUtils 
- Kubernetes deployment requires a fixed port. Configuration 
blob.server.port will be set to 6124
2020-06-08 14:50:58,551 INFO  org.apache.flink.kubernetes.utils.KubernetesUtils 
- Kubernetes deployment requires a fixed port. Configuration 
taskmanager.rpc.port will be set to 6122
2020-06-08 14:50:59,532 INFO  
org.apache.flink.kubernetes.KubernetesClusterDescriptor   - Create flink 
session cluster ledger-flink-session successfully, JobManager Web Interface: 
REDACTED
[REDACTED flink]$ kubectl get services | fgrep flink
ledger-flink-sessionClusterIP  REDACTED
8081/TCP,6123/TCP,6124/TCP   24s
ledger-flink-session-rest   LoadBalancer   REDACTED 
8081:32379/TCP   24s
[REDACTED flink]$ kubectl get pods | fgrep flink
ledger-flink-session-7bf95b68b5-tsfw4   1/1 Running 0  6s
[REDACTED flink]$ nohup kubectl port-forward service/ledger-flink-session 8081 &
[1] 16722
[REDACTED flink]$ nohup: ignoring input and appending output to ânohup.outâ
[REDACTED flink]$ kubectl exec -it ledger-flink-session-7bf95b68b5-tsfw4 bash
root@ledger-flink-session-7bf95b68b5-tsfw4:/opt/flink# cat log/jobmanager.log
2020-06-08 14:51:31,391 INFO  
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - 

2020-06-08 14:51:31,393 INFO  
org.apache.flink.runtime.entrypoint.ClusterEntrypoint -  Starti

Re: Native K8S not creating TMs

2020-06-10 Thread Bohinski, Kevin

Hi Yang

I’m using DEBUG level; do you know what to search for to see kubernetes-client 
K8s apiserver address? I don’t see anything useful so far.

Best
kevin

On 2020/06/08 16:02:07, "Bohinski, Kevin" 
mailto:k...@comcast.com>> wrote:
> Hi Yang>
>
>
>
> Thanks again for your help so far.>
>
> I tried your suggestion, still with no luck.>
>
>
>
> Attached are the logs, please let me know if there are more I should send.>
>
>
>
> Best>
>
> kevin>
>
>
>
> On 2020/06/08 03:02:40, Yang Wang mailto:d@gmail.com>>> 
> wrote:>
>
> > Hi Kevin,>>
>
> >>
>
> > It may because the characters length limitation of K8s(no more than 
> > 63)[1].>>
>
> > So the pod>>
>
> > name could not be too long. I notice that you are using the client>>
>
> > automatic generated>>
>
> > cluster-id. It may cause problem and could you set a meaningful cluster-id>>
>
> > for your Flink>>
>
> > session? For example,>>
>
> >>
>
> > kubernetes-session.sh ... -Dkubernetes.cluster-id=my-flink-k8s-session>>
>
> >>
>
> > This behavior has been improved in Flink 1.11 to check the length in 
> > client>>
>
> > side before submission.>>
>
> >>
>
> > If it still could not work, could you share your full command and>>
>
> > jobmanager logs? It will help a lot>>
>
> > to find the root cause.>>
>
> >>
>
> >>
>
> > [1].>>
>
> > https://stackoverflow.com/questions/50412837/kubernetes-label-name-63-character-limit><https://stackoverflow.com/questions/50412837/kubernetes-label-name-63-character-limit%3e>>
>
> >>
>
> >>
>
> > Best,>>
>
> > Yang>>
>
> >>
>
> > kb mailto:ke...@comcast.com>>> 于2020年6月6日周六 上午1:00写道：>>
>
> >>
>
> > > Thanks Yang for the suggestion, I have tried it and I'm still getting 
> > > the>>
>
> > > same exception. Is it possible its due to the null pod name? Operation:>>
>
> > > [create]  for kind: [Pod]  with name: [null]  in namespace: [default]>>
>
> > > failed.>>
>
> > >>>
>
> > > Best,>>
>
> > > kevin>>
>
> > >>>
>
> > >>>
>
> > >>>
>
> > > -->>
>
> > > Sent from:>>
>
> > > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/><http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/%3e>>
>
> > >>>
>
> >>
>
>
>
> Best,>
>
> kevin>
>
>
>
>

Best,
kevin

Re: Native K8S not creating TMs

2020-06-25 Thread Bohinski, Kevin

Hi Yang,

Thanks for your help, that command worked, so we connected a remote debugger 
and found the root exception was initially a timeout exception from okhttp. The 
increases you mentioned worked.

Thanks again for all the help!
Best,
kevin


On 2020/06/19 03:46:36, Yang Wang mailto:d...@gmail.com>> wrote:
> Thanks for sharing the DEBUG level log.>
>
> I carefully check the logs and find that the kubernetes-client discovered>
> the>
> api server address and token successfully.  However, it could not contact>
> with>
> api server(10.100.0.1:443). Could you check whether you api server is>
> configured>
> to allow accessing within cluster.>
>
> I think you could start any pod and tunnel in to run the following command.>
>
> KUBE_TOKEN=$(
> wget -vO- --ca-certificate>
> /var/run/secrets/kubernetes.io/serviceaccount/ca.crt  --header>
> "Authorization: Bearer $KUBE_TOKEN">
> https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_PORT_443_TCP_PORT/api>
>
> BTW, what's your kubernetes version? And i am not sure whether increasing>
> the timeout>
> could help.>
>
> -Dcontainerized.master.env.KUBERNETES_REQUEST_TIMEOUT=6>
> -Dcontainerized.master.env.KUBERNETES_CONNECTION_TIMEOUT=6>
>
>
> Best,>
> Yang>
>
>
> Yang Wang mailto:da...@gmail.com>> 于2020年6月16日周二 下午12:00写道：>
>
> > Hi Kevin,>
> >>
> > Sorry for not notice your last response.>
> > Could you share you full DEBUG level jobmanager logs? I will try to figure>
> > out>
> > whether it is a issue of Flink or K8s. Because i could not reproduce your>
> > situation>
> > with my local K8s cluster.>
> >>
> >>
> > Best,>
> > Yang>
> >>
> > Yang Wang mailto:da...@gmail.com>> 于2020年6月8日周一 上午11:02写道：>
> >>
> >> Hi Kevin,>
> >>>
> >> It may because the characters length limitation of K8s(no more than>
> >> 63)[1]. So the pod>
> >> name could not be too long. I notice that you are using the client>
> >> automatic generated>
> >> cluster-id. It may cause problem and could you set a meaningful>
> >> cluster-id for your Flink>
> >> session? For example,>
> >>>
> >> kubernetes-session.sh ... -Dkubernetes.cluster-id=my-flink-k8s-session>
> >>>
> >> This behavior has been improved in Flink 1.11 to check the length in>
> >> client side before submission.>
> >>>
> >> If it still could not work, could you share your full command and>
> >> jobmanager logs? It will help a lot>
> >> to find the root cause.>
> >>>
> >>>
> >> [1].>
> >> https://stackoverflow.com/questions/50412837/kubernetes-label-name-63-character-limit>
> >>>
> >>>
> >> Best,>
> >> Yang>
> >>>
> >> kb mailto:ke...@comcast.com>> 于2020年6月6日周六 上午1:00写道：>
> >>>
> >>> Thanks Yang for the suggestion, I have tried it and I'm still getting the>
> >>> same exception. Is it possible its due to the null pod name? Operation:>
> >>> [create]  for kind: [Pod]  with name: [null]  in namespace: [default]>
> >>> failed.>
> 
> >>> Best,>
> >>> kevin>
> 
> 
> 
> >>> -->
> >>> Sent from:>
> >>> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/>
> 
> >>>
>

Best,
kevin

Native K8S IAM Role?

2020-06-25 Thread Bohinski, Kevin

Hi,

How do we attach an IAM role to the native K8S sessions?

Typically for our other pods we use the following in our yamls:
spec:
  template:
metadata:
  annotations:
iam.amazonaws.com/role: ROLE_ARN

Best
kevin

Re: Native K8S IAM Role?

2020-06-25 Thread Bohinski, Kevin

(via https://github.com/jtblin/kube2iam )

On 2020/06/25 19:08:41, "Bohinski, Kevin" 
mailto:k...@comcast.com>> wrote:
> Hi,>
>
>
>
> How do we attach an IAM role to the native K8S sessions?>
>
>
>
> Typically for our other pods we use the following in our yamls:>
>
> spec:>
>
>   template:>
>
> metadata:>
>
>   annotations:>
>
> iam.amazonaws.com/role: ROLE_ARN>
>
>
>
> Best>
>
> kevin>
>
>

Best,
kevin

Re: [EXTERNAL] Re: Native K8S IAM Role?

2020-06-28 Thread Bohinski, Kevin

Hi Yang,

Awesome, looking forward to 1.11!
In the meantime, we are using a mutating web hook in case anyone else is facing 
this...

Best,
kevin


From: Yang Wang 
Date: Saturday, June 27, 2020 at 11:23 PM
To: "Bohinski, Kevin" 
Cc: "user@flink.apache.org" 
Subject: [EXTERNAL] Re: Native K8S IAM Role?

Hi kevin,

If you mean to add annotations for Flink native K8s session pods, you could use 
"kubernetes.jobmanager.annotations"
and "kubernetes.taskmanager.annotations"[1]. However, they are only supported 
from release-1.11. Maybe you could
wait for a little bit more time, 1.11 will be released soon. And we add more 
features for native K8s integration in 1.11
(e.g. application mode, label, annotation, toleration, etc.).


[1]. 
https://ci.apache.org/projects/flink/flink-docs-master/ops/config.html#kubernetes<https://urldefense.com/v3/__https:/ci.apache.org/projects/flink/flink-docs-master/ops/config.html*kubernetes__;Iw!!CQl3mcHX2A!ULDBt0kuUlwSJPYMoWXSBl4cXonhzeMiAFpUtVsP4Am1G77FpT6rl8o35FxdplLVN6GdDQ$>

Best,
Yang

Bohinski, Kevin mailto:kevin_bohin...@comcast.com>> 
于2020年6月26日周五 上午3:09写道：
Hi,

How do we attach an IAM role to the native K8S sessions?

Typically for our other pods we use the following in our yamls:
spec:
  template:
metadata:
  annotations:

iam.amazonaws.com/role<https://urldefense.com/v3/__http:/iam.amazonaws.com/role__;!!CQl3mcHX2A!ULDBt0kuUlwSJPYMoWXSBl4cXonhzeMiAFpUtVsP4Am1G77FpT6rl8o35FxdplKlhJ55SA$>:
 ROLE_ARN

Best
kevin

Map type param escaping :

2020-07-14 Thread Bohinski, Kevin

Hi,

How do we escape : in map type param?

For example, we have:
-Dkubernetes.jobmanager.annotations=KEY:V:A:L:U:E

Which should result in {“KEY”: “V:A:L:U:E”}.

Best,
kevin

Re: Map type param escaping :

2020-07-14 Thread Bohinski, Kevin

Figured it out, pulled StructuredOptionsSplitter into a debugger and was able 
to get it working with:
-Dkubernetes.jobmanager.annotations="\"KEY:\"\"V:A:L:U:E\"\"\""

Best
kevin

Native K8S Jobmanager restarts and job never recovers

2020-08-07 Thread Bohinski, Kevin

Hi all,

In our 1.11.1 native k8s session after we submit a job it will run successfully 
for a few hours then fail when the jobmanager pod restarts.

Relevant logs after restart are attached below. Any suggestions?

Best
kevin

2020-08-06 21:50:24,425 INFO  
org.apache.flink.kubernetes.KubernetesResourceManager[] - Recovered 32 
pods from previous attempts, current attempt id is 2.
2020-08-06 21:50:24,610 DEBUG 
org.apache.flink.kubernetes.kubeclient.resources.KubernetesPodsWatcher [] - 
Received ADDED event for pod REDACTED-flink-session-taskmanager-1-16, details: 
PodStatus(conditions=[PodCondition(lastProbeTime=null, 
lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null, 
status=True, type=Initialized, additionalProperties={}), 
PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:37Z, 
message=null, reason=null, status=True, type=Ready, additionalProperties={}), 
PodCondition(lastProbeTime=null, lastTransitionTime=2020-08-06T18:48:37Z, 
message=null, reason=null, status=True, type=ContainersReady, 
additionalProperties={}), PodCondition(lastProbeTime=null, 
lastTransitionTime=2020-08-06T18:48:33Z, message=null, reason=null, 
status=True, type=PodScheduled, additionalProperties={})], 
containerStatuses=[ContainerStatus(containerID=docker://REDACTED, 
image=REDACTED/flink:1.11.1-scala_2.11-s3-0, 
imageID=docker-pullable://REDACTED/flink@sha256:REDACTED, 
lastState=ContainerState(running=null, terminated=null, waiting=null, 
additionalProperties={}), name=flink-task-manager, ready=true, restartCount=0, 
started=true, 
state=ContainerState(running=ContainerStateRunning(startedAt=2020-08-06T18:48:35Z,
 additionalProperties={}), terminated=null, waiting=null, 
additionalProperties={}), additionalProperties={})], 
ephemeralContainerStatuses=[], hostIP=REDACTED, initContainerStatuses=[], 
message=null, nominatedNodeName=null, phase=Running, podIP=REDACTED, 
podIPs=[PodIP(ip=REDACTED, additionalProperties={})], qosClass=Guaranteed, 
reason=null, startTime=2020-08-06T18:48:33Z, additionalProperties={})
2020-08-06 21:50:24,613 DEBUG 
org.apache.flink.kubernetes.KubernetesResourceManager[] - Ignore 
TaskManager pod that is already added: REDACTED-flink-session-taskmanager-1-16
2020-08-06 21:50:24,615 INFO  
org.apache.flink.kubernetes.KubernetesResourceManager[] - Received 0 
new TaskManager pods. Remaining pending pod requests: 0
2020-08-06 21:50:24,631 DEBUG 
org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore [] - Could 
not load archived execution graph for job id 8c76f37962afd87783c65c95387fb828.
java.util.concurrent.ExecutionException: java.io.FileNotFoundException: Could 
not find file for archived execution graph 8c76f37962afd87783c65c95387fb828. 
This indicates that the file either has been deleted or never written.
at 
org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.getValue(AbstractFuture.java:299)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:286)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.shaded.guava18.com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:116)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.shaded.guava18.com.google.common.util.concurrent.Uninterruptibles.getUninterruptibly(Uninterruptibles.java:137)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.getAndRecordStats(LocalCache.java:2348)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2320)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2282)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2197)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.get(LocalCache.java:3937)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache.getOrLoad(LocalCache.java:3941)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.shaded.guava18.com.google.common.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4824)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.runtime.dispatcher.FileArchivedExecutionGraphStore.get(FileArchivedExecutionGraphStore.java:143)
 ~[flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.runtime.dispatcher.Dispatcher.lambda$requestJob$20(Dispatcher.java:554)
 ~[flink-dist_2.11-1.11.1.jar:

Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

2020-08-13 Thread Bohinski, Kevin

Hi

Got the logs on crash, hopefully they help.

2020-08-13 22:00:40,336 ERROR 
org.apache.flink.kubernetes.KubernetesResourceManager[] - Fatal error 
occurred in ResourceManager.
io.fabric8.kubernetes.client.KubernetesClientException: too old resource 
version: 8617182 (8633230)
at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_262]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_262]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]
2020-08-13 22:00:40,337 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Fatal error 
occurred in the cluster entrypoint.
io.fabric8.kubernetes.client.KubernetesClientException: too old resource 
version: 8617182 (8633230)
at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_262]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_262]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]
2020-08-13 22:00:40,416 INFO  org.apache.flink.runtime.blob.BlobServer  
   [] - Stopped BLOB server at 0.0.0.0:6124

Best,
kevin


From: Yang Wang 
Date: Sunday, August 9, 2020 at 10:29 PM
To: "Bohinski, Kevin" 
Cc: "user@flink.apache.org" 
Subject: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

Hi Kevin,

I think you may not set the high availability configurations in your native K8s 
session. Currently, we only
support zookeeper HA, so you need to add the following configuration. After the 
HA is configured, the
running job, checkpoint and other meta could be stored. When the jobmanager 
failover, all the jobs
could be recovered then. I have tested it could work properly.


high-availability: zookeeper

high-availability.zookeeper.quorum: zk-client:2181

high-availability.storageDir: hdfs:///flink/recovery

I know you may not have a zookeeper cluster.You could a zookeeper K8s 
operator[1] to deploy a new one.

More over, it is not very convenient to use zookeeper as HA. So a K8s native HA 
support[2] is in plan and we
are trying to finish it in the next major release cycle(1.12).


[1]. 
https://github.com/pravega/zookeeper-operator<https://urldefense.com/v3/__https:/github.com/prave

Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

2020-08-13 Thread Bohinski, Kevin

Might be useful
https://stackoverflow.com/a/61437982

Best,
kevin


From: "Bohinski, Kevin" 
Date: Thursday, August 13, 2020 at 6:13 PM
To: Yang Wang 
Cc: "user@flink.apache.org" 
Subject: Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never 
recovers

Hi

Got the logs on crash, hopefully they help.

2020-08-13 22:00:40,336 ERROR 
org.apache.flink.kubernetes.KubernetesResourceManager[] - Fatal error 
occurred in ResourceManager.
io.fabric8.kubernetes.client.KubernetesClientException: too old resource 
version: 8617182 (8633230)
at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_262]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_262]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]
2020-08-13 22:00:40,337 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Fatal error 
occurred in the cluster entrypoint.
io.fabric8.kubernetes.client.KubernetesClientException: too old resource 
version: 8617182 (8633230)
at 
io.fabric8.kubernetes.client.dsl.internal.WatchConnectionManager$1.onMessage(WatchConnectionManager.java:259)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.RealCall$AsyncCall.execute(RealCall.java:206)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
org.apache.flink.kubernetes.shaded.okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32)
 [flink-dist_2.11-1.11.1.jar:1.11.1]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
[?:1.8.0_262]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
[?:1.8.0_262]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_262]
2020-08-13 22:00:40,416 INFO  org.apache.flink.runtime.blob.BlobServer  
   [] - Stopped BLOB server at 0.0.0.0:6124

Best,
kevin


From: Yang Wang 
Date: Sunday, August 9, 2020 at 10:29 PM
To: "Bohinski, Kevin" 
Cc: "user@flink.apache.org" 
Subject: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

Hi Kevin,

I think you may not set the high availability configurations in your native K8s 
session. Currently, we only
support zookeeper HA, so you need to add the following configuration. After the 
HA is configured, the
running job, checkpoint and other meta could be stored. When the jobmanager 
failover, all the jobs
could be recovered then. I have tested it could work properly.


high-availability: zookeeper

high-availability.zookeeper.quorum: zk-client:2181

high-availability.storageDir: hdfs:///flink/recovery

I know you may not have a zookeeper cluster.You could a zookeeper K8s 
operator[1] to deploy a new one.

Mor

StreamingFileSink on EMR

2019-02-25 Thread Bohinski, Kevin (Contractor)

When running Flink 1.7 on EMR 5.21 using StreamingFileSink we see 
java.lang.UnsupportedOperationException: Recoverable writers on Hadoop are only 
supported for HDFS and for Hadoop version 2.7 or newer. EMR is showing Hadoop 
version 2.8.5. Is anyone else seeing this issue?

Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

Native K8S HA Session Cluster Issue 1.12.1

1.12.2 docker image

Re: [EXTERNAL] Re: 1.12.2 docker image

Re: [EXTERNAL] Re: 1.12.2 docker image

PyFlink DataStream Example Kafka/Kinesis?

Re: PyFlink DataStream Example Kafka/Kinesis?

Re: PyFlink DataStream Example Kafka/Kinesis?

Re: [EXTERNAL] Re: PyFlink DataStream Example Kafka/Kinesis?

Re: Native K8S not creating TMs

Re: Native K8S not creating TMs

Re: Native K8S not creating TMs

Native K8S IAM Role?

Re: Native K8S IAM Role?

Re: [EXTERNAL] Re: Native K8S IAM Role?

Map type param escaping :

Re: Map type param escaping :

Native K8S Jobmanager restarts and job never recovers

Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

Re: [EXTERNAL] Re: Native K8S Jobmanager restarts and job never recovers

StreamingFileSink on EMR

21 matches

Site Navigation

Mail list logo

Footer information