[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError

2021-04-22 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327911#comment-17327911
 ] 

Flink Jira Bot commented on FLINK-17177:


This major issue is unassigned and itself and all of its Sub-Tasks have not 
been updated for 30 days. So, it has been labeled "stale-major". If this ticket 
is indeed "major", please either assign yourself or give an update. Afterwards, 
please remove the label. In 7 days the issue will be deprioritized.

> Handle ERROR event correctly in KubernetesResourceManager#onError
> -
>
> Key: FLINK-17177
> URL: https://issues.apache.org/jira/browse/FLINK-17177
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Priority: Major
>  Labels: stale-major
> Fix For: 1.13.0
>
>
> Currently, once we receive an *ERROR* event that is sent from the K8s API 
> server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} 
> will handle it by calling the 
> {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect 
> since the *ERROR* event may indicate an exception in the HTTP layer, which 
> means the previously created {{Watcher}} may be no longer available and we'd 
> better re-create the {{Watcher}} immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError

2020-07-06 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151821#comment-17151821
 ] 

Yang Wang commented on FLINK-17177:
---

I agree that we could first make the logging of {{ERROR}} action in 
{{KubernetesPodsWatcher}} to warning level. This will help us to debug the 
potential issues in the future although i have not run into this case.

> Handle ERROR event correctly in KubernetesResourceManager#onError
> -
>
> Key: FLINK-17177
> URL: https://issues.apache.org/jira/browse/FLINK-17177
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Priority: Major
> Fix For: 1.12.0
>
>
> Currently, once we receive an *ERROR* event that is sent from the K8s API 
> server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} 
> will handle it by calling the 
> {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect 
> since the *ERROR* event may indicate an exception in the HTTP layer, which 
> means the previously created {{Watcher}} may be no longer available and we'd 
> better re-create the {{Watcher}} immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError

2020-07-05 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151533#comment-17151533
 ] 

Robert Metzger commented on FLINK-17177:


Thanks for your response. Let's see what [~fly_in_gis]'s opinion is on this.

> Handle ERROR event correctly in KubernetesResourceManager#onError
> -
>
> Key: FLINK-17177
> URL: https://issues.apache.org/jira/browse/FLINK-17177
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Priority: Major
> Fix For: 1.12.0
>
>
> Currently, once we receive an *ERROR* event that is sent from the K8s API 
> server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} 
> will handle it by calling the 
> {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect 
> since the *ERROR* event may indicate an exception in the HTTP layer, which 
> means the previously created {{Watcher}} may be no longer available and we'd 
> better re-create the {{Watcher}} immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError

2020-07-04 Thread Canbin Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151250#comment-17151250
 ] 

Canbin Zheng commented on FLINK-17177:
--

[~rmetzger] I didn't look into this ticket in the past two months. Maybe it's a 
good idea that we first log a "WARN" message in 
{{KubernetesResourceManager#onError}}.

> Handle ERROR event correctly in KubernetesResourceManager#onError
> -
>
> Key: FLINK-17177
> URL: https://issues.apache.org/jira/browse/FLINK-17177
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, once we receive an *ERROR* event that is sent from the K8s API 
> server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} 
> will handle it by calling the 
> {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect 
> since the *ERROR* event may indicate an exception in the HTTP layer, which 
> means the previously created {{Watcher}} may be no longer available and we'd 
> better re-create the {{Watcher}} immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError

2020-07-03 Thread Robert Metzger (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150863#comment-17150863
 ] 

Robert Metzger commented on FLINK-17177:


Looking at the code, it seems that we are only logging (any) event on DEBUG 
level.

Maybe as an intermediate step, we could log on "WARN" that we've received an 
error from K8s?
Otherwise, we might have error reports from users which will be hard to debug.
Also, this might help us understand in the long run, which types of errors K8s 
is reporting here.

> Handle ERROR event correctly in KubernetesResourceManager#onError
> -
>
> Key: FLINK-17177
> URL: https://issues.apache.org/jira/browse/FLINK-17177
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, once we receive an *ERROR* event that is sent from the K8s API 
> server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} 
> will handle it by calling the 
> {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect 
> since the *ERROR* event may indicate an exception in the HTTP layer, which 
> means the previously created {{Watcher}} may be no longer available and we'd 
> better re-create the {{Watcher}} immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError

2020-04-18 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17086719#comment-17086719
 ] 

Yang Wang commented on FLINK-17177:
---

[~felixzheng], thanks a lot. Let's try to figure out when the {{Error 
WatchEvent}} will be sent from K8s ApiServer.
 * If it happens in resource spec check(e.g. resource version too old, format 
check failed), then current handle logics is right, remove the pod and create a 
new one.
 * If it happens because of some K8s internal error, creating a new watcher 
could not solve this problem. Maybe we need to throw a fatal error and failed 
the current jobmanager attempt.
 * Some other case ...

 

Moreover, i am afraid that if there are some HTTP layer errors, the 
{{WatchConnectionManager}} could handle it and retry internally, i.e. creating 
a new {{WebSocket}}. Just like YARN, if there is some network problem, the IPC 
client of {{AMRMClient}} will handle the retry logic.

> Handle ERROR event correctly in KubernetesResourceManager#onError
> -
>
> Key: FLINK-17177
> URL: https://issues.apache.org/jira/browse/FLINK-17177
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, once we receive an *ERROR* event that is sent from the K8s API 
> server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} 
> will handle it by calling the 
> {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect 
> since the *ERROR* event may indicate an exception in the HTTP layer, which 
> means the previously created {{Watcher}} may be no longer available and we'd 
> better re-create the {{Watcher}} immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError

2020-04-18 Thread Canbin Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17086442#comment-17086442
 ] 

Canbin Zheng commented on FLINK-17177:
--

{quote}Since i find the fabric8 kubernetes client has a different 
implementation with K8s go client. It will never produce a {{onError}} 
{{WatchEvent}} on client side[1].

But i am not familiar with the K8s ApiServer about when it will return a 
{{Error}} type {{WatchEvent}}. [~felixzheng] Could you share some insight with 
me?
{quote}
Hi, [~fly_in_gis]! I have searched the K8s source code and I haven't found any 
place that the Server sends an {{ERROR}} event so far. Maybe we need more 
investigation, I will give you feedback when I have new discoveries.

 

> Handle ERROR event correctly in KubernetesResourceManager#onError
> -
>
> Key: FLINK-17177
> URL: https://issues.apache.org/jira/browse/FLINK-17177
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, once we receive an *ERROR* event that is sent from the K8s API 
> server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} 
> will handle it by calling the 
> {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect 
> since the *ERROR* event indicates an exception in the HTTP layer that is 
> caused by the K8s Server, which means the previously created {{Watcher}} may 
> be no longer available and we'd better re-create the {{Watcher}} immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError

2020-04-17 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085601#comment-17085601
 ] 

Yang Wang commented on FLINK-17177:
---

Since i find the fabric8 kubernetes client has a different implementation with 
K8s go client. It will never produce a {{onError}} {{WatchEvent}} on client 
side[1].

But i am not familiar with the K8s ApiServer about when it will return a 
{{Error}} type {{WatchEvent}}. [~felixzheng] Could you share some insight with 
me?

 

[1]. 
[https://github.com/fabric8io/kubernetes-client/blob/1afea0c364a48d9a8745539f84de59630ef6f559/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchHTTPManager.java#L289]

> Handle ERROR event correctly in KubernetesResourceManager#onError
> -
>
> Key: FLINK-17177
> URL: https://issues.apache.org/jira/browse/FLINK-17177
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, once we receive an *ERROR* event that is sent from the K8s API 
> server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} 
> will handle it by calling the 
> {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect 
> since the *ERROR* event indicates an exception in the HTTP layer that is 
> caused by the K8s Server, which means the previously created {{Watcher}} may 
> be no longer available and we'd better re-create the {{Watcher}} immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError

2020-04-16 Thread Canbin Zheng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085442#comment-17085442
 ] 

Canbin Zheng commented on FLINK-17177:
--

{quote}I post the {{WatchEvent}} in K8s here[1]. I do not find the "Error" type 
means "HTTP error". So could share some information about how the "Error" type 
is introduced by HTTP layer error?

 

[1]. 
[https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#watchevent-v1-meta]
{quote}
One case in K8s client-go is 
[https://github.com/kubernetes/kubernetes/blob/343c1e7636fe5c75cdd378c0b170b26935806de5/staging/src/k8s.io/apimachinery/pkg/watch/streamwatcher.go#L121]

Also, the K8s server could probably send {{ERROR}} event if something goes 
wrong in the HTTP stream.

> Handle ERROR event correctly in KubernetesResourceManager#onError
> -
>
> Key: FLINK-17177
> URL: https://issues.apache.org/jira/browse/FLINK-17177
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, once we receive an *ERROR* event that is sent from the K8s API 
> server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} 
> will handle it by calling the 
> {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect 
> since the *ERROR* event indicates an exception in the HTTP layer that is 
> caused by the K8s Server, which means the previously created {{Watcher}} may 
> be no longer available and we'd better re-create the {{Watcher}} immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError

2020-04-16 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085412#comment-17085412
 ] 

Yang Wang commented on FLINK-17177:
---

{code:java}
Object is: If Type is Added or Modified: the new state of the object. If Type 
is Deleted: the state of the object immediately before deletion. If Type is 
Error: Status is recommended; other types may make sense depending on context.
{code}
I post the {{WatchEvent}} in K8s here[1]. I do not find the "Error" type means 
"HTTP error". So could share some information about how the "Error" type is 
introduced by HTTP layer error?

 

[1]. 
[https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#watchevent-v1-meta]

> Handle ERROR event correctly in KubernetesResourceManager#onError
> -
>
> Key: FLINK-17177
> URL: https://issues.apache.org/jira/browse/FLINK-17177
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, once we receive an *ERROR* event that is sent from the K8s API 
> server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} 
> will handle it by calling the 
> {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect 
> since the *ERROR* event indicates an exception in the HTTP layer that is 
> caused by the K8s Server, which means the previously created {{Watcher}} may 
> be no longer available and we'd better re-create the {{Watcher}} immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError

2020-04-16 Thread Yang Wang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085378#comment-17085378
 ] 

Yang Wang commented on FLINK-17177:
---

Hi [~felixzheng], thanks for creating this insightful tickets. I am not sure 
whether we need to create a new watcher here since the 
{{WatchConnectionManager}} in fabric8 kubernetes client has internal retry 
logics for http/websocket failure.

Another concern is if the reconnect limit exhausts, may because of pod network 
or K8s api server down, i think we need to throw fatal error and exit the 
jobmanager pod. Then a new jobmanager pod will be started to take over. This is 
also the logic for YARN deployment. If the {{AMRMClient}} heartbeats with YARN 
ResourceManager failed(the ipc client has retried enough times), then 
{{YarnResourceManager}} will also call the onFatalError.

> Handle ERROR event correctly in KubernetesResourceManager#onError
> -
>
> Key: FLINK-17177
> URL: https://issues.apache.org/jira/browse/FLINK-17177
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / Kubernetes
>Affects Versions: 1.10.0, 1.10.1
>Reporter: Canbin Zheng
>Priority: Major
> Fix For: 1.11.0
>
>
> Currently, once we receive an *ERROR* event that is sent from the K8s API 
> server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} 
> will handle it by calling the 
> {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect 
> since the *ERROR* event indicates an exception in the HTTP layer that is 
> caused by the K8s Server, which means the previously created {{Watcher}} may 
> be no longer available and we'd better re-create the {{Watcher}} immediately.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)