[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError
[ https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17327911#comment-17327911 ] Flink Jira Bot commented on FLINK-17177: This major issue is unassigned and itself and all of its Sub-Tasks have not been updated for 30 days. So, it has been labeled "stale-major". If this ticket is indeed "major", please either assign yourself or give an update. Afterwards, please remove the label. In 7 days the issue will be deprioritized. > Handle ERROR event correctly in KubernetesResourceManager#onError > - > > Key: FLINK-17177 > URL: https://issues.apache.org/jira/browse/FLINK-17177 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Priority: Major > Labels: stale-major > Fix For: 1.13.0 > > > Currently, once we receive an *ERROR* event that is sent from the K8s API > server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} > will handle it by calling the > {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect > since the *ERROR* event may indicate an exception in the HTTP layer, which > means the previously created {{Watcher}} may be no longer available and we'd > better re-create the {{Watcher}} immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError
[ https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151821#comment-17151821 ] Yang Wang commented on FLINK-17177: --- I agree that we could first make the logging of {{ERROR}} action in {{KubernetesPodsWatcher}} to warning level. This will help us to debug the potential issues in the future although i have not run into this case. > Handle ERROR event correctly in KubernetesResourceManager#onError > - > > Key: FLINK-17177 > URL: https://issues.apache.org/jira/browse/FLINK-17177 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Priority: Major > Fix For: 1.12.0 > > > Currently, once we receive an *ERROR* event that is sent from the K8s API > server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} > will handle it by calling the > {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect > since the *ERROR* event may indicate an exception in the HTTP layer, which > means the previously created {{Watcher}} may be no longer available and we'd > better re-create the {{Watcher}} immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError
[ https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151533#comment-17151533 ] Robert Metzger commented on FLINK-17177: Thanks for your response. Let's see what [~fly_in_gis]'s opinion is on this. > Handle ERROR event correctly in KubernetesResourceManager#onError > - > > Key: FLINK-17177 > URL: https://issues.apache.org/jira/browse/FLINK-17177 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Priority: Major > Fix For: 1.12.0 > > > Currently, once we receive an *ERROR* event that is sent from the K8s API > server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} > will handle it by calling the > {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect > since the *ERROR* event may indicate an exception in the HTTP layer, which > means the previously created {{Watcher}} may be no longer available and we'd > better re-create the {{Watcher}} immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError
[ https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17151250#comment-17151250 ] Canbin Zheng commented on FLINK-17177: -- [~rmetzger] I didn't look into this ticket in the past two months. Maybe it's a good idea that we first log a "WARN" message in {{KubernetesResourceManager#onError}}. > Handle ERROR event correctly in KubernetesResourceManager#onError > - > > Key: FLINK-17177 > URL: https://issues.apache.org/jira/browse/FLINK-17177 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > Currently, once we receive an *ERROR* event that is sent from the K8s API > server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} > will handle it by calling the > {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect > since the *ERROR* event may indicate an exception in the HTTP layer, which > means the previously created {{Watcher}} may be no longer available and we'd > better re-create the {{Watcher}} immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError
[ https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150863#comment-17150863 ] Robert Metzger commented on FLINK-17177: Looking at the code, it seems that we are only logging (any) event on DEBUG level. Maybe as an intermediate step, we could log on "WARN" that we've received an error from K8s? Otherwise, we might have error reports from users which will be hard to debug. Also, this might help us understand in the long run, which types of errors K8s is reporting here. > Handle ERROR event correctly in KubernetesResourceManager#onError > - > > Key: FLINK-17177 > URL: https://issues.apache.org/jira/browse/FLINK-17177 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > Currently, once we receive an *ERROR* event that is sent from the K8s API > server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} > will handle it by calling the > {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect > since the *ERROR* event may indicate an exception in the HTTP layer, which > means the previously created {{Watcher}} may be no longer available and we'd > better re-create the {{Watcher}} immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError
[ https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17086719#comment-17086719 ] Yang Wang commented on FLINK-17177: --- [~felixzheng], thanks a lot. Let's try to figure out when the {{Error WatchEvent}} will be sent from K8s ApiServer. * If it happens in resource spec check(e.g. resource version too old, format check failed), then current handle logics is right, remove the pod and create a new one. * If it happens because of some K8s internal error, creating a new watcher could not solve this problem. Maybe we need to throw a fatal error and failed the current jobmanager attempt. * Some other case ... Moreover, i am afraid that if there are some HTTP layer errors, the {{WatchConnectionManager}} could handle it and retry internally, i.e. creating a new {{WebSocket}}. Just like YARN, if there is some network problem, the IPC client of {{AMRMClient}} will handle the retry logic. > Handle ERROR event correctly in KubernetesResourceManager#onError > - > > Key: FLINK-17177 > URL: https://issues.apache.org/jira/browse/FLINK-17177 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > Currently, once we receive an *ERROR* event that is sent from the K8s API > server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} > will handle it by calling the > {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect > since the *ERROR* event may indicate an exception in the HTTP layer, which > means the previously created {{Watcher}} may be no longer available and we'd > better re-create the {{Watcher}} immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError
[ https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17086442#comment-17086442 ] Canbin Zheng commented on FLINK-17177: -- {quote}Since i find the fabric8 kubernetes client has a different implementation with K8s go client. It will never produce a {{onError}} {{WatchEvent}} on client side[1]. But i am not familiar with the K8s ApiServer about when it will return a {{Error}} type {{WatchEvent}}. [~felixzheng] Could you share some insight with me? {quote} Hi, [~fly_in_gis]! I have searched the K8s source code and I haven't found any place that the Server sends an {{ERROR}} event so far. Maybe we need more investigation, I will give you feedback when I have new discoveries. > Handle ERROR event correctly in KubernetesResourceManager#onError > - > > Key: FLINK-17177 > URL: https://issues.apache.org/jira/browse/FLINK-17177 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > Currently, once we receive an *ERROR* event that is sent from the K8s API > server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} > will handle it by calling the > {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect > since the *ERROR* event indicates an exception in the HTTP layer that is > caused by the K8s Server, which means the previously created {{Watcher}} may > be no longer available and we'd better re-create the {{Watcher}} immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError
[ https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085601#comment-17085601 ] Yang Wang commented on FLINK-17177: --- Since i find the fabric8 kubernetes client has a different implementation with K8s go client. It will never produce a {{onError}} {{WatchEvent}} on client side[1]. But i am not familiar with the K8s ApiServer about when it will return a {{Error}} type {{WatchEvent}}. [~felixzheng] Could you share some insight with me? [1]. [https://github.com/fabric8io/kubernetes-client/blob/1afea0c364a48d9a8745539f84de59630ef6f559/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/dsl/internal/WatchHTTPManager.java#L289] > Handle ERROR event correctly in KubernetesResourceManager#onError > - > > Key: FLINK-17177 > URL: https://issues.apache.org/jira/browse/FLINK-17177 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > Currently, once we receive an *ERROR* event that is sent from the K8s API > server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} > will handle it by calling the > {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect > since the *ERROR* event indicates an exception in the HTTP layer that is > caused by the K8s Server, which means the previously created {{Watcher}} may > be no longer available and we'd better re-create the {{Watcher}} immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError
[ https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085442#comment-17085442 ] Canbin Zheng commented on FLINK-17177: -- {quote}I post the {{WatchEvent}} in K8s here[1]. I do not find the "Error" type means "HTTP error". So could share some information about how the "Error" type is introduced by HTTP layer error? [1]. [https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#watchevent-v1-meta] {quote} One case in K8s client-go is [https://github.com/kubernetes/kubernetes/blob/343c1e7636fe5c75cdd378c0b170b26935806de5/staging/src/k8s.io/apimachinery/pkg/watch/streamwatcher.go#L121] Also, the K8s server could probably send {{ERROR}} event if something goes wrong in the HTTP stream. > Handle ERROR event correctly in KubernetesResourceManager#onError > - > > Key: FLINK-17177 > URL: https://issues.apache.org/jira/browse/FLINK-17177 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > Currently, once we receive an *ERROR* event that is sent from the K8s API > server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} > will handle it by calling the > {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect > since the *ERROR* event indicates an exception in the HTTP layer that is > caused by the K8s Server, which means the previously created {{Watcher}} may > be no longer available and we'd better re-create the {{Watcher}} immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError
[ https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085412#comment-17085412 ] Yang Wang commented on FLINK-17177: --- {code:java} Object is: If Type is Added or Modified: the new state of the object. If Type is Deleted: the state of the object immediately before deletion. If Type is Error: Status is recommended; other types may make sense depending on context. {code} I post the {{WatchEvent}} in K8s here[1]. I do not find the "Error" type means "HTTP error". So could share some information about how the "Error" type is introduced by HTTP layer error? [1]. [https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.10/#watchevent-v1-meta] > Handle ERROR event correctly in KubernetesResourceManager#onError > - > > Key: FLINK-17177 > URL: https://issues.apache.org/jira/browse/FLINK-17177 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > Currently, once we receive an *ERROR* event that is sent from the K8s API > server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} > will handle it by calling the > {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect > since the *ERROR* event indicates an exception in the HTTP layer that is > caused by the K8s Server, which means the previously created {{Watcher}} may > be no longer available and we'd better re-create the {{Watcher}} immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-17177) Handle ERROR event correctly in KubernetesResourceManager#onError
[ https://issues.apache.org/jira/browse/FLINK-17177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085378#comment-17085378 ] Yang Wang commented on FLINK-17177: --- Hi [~felixzheng], thanks for creating this insightful tickets. I am not sure whether we need to create a new watcher here since the {{WatchConnectionManager}} in fabric8 kubernetes client has internal retry logics for http/websocket failure. Another concern is if the reconnect limit exhausts, may because of pod network or K8s api server down, i think we need to throw fatal error and exit the jobmanager pod. Then a new jobmanager pod will be started to take over. This is also the logic for YARN deployment. If the {{AMRMClient}} heartbeats with YARN ResourceManager failed(the ipc client has retried enough times), then {{YarnResourceManager}} will also call the onFatalError. > Handle ERROR event correctly in KubernetesResourceManager#onError > - > > Key: FLINK-17177 > URL: https://issues.apache.org/jira/browse/FLINK-17177 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Priority: Major > Fix For: 1.11.0 > > > Currently, once we receive an *ERROR* event that is sent from the K8s API > server via the K8s {{Watcher}}, then {{KubernetesResourceManager#onError}} > will handle it by calling the > {{KubernetesResourceManager#removePodIfTerminated}}. This may be incorrect > since the *ERROR* event indicates an exception in the HTTP layer that is > caused by the K8s Server, which means the previously created {{Watcher}} may > be no longer available and we'd better re-create the {{Watcher}} immediately. -- This message was sent by Atlassian Jira (v8.3.4#803005)