[ https://issues.apache.org/jira/browse/FLINK-31974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719355#comment-17719355 ]
Thomas Weise commented on FLINK-31974: -------------------------------------- There are many cases where errors are transient. This specific case is actually quite obvious, the resource availability on a large cluster is changing constantly. A pod may not be scheduled now but few seconds later. Other k8s related issues can also be transient, for example a failed request due to rate limiting will likely succeed soon after and we would actually make things worse by not following a backoff/retry strategy and simply letting the job fail. I'm also leaning more towards retry by default strategy and identify the cases that should be fatal error. > JobManager crashes after KubernetesClientException exception with > FatalExitExceptionHandler > ------------------------------------------------------------------------------------------- > > Key: FLINK-31974 > URL: https://issues.apache.org/jira/browse/FLINK-31974 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes > Affects Versions: 1.17.0 > Reporter: Sergio Sainz > Assignee: Weijie Guo > Priority: Major > > When resource quota limit is reached JobManager will throw > > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > > In {*}1.16.1 , this is handled gracefully{*}: > {code} > 2023-04-28 22:07:24,631 WARN > org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] - > Failed requesting worker with resource spec WorkerResourceSpec > \{cpuCores=1.0, taskHeapSize=25.600mb (26843542 bytes), taskOffHeapSize=0 > bytes, networkMemSize=64.000mb (67108864 bytes), managedMemSize=230.400mb > (241591914 bytes), numSlots=4}, current pending count: 0 > java.util.concurrent.CompletionException: > io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: > POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-138" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: https://10.96.0.1/api/v1/namespaces/my-namespace/pods. > Message: Forbidden!Configured service account doesn't have access. Service > account may have been revoked. pods > "my-namespace-flink-cluster-taskmanager-1-138" is forbidden: exceeded quota: > my-namespace-resource-quota, requested: limits.cpu=3, used: > limits.cpu=12100m, limited: limits.cpu=13. > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) > ~[flink-dist-1.16.1.jar:1.16.1] > at > io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) > ~[flink-dist-1.16.1.jar:1.16.1] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) > ~[flink-dist-1.16.1.jar:1.16.1] > ... 4 more > {code} > But , {*}in Flink 1.17.0 , Job Manager crashes{*}: > {code} > 2023-04-28 20:50:50,534 ERROR org.apache.flink.util.FatalExitExceptionHandler > [] - FATAL: Thread 'flink-akka.actor.default-dispatcher-15' > produced an uncaught exception. Stopping the process... > java.util.concurrent.CompletionException: > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture.completeThrowable(Unknown > Source) ~[?:?] > at java.util.concurrent.CompletableFuture$AsyncRun.run(Unknown > Source) ~[?:?] > at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) > ~[?:?] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) > ~[?:?] > at java.lang.Thread.run(Unknown Source) ~[?:?] > Caused by: > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.KubernetesClientException: > Failure executing: POST at: > https://10.96.0.1/api/v1/namespaces/my-namespace/pods. Message: > Forbidden!Configured service account doesn't have access. Service account may > have been revoked. pods "my-namespace-flink-cluster-taskmanager-1-2" is > forbidden: exceeded quota: my-namespace-resource-quota, requested: > limits.cpu=3, used: limits.cpu=12100m, limited: limits.cpu=13. > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:684) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:664) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:613) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:308) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:644) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:83) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.shaded.io.fabric8.kubernetes.client.dsl.base.CreateOnlyResourceOperation.create(CreateOnlyResourceOperation.java:61) > ~[flink-dist-1.17.0.jar:1.17.0] > at > org.apache.flink.kubernetes.kubeclient.Fabric8FlinkKubeClient.lambda$createTaskManagerPod$1(Fabric8FlinkKubeClient.java:163) > ~[flink-dist-1.17.0.jar:1.17.0] > ... 4 more > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)