After upgrading the Flink Kubernetes Operator from v1.11 to v1.12 upgrades started to fail in all my jobs with the following error message:
``` Error during event processing ExecutionScope{ resource id: ResourceID{name='my-job-checkpoint-periodic-1741010907590', namespace='platform'}, version: 2446801878} ``` The upgrade was failing in a very weird way: - First a savepoint was taken and uploaded to S3 - After some time that savepoint was finally removed from S3 but not from the cluster CR - Making the upgrade fail because the savepoint could not be found Can this be related to this change from here? - https://flink.apache.org/2025/06/03/apache-flink-kubernetes-operator-1.12.0-release-announcement/#bug-fixes-and-stability-enhancements *Savepoint Information Update*: Fixed a bug where upgrade savepoints were not added to the deprecated savepointInfo, ensuring accurate tracking of savepoints during upgrades. In case it helps, here you are the complete stack trace: ```json { "threadId": 352, "loggerFqcn": "org.apache.logging.slf4j.Log4jLogger", "level": "ERROR", "thrown": { "extendedStackTrace": [ { "file": "Controller.java", "method": "cleanup", "line": 212, "exact": false, "location": "flink-kubernetes-operator-1.12.0-shaded.jar", "class": "io.javaoperatorsdk.operator.processing.Controller", "version": "1.12.0" }, { "file": "ReconciliationDispatcher.java", "method": "handleCleanup", "line": 291, "exact": false, "location": "flink-kubernetes-operator-1.12.0-shaded.jar", "class": "io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher", "version": "1.12.0" }, { "file": "ReconciliationDispatcher.java", "method": "handleDispatch", "line": 89, "exact": false, "location": "flink-kubernetes-operator-1.12.0-shaded.jar", "class": "io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher", "version": "1.12.0" }, { "file": "ReconciliationDispatcher.java", "method": "handleExecution", "line": 64, "exact": false, "location": "flink-kubernetes-operator-1.12.0-shaded.jar", "class": "io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher", "version": "1.12.0" }, { "file": "EventProcessor.java", "method": "run", "line": 452, "exact": true, "location": "flink-kubernetes-operator-1.12.0-shaded.jar", "class": "io.javaoperatorsdk.operator.processing.event.EventProcessor$ReconcilerExecutor", "version": "1.12.0" }, { "method": "runWorker", "line": -1, "exact": true, "location": "?", "class": "java.util.concurrent.ThreadPoolExecutor", "version": "?" }, { "method": "run", "line": -1, "exact": true, "location": "?", "class": "java.util.concurrent.ThreadPoolExecutor$Worker", "version": "?" }, { "method": "run", "line": -1, "exact": true, "location": "?", "class": "java.lang.Thread", "version": "?" } ], "localizedMessage": "java.lang.NullPointerException", "name": "io.javaoperatorsdk.operator.OperatorException", "cause": { "extendedStackTrace": [ { "file": "FlinkResourceContextFactory.java", "method": "getFlinkStateSnapshotContext", "line": 96, "exact": false, "location": "flink-kubernetes-operator-1.12.0-shaded.jar", "class": "org.apache.flink.kubernetes.operator.service.FlinkResourceContextFactory", "version": "1.12.0" }, { "file": "FlinkStateSnapshotController.java", "method": "cleanup", "line": 97, "exact": false, "location": "flink-kubernetes-operator-1.12.0-shaded.jar", "class": "org.apache.flink.kubernetes.operator.controller.FlinkStateSnapshotController", "version": "1.12.0" }, { "file": "FlinkStateSnapshotController.java", "method": "cleanup", "line": 55, "exact": false, "location": "flink-kubernetes-operator-1.12.0-shaded.jar", "class": "org.apache.flink.kubernetes.operator.controller.FlinkStateSnapshotController", "version": "1.12.0" }, { "file": "Controller.java", "method": "execute", "line": 199, "exact": false, "location": "flink-kubernetes-operator-1.12.0-shaded.jar", "class": "io.javaoperatorsdk.operator.processing.Controller$2", "version": "1.12.0" }, { "file": "Controller.java", "method": "execute", "line": 162, "exact": false, "location": "flink-kubernetes-operator-1.12.0-shaded.jar", "class": "io.javaoperatorsdk.operator.processing.Controller$2", "version": "1.12.0" }, { "file": "OperatorJosdkMetrics.java", "method": "timeControllerExecution", "line": 80, "exact": false, "location": "flink-kubernetes-operator-1.12.0-shaded.jar", "class": "org.apache.flink.kubernetes.operator.metrics.OperatorJosdkMetrics", "version": "1.12.0" }, { "file": "Controller.java", "method": "cleanup", "line": 161, "exact": false, "location": "flink-kubernetes-operator-1.12.0-shaded.jar", "class": "io.javaoperatorsdk.operator.processing.Controller", "version": "1.12.0" } ], "name": "java.lang.NullPointerException", "commonElementCount": 7 }, "commonElementCount": 0, "message": "java.lang.NullPointerException" }, "endOfBatch": false, "thread": "ReconcilerExecutor-flinkstatesnapshotcontroller-352", "loggerName": "io.javaoperatorsdk.operator.processing.event.EventProcessor", "threadPriority": 5, "instant": { "epochSecond": 1750744905, "nanoOfSecond": 13000000 } } ``` On 2025/03/04 08:29:20 Salva Alcántara wrote: > Hey all! I recently bumped the Flink Kubernetes Operator to v1.10.0 and one > of the things I wanted to check is the usage of the new FlinkStateSnapshot > CRD. I confirmed that the CRD was correctly created in my cluster, however > I'm still seeing these logs: > > ``` > Starting Operator > 2025-03-01T08:31:08.779422Z main ERROR appender CONSOLE has no parameter > that matches element JsonLayout > 2025-03-01T08:31:08.782927Z main ERROR Unable to locate appender > "ConsoleAppender" for logger config "root" > 2025-03-01 08:31:12,885 i.f.k.c.d.i.VersionUsageUtils [WARN ] The client > is using resource type 'flinkstatesnapshots' with unstable version 'v1beta1' > 2025-03-01 08:31:14,180 o.a.f.k.o.c.FlinkConfigManager [WARN ] > FlinkStateSnapshot CRD was not installed, snapshot resources will be > disabled! > ``` > > I think this relates to the RBAC stuff. For what it's worth, the > FlinkStateSnapshot CRD was not installed log message goes away if I switch > to a cluster-wide installaction (which handles RBAC via clusterrole & > clusterrolebinding). However, for a namespaced installation like mine > (using a non-empty array for watchNamespaces) there must be something > wrong, despite RBAC apparently being right, i.e.: > > ``` > kubectl auth can-i list flinkstatesnapshot -n a-watched-namespace > --as=system:serviceaccount:flink-operator:flink-operator > yes > ``` > > The answer is the same for any namespace within watchNamespaces (w.r.t. > flink-operator, which is where I deploy the operator). > > The issue might be in this line: > > - > https://github.com/apache/flink-kubernetes-operator/blob/9eb3c385b90a5a2f08376720f[ …]ache/flink/kubernetes/operator/utils/KubernetesClientUtils.java > < https://github.com/apache/flink-kubernetes-operator/blob/9eb3c385b90a5a2f08376720f3204d1784981a0c/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/utils/KubernetesClientUtils.java#L72C31-L72C67 > > > which is not passing any special config, maybe the idea was to use > getKubernetesClient instead? Can anyone help troubleshoot the issue? >