Sebastian Struß created FLINK-29535: ---------------------------------------
Summary: Flink Operator Certificate renew issue Key: FLINK-29535 URL: https://issues.apache.org/jira/browse/FLINK-29535 Project: Flink Issue Type: Bug Components: Kubernetes Operator Reporter: Sebastian Struß It seems that there is an issue with the Kubernetes Operator (at least in version 1.1.0) when it comes to certificates for the webhook. We've seen this error message pop up in the logs: | | |An exceptionCaught() event was fired, and it reached at the tail of the pipeline. It usually means the last handler in the pipeline did not handle the exception.| | and javax.net.ssl.SSLHandshakeException: Received fatal alert: bad_certificate at sun.security.ssl.Alert.createSSLException(Unknown Source) ~[?:?] at sun.security.ssl.Alert.createSSLException(Unknown Source) ~[?:?] at sun.security.ssl.TransportContext.fatal(Unknown Source) ~[?:?] at sun.security.ssl.Alert$AlertConsumer.consume(Unknown Source) ~[?:?] at sun.security.ssl.TransportContext.dispatch(Unknown Source) ~[?:?] at sun.security.ssl.SSLTransport.decode(Unknown Source) ~[?:?] at sun.security.ssl.SSLEngineImpl.decode(Unknown Source) ~[?:?] at sun.security.ssl.SSLEngineImpl.readRecord(Unknown Source) ~[?:?] at sun.security.ssl.SSLEngineImpl.unwrap(Unknown Source) ~[?:?] at sun.security.ssl.SSLEngineImpl.unwrap(Unknown Source) ~[?:?] at javax.net.ssl.SSLEngine.unwrap(Unknown Source) ~[?:?] at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler$SslEngineType$3.unwrap(SslHandler.java:296) ~[flink-kubernetes-operator-1.1.0-shaded.jar:1.1.0] at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1342) ~[flink-kubernetes-operator-1.1.0-shaded.jar:1.1.0] at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1235) ~[flink-kubernetes-operator-1.1.0-shaded.jar:1.1.0] at org.apache.flink.shaded.netty4.io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1284) ~[flink-kubernetes-operator-1.1.0-shaded.jar:1.1.0] at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507) ~[flink-kubernetes-operator-1.1.0-shaded.jar:1.1.0] at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446) ~[flink-kubernetes-operator-1.1.0-shaded.jar:1.1.0]| It happens when our fluxcd is trying to update the FlinkDeployment resource. This seems to trigger a webhook to an endpoint (in the operator) which is serving a (then) invalid certificate. We've noticed this after 18 days of it running, so maybe something shortlived was not renewed correctly? -- This message was sent by Atlassian Jira (v8.20.10#820010)