Hi all, I did some tests based on the PR Christophe mentioned above and by making a change on the NettyClient to use CanonicalHostName instead of HostNameAddress to identify the server, the SSL validation works!!
I created a PR with this change: https://github.com/apache/flink/pull/5789 Regards, Edward 2018-03-28 17:22 GMT+02:00 Edward Alexander Rojas Clavijo < edward.roja...@gmail.com>: > Hi Till, > > I just created the JIRA ticket: https://issues.apache.org/ > jira/browse/FLINK-9103 > > I added the JobManager and TaskManager logs, Hope this helps to resolve > the issue. > > Regards, > Edward > > 2018-03-27 17:48 GMT+02:00 Till Rohrmann <trohrm...@apache.org>: > >> Hi Edward, >> >> could you please file a JIRA issue for this problem. It might be as >> simple as that the TaskManager's network stack uses the IP instead of the >> hostname as you suggested. But we have to look into this to be sure. Also >> the logs of the JobManager as well as the TaskManagers could be helpful. >> >> Cheers, >> Till >> >> On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <cjo...@gmail.com> >> wrote: >> >>> >>> I suspect this relates to: https://issues.apache.org/ >>> jira/browse/FLINK-5030 >>> >>> For which there was a PR at some point but nothing has been done so far. >>> It seems the current code explicitly uses the IP vs Hostname for Netty SSL >>> configuration. >>> >>> Without that I'm really wondering how people are reasonably using SSL on >>> a Kubernetes Flink-based cluster as every time a pod is (re-started) it can >>> theoretically take a different IP? Or do I miss something? >>> >>> -- >>> Christophe >>> >>> On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo < >>> edward.roja...@gmail.com> wrote: >>> >>>> Hi all, >>>> >>>> Currently I have a Flink 1.4 cluster running on kubernetes and with SSL >>>> configuration based on https://ci.apache.org/proje >>>> cts/flink/flink-docs-master/ops/security-ssl.html. >>>> >>>> However, as the IP of the nodes are dynamic (from the nature of >>>> kubernetes), we are using only the DNS which we can control using >>>> kubernetes services. So we add to the Subject Alternative Name(SAN) the >>>> flink-jobmanager DNS and also the DNS for the task managers >>>> *.flink-taskmanager-svc (each task manager has a DNS in the form >>>> flink-taskmanager-0.flink-taskmanager-svc). >>>> >>>> Additionally we set the jobmanager.rpc.address property on all the >>>> nodes and each task manager sets the taskmanager.host property, all >>>> matching the ones on the certificate. >>>> >>>> This is working well when using Job with Parallelism set to 1. The SSL >>>> validations are good and the Jobmanager can communicate with Task manager >>>> and vice versa. >>>> >>>> But when we set the parallelism to more than 1 we have exceptions on >>>> the SSL validation like this: >>>> >>>> Caused by: java.security.cert.CertificateException: No subject >>>> alternative names matching IP address 172.30.247.163 found >>>> at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168) >>>> at sun.security.util.HostnameChecker.match(HostnameChecker.java:94) >>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus >>>> tManagerImpl.java:455) >>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus >>>> tManagerImpl.java:436) >>>> at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509Trust >>>> ManagerImpl.java:252) >>>> at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X50 >>>> 9TrustManagerImpl.java:136) >>>> at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHa >>>> ndshaker.java:1601) >>>> ... 21 more >>>> >>>> >>>> From the logs I see the Jobmanager is correctly registering the >>>> taskmanagers: >>>> >>>> org.apache.flink.runtime.instance.InstanceManager - Registered >>>> TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-fl >>>> ink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager) >>>> as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered >>>> hosts is 3. Current number of alive task slots is 6. >>>> >>>> And also each taskmanager is correctly registered to use the hostname >>>> for communication: >>>> >>>> org.apache.flink.runtime.taskmanager.TaskManager - TaskManager will >>>> use hostname/address 'flink-taskmanager-1.flink-tas >>>> kmanager-svc.default.svc.cluster.local' (172.30.247.163) for >>>> communication. >>>> ... >>>> akka.remote.Remoting - Remoting started; listening on addresses >>>> :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager >>>> -svc.default.svc.cluster.local:6122] >>>> ... >>>> org.apache.flink.runtime.io.network.netty.NettyConfig - NettyConfig >>>> [server address: flink-taskmanager-1.flink-task >>>> manager-svc.default.svc.cluster.local/172.30.247.163, server port: >>>> 6121, ssl enabled: true, memory segment size (bytes): 32768, transport >>>> type: NIO, number of server threads: 2 (manual), number of client threads: >>>> 2 (manual), server connect backlog: 0 (use Netty's default), client connect >>>> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's >>>> default)] >>>> ... >>>> org.apache.flink.runtime.taskmanager.TaskManager - TaskManager data >>>> connection information: bf4a9b50e57c99c17049adb66d65f685 @ >>>> flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local >>>> (dataPort=6121) >>>> >>>> >>>> >>>> But even with that, it seems like the taskmanagers are using the IP >>>> communicate between them and the SSL validation fails. >>>> >>>> Do you know if it's possible to make the taskmanagers to use the >>>> hostname to communicate instead of the IP ? >>>> or >>>> Do you have any advice to get the SSL configuration to work on this >>>> environment ? >>>> >>>> Thanks in advance. >>>> >>>> Regards, >>>> Edward >>>> >>> >>> >>> >>> -- >>> Christophe >>> >> >> > > > -- > *Edward Alexander Rojas Clavijo* > > > > *Software EngineerHybrid CloudIBM France* >