Hi all,

I did some tests based on the PR Christophe mentioned above and by making a
change on the NettyClient to use CanonicalHostName instead of
HostNameAddress to identify the server, the SSL validation works!!

I created a PR with this change: https://github.com/apache/flink/pull/5789

Regards,
Edward

2018-03-28 17:22 GMT+02:00 Edward Alexander Rojas Clavijo <
edward.roja...@gmail.com>:

> Hi Till,
>
> I just created the JIRA ticket: https://issues.apache.org/
> jira/browse/FLINK-9103
>
> I added the JobManager and TaskManager logs, Hope this helps to resolve
> the issue.
>
> Regards,
> Edward
>
> 2018-03-27 17:48 GMT+02:00 Till Rohrmann <trohrm...@apache.org>:
>
>> Hi Edward,
>>
>> could you please file a JIRA issue for this problem. It might be as
>> simple as that the TaskManager's network stack uses the IP instead of the
>> hostname as you suggested. But we have to look into this to be sure. Also
>> the logs of the JobManager as well as the TaskManagers could be helpful.
>>
>> Cheers,
>> Till
>>
>> On Tue, Mar 27, 2018 at 5:17 PM, Christophe Jolif <cjo...@gmail.com>
>> wrote:
>>
>>>
>>> I suspect this relates to: https://issues.apache.org/
>>> jira/browse/FLINK-5030
>>>
>>> For which there was a PR at some point but nothing has been done so far.
>>> It seems the current code explicitly uses the IP vs Hostname for Netty SSL
>>> configuration.
>>>
>>> Without that I'm really wondering how people are reasonably using SSL on
>>> a Kubernetes Flink-based cluster as every time a pod is (re-started) it can
>>> theoretically take a different IP? Or do I miss something?
>>>
>>> --
>>> Christophe
>>>
>>> On Tue, Mar 27, 2018 at 3:24 PM, Edward Alexander Rojas Clavijo <
>>> edward.roja...@gmail.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Currently I have a Flink 1.4 cluster running on kubernetes and with SSL
>>>> configuration based on https://ci.apache.org/proje
>>>> cts/flink/flink-docs-master/ops/security-ssl.html.
>>>>
>>>> However, as the IP of the nodes are dynamic (from the nature of
>>>> kubernetes), we are using only the DNS which we can control using
>>>> kubernetes services. So we add to the Subject Alternative Name(SAN) the
>>>> flink-jobmanager DNS and also the DNS for the task managers
>>>> *.flink-taskmanager-svc (each task manager has a DNS in the form
>>>> flink-taskmanager-0.flink-taskmanager-svc).
>>>>
>>>> Additionally we set the jobmanager.rpc.address property on all the
>>>> nodes and each task manager sets the taskmanager.host property, all
>>>> matching the ones on the certificate.
>>>>
>>>> This is working well when using Job with Parallelism set to 1. The SSL
>>>> validations are good and the Jobmanager can communicate with Task manager
>>>> and vice versa.
>>>>
>>>> But when we set the parallelism to more than 1 we have exceptions on
>>>> the SSL validation like this:
>>>>
>>>> Caused by: java.security.cert.CertificateException: No subject
>>>> alternative names matching IP address 172.30.247.163 found
>>>> at sun.security.util.HostnameChecker.matchIP(HostnameChecker.java:168)
>>>> at sun.security.util.HostnameChecker.match(HostnameChecker.java:94)
>>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>>>> tManagerImpl.java:455)
>>>> at sun.security.ssl.X509TrustManagerImpl.checkIdentity(X509Trus
>>>> tManagerImpl.java:436)
>>>> at sun.security.ssl.X509TrustManagerImpl.checkTrusted(X509Trust
>>>> ManagerImpl.java:252)
>>>> at sun.security.ssl.X509TrustManagerImpl.checkServerTrusted(X50
>>>> 9TrustManagerImpl.java:136)
>>>> at sun.security.ssl.ClientHandshaker.serverCertificate(ClientHa
>>>> ndshaker.java:1601)
>>>> ... 21 more
>>>>
>>>>
>>>> From the logs I see the Jobmanager is correctly registering the
>>>> taskmanagers:
>>>>
>>>> org.apache.flink.runtime.instance.InstanceManager   - Registered
>>>> TaskManager at flink-taskmanager-1 (akka.ssl.tcp://flink@taiga-fl
>>>> ink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local:6122/user/taskmanager)
>>>> as 1a3f59693cec8b3929ed8898edcc2700. Current number of registered
>>>> hosts is 3. Current number of alive task slots is 6.
>>>>
>>>> And also each taskmanager is correctly registered to use the hostname
>>>> for communication:
>>>>
>>>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager will
>>>> use hostname/address 'flink-taskmanager-1.flink-tas
>>>> kmanager-svc.default.svc.cluster.local' (172.30.247.163) for
>>>> communication.
>>>> ...
>>>> akka.remote.Remoting   - Remoting started; listening on addresses
>>>> :[akka.ssl.tcp://flink@flink-taskmanager-1.flink-taskmanager
>>>> -svc.default.svc.cluster.local:6122]
>>>> ...
>>>> org.apache.flink.runtime.io.network.netty.NettyConfig   - NettyConfig
>>>> [server address: flink-taskmanager-1.flink-task
>>>> manager-svc.default.svc.cluster.local/172.30.247.163, server port:
>>>> 6121, ssl enabled: true, memory segment size (bytes): 32768, transport
>>>> type: NIO, number of server threads: 2 (manual), number of client threads:
>>>> 2 (manual), server connect backlog: 0 (use Netty's default), client connect
>>>> timeout (sec): 120, send/receive buffer size (bytes): 0 (use Netty's
>>>> default)]
>>>> ...
>>>> org.apache.flink.runtime.taskmanager.TaskManager   - TaskManager data
>>>> connection information: bf4a9b50e57c99c17049adb66d65f685 @
>>>> flink-taskmanager-1.flink-taskmanager-svc.default.svc.cluster.local
>>>> (dataPort=6121)
>>>>
>>>>
>>>>
>>>> But even with that, it seems like the taskmanagers are using the IP
>>>> communicate between them and the SSL validation fails.
>>>>
>>>> Do you know if it's possible to make the taskmanagers to use the
>>>> hostname to communicate instead of the IP ?
>>>> or
>>>> Do you have any advice to get the SSL configuration to work on this
>>>> environment ?
>>>>
>>>> Thanks in advance.
>>>>
>>>> Regards,
>>>> Edward
>>>>
>>>
>>>
>>>
>>> --
>>> Christophe
>>>
>>
>>
>
>
> --
> *Edward Alexander Rojas Clavijo*
>
>
>
> *Software EngineerHybrid CloudIBM France*
>

Reply via email to