I am not saying that we shouldn't add a strong authentication mechanism if
there are good reasons for it. I primarily would like to understand the
context a bit better in order to give qualified feedback and come to a good
decision. In order to do this, I have the feeling that we haven't fully
considered all available options which are on the table, tbh.

Does the problem of certificate expiry also apply for self-signed
certificates? If yes, then this should then also be a problem for the
internal encryption of Flink's communication. If not, then one could use
self-signed certificates with a longer validity to solve the mentioned
issue.

I think you can set up Flink in such a way that you don't have to handle
all the different certificates. For example, you could deploy Flink with a
"sidecar proxy" which is responsible for the authentication using an
arbitrary method (e.g. Kerberos) and then bind the REST endpoint to a local
network interface. That way, the REST endpoint would only be available
through the sidecar proxy. Additionally, one could enable SSL for this
communication. Would this be a solution for the problem?

Cheers,
Till

On Thu, Jun 3, 2021 at 10:46 PM Márton Balassi <balassi.mar...@gmail.com>
wrote:

> That is an interesting idea, Till.
>
> The main issue with it is that TLS certificates have an expiration time,
> usually they get approved for a couple years. Forcing our users to restart
> jobs to reprovision TLS certificates would be weird when we could just
> implement a single proper strong authentication mechanism instead in a
> couple hundred lines of code. :-)
>
> In many cases it is also impractical to go the TLS mutual route, because
> the Flink Dashboard can end up on any node in the k8s/Yarn cluster which
> means that we need a certificate per node (due to the mutual auth), but if
> we also want to protect the private key of these from users accidentally or
> intentionally leaking them then we need this per user. As in we end up
> managing user*machine number certificates and having to renew them
> periodically, which albeit automatable is unfortunately not yet automated
> in all large organizations.
>
> I fully agree that TLS certificate mutual authentication has its nice
> properties, especially at very large (multiple thousand node) clusters -
> but it has its own challenges too. Thanks for bringing it up.
>
> Happy to have this added to the rejected alternative list so that we have
> the full picture documented.
>
> On Thu, Jun 3, 2021 at 5:52 PM Till Rohrmann <trohrm...@apache.org> wrote:
>
>> I guess the idea would then be to let the proxy do the authentication job
>> and only forward the request via an SSL mutually encrypted connection to
>> the Flink cluster. Would this be possible? The beauty of this setup is in
>> my opinion that this setup should work with all kinds of authentication
>> mechanisms.
>>
>> Cheers,
>> Till
>>
>> On Thu, Jun 3, 2021 at 3:12 PM Gabor Somogyi <gabor.g.somo...@gmail.com>
>> wrote:
>>
>>> Thanks for giving options to fulfil the need.
>>>
>>> Users are looking for a solution where users can be identified on the
>>> whole cluster and restrict access to resources/actions.
>>> A good example for such an action is cancelling other users running jobs.
>>>
>>> * SSL does provide mutual authentication but when authentication passed
>>> there is no user based on restrictions can be made.
>>> * The less problematic part is that generating/maintaining short time
>>> valid certificates would be a hard (that's the reason KDC like servers
>>> exist).
>>> Having long time valid certificates would widen the attack surface but
>>> since the first concern is there this is just a cosmetic issue.
>>>
>>> All in all using TLS certificates is not sufficient in these
>>> environments unfortunately.
>>>
>>> BR,
>>> G
>>>
>>>
>>> On Thu, Jun 3, 2021 at 12:49 PM Till Rohrmann <trohrm...@apache.org>
>>> wrote:
>>>
>>>> Thanks for the information Gabor. If it is about securing the
>>>> communication between the REST client and the REST server, then Flink
>>>> already supports enabling mutual SSL authentication [1]. Would this be
>>>> enough to secure the communication and to pass an audit?
>>>>
>>>> [1]
>>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity
>>>>
>>>> Cheers,
>>>> Till
>>>>
>>>> On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi <
>>>> gabor.g.somo...@gmail.com> wrote:
>>>>
>>>>> Hi Till,
>>>>>
>>>>> Since I'm working in security area 10+ years let me share my thought.
>>>>> I would like to emphasise there are experts better than me but I have
>>>>> some
>>>>> basics.
>>>>> The discussion is open and not trying to tell alone things...
>>>>>
>>>>> > I mean if an attacker can get access to one of the machines, then it
>>>>> should also be possible to obtain the right Kerberos token.
>>>>> Not necessarily. For example if one gets access to a specific user's
>>>>> credentials then it's not possible to compromise other user's jobs,
>>>>> data,
>>>>> etc...
>>>>> Security is like an onion, the more layers has been added the more
>>>>> time an
>>>>> attacker needs to proceed.
>>>>> At the end of the day if one is in, then most probably can find the
>>>>> way but
>>>>> this time is normally enough to sysadmins or security experts to
>>>>> close down the system and minimize the damage.
>>>>>
>>>>> The other thing is that all tokens has a timeout and if the token is
>>>>> invalid then the attacker can't proceed further.
>>>>>
>>>>> > Is Kerberos also the standard authentication protocol for Kubernetes
>>>>> deployments?
>>>>> Kerberos is an industry standard which is cloud/deployment agnostic
>>>>> and it
>>>>> can be used in any deployments including k8s.
>>>>> The main intention is to use kerberos in k8s deployments too since
>>>>> we're
>>>>> going this direction as well.
>>>>> Please see how Spark does this:
>>>>>
>>>>> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes
>>>>>
>>>>> Last but not least the most important reason to add at least one strong
>>>>> authentication is that we have users who has
>>>>> hard requirements on this. They're doing security audits and if they
>>>>> fail
>>>>> then it's deal breaking.
>>>>> That is why we have added kerberos at the first place. Unfortunately we
>>>>> can't name them in this public list, however
>>>>> the customers who specifically asked for this were mainly in the
>>>>> banking
>>>>> and telco sector.
>>>>>
>>>>> BR,
>>>>> G
>>>>>
>>>>>
>>>>> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann <trohrm...@apache.org>
>>>>> wrote:
>>>>>
>>>>> > Thanks for updating the document Márton. Why is it that banks will
>>>>> > consider it more secure if Flink comes with Kerberos authentication
>>>>> > (assuming a properly secured setup)? I mean if an attacker can get
>>>>> access
>>>>> > to one of the machines, then it should also be possible to obtain
>>>>> the right
>>>>> > Kerberos token.
>>>>> >
>>>>> > I am not an authentication expert and that's why I wanted to ask
>>>>> what are
>>>>> > other authentication protocols other than Kerberos? Why did we select
>>>>> > Kerberos and not any other authentication protocol? Maybe you can
>>>>> list the
>>>>> > pros and cons for the different protocols. Is Kerberos also the
>>>>> standard
>>>>> > authentication protocol for Kubernetes deployments? If not, what
>>>>> would be
>>>>> > the answer when deploying on K8s?
>>>>> >
>>>>> > Cheers,
>>>>> > Till
>>>>> >
>>>>> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi <
>>>>> gabor.g.somo...@gmail.com>
>>>>> > wrote:
>>>>> >
>>>>> >> Hi team,
>>>>> >>
>>>>> >> Happy to be here and hope I can provide quality additions in the
>>>>> future.
>>>>> >>
>>>>> >> Thank you all for helpful the suggestions!
>>>>> >> Considering them the FLIP has been modified and the work continues
>>>>> on the
>>>>> >> already existing Jira.
>>>>> >>
>>>>> >> BR,
>>>>> >> G
>>>>> >>
>>>>> >>
>>>>> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi <
>>>>> balassi.mar...@gmail.com>
>>>>> >> wrote:
>>>>> >>
>>>>> >>> Thanks, Chesney - I totally missed that. Answered on the ticket
>>>>> too, let
>>>>> >>> us continue there then.
>>>>> >>>
>>>>> >>> Till, I agree that we should keep this codepath as slim as
>>>>> possible. It
>>>>> >>> is an important design decision that we aim to keep the list of
>>>>> >>> authentication protocols to a minimum. We believe that this should
>>>>> not be a
>>>>> >>> primary concern of Flink and a trusted proxy service (for example
>>>>> Apache
>>>>> >>> Knox) should be used to enable a multitude of enduser
>>>>> authentication
>>>>> >>> mechanisms. The bare minimum of authentication mechanisms to
>>>>> support
>>>>> >>> consequently consist of a single strong authentication protocol
>>>>> for which
>>>>> >>> Kerberos is the enterprise solution and HTTP Basic primary for
>>>>> development
>>>>> >>> and light-weight scenarios.
>>>>> >>>
>>>>> >>> Added the above wording to G's doc.
>>>>> >>>
>>>>> >>>
>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>> >>>
>>>>> >>>
>>>>> >>>
>>>>> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler <
>>>>> ches...@apache.org>
>>>>> >>> wrote:
>>>>> >>>
>>>>> >>>> There's a related effort:
>>>>> >>>> https://issues.apache.org/jira/browse/FLINK-21108
>>>>> >>>>
>>>>> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote:
>>>>> >>>> > Hi Gabor, welcome to the Flink community!
>>>>> >>>> >
>>>>> >>>> > Thanks for sharing this proposal with the community Márton. In
>>>>> >>>> general, I
>>>>> >>>> > agree that authentication is missing and that this is required
>>>>> for
>>>>> >>>> using
>>>>> >>>> > Flink within an enterprise. The thing I am wondering is whether
>>>>> this
>>>>> >>>> > feature strictly needs to be implemented inside of Flink or
>>>>> whether a
>>>>> >>>> proxy
>>>>> >>>> > setup could do the job? Have you considered this option? If
>>>>> yes, then
>>>>> >>>> it
>>>>> >>>> > would be good to list it under the point of rejected
>>>>> alternatives.
>>>>> >>>> >
>>>>> >>>> > I do see the benefit of implementing this feature inside of
>>>>> Flink if
>>>>> >>>> many
>>>>> >>>> > users need it. If not, then it might be easier for the project
>>>>> to not
>>>>> >>>> > increase the surface area since it makes the overall maintenance
>>>>> >>>> harder.
>>>>> >>>> >
>>>>> >>>> > Cheers,
>>>>> >>>> > Till
>>>>> >>>> >
>>>>> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi <
>>>>> mbala...@apache.org>
>>>>> >>>> wrote:
>>>>> >>>> >
>>>>> >>>> >> Hi team,
>>>>> >>>> >>
>>>>> >>>> >> Firstly I would like to introduce Gabor or G [1] for short to
>>>>> the
>>>>> >>>> >> community, he is a Spark committer who has recently
>>>>> transitioned to
>>>>> >>>> the
>>>>> >>>> >> Flink Engineering team at Cloudera and is looking forward to
>>>>> >>>> contributing
>>>>> >>>> >> to Apache Flink. Previously G primarily focused on Spark
>>>>> Streaming
>>>>> >>>> and
>>>>> >>>> >> security.
>>>>> >>>> >>
>>>>> >>>> >> Based on requests from our customers G has implemented
>>>>> Kerberos and
>>>>> >>>> HTTP
>>>>> >>>> >> Basic Authentication for the Flink Dashboard and HistoryServer.
>>>>> >>>> Previously
>>>>> >>>> >> lacked an authentication story.
>>>>> >>>> >>
>>>>> >>>> >> We are looking to contribute this functionality back to the
>>>>> >>>> community, we
>>>>> >>>> >> believe that given Flink's maturity there should be a common
>>>>> code
>>>>> >>>> solution
>>>>> >>>> >> for this general pattern.
>>>>> >>>> >>
>>>>> >>>> >> We are looking forward to your feedback on G's design. [2]
>>>>> >>>> >>
>>>>> >>>> >> [1] http://gaborsomogyi.com/
>>>>> >>>> >> [2]
>>>>> >>>> >>
>>>>> >>>> >>
>>>>> >>>>
>>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit
>>>>> >>>> >>
>>>>> >>>>
>>>>> >>>>
>>>>>
>>>>

Reply via email to