I am not saying that we shouldn't add a strong authentication mechanism if there are good reasons for it. I primarily would like to understand the context a bit better in order to give qualified feedback and come to a good decision. In order to do this, I have the feeling that we haven't fully considered all available options which are on the table, tbh.
Does the problem of certificate expiry also apply for self-signed certificates? If yes, then this should then also be a problem for the internal encryption of Flink's communication. If not, then one could use self-signed certificates with a longer validity to solve the mentioned issue. I think you can set up Flink in such a way that you don't have to handle all the different certificates. For example, you could deploy Flink with a "sidecar proxy" which is responsible for the authentication using an arbitrary method (e.g. Kerberos) and then bind the REST endpoint to a local network interface. That way, the REST endpoint would only be available through the sidecar proxy. Additionally, one could enable SSL for this communication. Would this be a solution for the problem? Cheers, Till On Thu, Jun 3, 2021 at 10:46 PM Márton Balassi <balassi.mar...@gmail.com> wrote: > That is an interesting idea, Till. > > The main issue with it is that TLS certificates have an expiration time, > usually they get approved for a couple years. Forcing our users to restart > jobs to reprovision TLS certificates would be weird when we could just > implement a single proper strong authentication mechanism instead in a > couple hundred lines of code. :-) > > In many cases it is also impractical to go the TLS mutual route, because > the Flink Dashboard can end up on any node in the k8s/Yarn cluster which > means that we need a certificate per node (due to the mutual auth), but if > we also want to protect the private key of these from users accidentally or > intentionally leaking them then we need this per user. As in we end up > managing user*machine number certificates and having to renew them > periodically, which albeit automatable is unfortunately not yet automated > in all large organizations. > > I fully agree that TLS certificate mutual authentication has its nice > properties, especially at very large (multiple thousand node) clusters - > but it has its own challenges too. Thanks for bringing it up. > > Happy to have this added to the rejected alternative list so that we have > the full picture documented. > > On Thu, Jun 3, 2021 at 5:52 PM Till Rohrmann <trohrm...@apache.org> wrote: > >> I guess the idea would then be to let the proxy do the authentication job >> and only forward the request via an SSL mutually encrypted connection to >> the Flink cluster. Would this be possible? The beauty of this setup is in >> my opinion that this setup should work with all kinds of authentication >> mechanisms. >> >> Cheers, >> Till >> >> On Thu, Jun 3, 2021 at 3:12 PM Gabor Somogyi <gabor.g.somo...@gmail.com> >> wrote: >> >>> Thanks for giving options to fulfil the need. >>> >>> Users are looking for a solution where users can be identified on the >>> whole cluster and restrict access to resources/actions. >>> A good example for such an action is cancelling other users running jobs. >>> >>> * SSL does provide mutual authentication but when authentication passed >>> there is no user based on restrictions can be made. >>> * The less problematic part is that generating/maintaining short time >>> valid certificates would be a hard (that's the reason KDC like servers >>> exist). >>> Having long time valid certificates would widen the attack surface but >>> since the first concern is there this is just a cosmetic issue. >>> >>> All in all using TLS certificates is not sufficient in these >>> environments unfortunately. >>> >>> BR, >>> G >>> >>> >>> On Thu, Jun 3, 2021 at 12:49 PM Till Rohrmann <trohrm...@apache.org> >>> wrote: >>> >>>> Thanks for the information Gabor. If it is about securing the >>>> communication between the REST client and the REST server, then Flink >>>> already supports enabling mutual SSL authentication [1]. Would this be >>>> enough to secure the communication and to pass an audit? >>>> >>>> [1] >>>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/security/security-ssl/#external--rest-connectivity >>>> >>>> Cheers, >>>> Till >>>> >>>> On Thu, Jun 3, 2021 at 10:33 AM Gabor Somogyi < >>>> gabor.g.somo...@gmail.com> wrote: >>>> >>>>> Hi Till, >>>>> >>>>> Since I'm working in security area 10+ years let me share my thought. >>>>> I would like to emphasise there are experts better than me but I have >>>>> some >>>>> basics. >>>>> The discussion is open and not trying to tell alone things... >>>>> >>>>> > I mean if an attacker can get access to one of the machines, then it >>>>> should also be possible to obtain the right Kerberos token. >>>>> Not necessarily. For example if one gets access to a specific user's >>>>> credentials then it's not possible to compromise other user's jobs, >>>>> data, >>>>> etc... >>>>> Security is like an onion, the more layers has been added the more >>>>> time an >>>>> attacker needs to proceed. >>>>> At the end of the day if one is in, then most probably can find the >>>>> way but >>>>> this time is normally enough to sysadmins or security experts to >>>>> close down the system and minimize the damage. >>>>> >>>>> The other thing is that all tokens has a timeout and if the token is >>>>> invalid then the attacker can't proceed further. >>>>> >>>>> > Is Kerberos also the standard authentication protocol for Kubernetes >>>>> deployments? >>>>> Kerberos is an industry standard which is cloud/deployment agnostic >>>>> and it >>>>> can be used in any deployments including k8s. >>>>> The main intention is to use kerberos in k8s deployments too since >>>>> we're >>>>> going this direction as well. >>>>> Please see how Spark does this: >>>>> >>>>> https://spark.apache.org/docs/latest/security.html#secure-interaction-with-kubernetes >>>>> >>>>> Last but not least the most important reason to add at least one strong >>>>> authentication is that we have users who has >>>>> hard requirements on this. They're doing security audits and if they >>>>> fail >>>>> then it's deal breaking. >>>>> That is why we have added kerberos at the first place. Unfortunately we >>>>> can't name them in this public list, however >>>>> the customers who specifically asked for this were mainly in the >>>>> banking >>>>> and telco sector. >>>>> >>>>> BR, >>>>> G >>>>> >>>>> >>>>> On Thu, Jun 3, 2021 at 9:20 AM Till Rohrmann <trohrm...@apache.org> >>>>> wrote: >>>>> >>>>> > Thanks for updating the document Márton. Why is it that banks will >>>>> > consider it more secure if Flink comes with Kerberos authentication >>>>> > (assuming a properly secured setup)? I mean if an attacker can get >>>>> access >>>>> > to one of the machines, then it should also be possible to obtain >>>>> the right >>>>> > Kerberos token. >>>>> > >>>>> > I am not an authentication expert and that's why I wanted to ask >>>>> what are >>>>> > other authentication protocols other than Kerberos? Why did we select >>>>> > Kerberos and not any other authentication protocol? Maybe you can >>>>> list the >>>>> > pros and cons for the different protocols. Is Kerberos also the >>>>> standard >>>>> > authentication protocol for Kubernetes deployments? If not, what >>>>> would be >>>>> > the answer when deploying on K8s? >>>>> > >>>>> > Cheers, >>>>> > Till >>>>> > >>>>> > On Wed, Jun 2, 2021 at 12:07 PM Gabor Somogyi < >>>>> gabor.g.somo...@gmail.com> >>>>> > wrote: >>>>> > >>>>> >> Hi team, >>>>> >> >>>>> >> Happy to be here and hope I can provide quality additions in the >>>>> future. >>>>> >> >>>>> >> Thank you all for helpful the suggestions! >>>>> >> Considering them the FLIP has been modified and the work continues >>>>> on the >>>>> >> already existing Jira. >>>>> >> >>>>> >> BR, >>>>> >> G >>>>> >> >>>>> >> >>>>> >> On Wed, Jun 2, 2021 at 11:23 AM Márton Balassi < >>>>> balassi.mar...@gmail.com> >>>>> >> wrote: >>>>> >> >>>>> >>> Thanks, Chesney - I totally missed that. Answered on the ticket >>>>> too, let >>>>> >>> us continue there then. >>>>> >>> >>>>> >>> Till, I agree that we should keep this codepath as slim as >>>>> possible. It >>>>> >>> is an important design decision that we aim to keep the list of >>>>> >>> authentication protocols to a minimum. We believe that this should >>>>> not be a >>>>> >>> primary concern of Flink and a trusted proxy service (for example >>>>> Apache >>>>> >>> Knox) should be used to enable a multitude of enduser >>>>> authentication >>>>> >>> mechanisms. The bare minimum of authentication mechanisms to >>>>> support >>>>> >>> consequently consist of a single strong authentication protocol >>>>> for which >>>>> >>> Kerberos is the enterprise solution and HTTP Basic primary for >>>>> development >>>>> >>> and light-weight scenarios. >>>>> >>> >>>>> >>> Added the above wording to G's doc. >>>>> >>> >>>>> >>> >>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit >>>>> >>> >>>>> >>> >>>>> >>> >>>>> >>> On Tue, Jun 1, 2021 at 11:47 AM Chesnay Schepler < >>>>> ches...@apache.org> >>>>> >>> wrote: >>>>> >>> >>>>> >>>> There's a related effort: >>>>> >>>> https://issues.apache.org/jira/browse/FLINK-21108 >>>>> >>>> >>>>> >>>> On 6/1/2021 10:14 AM, Till Rohrmann wrote: >>>>> >>>> > Hi Gabor, welcome to the Flink community! >>>>> >>>> > >>>>> >>>> > Thanks for sharing this proposal with the community Márton. In >>>>> >>>> general, I >>>>> >>>> > agree that authentication is missing and that this is required >>>>> for >>>>> >>>> using >>>>> >>>> > Flink within an enterprise. The thing I am wondering is whether >>>>> this >>>>> >>>> > feature strictly needs to be implemented inside of Flink or >>>>> whether a >>>>> >>>> proxy >>>>> >>>> > setup could do the job? Have you considered this option? If >>>>> yes, then >>>>> >>>> it >>>>> >>>> > would be good to list it under the point of rejected >>>>> alternatives. >>>>> >>>> > >>>>> >>>> > I do see the benefit of implementing this feature inside of >>>>> Flink if >>>>> >>>> many >>>>> >>>> > users need it. If not, then it might be easier for the project >>>>> to not >>>>> >>>> > increase the surface area since it makes the overall maintenance >>>>> >>>> harder. >>>>> >>>> > >>>>> >>>> > Cheers, >>>>> >>>> > Till >>>>> >>>> > >>>>> >>>> > On Mon, May 31, 2021 at 4:57 PM Márton Balassi < >>>>> mbala...@apache.org> >>>>> >>>> wrote: >>>>> >>>> > >>>>> >>>> >> Hi team, >>>>> >>>> >> >>>>> >>>> >> Firstly I would like to introduce Gabor or G [1] for short to >>>>> the >>>>> >>>> >> community, he is a Spark committer who has recently >>>>> transitioned to >>>>> >>>> the >>>>> >>>> >> Flink Engineering team at Cloudera and is looking forward to >>>>> >>>> contributing >>>>> >>>> >> to Apache Flink. Previously G primarily focused on Spark >>>>> Streaming >>>>> >>>> and >>>>> >>>> >> security. >>>>> >>>> >> >>>>> >>>> >> Based on requests from our customers G has implemented >>>>> Kerberos and >>>>> >>>> HTTP >>>>> >>>> >> Basic Authentication for the Flink Dashboard and HistoryServer. >>>>> >>>> Previously >>>>> >>>> >> lacked an authentication story. >>>>> >>>> >> >>>>> >>>> >> We are looking to contribute this functionality back to the >>>>> >>>> community, we >>>>> >>>> >> believe that given Flink's maturity there should be a common >>>>> code >>>>> >>>> solution >>>>> >>>> >> for this general pattern. >>>>> >>>> >> >>>>> >>>> >> We are looking forward to your feedback on G's design. [2] >>>>> >>>> >> >>>>> >>>> >> [1] http://gaborsomogyi.com/ >>>>> >>>> >> [2] >>>>> >>>> >> >>>>> >>>> >> >>>>> >>>> >>>>> https://docs.google.com/document/d/1NMPeJ9H0G49TGy3AzTVVJVKmYC0okwOtqLTSPnGqzHw/edit >>>>> >>>> >> >>>>> >>>> >>>>> >>>> >>>>> >>>>