Hello Folks, We have finalised the PR regarding TLS certificate upgrade: [FLINK-37504] Handle TLS Certificate Renewal <https://github.com/apache/flink/pull/27034>
As suggested by the community, we have added dedicated unit tests for specific components, notably the *BlobServer*. I also added a full end-to-end test to ensure Flink performs the certificate reload. If you have spare cycles, we would appreciate your attention to the PR. Please note that Nicolas has paused work on this functionality, so *I have taken over ownership of this feature.* Thanks Kind regards Oleksandr Nitavskyi On Tue, Jun 24, 2025 at 8:54 AM Gabor Somogyi <[email protected]> wrote: > Thanks for your efforts! > The FLIP looks good from my side. > > G > > > On Tue, Jun 24, 2025 at 7:22 AM Nicolas Fraison < > [email protected]> wrote: > >> Hi, >> >> This remaining part of the doc has been updated >> >> On Thu, Jun 19, 2025 at 9:37 AM Gabor Somogyi <[email protected]> >> wrote: >> >>> I think in this SSLContextLoader class call() function is not correct >>> and must be onFileOrDirectoryModified(...), right? >>> >>> [image: Screenshot 2025-06-19 at 9.35.14.png] >>> >>> On Thu, Jun 19, 2025 at 8:14 AM Nicolas Fraison < >>> [email protected]> wrote: >>> >>>> Hi, >>>> >>>> Which part of the doc are you referring to? >>>> >>>> On Wed, Jun 18, 2025 at 7:08 PM Gabor Somogyi < >>>> [email protected]> wrote: >>>> >>>>> There is still a call function there which is left there from the >>>>> previous >>>>> design... >>>>> >>>>> On Tue, Jun 17, 2025 at 9:48 AM Nicolas Fraison >>>>> <[email protected]> wrote: >>>>> >>>>> > Done >>>>> > >>>>> > On Mon, Jun 16, 2025 at 2:46 PM Gabor Somogyi < >>>>> [email protected]> >>>>> > wrote: >>>>> > >>>>> > > I still see 3 Callable occurrence in the FLIP... >>>>> > > >>>>> > > On Thu, Jun 12, 2025 at 2:16 PM Nicolas Fraison >>>>> > > <[email protected]> wrote: >>>>> > > >>>>> > > > Hi, FLIP has been updated with last comment >>>>> > > > >>>>> > > > On Thu, Jun 5, 2025 at 6:35 PM Gabor Somogyi < >>>>> > [email protected]> >>>>> > > > wrote: >>>>> > > > >>>>> > > > > The interface has been added. Do we intend to pass an >>>>> > > > > instance registerWatchedPath function as we discussed? >>>>> > > > > If yes then the FLIP needs further adjustments all places >>>>> where now >>>>> > > > > Callable provided. >>>>> > > > > >>>>> > > > > G >>>>> > > > > >>>>> > > > > >>>>> > > > > On Thu, Jun 5, 2025 at 2:27 PM Nicolas Fraison >>>>> > > > > <[email protected]> wrote: >>>>> > > > > >>>>> > > > > > Hi, >>>>> > > > > > >>>>> > > > > > It indeed make sense, FLIP has been updated >>>>> > > > > > >>>>> > > > > > Nicolas >>>>> > > > > > >>>>> > > > > > On Wed, Jun 4, 2025 at 12:10 PM Gabor Somogyi < >>>>> > > > [email protected] >>>>> > > > > > >>>>> > > > > > wrote: >>>>> > > > > > >>>>> > > > > > > Hi, >>>>> > > > > > > >>>>> > > > > > > I've read it through. Basically looks good with one >>>>> comment. >>>>> > > > > > > *registerWatchedPath function* has a *Callable* as >>>>> callback. For >>>>> > > the >>>>> > > > > > > current implementation that would be enough, >>>>> > > > > > > but when somebody would like to use that for any other >>>>> use-case >>>>> > > then >>>>> > > > it >>>>> > > > > > > would be hard. Examples: >>>>> > > > > > > * A single *call* function tells the user nothing what >>>>> kind of >>>>> > > event >>>>> > > > is >>>>> > > > > > > this >>>>> > > > > > > * the watch service supports 3 events (create, modify, >>>>> delete), >>>>> > now >>>>> > > > > when >>>>> > > > > > I >>>>> > > > > > > register I can get only updates (I presume) >>>>> > > > > > > * 1+ dir watch with the same callback is not possible >>>>> > > > > > > >>>>> > > > > > > My suggestion would be to create a proper callback with >>>>> all the >>>>> > > event >>>>> > > > > > type >>>>> > > > > > > functions and no-op default behavior with the following >>>>> names[1]. >>>>> > > > > > > This is ~30 lines addition but will increase the >>>>> readability >>>>> > > heavily >>>>> > > > + >>>>> > > > > no >>>>> > > > > > > need to touch this code later. >>>>> > > > > > > >>>>> > > > > > > BR, >>>>> > > > > > > G >>>>> > > > > > > >>>>> > > > > > > [1] >>>>> > > > > > > >>>>> > > > > > > >>>>> > > > > > >>>>> > > > > >>>>> > > > >>>>> > > >>>>> > >>>>> https://github.com/apache/flink-kubernetes-operator/commit/e5a325c48965a50d61d0aa29e61ba79e97f27082#diff-a30b3ed9b8c53e998b15d7da7ad2e54374c98ffc3c920f76a70bce3fb37a9b2eR87-R93 >>>>> > > > > > > >>>>> > > > > > > >>>>> > > > > > > On Tue, Jun 3, 2025 at 8:53 AM Nicolas Fraison >>>>> > > > > > > <[email protected]> wrote: >>>>> > > > > > > >>>>> > > > > > > > Hi, >>>>> > > > > > > > >>>>> > > > > > > > The FLIP has been updated. >>>>> > > > > > > > Let me know if you have some other comments. >>>>> > > > > > > > >>>>> > > > > > > > Nicolas >>>>> > > > > > > > >>>>> > > > > > > > On Tue, May 20, 2025 at 12:01 PM Nicolas Fraison < >>>>> > > > > > > > [email protected]> wrote: >>>>> > > > > > > > >>>>> > > > > > > > > Hi Gabor, >>>>> > > > > > > > > >>>>> > > > > > > > > Thanks for the feedback. >>>>> > > > > > > > > The overall proposal with atomic dirty flag being set >>>>> by the >>>>> > > > > callback >>>>> > > > > > > > will >>>>> > > > > > > > > indeed work with any kind of implementation. >>>>> > > > > > > > > >>>>> > > > > > > > > Will see to update the FLIP in a week or 2 if there >>>>> are no >>>>> > > other >>>>> > > > > > > > comments >>>>> > > > > > > > > on it >>>>> > > > > > > > > >>>>> > > > > > > > > Nicolas >>>>> > > > > > > > > >>>>> > > > > > > > > >>>>> > > > > > > > > On Tue, May 20, 2025 at 11:08 AM Gabor Somogyi < >>>>> > > > > > > > [email protected]> >>>>> > > > > > > > > wrote: >>>>> > > > > > > > > >>>>> > > > > > > > >> Hi Nicolas, >>>>> > > > > > > > >> >>>>> > > > > > > > >> Related SSLContext I've not gone through all the >>>>> cases where >>>>> > > we >>>>> > > > > need >>>>> > > > > > > to >>>>> > > > > > > > >> reload so instead I'm sharing the concept. >>>>> > > > > > > > >> The main intention is that we must control all >>>>> execution >>>>> > paths >>>>> > > > > which >>>>> > > > > > > > >> decide >>>>> > > > > > > > >> which certificate used for authentication. >>>>> > > > > > > > >> Creating an SSLContext decorator which checks reload >>>>> first >>>>> > and >>>>> > > > > then >>>>> > > > > > > > >> forwards all calls to the original (wrapped) context >>>>> > > > > > > > >> is one way to achieve that. If there are different >>>>> > > > implementations >>>>> > > > > > > which >>>>> > > > > > > > >> end up in similar behavior then it's fine. >>>>> > > > > > > > >> >>>>> > > > > > > > >> BR, >>>>> > > > > > > > >> G >>>>> > > > > > > > >> >>>>> > > > > > > > >> >>>>> > > > > > > > >> On Tue, May 20, 2025 at 10:15 AM Nicolas Fraison >>>>> > > > > > > > >> <[email protected]> wrote: >>>>> > > > > > > > >> >>>>> > > > > > > > >> > Hi, >>>>> > > > > > > > >> > >>>>> > > > > > > > >> > Overall your proposal looks great. >>>>> > > > > > > > >> > The event handling must indeed be super fast and we >>>>> must >>>>> > > also >>>>> > > > > not >>>>> > > > > > > > change >>>>> > > > > > > > >> > original code path if reload not needed >>>>> > > > > > > > >> > >>>>> > > > > > > > >> > I have some concerns around the ReloadableSSLContext >>>>> > > > > implementing >>>>> > > > > > > all >>>>> > > > > > > > >> > SSLContext >>>>> > > > > > > > >> > Do you really mean SSLContext (java SSLContext one) >>>>> or do >>>>> > > you >>>>> > > > > > refer >>>>> > > > > > > to >>>>> > > > > > > > >> the >>>>> > > > > > > > >> > SslContext from netty? >>>>> > > > > > > > >> > >>>>> > > > > > > > >> > If java SSLContext >>>>> > > > > > > > >> > - I'm not sure how this will manage reload from the >>>>> > > BlobServer >>>>> > > > > > > > >> > BlobServer relies on creation of an >>>>> SSLServerSocketFactory >>>>> > > > from >>>>> > > > > > the >>>>> > > > > > > > >> > SSLContext. >>>>> > > > > > > > >> > But from my current understanding the >>>>> > SSLServerSocketFactory >>>>> > > > > does >>>>> > > > > > > not >>>>> > > > > > > > >> have >>>>> > > > > > > > >> > any connection with the SSLContext. >>>>> > > > > > > > >> > It only has some with an SSLContextImpl extends >>>>> > > SSLContextSpi. >>>>> > > > > > > > >> > I think we will need a callback here to enforce >>>>> recreation >>>>> > > of >>>>> > > > > the >>>>> > > > > > > > >> > BlobServer socket. >>>>> > > > > > > > >> > - I'm also don't see how to attach this to the >>>>> SslContext >>>>> > > from >>>>> > > > > > netty >>>>> > > > > > > > >> > >>>>> > > > > > > > >> > If netty SslContext >>>>> > > > > > > > >> > - we would still need to have the callback to >>>>> recreate the >>>>> > > > > > > BlobServer >>>>> > > > > > > > >> > socket >>>>> > > > > > > > >> > - for netty and pekko we should be able to rely on >>>>> it >>>>> > > > > > > > >> > >>>>> > > > > > > > >> > Nicolas >>>>> > > > > > > > >> > >>>>> > > > > > > > >> > On Wed, May 7, 2025 at 12:40 PM Gabor Somogyi < >>>>> > > > > > > > >> [email protected]> >>>>> > > > > > > > >> > wrote: >>>>> > > > > > > > >> > >>>>> > > > > > > > >> > > Hi All, >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > > I've read through the concerns/proposals related >>>>> the >>>>> > watch >>>>> > > > > > service >>>>> > > > > > > > and >>>>> > > > > > > > >> > here >>>>> > > > > > > > >> > > are my conclusions: >>>>> > > > > > > > >> > > * Watch service can watch local file systems >>>>> only: It's >>>>> > > fair >>>>> > > > > to >>>>> > > > > > > say >>>>> > > > > > > > >> that >>>>> > > > > > > > >> > > certificates must be copied to local FS in order >>>>> to work >>>>> > > > (init >>>>> > > > > > > > >> container >>>>> > > > > > > > >> > cp >>>>> > > > > > > > >> > > command or something) >>>>> > > > > > > > >> > > * Watch service can send multiple events even for >>>>> a >>>>> > single >>>>> > > > > > > directory >>>>> > > > > > > > >> > > change: this can be mitigated with a single >>>>> atomic dirty >>>>> > > > flag >>>>> > > > > > (see >>>>> > > > > > > > my >>>>> > > > > > > > >> > > design suggestion) >>>>> > > > > > > > >> > > * Polling file modification time: When I hear any >>>>> kind >>>>> > of >>>>> > > > > > polling >>>>> > > > > > > > >> > > implemented by us is just something I'm mostly >>>>> opposing >>>>> > > > > > > > >> > > * security.ssl.internal.keystore.reload.duration: >>>>> > Security >>>>> > > > > > > features >>>>> > > > > > > > >> must >>>>> > > > > > > > >> > be >>>>> > > > > > > > >> > > executed nearly immediately no matter what >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > > As a general remark, not sure how many users are >>>>> using >>>>> > the >>>>> > > > > watch >>>>> > > > > > > > >> service >>>>> > > > > > > > >> > > based approach in the operator but until now I've >>>>> not >>>>> > seen >>>>> > > > any >>>>> > > > > > > issue >>>>> > > > > > > > >> > > related to that. >>>>> > > > > > > > >> > > If somebody is having some specifics then please >>>>> share. >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > > For now I would shoot for keystore not to have >>>>> feature >>>>> > > > creep. >>>>> > > > > > When >>>>> > > > > > > > >> there >>>>> > > > > > > > >> > is >>>>> > > > > > > > >> > > a valid use-case, community interest and the >>>>> keystore >>>>> > > story >>>>> > > > is >>>>> > > > > > > > already >>>>> > > > > > > > >> > rock >>>>> > > > > > > > >> > > stable >>>>> > > > > > > > >> > > then we can consider to involve truststore later. >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > > Having the mentioned assumption that the operator >>>>> > approach >>>>> > > > > > works, >>>>> > > > > > > > >> here is >>>>> > > > > > > > >> > > my high level proposal: >>>>> > > > > > > > >> > > * Let's have enable flag for all such watch >>>>> > functionality. >>>>> > > > If >>>>> > > > > > it's >>>>> > > > > > > > >> false >>>>> > > > > > > > >> > > then the currently existing functionality must >>>>> remain >>>>> > > as-is >>>>> > > > > > > > >> > > * Let's have a LocalFSWatchService which is a >>>>> singleton >>>>> > > > which >>>>> > > > > > has >>>>> > > > > > > no >>>>> > > > > > > > >> path >>>>> > > > > > > > >> > > registrations by default >>>>> > > > > > > > >> > > * Add path registration functionality which is >>>>> > > synchronised >>>>> > > > > > where >>>>> > > > > > > a >>>>> > > > > > > > >> > > callback can be registered >>>>> > > > > > > > >> > > * Let's have a ReloadableSSLContext which >>>>> implements the >>>>> > > > > > mentioned >>>>> > > > > > > > >> > callback >>>>> > > > > > > > >> > > * Inside the callback set an atomic dirty flag >>>>> only >>>>> > (this >>>>> > > > can >>>>> > > > > > > handle >>>>> > > > > > > > >> > > multiple events for the same directory change + >>>>> event >>>>> > > > handling >>>>> > > > > > in >>>>> > > > > > > > the >>>>> > > > > > > > >> > watch >>>>> > > > > > > > >> > > service must be extreme fast) >>>>> > > > > > > > >> > > * Inside ReloadableSSLContext all SSLContext >>>>> actions >>>>> > must >>>>> > > be >>>>> > > > > > > > >> overridden. >>>>> > > > > > > > >> > At >>>>> > > > > > > > >> > > the beginning of each function dirty flag must be >>>>> > checked >>>>> > > > > > > > >> > > and if dirty then certificates must be reloaded, >>>>> flag >>>>> > can >>>>> > > be >>>>> > > > > set >>>>> > > > > > > > back >>>>> > > > > > > > >> to >>>>> > > > > > > > >> > > false (then original functionality call). It's >>>>> extremely >>>>> > > > > > important >>>>> > > > > > > > >> that >>>>> > > > > > > > >> > > context reload must be synchronised. >>>>> > > > > > > > >> > > A synchronised boolean check + possible context >>>>> reload >>>>> > can >>>>> > > > > > consume >>>>> > > > > > > > >> some >>>>> > > > > > > > >> > > time but I wouldn't expect any significant >>>>> performance >>>>> > > drop. >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > > My proposal main drivers: >>>>> > > > > > > > >> > > * Original code path must run when no watch >>>>> service >>>>> > asked >>>>> > > > > > > > >> > > * Super fast event handling because million >>>>> events may >>>>> > > come >>>>> > > > in >>>>> > > > > > > (not >>>>> > > > > > > > >> > > expecting but we should be prepared) >>>>> > > > > > > > >> > > * Clean separation between dir/file watch and >>>>> file/dir >>>>> > > usage >>>>> > > > > > > > >> > > * Well considered synchronisation model >>>>> > > > > > > > >> > > * Extensive unit testing because we're intended >>>>> to touch >>>>> > > the >>>>> > > > > > heart >>>>> > > > > > > > of >>>>> > > > > > > > >> > Flink >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > > Happy to hear other opinions. >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > > BR, >>>>> > > > > > > > >> > > G >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > > On Mon, Apr 28, 2025 at 1:54 PM Nicolas Fraison >>>>> > > > > > > > >> > > <[email protected]> wrote: >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > > > Thanks all for your feedback and sorry for the >>>>> late >>>>> > > answer >>>>> > > > > (I >>>>> > > > > > > was >>>>> > > > > > > > on >>>>> > > > > > > > >> > > > holiday). >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > 1. Indeed it would add 4 threads. >>>>> > > > > > > > >> > > > For the non pekko component we can indeed have >>>>> one >>>>> > > watcher >>>>> > > > > > > service >>>>> > > > > > > > >> used >>>>> > > > > > > > >> > > to >>>>> > > > > > > > >> > > > reload SSLContext for those components >>>>> > > > > > > > >> > > > For pekko this is a little more challenging as >>>>> the >>>>> > > > creation >>>>> > > > > of >>>>> > > > > > > the >>>>> > > > > > > > >> > pekko >>>>> > > > > > > > >> > > > ssl engine is managed by pekko himself. >>>>> > > > > > > > >> > > > Flink only generates appropriate config with >>>>> class to >>>>> > > > > execute >>>>> > > > > > to >>>>> > > > > > > > >> > initiate >>>>> > > > > > > > >> > > > the pekko ssl engine [1]. This means that I >>>>> will not >>>>> > be >>>>> > > > able >>>>> > > > > > to >>>>> > > > > > > > >> provide >>>>> > > > > > > > >> > > the >>>>> > > > > > > > >> > > > watcher service to this ssl engine. >>>>> > > > > > > > >> > > > One solution would be to rely on a singleton >>>>> instead >>>>> > of >>>>> > > a >>>>> > > > > > > service >>>>> > > > > > > > >> > > injected >>>>> > > > > > > > >> > > > in each component but I'm not sure this is fine >>>>> to use >>>>> > > > such >>>>> > > > > in >>>>> > > > > > > > >> flink. >>>>> > > > > > > > >> > > > WDYT? >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > We can also add a specific flink configuration >>>>> > > > > > > > >> > > > (security.ssl.internal.keystore.reload.enable) >>>>> to only >>>>> > > add >>>>> > > > > > this >>>>> > > > > > > > >> watcher >>>>> > > > > > > > >> > > > mechanism if the config is enabled to avoid >>>>> adding >>>>> > those >>>>> > > > > > threads >>>>> > > > > > > > if >>>>> > > > > > > > >> > this >>>>> > > > > > > > >> > > is >>>>> > > > > > > > >> > > > not needed. >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > 2. I'm fine with the LocalFSWatchService naming. >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > 3. Also agree that some e2e tests must be added. >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > @doguscan namal, Thanks for challenging the >>>>> proposal. >>>>> > > > > > > > >> > > > FYI, we are planning to rely on certificates >>>>> with >>>>> > really >>>>> > > > > short >>>>> > > > > > > > >> > validity. >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > D1. It seems that the proposal to rely on a >>>>> reload >>>>> > > period >>>>> > > > > will >>>>> > > > > > > > still >>>>> > > > > > > > >> > face >>>>> > > > > > > > >> > > > potential issues: >>>>> > > > > > > > >> > > > - receiving events while file content updates >>>>> are >>>>> > still >>>>> > > in >>>>> > > > > > > > progress: >>>>> > > > > > > > >> > > > There is no guarantee that we will not load the >>>>> > > > certificate >>>>> > > > > > > while >>>>> > > > > > > > >> file >>>>> > > > > > > > >> > > > content updates are still in progress >>>>> > > > > > > > >> > > > - while it will not be affected by multiple >>>>> > > notifications, >>>>> > > > > we >>>>> > > > > > > can >>>>> > > > > > > > >> > reach a >>>>> > > > > > > > >> > > > point where only the truststore is updated when >>>>> the >>>>> > > reload >>>>> > > > > > > > happens. >>>>> > > > > > > > >> > > > Which means that if the keystore is updated just >>>>> > after, >>>>> > > it >>>>> > > > > > will >>>>> > > > > > > > not >>>>> > > > > > > > >> be >>>>> > > > > > > > >> > > > taken in account before next run of the reload >>>>> > mechanism >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > I think that with WatchService and appropriate >>>>> reload >>>>> > > > grace >>>>> > > > > > > period >>>>> > > > > > > > >> > > > mechanism we should be able to mitigate those 2 >>>>> issues >>>>> > > > > > (ensuring >>>>> > > > > > > > >> > minimum >>>>> > > > > > > > >> > > > reload even with multiple notify) >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > From KIP-1119 [2] it looks like the same kind of >>>>> > > > > requirements >>>>> > > > > > is >>>>> > > > > > > > >> under >>>>> > > > > > > > >> > > > discussion for Kafka to also rely on the >>>>> WatchService >>>>> > > Java >>>>> > > > > API >>>>> > > > > > > > >> > > (SpringBoot >>>>> > > > > > > > >> > > > seems to also rely on this API to manage ssl >>>>> reload >>>>> > > [3]). >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > D3. Do we have a real use case for this? >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > [1] >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > >>>>> > > > > > > > >> >>>>> > > > > > > > >>>>> > > > > > > >>>>> > > > > > >>>>> > > > > >>>>> > > > >>>>> > > >>>>> > >>>>> https://github.com/apache/flink/blob/b523264ab45d37cd9584a0e8c06f1ef6bd1aaed7/flink-rpc/flink-rpc-akka/src/main/java/org/apache/flink/runtime/rpc/pekko/PekkoUtils.java#L372 >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > [2] >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > >>>>> > > > > > > > >> >>>>> > > > > > > > >>>>> > > > > > > >>>>> > > > > > >>>>> > > > > >>>>> > > > >>>>> > > >>>>> > >>>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1119%3A+Add+support+for+SSL+auto+reload >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > [3] >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > >>>>> > > > > > > > >> >>>>> > > > > > > > >>>>> > > > > > > >>>>> > > > > > >>>>> > > > > >>>>> > > > >>>>> > > >>>>> > >>>>> https://docs.spring.io/spring-boot/reference/features/ssl.html#features.ssl.reloading >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > Regards, >>>>> > > > > > > > >> > > > Nicolas >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > On Wed, Apr 23, 2025 at 12:14 PM Doğuşcan Namal >>>>> < >>>>> > > > > > > > >> > > [email protected]> >>>>> > > > > > > > >> > > > wrote: >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > > Hi Nicolas, thanks for the FLIP. >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > I am fully supportive of the motivation and we >>>>> > should >>>>> > > be >>>>> > > > > > > > >> supporting >>>>> > > > > > > > >> > > this >>>>> > > > > > > > >> > > > > feature. Here are couple of comments from my >>>>> side: >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > D1) >>>>> > > > > > > > >> > > > > Since you shared the implementation details >>>>> on the >>>>> > > FLIP >>>>> > > > as >>>>> > > > > > > > well, I >>>>> > > > > > > > >> > > would >>>>> > > > > > > > >> > > > > like to discuss whether using Java's >>>>> WatchService is >>>>> > > the >>>>> > > > > > best >>>>> > > > > > > > >> choice >>>>> > > > > > > > >> > > > here. >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > I believe that the certificate renewal is not >>>>> a >>>>> > > frequent >>>>> > > > > > > > >> operation. >>>>> > > > > > > > >> > > Even >>>>> > > > > > > > >> > > > > once a day certificate renewals are not >>>>> > > realistic(except >>>>> > > > > for >>>>> > > > > > > > test >>>>> > > > > > > > >> > cases >>>>> > > > > > > > >> > > > > maybe) but let's assume that this covers up >>>>> the >>>>> > p99.99 >>>>> > > > of >>>>> > > > > > the >>>>> > > > > > > > use >>>>> > > > > > > > >> > > cases. >>>>> > > > > > > > >> > > > I >>>>> > > > > > > > >> > > > > am confident on this estimation since there >>>>> hasn't >>>>> > > been >>>>> > > > a >>>>> > > > > > > > request >>>>> > > > > > > > >> > from >>>>> > > > > > > > >> > > > the >>>>> > > > > > > > >> > > > > community for this feature so far, which >>>>> confirms >>>>> > that >>>>> > > > > > people >>>>> > > > > > > > were >>>>> > > > > > > > >> > okay >>>>> > > > > > > > >> > > > > with infrequent cluster restarts. Following >>>>> that it >>>>> > is >>>>> > > > > > > > >> infrequent, I >>>>> > > > > > > > >> > > > > believe that spawning up a thread that >>>>> watches the >>>>> > > file >>>>> > > > > > > > >> modification >>>>> > > > > > > > >> > > > > operations is not the best use of the limited >>>>> > > resources >>>>> > > > > on a >>>>> > > > > > > > >> cluster. >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > There are some known limitations of the >>>>> WatchService >>>>> > > as >>>>> > > > > well >>>>> > > > > > > > such >>>>> > > > > > > > >> as >>>>> > > > > > > > >> > > > > receiving multiple modification events for >>>>> the same >>>>> > > > > > occurence >>>>> > > > > > > > [1], >>>>> > > > > > > > >> > > > > inotify's (WatchService's underlying >>>>> mechanism in >>>>> > > Linux >>>>> > > > > > > > >> environments) >>>>> > > > > > > > >> > > > > problems on containerized environments due to >>>>> remote >>>>> > > > file >>>>> > > > > > > > systems >>>>> > > > > > > > >> [2] >>>>> > > > > > > > >> > > or >>>>> > > > > > > > >> > > > > receiving events while file content updates >>>>> are >>>>> > still >>>>> > > in >>>>> > > > > > > > progress. >>>>> > > > > > > > >> > > [3]. I >>>>> > > > > > > > >> > > > > do not know if these limitations are >>>>> addressed in >>>>> > the >>>>> > > > > newer >>>>> > > > > > > > >> versions >>>>> > > > > > > > >> > > but >>>>> > > > > > > > >> > > > > regardless of that it is clear that we may >>>>> face with >>>>> > > > some >>>>> > > > > > ugly >>>>> > > > > > > > >> edge >>>>> > > > > > > > >> > > cases >>>>> > > > > > > > >> > > > > due to that. >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > Given these complications, I would recommend >>>>> just >>>>> > > > > creating a >>>>> > > > > > > new >>>>> > > > > > > > >> > > > SSLContext >>>>> > > > > > > > >> > > > > after a configured duration is expired. We >>>>> could >>>>> > > record >>>>> > > > > the >>>>> > > > > > > > >> timestamp >>>>> > > > > > > > >> > > > when >>>>> > > > > > > > >> > > > > the previous SSLContext is created and update >>>>> it >>>>> > > after a >>>>> > > > > > > > >> configured >>>>> > > > > > > > >> > > > > duration is passed. This will be much easier >>>>> to test >>>>> > > and >>>>> > > > > > > reason >>>>> > > > > > > > >> about >>>>> > > > > > > > >> > > > when >>>>> > > > > > > > >> > > > > it is running on production. This will >>>>> eliminate the >>>>> > > > > > necessity >>>>> > > > > > > > to >>>>> > > > > > > > >> > > reason >>>>> > > > > > > > >> > > > > about the file modification operations as >>>>> well. >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > I briefly skimmed through the classes that >>>>> need to >>>>> > be >>>>> > > > > > modified >>>>> > > > > > > > >> and it >>>>> > > > > > > > >> > > > > looked feasible for me. Let me know what are >>>>> your >>>>> > > > comments >>>>> > > > > > on >>>>> > > > > > > > >> these. >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > Note that this is already used in the Kafka >>>>> world >>>>> > > where >>>>> > > > a >>>>> > > > > > new >>>>> > > > > > > > >> > > SSLContext >>>>> > > > > > > > >> > > > is >>>>> > > > > > > > >> > > > > created after 12 hours. >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > D2) We could provide a configuration to the >>>>> user, >>>>> > such >>>>> > > > as >>>>> > > > > > > > >> > > > > >>>>> "security.ssl.internal.keystore.reload.duration" so >>>>> > > they >>>>> > > > > > could >>>>> > > > > > > > >> decide >>>>> > > > > > > > >> > > how >>>>> > > > > > > > >> > > > > often the new certificates should be loaded. >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > D3) >>>>> > > > > > > > >> > > > > On the other hand, I wonder whether we should >>>>> also >>>>> > > > handle >>>>> > > > > > > > >> supporting >>>>> > > > > > > > >> > > > > updating the file paths of the truststores and >>>>> > > keystores >>>>> > > > > > under >>>>> > > > > > > > >> this >>>>> > > > > > > > >> > > FLIP >>>>> > > > > > > > >> > > > as >>>>> > > > > > > > >> > > > > well. Since the name of the FLIP is "Handle >>>>> TLS >>>>> > > > > Certificate >>>>> > > > > > > > >> Renewal" >>>>> > > > > > > > >> > I >>>>> > > > > > > > >> > > > > think we could bring that into scope too :) >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > Thanks >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > [1] >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > >>>>> > > > > > > > >> >>>>> > > > > > > > >>>>> > > > > > > >>>>> > > > > > >>>>> > > > > >>>>> > > > >>>>> > > >>>>> > >>>>> https://stackoverflow.com/questions/16777869/java-7-watchservice-ignoring-multiple-occurrences-of-the-same-event >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > [2] >>>>> > > > > > > > >>>>> https://blog.arkey.fr/2019/09/13/watchservice-and-bind-mount/ >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > [3] >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> >>>>> > > > > > >>>>> > > >>>>> https://surajatreyac.github.io/2014-07-29/reactive_file_handling.html >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > [4] See also Platform Dependencies - >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > >>>>> > > > > > > > >> >>>>> > > > > > > > >>>>> > > > > > > >>>>> > > > > > >>>>> > > > > >>>>> > > > >>>>> > > >>>>> > >>>>> https://docs.oracle.com/javase/8/docs/api/index.html?java/nio/file/WatchService.html >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > On Fri, 18 Apr 2025 at 18:25, Gabor Somogyi < >>>>> > > > > > > > >> > [email protected] >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > > > wrote: >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > > > Hi Robert, >>>>> > > > > > > > >> > > > > > >>>>> > > > > > > > >> > > > > > Since I've added the same feature to the >>>>> operator >>>>> > > I'll >>>>> > > > > > take >>>>> > > > > > > a >>>>> > > > > > > > >> look >>>>> > > > > > > > >> > at >>>>> > > > > > > > >> > > > it. >>>>> > > > > > > > >> > > > > > Though it won't be lightning fast since I'm >>>>> having >>>>> > > > > several >>>>> > > > > > > > weeks >>>>> > > > > > > > >> > off. >>>>> > > > > > > > >> > > > > > >>>>> > > > > > > > >> > > > > > Your questions are valid especially >>>>> considering >>>>> > the >>>>> > > > fact >>>>> > > > > > > that >>>>> > > > > > > > >> this >>>>> > > > > > > > >> > > > > feature >>>>> > > > > > > > >> > > > > > touches the hearth of the authentication so >>>>> this >>>>> > > must >>>>> > > > be >>>>> > > > > > > rock >>>>> > > > > > > > >> solid >>>>> > > > > > > > >> > > > > > in order to avoid grey hair :) >>>>> > > > > > > > >> > > > > > >>>>> > > > > > > > >> > > > > > (1) I would vote on a single service which >>>>> is >>>>> > > heavily >>>>> > > > > unit >>>>> > > > > > > > >> tested >>>>> > > > > > > > >> > > with >>>>> > > > > > > > >> > > > > > all the possible combinations including >>>>> threading. >>>>> > > > > > > > >> > > > > > >>>>> > > > > > > > >> > > > > > Some standalone app could be added to >>>>> really play >>>>> > > with >>>>> > > > > it >>>>> > > > > > > > (that >>>>> > > > > > > > >> > would >>>>> > > > > > > > >> > > > > help >>>>> > > > > > > > >> > > > > > review). >>>>> > > > > > > > >> > > > > > I mean, create X files, start Y threads, >>>>> and make >>>>> > > > > > > assertions. >>>>> > > > > > > > >> > > > > > The reason why I'm suggesting it is the >>>>> fact that >>>>> > > > AFAIR >>>>> > > > > > the >>>>> > > > > > > > >> watch >>>>> > > > > > > > >> > > > service >>>>> > > > > > > > >> > > > > > is quite sensitive even in single thread. >>>>> If we >>>>> > > could >>>>> > > > do >>>>> > > > > > > this >>>>> > > > > > > > >> in a >>>>> > > > > > > > >> > > > finite >>>>> > > > > > > > >> > > > > > time >>>>> > > > > > > > >> > > > > > consuming unit test then it's even better. >>>>> > > > > > > > >> > > > > > >>>>> > > > > > > > >> > > > > > (2) +1 on that name to avoid confusion >>>>> > > > > > > > >> > > > > > >>>>> > > > > > > > >> > > > > > (3) I agree that some e2e is must, however >>>>> this >>>>> > can >>>>> > > be >>>>> > > > > > > easily >>>>> > > > > > > > >> and >>>>> > > > > > > > >> > > > deeply >>>>> > > > > > > > >> > > > > > unit >>>>> > > > > > > > >> > > > > > tested so that part is also essential. One >>>>> key >>>>> > test >>>>> > > > here >>>>> > > > > > is >>>>> > > > > > > > when >>>>> > > > > > > > >> > > > > > certificates >>>>> > > > > > > > >> > > > > > are not changing then no action must be >>>>> performed >>>>> > > (not >>>>> > > > > to >>>>> > > > > > > > break >>>>> > > > > > > > >> the >>>>> > > > > > > > >> > > > whole >>>>> > > > > > > > >> > > > > > system apart). >>>>> > > > > > > > >> > > > > > >>>>> > > > > > > > >> > > > > > Purely personal opinion but such feature >>>>> > > developments >>>>> > > > > are >>>>> > > > > > > slow >>>>> > > > > > > > >> by >>>>> > > > > > > > >> > > > nature >>>>> > > > > > > > >> > > > > > because >>>>> > > > > > > > >> > > > > > of edge case / stress testing. >>>>> > > > > > > > >> > > > > > >>>>> > > > > > > > >> > > > > > BR, >>>>> > > > > > > > >> > > > > > G >>>>> > > > > > > > >> > > > > > >>>>> > > > > > > > >> > > > > > >>>>> > > > > > > > >> > > > > > On Fri, Apr 18, 2025 at 4:53 PM Robert >>>>> Metzger < >>>>> > > > > > > > >> > [email protected]> >>>>> > > > > > > > >> > > > > > wrote: >>>>> > > > > > > > >> > > > > > >>>>> > > > > > > > >> > > > > > > Hi Nicolas, >>>>> > > > > > > > >> > > > > > > >>>>> > > > > > > > >> > > > > > > This looks like a nice improvement, >>>>> thanks for >>>>> > the >>>>> > > > > write >>>>> > > > > > > up. >>>>> > > > > > > > >> > > > > > > Are you in touch with any committer who's >>>>> > willing >>>>> > > to >>>>> > > > > > > review >>>>> > > > > > > > / >>>>> > > > > > > > >> > merge >>>>> > > > > > > > >> > > > > this? >>>>> > > > > > > > >> > > > > > > >>>>> > > > > > > > >> > > > > > > Some random questions on the FLIP: >>>>> > > > > > > > >> > > > > > > (1) "Each service that depends on TLS >>>>> > > certificates >>>>> > > > > will >>>>> > > > > > > > >> > > initialize a >>>>> > > > > > > > >> > > > > > > FileSytemWatchService" >>>>> > > > > > > > >> > > > > > > >>>>> > > > > > > > >> > > > > > > It seems that there are 4 components >>>>> using SSL, >>>>> > > does >>>>> > > > > > this >>>>> > > > > > > > mean >>>>> > > > > > > > >> > > there >>>>> > > > > > > > >> > > > > will >>>>> > > > > > > > >> > > > > > > be 4 additional threads running, watching >>>>> the >>>>> > same >>>>> > > > set >>>>> > > > > > of >>>>> > > > > > > > >> files? >>>>> > > > > > > > >> > > > > > > Wouldn't it be better to introduce a >>>>> central >>>>> > file >>>>> > > > > > watching >>>>> > > > > > > > >> > service, >>>>> > > > > > > > >> > > > and >>>>> > > > > > > > >> > > > > > SSL >>>>> > > > > > > > >> > > > > > > users can subscribe to updates, to reduce >>>>> the >>>>> > > number >>>>> > > > > of >>>>> > > > > > > > >> threads? >>>>> > > > > > > > >> > > > > > > If this makes the whole effort 4x more >>>>> > > complicated, >>>>> > > > I >>>>> > > > > > > > wouldn't >>>>> > > > > > > > >> > > > consider >>>>> > > > > > > > >> > > > > > it, >>>>> > > > > > > > >> > > > > > > but if its roughly the same effort, we >>>>> should :) >>>>> > > > > > > > >> > > > > > > >>>>> > > > > > > > >> > > > > > > (2) "FileSytemWatchService" >>>>> > > > > > > > >> > > > > > > When I read this name, I was wondering, >>>>> whether >>>>> > > this >>>>> > > > > is >>>>> > > > > > > > >> somehow >>>>> > > > > > > > >> > > > related >>>>> > > > > > > > >> > > > > > to >>>>> > > > > > > > >> > > > > > > the Flink "FileSystem" classes. Which I >>>>> think >>>>> > its' >>>>> > > > > not. >>>>> > > > > > > > >> > > > > > > Maybe a different name, that makes this >>>>> > separation >>>>> > > > > more >>>>> > > > > > > > >> explicit, >>>>> > > > > > > > >> > > > would >>>>> > > > > > > > >> > > > > > > make sense. Maybe "LocalFSWatchService"? >>>>> > > > > > > > >> > > > > > > (I'm sorry to bring up naming stuff -- >>>>> its very >>>>> > > > > > > subjective, >>>>> > > > > > > > >> and >>>>> > > > > > > > >> > > > > > difficult) >>>>> > > > > > > > >> > > > > > > >>>>> > > > > > > > >> > > > > > > (3) For the test plan: There seem to be >>>>> some SSL >>>>> > > > > related >>>>> > > > > > > e2e >>>>> > > > > > > > >> > tests: >>>>> > > > > > > > >> > > > > > > >>>>> > > > > > > > >> > > > > > > >>>>> > > > > > > > >> > > > > > >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > >>>>> > > > > > > > >> >>>>> > > > > > > > >>>>> > > > > > > >>>>> > > > > > >>>>> > > > > >>>>> > > > >>>>> > > >>>>> > >>>>> https://github.com/apache/flink/blob/master/flink-end-to-end-tests/test-scripts/common_ssl.sh >>>>> > > > > > > > >> > > > > > > It would be nice to extend them to cover >>>>> this >>>>> > > > feature >>>>> > > > > as >>>>> > > > > > > > >> well. I >>>>> > > > > > > > >> > > > would >>>>> > > > > > > > >> > > > > > hate >>>>> > > > > > > > >> > > > > > > for this feature to slowly break by future >>>>> > > changes, >>>>> > > > so >>>>> > > > > > > good >>>>> > > > > > > > >> e2e >>>>> > > > > > > > >> > > test >>>>> > > > > > > > >> > > > > > > coverage is key, in particular bc so many >>>>> > > components >>>>> > > > > are >>>>> > > > > > > > >> > involved. >>>>> > > > > > > > >> > > > > > > >>>>> > > > > > > > >> > > > > > > Best, >>>>> > > > > > > > >> > > > > > > Robert >>>>> > > > > > > > >> > > > > > > >>>>> > > > > > > > >> > > > > > > On Wed, Apr 16, 2025 at 11:55 AM Nicolas >>>>> Fraison >>>>> > > > > > > > >> > > > > > > <[email protected]> >>>>> wrote: >>>>> > > > > > > > >> > > > > > > >>>>> > > > > > > > >> > > > > > > > Hi All, >>>>> > > > > > > > >> > > > > > > > >>>>> > > > > > > > >> > > > > > > > I'd like to start a discussion to >>>>> Handle TLS >>>>> > > > > > Certificate >>>>> > > > > > > > >> > Renewal >>>>> > > > > > > > >> > > > > > > > Please provide some feedback on this >>>>> proposal: >>>>> > > > > > > > >> > > > > > > > >>>>> > > > > > > > >> > > > > > > > >>>>> > > > > > > > >> > > > > > > >>>>> > > > > > > > >> > > > > > >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > >>>>> > > > > > > > >> >>>>> > > > > > > > >>>>> > > > > > > >>>>> > > > > > >>>>> > > > > >>>>> > > > >>>>> > > >>>>> > >>>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-523%3A+Handle+TLS+Certificate+Renewal >>>>> > > > > > > > >> > > > > > > > >>>>> > > > > > > > >> > > > > > > > Regards, >>>>> > > > > > > > >> > > > > > > > >>>>> > > > > > > > >> > > > > > > > Nicolas Fraison >>>>> > > > > > > > >> > > > > > > > >>>>> > > > > > > > >> > > > > > > >>>>> > > > > > > > >> > > > > > >>>>> > > > > > > > >> > > > > >>>>> > > > > > > > >> > > > >>>>> > > > > > > > >> > > >>>>> > > > > > > > >> > >>>>> > > > > > > > >> >>>>> > > > > > > > > >>>>> > > > > > > > >>>>> > > > > > > >>>>> > > > > > >>>>> > > > > >>>>> > > > >>>>> > > >>>>> > >>>>> >>>>
