Thanks for your efforts! The FLIP looks good from my side. G
On Tue, Jun 24, 2025 at 7:22 AM Nicolas Fraison < nicolas.frai...@datadoghq.com> wrote: > Hi, > > This remaining part of the doc has been updated > > On Thu, Jun 19, 2025 at 9:37 AM Gabor Somogyi <gabor.g.somo...@gmail.com> > wrote: > >> I think in this SSLContextLoader class call() function is not correct and >> must be onFileOrDirectoryModified(...), right? >> >> [image: Screenshot 2025-06-19 at 9.35.14.png] >> >> On Thu, Jun 19, 2025 at 8:14 AM Nicolas Fraison < >> nicolas.frai...@datadoghq.com> wrote: >> >>> Hi, >>> >>> Which part of the doc are you referring to? >>> >>> On Wed, Jun 18, 2025 at 7:08 PM Gabor Somogyi <gabor.g.somo...@gmail.com> >>> wrote: >>> >>>> There is still a call function there which is left there from the >>>> previous >>>> design... >>>> >>>> On Tue, Jun 17, 2025 at 9:48 AM Nicolas Fraison >>>> <nicolas.frai...@datadoghq.com.invalid> wrote: >>>> >>>> > Done >>>> > >>>> > On Mon, Jun 16, 2025 at 2:46 PM Gabor Somogyi < >>>> gabor.g.somo...@gmail.com> >>>> > wrote: >>>> > >>>> > > I still see 3 Callable occurrence in the FLIP... >>>> > > >>>> > > On Thu, Jun 12, 2025 at 2:16 PM Nicolas Fraison >>>> > > <nicolas.frai...@datadoghq.com.invalid> wrote: >>>> > > >>>> > > > Hi, FLIP has been updated with last comment >>>> > > > >>>> > > > On Thu, Jun 5, 2025 at 6:35 PM Gabor Somogyi < >>>> > gabor.g.somo...@gmail.com> >>>> > > > wrote: >>>> > > > >>>> > > > > The interface has been added. Do we intend to pass an >>>> > > > > instance registerWatchedPath function as we discussed? >>>> > > > > If yes then the FLIP needs further adjustments all places where >>>> now >>>> > > > > Callable provided. >>>> > > > > >>>> > > > > G >>>> > > > > >>>> > > > > >>>> > > > > On Thu, Jun 5, 2025 at 2:27 PM Nicolas Fraison >>>> > > > > <nicolas.frai...@datadoghq.com.invalid> wrote: >>>> > > > > >>>> > > > > > Hi, >>>> > > > > > >>>> > > > > > It indeed make sense, FLIP has been updated >>>> > > > > > >>>> > > > > > Nicolas >>>> > > > > > >>>> > > > > > On Wed, Jun 4, 2025 at 12:10 PM Gabor Somogyi < >>>> > > > gabor.g.somo...@gmail.com >>>> > > > > > >>>> > > > > > wrote: >>>> > > > > > >>>> > > > > > > Hi, >>>> > > > > > > >>>> > > > > > > I've read it through. Basically looks good with one comment. >>>> > > > > > > *registerWatchedPath function* has a *Callable* as >>>> callback. For >>>> > > the >>>> > > > > > > current implementation that would be enough, >>>> > > > > > > but when somebody would like to use that for any other >>>> use-case >>>> > > then >>>> > > > it >>>> > > > > > > would be hard. Examples: >>>> > > > > > > * A single *call* function tells the user nothing what kind >>>> of >>>> > > event >>>> > > > is >>>> > > > > > > this >>>> > > > > > > * the watch service supports 3 events (create, modify, >>>> delete), >>>> > now >>>> > > > > when >>>> > > > > > I >>>> > > > > > > register I can get only updates (I presume) >>>> > > > > > > * 1+ dir watch with the same callback is not possible >>>> > > > > > > >>>> > > > > > > My suggestion would be to create a proper callback with all >>>> the >>>> > > event >>>> > > > > > type >>>> > > > > > > functions and no-op default behavior with the following >>>> names[1]. >>>> > > > > > > This is ~30 lines addition but will increase the readability >>>> > > heavily >>>> > > > + >>>> > > > > no >>>> > > > > > > need to touch this code later. >>>> > > > > > > >>>> > > > > > > BR, >>>> > > > > > > G >>>> > > > > > > >>>> > > > > > > [1] >>>> > > > > > > >>>> > > > > > > >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> https://github.com/apache/flink-kubernetes-operator/commit/e5a325c48965a50d61d0aa29e61ba79e97f27082#diff-a30b3ed9b8c53e998b15d7da7ad2e54374c98ffc3c920f76a70bce3fb37a9b2eR87-R93 >>>> > > > > > > >>>> > > > > > > >>>> > > > > > > On Tue, Jun 3, 2025 at 8:53 AM Nicolas Fraison >>>> > > > > > > <nicolas.frai...@datadoghq.com.invalid> wrote: >>>> > > > > > > >>>> > > > > > > > Hi, >>>> > > > > > > > >>>> > > > > > > > The FLIP has been updated. >>>> > > > > > > > Let me know if you have some other comments. >>>> > > > > > > > >>>> > > > > > > > Nicolas >>>> > > > > > > > >>>> > > > > > > > On Tue, May 20, 2025 at 12:01 PM Nicolas Fraison < >>>> > > > > > > > nicolas.frai...@datadoghq.com> wrote: >>>> > > > > > > > >>>> > > > > > > > > Hi Gabor, >>>> > > > > > > > > >>>> > > > > > > > > Thanks for the feedback. >>>> > > > > > > > > The overall proposal with atomic dirty flag being set >>>> by the >>>> > > > > callback >>>> > > > > > > > will >>>> > > > > > > > > indeed work with any kind of implementation. >>>> > > > > > > > > >>>> > > > > > > > > Will see to update the FLIP in a week or 2 if there >>>> are no >>>> > > other >>>> > > > > > > > comments >>>> > > > > > > > > on it >>>> > > > > > > > > >>>> > > > > > > > > Nicolas >>>> > > > > > > > > >>>> > > > > > > > > >>>> > > > > > > > > On Tue, May 20, 2025 at 11:08 AM Gabor Somogyi < >>>> > > > > > > > gabor.g.somo...@gmail.com> >>>> > > > > > > > > wrote: >>>> > > > > > > > > >>>> > > > > > > > >> Hi Nicolas, >>>> > > > > > > > >> >>>> > > > > > > > >> Related SSLContext I've not gone through all the cases >>>> where >>>> > > we >>>> > > > > need >>>> > > > > > > to >>>> > > > > > > > >> reload so instead I'm sharing the concept. >>>> > > > > > > > >> The main intention is that we must control all >>>> execution >>>> > paths >>>> > > > > which >>>> > > > > > > > >> decide >>>> > > > > > > > >> which certificate used for authentication. >>>> > > > > > > > >> Creating an SSLContext decorator which checks reload >>>> first >>>> > and >>>> > > > > then >>>> > > > > > > > >> forwards all calls to the original (wrapped) context >>>> > > > > > > > >> is one way to achieve that. If there are different >>>> > > > implementations >>>> > > > > > > which >>>> > > > > > > > >> end up in similar behavior then it's fine. >>>> > > > > > > > >> >>>> > > > > > > > >> BR, >>>> > > > > > > > >> G >>>> > > > > > > > >> >>>> > > > > > > > >> >>>> > > > > > > > >> On Tue, May 20, 2025 at 10:15 AM Nicolas Fraison >>>> > > > > > > > >> <nicolas.frai...@datadoghq.com.invalid> wrote: >>>> > > > > > > > >> >>>> > > > > > > > >> > Hi, >>>> > > > > > > > >> > >>>> > > > > > > > >> > Overall your proposal looks great. >>>> > > > > > > > >> > The event handling must indeed be super fast and we >>>> must >>>> > > also >>>> > > > > not >>>> > > > > > > > change >>>> > > > > > > > >> > original code path if reload not needed >>>> > > > > > > > >> > >>>> > > > > > > > >> > I have some concerns around the ReloadableSSLContext >>>> > > > > implementing >>>> > > > > > > all >>>> > > > > > > > >> > SSLContext >>>> > > > > > > > >> > Do you really mean SSLContext (java SSLContext one) >>>> or do >>>> > > you >>>> > > > > > refer >>>> > > > > > > to >>>> > > > > > > > >> the >>>> > > > > > > > >> > SslContext from netty? >>>> > > > > > > > >> > >>>> > > > > > > > >> > If java SSLContext >>>> > > > > > > > >> > - I'm not sure how this will manage reload from the >>>> > > BlobServer >>>> > > > > > > > >> > BlobServer relies on creation of an >>>> SSLServerSocketFactory >>>> > > > from >>>> > > > > > the >>>> > > > > > > > >> > SSLContext. >>>> > > > > > > > >> > But from my current understanding the >>>> > SSLServerSocketFactory >>>> > > > > does >>>> > > > > > > not >>>> > > > > > > > >> have >>>> > > > > > > > >> > any connection with the SSLContext. >>>> > > > > > > > >> > It only has some with an SSLContextImpl extends >>>> > > SSLContextSpi. >>>> > > > > > > > >> > I think we will need a callback here to enforce >>>> recreation >>>> > > of >>>> > > > > the >>>> > > > > > > > >> > BlobServer socket. >>>> > > > > > > > >> > - I'm also don't see how to attach this to the >>>> SslContext >>>> > > from >>>> > > > > > netty >>>> > > > > > > > >> > >>>> > > > > > > > >> > If netty SslContext >>>> > > > > > > > >> > - we would still need to have the callback to >>>> recreate the >>>> > > > > > > BlobServer >>>> > > > > > > > >> > socket >>>> > > > > > > > >> > - for netty and pekko we should be able to rely on it >>>> > > > > > > > >> > >>>> > > > > > > > >> > Nicolas >>>> > > > > > > > >> > >>>> > > > > > > > >> > On Wed, May 7, 2025 at 12:40 PM Gabor Somogyi < >>>> > > > > > > > >> gabor.g.somo...@gmail.com> >>>> > > > > > > > >> > wrote: >>>> > > > > > > > >> > >>>> > > > > > > > >> > > Hi All, >>>> > > > > > > > >> > > >>>> > > > > > > > >> > > I've read through the concerns/proposals related >>>> the >>>> > watch >>>> > > > > > service >>>> > > > > > > > and >>>> > > > > > > > >> > here >>>> > > > > > > > >> > > are my conclusions: >>>> > > > > > > > >> > > * Watch service can watch local file systems only: >>>> It's >>>> > > fair >>>> > > > > to >>>> > > > > > > say >>>> > > > > > > > >> that >>>> > > > > > > > >> > > certificates must be copied to local FS in order >>>> to work >>>> > > > (init >>>> > > > > > > > >> container >>>> > > > > > > > >> > cp >>>> > > > > > > > >> > > command or something) >>>> > > > > > > > >> > > * Watch service can send multiple events even for a >>>> > single >>>> > > > > > > directory >>>> > > > > > > > >> > > change: this can be mitigated with a single atomic >>>> dirty >>>> > > > flag >>>> > > > > > (see >>>> > > > > > > > my >>>> > > > > > > > >> > > design suggestion) >>>> > > > > > > > >> > > * Polling file modification time: When I hear any >>>> kind >>>> > of >>>> > > > > > polling >>>> > > > > > > > >> > > implemented by us is just something I'm mostly >>>> opposing >>>> > > > > > > > >> > > * security.ssl.internal.keystore.reload.duration: >>>> > Security >>>> > > > > > > features >>>> > > > > > > > >> must >>>> > > > > > > > >> > be >>>> > > > > > > > >> > > executed nearly immediately no matter what >>>> > > > > > > > >> > > >>>> > > > > > > > >> > > As a general remark, not sure how many users are >>>> using >>>> > the >>>> > > > > watch >>>> > > > > > > > >> service >>>> > > > > > > > >> > > based approach in the operator but until now I've >>>> not >>>> > seen >>>> > > > any >>>> > > > > > > issue >>>> > > > > > > > >> > > related to that. >>>> > > > > > > > >> > > If somebody is having some specifics then please >>>> share. >>>> > > > > > > > >> > > >>>> > > > > > > > >> > > For now I would shoot for keystore not to have >>>> feature >>>> > > > creep. >>>> > > > > > When >>>> > > > > > > > >> there >>>> > > > > > > > >> > is >>>> > > > > > > > >> > > a valid use-case, community interest and the >>>> keystore >>>> > > story >>>> > > > is >>>> > > > > > > > already >>>> > > > > > > > >> > rock >>>> > > > > > > > >> > > stable >>>> > > > > > > > >> > > then we can consider to involve truststore later. >>>> > > > > > > > >> > > >>>> > > > > > > > >> > > Having the mentioned assumption that the operator >>>> > approach >>>> > > > > > works, >>>> > > > > > > > >> here is >>>> > > > > > > > >> > > my high level proposal: >>>> > > > > > > > >> > > * Let's have enable flag for all such watch >>>> > functionality. >>>> > > > If >>>> > > > > > it's >>>> > > > > > > > >> false >>>> > > > > > > > >> > > then the currently existing functionality must >>>> remain >>>> > > as-is >>>> > > > > > > > >> > > * Let's have a LocalFSWatchService which is a >>>> singleton >>>> > > > which >>>> > > > > > has >>>> > > > > > > no >>>> > > > > > > > >> path >>>> > > > > > > > >> > > registrations by default >>>> > > > > > > > >> > > * Add path registration functionality which is >>>> > > synchronised >>>> > > > > > where >>>> > > > > > > a >>>> > > > > > > > >> > > callback can be registered >>>> > > > > > > > >> > > * Let's have a ReloadableSSLContext which >>>> implements the >>>> > > > > > mentioned >>>> > > > > > > > >> > callback >>>> > > > > > > > >> > > * Inside the callback set an atomic dirty flag only >>>> > (this >>>> > > > can >>>> > > > > > > handle >>>> > > > > > > > >> > > multiple events for the same directory change + >>>> event >>>> > > > handling >>>> > > > > > in >>>> > > > > > > > the >>>> > > > > > > > >> > watch >>>> > > > > > > > >> > > service must be extreme fast) >>>> > > > > > > > >> > > * Inside ReloadableSSLContext all SSLContext >>>> actions >>>> > must >>>> > > be >>>> > > > > > > > >> overridden. >>>> > > > > > > > >> > At >>>> > > > > > > > >> > > the beginning of each function dirty flag must be >>>> > checked >>>> > > > > > > > >> > > and if dirty then certificates must be reloaded, >>>> flag >>>> > can >>>> > > be >>>> > > > > set >>>> > > > > > > > back >>>> > > > > > > > >> to >>>> > > > > > > > >> > > false (then original functionality call). It's >>>> extremely >>>> > > > > > important >>>> > > > > > > > >> that >>>> > > > > > > > >> > > context reload must be synchronised. >>>> > > > > > > > >> > > A synchronised boolean check + possible context >>>> reload >>>> > can >>>> > > > > > consume >>>> > > > > > > > >> some >>>> > > > > > > > >> > > time but I wouldn't expect any significant >>>> performance >>>> > > drop. >>>> > > > > > > > >> > > >>>> > > > > > > > >> > > My proposal main drivers: >>>> > > > > > > > >> > > * Original code path must run when no watch service >>>> > asked >>>> > > > > > > > >> > > * Super fast event handling because million events >>>> may >>>> > > come >>>> > > > in >>>> > > > > > > (not >>>> > > > > > > > >> > > expecting but we should be prepared) >>>> > > > > > > > >> > > * Clean separation between dir/file watch and >>>> file/dir >>>> > > usage >>>> > > > > > > > >> > > * Well considered synchronisation model >>>> > > > > > > > >> > > * Extensive unit testing because we're intended to >>>> touch >>>> > > the >>>> > > > > > heart >>>> > > > > > > > of >>>> > > > > > > > >> > Flink >>>> > > > > > > > >> > > >>>> > > > > > > > >> > > Happy to hear other opinions. >>>> > > > > > > > >> > > >>>> > > > > > > > >> > > BR, >>>> > > > > > > > >> > > G >>>> > > > > > > > >> > > >>>> > > > > > > > >> > > >>>> > > > > > > > >> > > On Mon, Apr 28, 2025 at 1:54 PM Nicolas Fraison >>>> > > > > > > > >> > > <nicolas.frai...@datadoghq.com.invalid> wrote: >>>> > > > > > > > >> > > >>>> > > > > > > > >> > > > Thanks all for your feedback and sorry for the >>>> late >>>> > > answer >>>> > > > > (I >>>> > > > > > > was >>>> > > > > > > > on >>>> > > > > > > > >> > > > holiday). >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > 1. Indeed it would add 4 threads. >>>> > > > > > > > >> > > > For the non pekko component we can indeed have >>>> one >>>> > > watcher >>>> > > > > > > service >>>> > > > > > > > >> used >>>> > > > > > > > >> > > to >>>> > > > > > > > >> > > > reload SSLContext for those components >>>> > > > > > > > >> > > > For pekko this is a little more challenging as >>>> the >>>> > > > creation >>>> > > > > of >>>> > > > > > > the >>>> > > > > > > > >> > pekko >>>> > > > > > > > >> > > > ssl engine is managed by pekko himself. >>>> > > > > > > > >> > > > Flink only generates appropriate config with >>>> class to >>>> > > > > execute >>>> > > > > > to >>>> > > > > > > > >> > initiate >>>> > > > > > > > >> > > > the pekko ssl engine [1]. This means that I will >>>> not >>>> > be >>>> > > > able >>>> > > > > > to >>>> > > > > > > > >> provide >>>> > > > > > > > >> > > the >>>> > > > > > > > >> > > > watcher service to this ssl engine. >>>> > > > > > > > >> > > > One solution would be to rely on a singleton >>>> instead >>>> > of >>>> > > a >>>> > > > > > > service >>>> > > > > > > > >> > > injected >>>> > > > > > > > >> > > > in each component but I'm not sure this is fine >>>> to use >>>> > > > such >>>> > > > > in >>>> > > > > > > > >> flink. >>>> > > > > > > > >> > > > WDYT? >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > We can also add a specific flink configuration >>>> > > > > > > > >> > > > (security.ssl.internal.keystore.reload.enable) >>>> to only >>>> > > add >>>> > > > > > this >>>> > > > > > > > >> watcher >>>> > > > > > > > >> > > > mechanism if the config is enabled to avoid >>>> adding >>>> > those >>>> > > > > > threads >>>> > > > > > > > if >>>> > > > > > > > >> > this >>>> > > > > > > > >> > > is >>>> > > > > > > > >> > > > not needed. >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > 2. I'm fine with the LocalFSWatchService naming. >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > 3. Also agree that some e2e tests must be added. >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > @doguscan namal, Thanks for challenging the >>>> proposal. >>>> > > > > > > > >> > > > FYI, we are planning to rely on certificates with >>>> > really >>>> > > > > short >>>> > > > > > > > >> > validity. >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > D1. It seems that the proposal to rely on a >>>> reload >>>> > > period >>>> > > > > will >>>> > > > > > > > still >>>> > > > > > > > >> > face >>>> > > > > > > > >> > > > potential issues: >>>> > > > > > > > >> > > > - receiving events while file content updates are >>>> > still >>>> > > in >>>> > > > > > > > progress: >>>> > > > > > > > >> > > > There is no guarantee that we will not load the >>>> > > > certificate >>>> > > > > > > while >>>> > > > > > > > >> file >>>> > > > > > > > >> > > > content updates are still in progress >>>> > > > > > > > >> > > > - while it will not be affected by multiple >>>> > > notifications, >>>> > > > > we >>>> > > > > > > can >>>> > > > > > > > >> > reach a >>>> > > > > > > > >> > > > point where only the truststore is updated when >>>> the >>>> > > reload >>>> > > > > > > > happens. >>>> > > > > > > > >> > > > Which means that if the keystore is updated just >>>> > after, >>>> > > it >>>> > > > > > will >>>> > > > > > > > not >>>> > > > > > > > >> be >>>> > > > > > > > >> > > > taken in account before next run of the reload >>>> > mechanism >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > I think that with WatchService and appropriate >>>> reload >>>> > > > grace >>>> > > > > > > period >>>> > > > > > > > >> > > > mechanism we should be able to mitigate those 2 >>>> issues >>>> > > > > > (ensuring >>>> > > > > > > > >> > minimum >>>> > > > > > > > >> > > > reload even with multiple notify) >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > From KIP-1119 [2] it looks like the same kind of >>>> > > > > requirements >>>> > > > > > is >>>> > > > > > > > >> under >>>> > > > > > > > >> > > > discussion for Kafka to also rely on the >>>> WatchService >>>> > > Java >>>> > > > > API >>>> > > > > > > > >> > > (SpringBoot >>>> > > > > > > > >> > > > seems to also rely on this API to manage ssl >>>> reload >>>> > > [3]). >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > D3. Do we have a real use case for this? >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > [1] >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > >>>> > > > > > > > >> > >>>> > > > > > > > >> >>>> > > > > > > > >>>> > > > > > > >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> https://github.com/apache/flink/blob/b523264ab45d37cd9584a0e8c06f1ef6bd1aaed7/flink-rpc/flink-rpc-akka/src/main/java/org/apache/flink/runtime/rpc/pekko/PekkoUtils.java#L372 >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > [2] >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > >>>> > > > > > > > >> > >>>> > > > > > > > >> >>>> > > > > > > > >>>> > > > > > > >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> https://cwiki.apache.org/confluence/display/KAFKA/KIP-1119%3A+Add+support+for+SSL+auto+reload >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > [3] >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > >>>> > > > > > > > >> > >>>> > > > > > > > >> >>>> > > > > > > > >>>> > > > > > > >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> https://docs.spring.io/spring-boot/reference/features/ssl.html#features.ssl.reloading >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > Regards, >>>> > > > > > > > >> > > > Nicolas >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > On Wed, Apr 23, 2025 at 12:14 PM Doğuşcan Namal < >>>> > > > > > > > >> > > namal.dogus...@gmail.com> >>>> > > > > > > > >> > > > wrote: >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > > Hi Nicolas, thanks for the FLIP. >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > I am fully supportive of the motivation and we >>>> > should >>>> > > be >>>> > > > > > > > >> supporting >>>> > > > > > > > >> > > this >>>> > > > > > > > >> > > > > feature. Here are couple of comments from my >>>> side: >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > D1) >>>> > > > > > > > >> > > > > Since you shared the implementation details on >>>> the >>>> > > FLIP >>>> > > > as >>>> > > > > > > > well, I >>>> > > > > > > > >> > > would >>>> > > > > > > > >> > > > > like to discuss whether using Java's >>>> WatchService is >>>> > > the >>>> > > > > > best >>>> > > > > > > > >> choice >>>> > > > > > > > >> > > > here. >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > I believe that the certificate renewal is not a >>>> > > frequent >>>> > > > > > > > >> operation. >>>> > > > > > > > >> > > Even >>>> > > > > > > > >> > > > > once a day certificate renewals are not >>>> > > realistic(except >>>> > > > > for >>>> > > > > > > > test >>>> > > > > > > > >> > cases >>>> > > > > > > > >> > > > > maybe) but let's assume that this covers up the >>>> > p99.99 >>>> > > > of >>>> > > > > > the >>>> > > > > > > > use >>>> > > > > > > > >> > > cases. >>>> > > > > > > > >> > > > I >>>> > > > > > > > >> > > > > am confident on this estimation since there >>>> hasn't >>>> > > been >>>> > > > a >>>> > > > > > > > request >>>> > > > > > > > >> > from >>>> > > > > > > > >> > > > the >>>> > > > > > > > >> > > > > community for this feature so far, which >>>> confirms >>>> > that >>>> > > > > > people >>>> > > > > > > > were >>>> > > > > > > > >> > okay >>>> > > > > > > > >> > > > > with infrequent cluster restarts. Following >>>> that it >>>> > is >>>> > > > > > > > >> infrequent, I >>>> > > > > > > > >> > > > > believe that spawning up a thread that watches >>>> the >>>> > > file >>>> > > > > > > > >> modification >>>> > > > > > > > >> > > > > operations is not the best use of the limited >>>> > > resources >>>> > > > > on a >>>> > > > > > > > >> cluster. >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > There are some known limitations of the >>>> WatchService >>>> > > as >>>> > > > > well >>>> > > > > > > > such >>>> > > > > > > > >> as >>>> > > > > > > > >> > > > > receiving multiple modification events for the >>>> same >>>> > > > > > occurence >>>> > > > > > > > [1], >>>> > > > > > > > >> > > > > inotify's (WatchService's underlying mechanism >>>> in >>>> > > Linux >>>> > > > > > > > >> environments) >>>> > > > > > > > >> > > > > problems on containerized environments due to >>>> remote >>>> > > > file >>>> > > > > > > > systems >>>> > > > > > > > >> [2] >>>> > > > > > > > >> > > or >>>> > > > > > > > >> > > > > receiving events while file content updates are >>>> > still >>>> > > in >>>> > > > > > > > progress. >>>> > > > > > > > >> > > [3]. I >>>> > > > > > > > >> > > > > do not know if these limitations are addressed >>>> in >>>> > the >>>> > > > > newer >>>> > > > > > > > >> versions >>>> > > > > > > > >> > > but >>>> > > > > > > > >> > > > > regardless of that it is clear that we may >>>> face with >>>> > > > some >>>> > > > > > ugly >>>> > > > > > > > >> edge >>>> > > > > > > > >> > > cases >>>> > > > > > > > >> > > > > due to that. >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > Given these complications, I would recommend >>>> just >>>> > > > > creating a >>>> > > > > > > new >>>> > > > > > > > >> > > > SSLContext >>>> > > > > > > > >> > > > > after a configured duration is expired. We >>>> could >>>> > > record >>>> > > > > the >>>> > > > > > > > >> timestamp >>>> > > > > > > > >> > > > when >>>> > > > > > > > >> > > > > the previous SSLContext is created and update >>>> it >>>> > > after a >>>> > > > > > > > >> configured >>>> > > > > > > > >> > > > > duration is passed. This will be much easier >>>> to test >>>> > > and >>>> > > > > > > reason >>>> > > > > > > > >> about >>>> > > > > > > > >> > > > when >>>> > > > > > > > >> > > > > it is running on production. This will >>>> eliminate the >>>> > > > > > necessity >>>> > > > > > > > to >>>> > > > > > > > >> > > reason >>>> > > > > > > > >> > > > > about the file modification operations as well. >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > I briefly skimmed through the classes that >>>> need to >>>> > be >>>> > > > > > modified >>>> > > > > > > > >> and it >>>> > > > > > > > >> > > > > looked feasible for me. Let me know what are >>>> your >>>> > > > comments >>>> > > > > > on >>>> > > > > > > > >> these. >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > Note that this is already used in the Kafka >>>> world >>>> > > where >>>> > > > a >>>> > > > > > new >>>> > > > > > > > >> > > SSLContext >>>> > > > > > > > >> > > > is >>>> > > > > > > > >> > > > > created after 12 hours. >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > D2) We could provide a configuration to the >>>> user, >>>> > such >>>> > > > as >>>> > > > > > > > >> > > > > >>>> "security.ssl.internal.keystore.reload.duration" so >>>> > > they >>>> > > > > > could >>>> > > > > > > > >> decide >>>> > > > > > > > >> > > how >>>> > > > > > > > >> > > > > often the new certificates should be loaded. >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > D3) >>>> > > > > > > > >> > > > > On the other hand, I wonder whether we should >>>> also >>>> > > > handle >>>> > > > > > > > >> supporting >>>> > > > > > > > >> > > > > updating the file paths of the truststores and >>>> > > keystores >>>> > > > > > under >>>> > > > > > > > >> this >>>> > > > > > > > >> > > FLIP >>>> > > > > > > > >> > > > as >>>> > > > > > > > >> > > > > well. Since the name of the FLIP is "Handle TLS >>>> > > > > Certificate >>>> > > > > > > > >> Renewal" >>>> > > > > > > > >> > I >>>> > > > > > > > >> > > > > think we could bring that into scope too :) >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > Thanks >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > [1] >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > >>>> > > > > > > > >> > >>>> > > > > > > > >> >>>> > > > > > > > >>>> > > > > > > >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> https://stackoverflow.com/questions/16777869/java-7-watchservice-ignoring-multiple-occurrences-of-the-same-event >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > [2] >>>> > > > > > > > >>>> https://blog.arkey.fr/2019/09/13/watchservice-and-bind-mount/ >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > [3] >>>> > > > > > > > >> > > > >>>> > > > > > > > >> >>>> > > > > > >>>> > > >>>> https://surajatreyac.github.io/2014-07-29/reactive_file_handling.html >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > [4] See also Platform Dependencies - >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > >>>> > > > > > > > >> > >>>> > > > > > > > >> >>>> > > > > > > > >>>> > > > > > > >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> https://docs.oracle.com/javase/8/docs/api/index.html?java/nio/file/WatchService.html >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > On Fri, 18 Apr 2025 at 18:25, Gabor Somogyi < >>>> > > > > > > > >> > gabor.g.somo...@gmail.com >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > > > wrote: >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > > > Hi Robert, >>>> > > > > > > > >> > > > > > >>>> > > > > > > > >> > > > > > Since I've added the same feature to the >>>> operator >>>> > > I'll >>>> > > > > > take >>>> > > > > > > a >>>> > > > > > > > >> look >>>> > > > > > > > >> > at >>>> > > > > > > > >> > > > it. >>>> > > > > > > > >> > > > > > Though it won't be lightning fast since I'm >>>> having >>>> > > > > several >>>> > > > > > > > weeks >>>> > > > > > > > >> > off. >>>> > > > > > > > >> > > > > > >>>> > > > > > > > >> > > > > > Your questions are valid especially >>>> considering >>>> > the >>>> > > > fact >>>> > > > > > > that >>>> > > > > > > > >> this >>>> > > > > > > > >> > > > > feature >>>> > > > > > > > >> > > > > > touches the hearth of the authentication so >>>> this >>>> > > must >>>> > > > be >>>> > > > > > > rock >>>> > > > > > > > >> solid >>>> > > > > > > > >> > > > > > in order to avoid grey hair :) >>>> > > > > > > > >> > > > > > >>>> > > > > > > > >> > > > > > (1) I would vote on a single service which is >>>> > > heavily >>>> > > > > unit >>>> > > > > > > > >> tested >>>> > > > > > > > >> > > with >>>> > > > > > > > >> > > > > > all the possible combinations including >>>> threading. >>>> > > > > > > > >> > > > > > >>>> > > > > > > > >> > > > > > Some standalone app could be added to really >>>> play >>>> > > with >>>> > > > > it >>>> > > > > > > > (that >>>> > > > > > > > >> > would >>>> > > > > > > > >> > > > > help >>>> > > > > > > > >> > > > > > review). >>>> > > > > > > > >> > > > > > I mean, create X files, start Y threads, and >>>> make >>>> > > > > > > assertions. >>>> > > > > > > > >> > > > > > The reason why I'm suggesting it is the fact >>>> that >>>> > > > AFAIR >>>> > > > > > the >>>> > > > > > > > >> watch >>>> > > > > > > > >> > > > service >>>> > > > > > > > >> > > > > > is quite sensitive even in single thread. If >>>> we >>>> > > could >>>> > > > do >>>> > > > > > > this >>>> > > > > > > > >> in a >>>> > > > > > > > >> > > > finite >>>> > > > > > > > >> > > > > > time >>>> > > > > > > > >> > > > > > consuming unit test then it's even better. >>>> > > > > > > > >> > > > > > >>>> > > > > > > > >> > > > > > (2) +1 on that name to avoid confusion >>>> > > > > > > > >> > > > > > >>>> > > > > > > > >> > > > > > (3) I agree that some e2e is must, however >>>> this >>>> > can >>>> > > be >>>> > > > > > > easily >>>> > > > > > > > >> and >>>> > > > > > > > >> > > > deeply >>>> > > > > > > > >> > > > > > unit >>>> > > > > > > > >> > > > > > tested so that part is also essential. One >>>> key >>>> > test >>>> > > > here >>>> > > > > > is >>>> > > > > > > > when >>>> > > > > > > > >> > > > > > certificates >>>> > > > > > > > >> > > > > > are not changing then no action must be >>>> performed >>>> > > (not >>>> > > > > to >>>> > > > > > > > break >>>> > > > > > > > >> the >>>> > > > > > > > >> > > > whole >>>> > > > > > > > >> > > > > > system apart). >>>> > > > > > > > >> > > > > > >>>> > > > > > > > >> > > > > > Purely personal opinion but such feature >>>> > > developments >>>> > > > > are >>>> > > > > > > slow >>>> > > > > > > > >> by >>>> > > > > > > > >> > > > nature >>>> > > > > > > > >> > > > > > because >>>> > > > > > > > >> > > > > > of edge case / stress testing. >>>> > > > > > > > >> > > > > > >>>> > > > > > > > >> > > > > > BR, >>>> > > > > > > > >> > > > > > G >>>> > > > > > > > >> > > > > > >>>> > > > > > > > >> > > > > > >>>> > > > > > > > >> > > > > > On Fri, Apr 18, 2025 at 4:53 PM Robert >>>> Metzger < >>>> > > > > > > > >> > rmetz...@apache.org> >>>> > > > > > > > >> > > > > > wrote: >>>> > > > > > > > >> > > > > > >>>> > > > > > > > >> > > > > > > Hi Nicolas, >>>> > > > > > > > >> > > > > > > >>>> > > > > > > > >> > > > > > > This looks like a nice improvement, thanks >>>> for >>>> > the >>>> > > > > write >>>> > > > > > > up. >>>> > > > > > > > >> > > > > > > Are you in touch with any committer who's >>>> > willing >>>> > > to >>>> > > > > > > review >>>> > > > > > > > / >>>> > > > > > > > >> > merge >>>> > > > > > > > >> > > > > this? >>>> > > > > > > > >> > > > > > > >>>> > > > > > > > >> > > > > > > Some random questions on the FLIP: >>>> > > > > > > > >> > > > > > > (1) "Each service that depends on TLS >>>> > > certificates >>>> > > > > will >>>> > > > > > > > >> > > initialize a >>>> > > > > > > > >> > > > > > > FileSytemWatchService" >>>> > > > > > > > >> > > > > > > >>>> > > > > > > > >> > > > > > > It seems that there are 4 components using >>>> SSL, >>>> > > does >>>> > > > > > this >>>> > > > > > > > mean >>>> > > > > > > > >> > > there >>>> > > > > > > > >> > > > > will >>>> > > > > > > > >> > > > > > > be 4 additional threads running, watching >>>> the >>>> > same >>>> > > > set >>>> > > > > > of >>>> > > > > > > > >> files? >>>> > > > > > > > >> > > > > > > Wouldn't it be better to introduce a >>>> central >>>> > file >>>> > > > > > watching >>>> > > > > > > > >> > service, >>>> > > > > > > > >> > > > and >>>> > > > > > > > >> > > > > > SSL >>>> > > > > > > > >> > > > > > > users can subscribe to updates, to reduce >>>> the >>>> > > number >>>> > > > > of >>>> > > > > > > > >> threads? >>>> > > > > > > > >> > > > > > > If this makes the whole effort 4x more >>>> > > complicated, >>>> > > > I >>>> > > > > > > > wouldn't >>>> > > > > > > > >> > > > consider >>>> > > > > > > > >> > > > > > it, >>>> > > > > > > > >> > > > > > > but if its roughly the same effort, we >>>> should :) >>>> > > > > > > > >> > > > > > > >>>> > > > > > > > >> > > > > > > (2) "FileSytemWatchService" >>>> > > > > > > > >> > > > > > > When I read this name, I was wondering, >>>> whether >>>> > > this >>>> > > > > is >>>> > > > > > > > >> somehow >>>> > > > > > > > >> > > > related >>>> > > > > > > > >> > > > > > to >>>> > > > > > > > >> > > > > > > the Flink "FileSystem" classes. Which I >>>> think >>>> > its' >>>> > > > > not. >>>> > > > > > > > >> > > > > > > Maybe a different name, that makes this >>>> > separation >>>> > > > > more >>>> > > > > > > > >> explicit, >>>> > > > > > > > >> > > > would >>>> > > > > > > > >> > > > > > > make sense. Maybe "LocalFSWatchService"? >>>> > > > > > > > >> > > > > > > (I'm sorry to bring up naming stuff -- its >>>> very >>>> > > > > > > subjective, >>>> > > > > > > > >> and >>>> > > > > > > > >> > > > > > difficult) >>>> > > > > > > > >> > > > > > > >>>> > > > > > > > >> > > > > > > (3) For the test plan: There seem to be >>>> some SSL >>>> > > > > related >>>> > > > > > > e2e >>>> > > > > > > > >> > tests: >>>> > > > > > > > >> > > > > > > >>>> > > > > > > > >> > > > > > > >>>> > > > > > > > >> > > > > > >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > >>>> > > > > > > > >> > >>>> > > > > > > > >> >>>> > > > > > > > >>>> > > > > > > >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> https://github.com/apache/flink/blob/master/flink-end-to-end-tests/test-scripts/common_ssl.sh >>>> > > > > > > > >> > > > > > > It would be nice to extend them to cover >>>> this >>>> > > > feature >>>> > > > > as >>>> > > > > > > > >> well. I >>>> > > > > > > > >> > > > would >>>> > > > > > > > >> > > > > > hate >>>> > > > > > > > >> > > > > > > for this feature to slowly break by future >>>> > > changes, >>>> > > > so >>>> > > > > > > good >>>> > > > > > > > >> e2e >>>> > > > > > > > >> > > test >>>> > > > > > > > >> > > > > > > coverage is key, in particular bc so many >>>> > > components >>>> > > > > are >>>> > > > > > > > >> > involved. >>>> > > > > > > > >> > > > > > > >>>> > > > > > > > >> > > > > > > Best, >>>> > > > > > > > >> > > > > > > Robert >>>> > > > > > > > >> > > > > > > >>>> > > > > > > > >> > > > > > > On Wed, Apr 16, 2025 at 11:55 AM Nicolas >>>> Fraison >>>> > > > > > > > >> > > > > > > <nicolas.frai...@datadoghq.com.invalid> >>>> wrote: >>>> > > > > > > > >> > > > > > > >>>> > > > > > > > >> > > > > > > > Hi All, >>>> > > > > > > > >> > > > > > > > >>>> > > > > > > > >> > > > > > > > I'd like to start a discussion to Handle >>>> TLS >>>> > > > > > Certificate >>>> > > > > > > > >> > Renewal >>>> > > > > > > > >> > > > > > > > Please provide some feedback on this >>>> proposal: >>>> > > > > > > > >> > > > > > > > >>>> > > > > > > > >> > > > > > > > >>>> > > > > > > > >> > > > > > > >>>> > > > > > > > >> > > > > > >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > >>>> > > > > > > > >> > >>>> > > > > > > > >> >>>> > > > > > > > >>>> > > > > > > >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-523%3A+Handle+TLS+Certificate+Renewal >>>> > > > > > > > >> > > > > > > > >>>> > > > > > > > >> > > > > > > > Regards, >>>> > > > > > > > >> > > > > > > > >>>> > > > > > > > >> > > > > > > > Nicolas Fraison >>>> > > > > > > > >> > > > > > > > >>>> > > > > > > > >> > > > > > > >>>> > > > > > > > >> > > > > > >>>> > > > > > > > >> > > > > >>>> > > > > > > > >> > > > >>>> > > > > > > > >> > > >>>> > > > > > > > >> > >>>> > > > > > > > >> >>>> > > > > > > > > >>>> > > > > > > > >>>> > > > > > > >>>> > > > > > >>>> > > > > >>>> > > > >>>> > > >>>> > >>>> >>>