The relationship between the session deployment and the Flink jobs looks good to me except for the session deployment deletion.
I strongly suggest not to set the ownerference of the FlinkSessionJob to the session FlinkDeployment. Otherwise, it will be a disaster if the session FlinkDeployment is deleted accidentally and there are many running jobs. We should check there's no running Flink job before deleting a session FlinkDeployment. And this will force the users to have a double confirmation. Best, Yang Aitozi <gjying1...@gmail.com> 于2022年3月22日周二 17:49写道: > Hi Thomas: > > Thanks for your valuable question. Let’s make the relationship between > the session deployment and the jobs more clear. > > IMO, the session deployment and jobs interact in these situations: > > - Create the session job. Then FlinkSessionJobController will wait for the > session cluster ready then submit the job. The look up key is namespace and > clusterId. > > - Delete the session job. Then it will cancel the current session job. > > - Delete the session deployment. It will have to delete the session job > first, we could set the ownerference of the FlinkSessionJob to let the > Kubernetes trigger the cleanup session jobs before removing the session > deployment. > > - Upgrade the session deployment. It will be a critical part, because it > will affect all the session jobs. We should suspend the job first and then > upgrade the session cluster. So I tend to validate that all the jobs are > suspended and then perform the session cluster upgrade. After upgrade then > change the session jobs to running manually. > > What do you think about this? If there is no objection, I will clarify it > in the FLIP doc. > > > Besides, sorry for the rough vote and discussion process. It's my first > time driving this, I will keep that in mind next time :) > Best, > Aitozi. > > Yang Wang <danrtsey...@gmail.com> 于2022年3月22日周二 10:11写道: > > > I think the session cluster could not be deleted unless all the running > > jobs have finished or cancelled. I agree this should be clarified in the > > FLIP. > > > > Best, > > Yang > > > > Thomas Weise <t...@apache.org> 于2022年3月22日周二 09:26写道: > > > > > Hi Aitozi, > > > > > > Thanks for the proposal. Can you please clarify in the FLIP the > > > relationship between the session deployment and the jobs that depend on > > it? > > > Will, for example, the operator ensure that the individual jobs are > > > deleted when the underlying cluster is deleted? > > > > > > Side note: When the discussion thread started 5 days ago and a FLIP > vote > > > was started 2 days later and there is also a weekend included, then > this > > is > > > probably on the short side for broader feedback. > > > > > > Thanks, > > > Thomas > > > > > > > > > On Fri, Mar 18, 2022 at 4:01 AM Yang Wang <danrtsey...@gmail.com> > wrote: > > > > > > > Great work. Since we are introducing a new public API, it deserves a > > > FLIP. > > > > And the FLIP will help the later contributors catch up soon. > > > > > > > > Best, > > > > Yang > > > > > > > > Gyula Fóra <gyula.f...@gmail.com> 于2022年3月18日周五 18:11写道: > > > > > > > > > Thank Aitozi, a FLIP might be an overkill at this point but no harm > > in > > > > > voting on it anyways :) > > > > > > > > > > Looks good! > > > > > > > > > > Gyula > > > > > > > > > > On Fri, Mar 18, 2022 at 10:25 AM Aitozi <gjying1...@gmail.com> > > wrote: > > > > > > > > > > > Hi Guys: > > > > > > > > > > > > FYI, I have integrated your comments and drawn the > > FLIP-215[1], I > > > > > will > > > > > > create another thread to vote for it. > > > > > > > > > > > > [1]: > > > > > > > > > > > > > > > > > > > > > > > > > > > https://cwiki.apache.org/confluence/display/FLINK/FLIP-215%3A+Introduce+FlinkSessionJob+CRD+in+the+kubernetes+operator > > > > > > > > > > > > Best, > > > > > > > > > > > > Aitozi. > > > > > > > > > > > > > > > > > > Aitozi <gjying1...@gmail.com> 于2022年3月17日周四 11:16写道: > > > > > > > > > > > > > Hi Biao Geng: > > > > > > > > > > > > > > Thanks for your feedback, I'm +1 to go with option#2. It's a > > > good > > > > > > > point that > > > > > > > > > > > > > > we should improve the error message debugging for the session > > job, > > > I > > > > > > > think > > > > > > > > > > > > > > it can be a follow up work as an improvement after we support > the > > > > > session > > > > > > > job operation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > > > > > Aitozi. > > > > > > > > > > > > > > > > > > > > > Geng Biao <biaoge...@gmail.com> 于2022年3月17日周四 10:55写道: > > > > > > > > > > > > > >> Thanks Aitozi for the work! > > > > > > >> > > > > > > >> I lean to option#2 of using JarRunHeaders with uber job jar as > > > well. > > > > > As > > > > > > >> Yang said, the user defined dependencies may be better > supported > > > in > > > > > > >> upstream flink. > > > > > > >> A follow-up thought: I think we should care the potential > > > influence > > > > > on > > > > > > >> user experiences: as the job graph is generated in JM, when > the > > > > > > generation > > > > > > >> fails due to some issues in the main() method, we should do > some > > > > work > > > > > on > > > > > > >> showing such error messages in this proposal or the later k8s > > > > operator > > > > > > >> implementation. Reason for this question is that if users > > submit > > > > many > > > > > > jobs > > > > > > >> to one same session cluster, it may be not easy for them to > find > > > > > > relevant > > > > > > >> error logs about main() method of a specific job. The > > FLINK-25715 > > > > > could > > > > > > >> help us later. > > > > > > >> > > > > > > >> > > > > > > >> Best, > > > > > > >> Biao Geng > > > > > > >> > > > > > > >> > > > > > > >> 发件人: Aitozi <gjying1...@gmail.com> > > > > > > >> 日期: 星期三, 2022年3月16日 下午5:19 > > > > > > >> 收件人: dev@flink.apache.org <dev@flink.apache.org> > > > > > > >> 主题: Re: [DISCUSS] Support the session job management in > > kubernetes > > > > > > >> operator > > > > > > >> Hi Yang Wang > > > > > > >> Thanks for your feedback, Provide the local and http > > > > > implementation > > > > > > >> for > > > > > > >> the first version makes sense to me. > > > > > > >> +1 for it. > > > > > > >> > > > > > > >> Best, > > > > > > >> Aitozi > > > > > > >> > > > > > > >> Yang Wang <danrtsey...@gmail.com> 于2022年3月16日周三 16:44写道: > > > > > > >> > > > > > > >> > # How to download the user jars > > > > > > >> > I agree with Gyula that it will be a burden if we bundle the > > > flink > > > > > > >> > filesystem dependencies in the operator image. > > > > > > >> > Maybe we could have a *ArtifactFetcher* interface in the > > > > > > >> > flink-kubernetes-operator. By default, we provide the local > > and > > > > http > > > > > > >> > implementation, > > > > > > >> > which means we could get the user jars from local files or > > HTTP > > > > > URLs. > > > > > > >> Flink > > > > > > >> > filesystem support could be done as a follow-up based on the > > > > > feedback. > > > > > > >> > > > > > > > >> > If the user wants to use the local implementation, they need > > to > > > > > mount > > > > > > a > > > > > > >> > PV(aka persist volume) to the operator first and then put > > their > > > > jars > > > > > > >> into > > > > > > >> > the PV. > > > > > > >> > > > > > > > >> > # How to talk to session JobManager to submit the job > > > > > > >> > After more consideration, I also prefer the second approach, > > via > > > > > REST > > > > > > >> API > > > > > > >> > /jars/:jarid/run. If we have strong requirements to support > > > > > > dependencies > > > > > > >> > jars and > > > > > > >> > artifacts, we could try to support this in the upstream > > project. > > > > > > >> > > > > > > > >> > Best, > > > > > > >> > Yang > > > > > > >> > > > > > > > >> > > > > > > > >> > Aitozi <gjying1...@gmail.com> 于2022年3月16日周三 16:11写道: > > > > > > >> > > > > > > > >> > > Hi Gyula > > > > > > >> > > Thanks for your quick response. Regarding the > different > > > > > > >> filesystems > > > > > > >> > > dependency, > > > > > > >> > > I think we can make it optional and pluggable, and let it > > > choose > > > > > by > > > > > > >> user > > > > > > >> > > when building > > > > > > >> > > their operator image. Users can build their image from the > > > base > > > > > > >> operator > > > > > > >> > > image and > > > > > > >> > > add filesystem dependency they want to use to it. BTW, we > > can > > > > > > support > > > > > > >> the > > > > > > >> > > http URI > > > > > > >> > > by default. > > > > > > >> > > > > > > > > >> > > Thanks, > > > > > > >> > > Aitozi. > > > > > > >> > > > > > > > > >> > > Gyula Fóra <gyula.f...@gmail.com> 于2022年3月16日周三 15:53写道: > > > > > > >> > > > > > > > > >> > > > Thank you Aitozi! > > > > > > >> > > > > > > > > > >> > > > I think this will be a very nice (and simple) addition > to > > > > enable > > > > > > >> these > > > > > > >> > > > use-cases. > > > > > > >> > > > > > > > > > >> > > > I have 2 comments regarding the proposal: > > > > > > >> > > > > > > > > > >> > > > 1. I think if we want to support different filesystems > to > > > > > download > > > > > > >> jars > > > > > > >> > > > from, we probably need some clever ways to add external > > > > operator > > > > > > >> > > > dependencies (jars, configs). > > > > > > >> > > > I would prefer not to bundle them into the base operator > > > > image. > > > > > > >> > > > > > > > > > >> > > > 2. I think we should avoid creating the jobgraphs on the > > > > > operator > > > > > > >> side > > > > > > >> > > and > > > > > > >> > > > use the jar upload/run rest api instead as you > suggested. > > > This > > > > > > will > > > > > > >> > avoid > > > > > > >> > > > flink version and dependency conflicts. > > > > > > >> > > > > > > > > > >> > > > Cheers, > > > > > > >> > > > Gyula > > > > > > >> > > > > > > > > > >> > > > On Wed, Mar 16, 2022 at 8:41 AM Aitozi < > > > gjying1...@gmail.com> > > > > > > >> wrote: > > > > > > >> > > > > > > > > > >> > > > > Hi Guys: > > > > > > >> > > > > > > > > > > >> > > > > I would like to open a discussion for support > > session > > > > job > > > > > > >> > > management > > > > > > >> > > > in > > > > > > >> > > > > kubernetes operator. It’s intended to enhance the > > > > > > >> > > > flink-kubernetes-operator > > > > > > >> > > > > to manage the session job with k8s tooling. I have > > drafted > > > > the > > > > > > >> design > > > > > > >> > > > > doc[1]. Please refer to it and give me some feedback . > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > [1] > > > > > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1WPGbur1eT3H_5gN-kyXfp7EDjdbJUURx6jN8nt6UT-s/edit# > > > > > > >> < > > > > > > >> > > > > > > > > > > > > > > > > > > > > > https://docs.google.com/document/d/1WPGbur1eT3H_5gN-kyXfp7EDjdbJUURx6jN8nt6UT-s/edit > > > > > > >> > > > > > > > >> > > > > > > > > > > >> > > > > Best, > > > > > > >> > > > > > > > > > > >> > > > > Aitozi. > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > >