Re: [DISCUSS] Support the session job management in kubernetes operator

Yang Wang Tue, 22 Mar 2022 03:47:47 -0700

The relationship between the session deployment and the Flink jobs looks
good to me except for the session deployment deletion.


I strongly suggest not to set the ownerference of the FlinkSessionJob to
the session FlinkDeployment.
Otherwise, it will be a disaster if the session FlinkDeployment is deleted
accidentally and there are many running jobs.
We should check there's no running Flink job before deleting a session
FlinkDeployment. And this will force the users to have a double
confirmation.

Best,
Yang


Aitozi <gjying1...@gmail.com> 于2022年3月22日周二 17:49写道：

> Hi Thomas:
>
>     Thanks for your valuable question. Let’s make the relationship between
> the session deployment and the jobs more clear.
>
> IMO, the session deployment and jobs interact in these situations:
>
> - Create the session job. Then FlinkSessionJobController will wait for the
> session cluster ready then submit the job. The look up key is namespace and
> clusterId.
>
> - Delete the session job. Then it will cancel the current session job.
>
> - Delete the session deployment. It will have to delete the session job
> first, we could set the ownerference of the FlinkSessionJob to let the
> Kubernetes trigger the cleanup session jobs before removing the session
> deployment.
>
> - Upgrade the session deployment. It will be a critical part, because it
> will affect all the session jobs. We should suspend the job first and then
> upgrade the session cluster. So I tend to validate that all the jobs are
> suspended and then perform the session cluster upgrade. After upgrade then
> change the session jobs to running manually.
>
> What do you think about this? If there is no objection, I will clarify it
> in the FLIP doc.
>
>
> Besides, sorry for the rough vote and discussion process. It's my first
> time driving this, I will keep that in mind next time :)
> Best,
> Aitozi.
>
> Yang Wang <danrtsey...@gmail.com> 于2022年3月22日周二 10:11写道：
>
> > I think the session cluster could not be deleted unless all the running
> > jobs have finished or cancelled. I agree this should be clarified in the
> > FLIP.
> >
> > Best,
> > Yang
> >
> > Thomas Weise <t...@apache.org> 于2022年3月22日周二 09:26写道：
> >
> > > Hi Aitozi,
> > >
> > > Thanks for the proposal. Can you please clarify in the FLIP the
> > > relationship between the session deployment and the jobs that depend on
> > it?
> > > Will, for example, the operator ensure that the individual jobs are
> > > deleted when the underlying cluster is deleted?
> > >
> > > Side note: When the discussion thread started 5 days ago and a FLIP
> vote
> > > was started 2 days later and there is also a weekend included, then
> this
> > is
> > > probably on the short side for broader feedback.
> > >
> > > Thanks,
> > > Thomas
> > >
> > >
> > > On Fri, Mar 18, 2022 at 4:01 AM Yang Wang <danrtsey...@gmail.com>
> wrote:
> > >
> > > > Great work. Since we are introducing a new public API, it deserves a
> > > FLIP.
> > > > And the FLIP will help the later contributors catch up soon.
> > > >
> > > > Best,
> > > > Yang
> > > >
> > > > Gyula Fóra <gyula.f...@gmail.com> 于2022年3月18日周五 18:11写道：
> > > >
> > > > > Thank Aitozi, a FLIP might be an overkill at this point but no harm
> > in
> > > > > voting on it anyways :)
> > > > >
> > > > > Looks good!
> > > > >
> > > > > Gyula
> > > > >
> > > > > On Fri, Mar 18, 2022 at 10:25 AM Aitozi <gjying1...@gmail.com>
> > wrote:
> > > > >
> > > > > > Hi Guys:
> > > > > >
> > > > > >     FYI, I have integrated your comments and drawn the
> > FLIP-215[1], I
> > > > > will
> > > > > > create another thread to vote for it.
> > > > > >
> > > > > > [1]:
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-215%3A+Introduce+FlinkSessionJob+CRD+in+the+kubernetes+operator
> > > > > >
> > > > > > Best,
> > > > > >
> > > > > > Aitozi.
> > > > > >
> > > > > >
> > > > > > Aitozi <gjying1...@gmail.com> 于2022年3月17日周四 11:16写道：
> > > > > >
> > > > > > > Hi Biao Geng:
> > > > > > >
> > > > > > >    Thanks for your feedback, I'm +1 to go with option#2. It's a
> > > good
> > > > > > > point that
> > > > > > >
> > > > > > > we should improve the error message debugging for the session
> > job,
> > > I
> > > > > > > think
> > > > > > >
> > > > > > > it can be a follow up work as an improvement after we support
> the
> > > > > session
> > > > > > > job operation.
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > >
> > > > > > > Aitozi.
> > > > > > >
> > > > > > >
> > > > > > > Geng Biao <biaoge...@gmail.com> 于2022年3月17日周四 10:55写道：
> > > > > > >
> > > > > > >> Thanks Aitozi for the work!
> > > > > > >>
> > > > > > >> I lean to option#2 of using JarRunHeaders with uber job jar as
> > > well.
> > > > > As
> > > > > > >> Yang said, the user defined dependencies may be better
> supported
> > > in
> > > > > > >> upstream flink.
> > > > > > >> A follow-up thought: I think we should care the  potential
> > > influence
> > > > > on
> > > > > > >> user experiences: as the job graph is generated in JM, when
> the
> > > > > > generation
> > > > > > >> fails due to some issues in the main() method, we should do
> some
> > > > work
> > > > > on
> > > > > > >> showing such error messages in this proposal or the later k8s
> > > > operator
> > > > > > >> implementation.  Reason for this question is that if users
> > submit
> > > > many
> > > > > > jobs
> > > > > > >> to one same session cluster, it may be not easy for them to
> find
> > > > > > relevant
> > > > > > >> error logs about main() method of a specific job. The
> > FLINK-25715
> > > > > could
> > > > > > >> help us later.
> > > > > > >>
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> Biao Geng
> > > > > > >>
> > > > > > >>
> > > > > > >> 发件人: Aitozi <gjying1...@gmail.com>
> > > > > > >> 日期: 星期三, 2022年3月16日 下午5:19
> > > > > > >> 收件人: dev@flink.apache.org <dev@flink.apache.org>
> > > > > > >> 主题: Re: [DISCUSS] Support the session job management in
> > kubernetes
> > > > > > >> operator
> > > > > > >> Hi Yang Wang
> > > > > > >>     Thanks for your feedback, Provide the local and http
> > > > > implementation
> > > > > > >> for
> > > > > > >> the first version makes sense to me.
> > > > > > >> +1 for it.
> > > > > > >>
> > > > > > >> Best,
> > > > > > >> Aitozi
> > > > > > >>
> > > > > > >> Yang Wang <danrtsey...@gmail.com> 于2022年3月16日周三 16:44写道：
> > > > > > >>
> > > > > > >> > # How to download the user jars
> > > > > > >> > I agree with Gyula that it will be a burden if we bundle the
> > > flink
> > > > > > >> > filesystem dependencies in the operator image.
> > > > > > >> > Maybe we could have a *ArtifactFetcher* interface in the
> > > > > > >> > flink-kubernetes-operator. By default, we provide the local
> > and
> > > > http
> > > > > > >> > implementation,
> > > > > > >> > which means we could get the user jars from local files or
> > HTTP
> > > > > URLs.
> > > > > > >> Flink
> > > > > > >> > filesystem support could be done as a follow-up based on the
> > > > > feedback.
> > > > > > >> >
> > > > > > >> > If the user wants to use the local implementation, they need
> > to
> > > > > mount
> > > > > > a
> > > > > > >> > PV(aka persist volume) to the operator first and then put
> > their
> > > > jars
> > > > > > >> into
> > > > > > >> > the PV.
> > > > > > >> >
> > > > > > >> > # How to talk to session JobManager to submit the job
> > > > > > >> > After more consideration, I also prefer the second approach,
> > via
> > > > > REST
> > > > > > >> API
> > > > > > >> > /jars/:jarid/run. If we have strong requirements to support
> > > > > > dependencies
> > > > > > >> > jars and
> > > > > > >> > artifacts, we could try to support this in the upstream
> > project.
> > > > > > >> >
> > > > > > >> > Best,
> > > > > > >> > Yang
> > > > > > >> >
> > > > > > >> >
> > > > > > >> > Aitozi <gjying1...@gmail.com> 于2022年3月16日周三 16:11写道：
> > > > > > >> >
> > > > > > >> > > Hi Gyula
> > > > > > >> > >     Thanks for your quick response. Regarding the
> different
> > > > > > >> filesystems
> > > > > > >> > > dependency,
> > > > > > >> > > I think we can make it optional and pluggable, and let it
> > > choose
> > > > > by
> > > > > > >> user
> > > > > > >> > > when building
> > > > > > >> > > their operator image. Users can build their image from the
> > > base
> > > > > > >> operator
> > > > > > >> > > image and
> > > > > > >> > > add filesystem dependency they want to use to it. BTW, we
> > can
> > > > > > support
> > > > > > >> the
> > > > > > >> > > http URI
> > > > > > >> > > by default.
> > > > > > >> > >
> > > > > > >> > > Thanks,
> > > > > > >> > > Aitozi.
> > > > > > >> > >
> > > > > > >> > > Gyula Fóra <gyula.f...@gmail.com> 于2022年3月16日周三 15:53写道：
> > > > > > >> > >
> > > > > > >> > > > Thank you Aitozi!
> > > > > > >> > > >
> > > > > > >> > > > I think this will be a very nice (and simple) addition
> to
> > > > enable
> > > > > > >> these
> > > > > > >> > > > use-cases.
> > > > > > >> > > >
> > > > > > >> > > > I have 2 comments regarding the proposal:
> > > > > > >> > > >
> > > > > > >> > > > 1. I think if we want to support different filesystems
> to
> > > > > download
> > > > > > >> jars
> > > > > > >> > > > from, we probably need some clever ways to add external
> > > > operator
> > > > > > >> > > > dependencies (jars, configs).
> > > > > > >> > > > I would prefer not to bundle them into the base operator
> > > > image.
> > > > > > >> > > >
> > > > > > >> > > > 2. I think we should avoid creating the jobgraphs on the
> > > > > operator
> > > > > > >> side
> > > > > > >> > > and
> > > > > > >> > > > use the jar upload/run rest api instead as you
> suggested.
> > > This
> > > > > > will
> > > > > > >> > avoid
> > > > > > >> > > > flink version and dependency conflicts.
> > > > > > >> > > >
> > > > > > >> > > > Cheers,
> > > > > > >> > > > Gyula
> > > > > > >> > > >
> > > > > > >> > > > On Wed, Mar 16, 2022 at 8:41 AM Aitozi <
> > > gjying1...@gmail.com>
> > > > > > >> wrote:
> > > > > > >> > > >
> > > > > > >> > > > > Hi Guys:
> > > > > > >> > > > >
> > > > > > >> > > > >     I would like to open a discussion for support
> > session
> > > > job
> > > > > > >> > > management
> > > > > > >> > > > in
> > > > > > >> > > > > kubernetes operator. It’s intended to enhance the
> > > > > > >> > > > flink-kubernetes-operator
> > > > > > >> > > > > to manage the session job with k8s tooling. I have
> > drafted
> > > > the
> > > > > > >> design
> > > > > > >> > > > > doc[1]. Please refer to it and give me some feedback .
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > > [1]
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1WPGbur1eT3H_5gN-kyXfp7EDjdbJUURx6jN8nt6UT-s/edit#
> > > > > > >> <
> > > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
> https://docs.google.com/document/d/1WPGbur1eT3H_5gN-kyXfp7EDjdbJUURx6jN8nt6UT-s/edit
> > > > > > >> >
> > > > > > >> > > > >
> > > > > > >> > > > > Best,
> > > > > > >> > > > >
> > > > > > >> > > > > Aitozi.
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Support the session job management in kubernetes operator

Reply via email to