Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Thomas Weise Fri, 04 Feb 2022 17:41:36 -0800

Hi,

Thanks for the continued feedback and discussion. Looks like we are
ready to start a VOTE, I will initiate it shortly.


In parallel it would be good to find the repository name.

My suggestion would be: flink-kubernetes-operator

I thought "flink-operator" could be a bit misleading since the term
operator already has a meaning in Flink.

I also considered "flink-k8s-operator" but that would be almost
identical to existing operator implementations and could lead to
confusion in the future.

Thoughts?

Thanks,
Thomas



On Fri, Feb 4, 2022 at 5:15 AM Gyula Fóra <[email protected]> wrote:
>
> Hi Danny,
>
> So far we have been focusing our dev efforts on the initial native
> implementation with the team.
> If the discussion and vote goes well for this FLIP we are looking forward
> to contributing the initial version sometime next week (fingers crossed).
>
> At that point I think we can already start the dev work to support the
> standalone mode as well, especially if you can dedicate some effort to
> pushing that side.
> Working together on this sounds like a great idea and we should start as
> soon as possible! :)
>
> Cheers,
> Gyula
>
> On Fri, Feb 4, 2022 at 2:07 PM Danny Cranmer <[email protected]>
> wrote:
>
> > I have been discussing this one with my team. We are interested in the
> > Standalone mode, and are willing to contribute towards the implementation.
> > Potentially we can work together to support both modes in parallel?
> >
> > Thanks,
> >
> > On Wed, Feb 2, 2022 at 4:02 PM Gyula Fóra <[email protected]> wrote:
> >
> > > Hi Danny!
> > >
> > > Thanks for the feedback :)
> > >
> > > Versioning:
> > > Versioning will be independent from Flink and the operator will depend
> > on a
> > > fixed flink version (in every given operator version).
> > > This should be the exact same setup as with Stateful Functions (
> > > https://github.com/apache/flink-statefun). So independent release cycle
> > > but
> > > still within the Flink umbrella.
> > >
> > > Deployment error handling:
> > > I think that's a very good point, as general exception handling for the
> > > different failure scenarios is a tricky problem. I think the exception
> > > classifiers and retry strategies could avoid a lot of manual intervention
> > > from the user. We will definitely need to add something like this. Once
> > we
> > > have the repo created with the initial operator code we should open some
> > > tickets for this and put it on the short term roadmap!
> > >
> > > Cheers,
> > > Gyula
> > >
> > > On Wed, Feb 2, 2022 at 4:50 PM Danny Cranmer <[email protected]>
> > > wrote:
> > >
> > > > Hey team,
> > > >
> > > > Great work on the FLIP, I am looking forward to this one. I agree that
> > we
> > > > can move forward to the voting stage.
> > > >
> > > > I have general feedback around how we will handle job submission
> > failure
> > > > and retry. As discussed in the Rejected Alternatives section, we can
> > use
> > > > Java to handle job submission failures from the Flink client. It would
> > be
> > > > useful to have the ability to configure exception classifiers and retry
> > > > strategy as part of operator configuration.
> > > >
> > > > Given this will be in a separate Github repository I am curious how
> > ther
> > > > versioning strategy will work in relation to the Flink version? Do we
> > > have
> > > > any other components with a similar setup I can look at? Will the
> > > operator
> > > > version track Flink or will it use its own versioning strategy with a
> > > Flink
> > > > version support matrix, or similar?
> > > >
> > > > Thanks,
> > > >
> > > >
> > > >
> > > > On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <
> > [email protected]>
> > > > wrote:
> > > >
> > > > > Hi team,
> > > > >
> > > > > Thank you for the great feedback, Thomas has updated the FLIP page
> > > > > accordingly. If you are comfortable with the currently existing
> > design
> > > > and
> > > > > depth in the FLIP [1] I suggest moving forward to the voting stage -
> > > once
> > > > > that reaches a positive conclusion it lets us create the separate
> > code
> > > > > repository under the flink project for the operator.
> > > > >
> > > > > I encourage everyone to keep improving the details in the meantime,
> > > > however
> > > > > I believe given the existing design and the general sentiment on this
> > > > > thread that the most efficient path from here is starting the
> > > > > implementation so that we can collectively iterate over it.
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > > >
> > >
> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
> > > > >
> > > > > On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <[email protected]>
> > wrote:
> > > > >
> > > > > > HI Xintong,
> > > > > >
> > > > > > Thanks for the feedback and please see responses below -->
> > > > > >
> > > > > > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <
> > [email protected]
> > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks Thomas for drafting this FLIP, and everyone for the
> > > > discussion.
> > > > > > >
> > > > > > > I also have a few questions and comments.
> > > > > > >
> > > > > > > ## Job Submission
> > > > > > > Deploying a Flink session cluster via kubectl & CR and then
> > > > submitting
> > > > > > jobs
> > > > > > > to the cluster via Flink cli / REST is probably the approach that
> > > > > > requires
> > > > > > > the least effort. However, I'd like to point out 2 weaknesses.
> > > > > > > 1. A lot of users use Flink in perjob/application modes. For
> > these
> > > > > users,
> > > > > > > having to run the job in two steps (deploy the cluster, and
> > submit
> > > > the
> > > > > > job)
> > > > > > > is not that convenient.
> > > > > > > 2. One of our motivations is being able to manage Flink
> > > applications'
> > > > > > > lifecycles with kubectl. Submitting jobs from cli sounds not
> > > aligned
> > > > > with
> > > > > > > this motivation.
> > > > > > > I think it's probably worth it to support submitting jobs via
> > > > kubectl &
> > > > > > CR
> > > > > > > in the first version, both together with deploying the cluster
> > like
> > > > in
> > > > > > > perjob/application mode and after deploying the cluster like in
> > > > session
> > > > > > > mode.
> > > > > > >
> > > > > >
> > > > > > The intention is to support application management through operator
> > > and
> > > > > CR,
> > > > > > which means there won't be any 2 step submission process, which as
> > > you
> > > > > > allude to would defeat the purpose of this project. The CR example
> > > > shows
> > > > > > the application part. Please note that the bare cluster support is
> > an
> > > > > > *additional* feature for scenarios that require external job
> > > > management.
> > > > > Is
> > > > > > there anything on the FLIP page that creates a different
> > impression?
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > ## Versioning
> > > > > > > Which Flink versions does the operator plan to support?
> > > > > > > 1. Native K8s deployment was firstly introduced in Flink 1.10
> > > > > > > 2. Native K8s HA was introduced in Flink 1.12
> > > > > > > 3. The Pod template support was introduced in Flink 1.13
> > > > > > > 4. There was some changes to the Flink docker image entrypoint
> > > script
> > > > > in,
> > > > > > > IIRC, Flink 1.13
> > > > > > >
> > > > > >
> > > > > > Great, thanks for providing this. It is important for the
> > > compatibility
> > > > > > going forward also. We are targeting Flink 1.14.x upwards. Before
> > the
> > > > > > operator is ready there will be another Flink release. Let's see if
> > > > > anyone
> > > > > > is interested in earlier versions?
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > ## Compatibility
> > > > > > > What kind of API compatibility we can commit to? It's probably
> > fine
> > > > to
> > > > > > have
> > > > > > > alpha / beta version APIs that allow incompatible future changes
> > > for
> > > > > the
> > > > > > > first version. But eventually we would need to guarantee
> > backwards
> > > > > > > compatibility, so that an early version CR can work with a new
> > > > version
> > > > > > > operator.
> > > > > > >
> > > > > >
> > > > > > Another great point and please let me include that on the FLIP
> > page.
> > > > ;-)
> > > > > >
> > > > > > I think we should allow incompatible changes for the first one or
> > two
> > > > > > versions, similar to how other major features have evolved
> > recently,
> > > > such
> > > > > > as FLIP-27.
> > > > > >
> > > > > > Would be great to get broader feedback on this one.
> > > > > >
> > > > > > Cheers,
> > > > > > Thomas
> > > > > >
> > > > > >
> > > > > >
> > > > > > >
> > > > > > > Thank you~
> > > > > > >
> > > > > > > Xintong Song
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <[email protected]>
> > > wrote:
> > > > > > >
> > > > > > > > Thanks for the feedback!
> > > > > > > >
> > > > > > > > >
> > > > > > > > > # 1 Flink Native vs Standalone integration
> > > > > > > > > Maybe we should make this more clear in the FLIP but we
> > agreed
> > > to
> > > > > do
> > > > > > > the
> > > > > > > > > first version of the operator based on the native
> > integration.
> > > > > > > > > While this clearly does not cover all use-cases and
> > > requirements,
> > > > > it
> > > > > > > > seems
> > > > > > > > > this would lead to a much smaller initial effort and a nicer
> > > > first
> > > > > > > > version.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I'm also leaning towards the native integration, as long as it
> > > > > reduces
> > > > > > > the
> > > > > > > > MVP effort. Ultimately the operator will need to also support
> > the
> > > > > > > > standalone mode. I would like to gain more confidence that
> > native
> > > > > > > > integration reduces the effort. While it cuts the effort to
> > > handle
> > > > > the
> > > > > > TM
> > > > > > > > pod creation, some mapping code from the CR to the native
> > > > integration
> > > > > > > > client and config needs to be created. As mentioned in the
> > FLIP,
> > > > > native
> > > > > > > > integration requires the Flink job manager to have access to
> > the
> > > > k8s
> > > > > > API
> > > > > > > to
> > > > > > > > create pods, which in some scenarios may be seen as
> > unfavorable.
> > > > > > > >
> > > > > > > >  > > > # Pod Template
> > > > > > > > > > > Is the pod template in CR same with what Flink has
> > already
> > > > > > > > > supported[4]?
> > > > > > > > > > > Then I am afraid not the arbitrary field(e.g. cpu/memory
> > > > > > resources)
> > > > > > > > > could
> > > > > > > > > > > take effect.
> > > > > > > >
> > > > > > > > Yes, pod template would look almost identical. There are a few
> > > > > settings
> > > > > > > > that the operator will control (and that may need to be
> > > > blacklisted),
> > > > > > but
> > > > > > > > in general we would not want to place restrictions. I think a
> > > > > mechanism
> > > > > > > > where a pod template is merged from multiple layers would also
> > be
> > > > > > > > interesting to make this more flexible.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Thomas
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Reply via email to