Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Danny Cranmer Wed, 02 Feb 2022 07:50:48 -0800

Hey team,

Great work on the FLIP, I am looking forward to this one. I agree that we
can move forward to the voting stage.


I have general feedback around how we will handle job submission failure
and retry. As discussed in the Rejected Alternatives section, we can use
Java to handle job submission failures from the Flink client. It would be
useful to have the ability to configure exception classifiers and retry
strategy as part of operator configuration.

Given this will be in a separate Github repository I am curious how ther
versioning strategy will work in relation to the Flink version? Do we have
any other components with a similar setup I can look at? Will the operator
version track Flink or will it use its own versioning strategy with a Flink
version support matrix, or similar?

Thanks,



On Tue, Feb 1, 2022 at 2:33 PM Márton Balassi <balassi.mar...@gmail.com>
wrote:

> Hi team,
>
> Thank you for the great feedback, Thomas has updated the FLIP page
> accordingly. If you are comfortable with the currently existing design and
> depth in the FLIP [1] I suggest moving forward to the voting stage - once
> that reaches a positive conclusion it lets us create the separate code
> repository under the flink project for the operator.
>
> I encourage everyone to keep improving the details in the meantime, however
> I believe given the existing design and the general sentiment on this
> thread that the most efficient path from here is starting the
> implementation so that we can collectively iterate over it.
>
> [1]
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-212%3A+Introduce+Flink+Kubernetes+Operator
>
> On Mon, Jan 31, 2022 at 10:15 PM Thomas Weise <t...@apache.org> wrote:
>
> > HI Xintong,
> >
> > Thanks for the feedback and please see responses below -->
> >
> > On Fri, Jan 28, 2022 at 12:21 AM Xintong Song <tonysong...@gmail.com>
> > wrote:
> >
> > > Thanks Thomas for drafting this FLIP, and everyone for the discussion.
> > >
> > > I also have a few questions and comments.
> > >
> > > ## Job Submission
> > > Deploying a Flink session cluster via kubectl & CR and then submitting
> > jobs
> > > to the cluster via Flink cli / REST is probably the approach that
> > requires
> > > the least effort. However, I'd like to point out 2 weaknesses.
> > > 1. A lot of users use Flink in perjob/application modes. For these
> users,
> > > having to run the job in two steps (deploy the cluster, and submit the
> > job)
> > > is not that convenient.
> > > 2. One of our motivations is being able to manage Flink applications'
> > > lifecycles with kubectl. Submitting jobs from cli sounds not aligned
> with
> > > this motivation.
> > > I think it's probably worth it to support submitting jobs via kubectl &
> > CR
> > > in the first version, both together with deploying the cluster like in
> > > perjob/application mode and after deploying the cluster like in session
> > > mode.
> > >
> >
> > The intention is to support application management through operator and
> CR,
> > which means there won't be any 2 step submission process, which as you
> > allude to would defeat the purpose of this project. The CR example shows
> > the application part. Please note that the bare cluster support is an
> > *additional* feature for scenarios that require external job management.
> Is
> > there anything on the FLIP page that creates a different impression?
> >
> >
> > >
> > > ## Versioning
> > > Which Flink versions does the operator plan to support?
> > > 1. Native K8s deployment was firstly introduced in Flink 1.10
> > > 2. Native K8s HA was introduced in Flink 1.12
> > > 3. The Pod template support was introduced in Flink 1.13
> > > 4. There was some changes to the Flink docker image entrypoint script
> in,
> > > IIRC, Flink 1.13
> > >
> >
> > Great, thanks for providing this. It is important for the compatibility
> > going forward also. We are targeting Flink 1.14.x upwards. Before the
> > operator is ready there will be another Flink release. Let's see if
> anyone
> > is interested in earlier versions?
> >
> >
> > >
> > > ## Compatibility
> > > What kind of API compatibility we can commit to? It's probably fine to
> > have
> > > alpha / beta version APIs that allow incompatible future changes for
> the
> > > first version. But eventually we would need to guarantee backwards
> > > compatibility, so that an early version CR can work with a new version
> > > operator.
> > >
> >
> > Another great point and please let me include that on the FLIP page. ;-)
> >
> > I think we should allow incompatible changes for the first one or two
> > versions, similar to how other major features have evolved recently, such
> > as FLIP-27.
> >
> > Would be great to get broader feedback on this one.
> >
> > Cheers,
> > Thomas
> >
> >
> >
> > >
> > > Thank you~
> > >
> > > Xintong Song
> > >
> > >
> > >
> > > On Fri, Jan 28, 2022 at 1:18 PM Thomas Weise <t...@apache.org> wrote:
> > >
> > > > Thanks for the feedback!
> > > >
> > > > >
> > > > > # 1 Flink Native vs Standalone integration
> > > > > Maybe we should make this more clear in the FLIP but we agreed to
> do
> > > the
> > > > > first version of the operator based on the native integration.
> > > > > While this clearly does not cover all use-cases and requirements,
> it
> > > > seems
> > > > > this would lead to a much smaller initial effort and a nicer first
> > > > version.
> > > > >
> > > >
> > > > I'm also leaning towards the native integration, as long as it
> reduces
> > > the
> > > > MVP effort. Ultimately the operator will need to also support the
> > > > standalone mode. I would like to gain more confidence that native
> > > > integration reduces the effort. While it cuts the effort to handle
> the
> > TM
> > > > pod creation, some mapping code from the CR to the native integration
> > > > client and config needs to be created. As mentioned in the FLIP,
> native
> > > > integration requires the Flink job manager to have access to the k8s
> > API
> > > to
> > > > create pods, which in some scenarios may be seen as unfavorable.
> > > >
> > > >  > > > # Pod Template
> > > > > > > Is the pod template in CR same with what Flink has already
> > > > > supported[4]?
> > > > > > > Then I am afraid not the arbitrary field(e.g. cpu/memory
> > resources)
> > > > > could
> > > > > > > take effect.
> > > >
> > > > Yes, pod template would look almost identical. There are a few
> settings
> > > > that the operator will control (and that may need to be blacklisted),
> > but
> > > > in general we would not want to place restrictions. I think a
> mechanism
> > > > where a pod template is merged from multiple layers would also be
> > > > interesting to make this more flexible.
> > > >
> > > > Cheers,
> > > > Thomas
> > > >
> > >
> >
>

Re: [DISCUSS] FLIP-212: Introduce Flink Kubernetes Operator

Reply via email to