Re: [DISCUSS] FLIP-250: Support Customized Kubernetes Schedulers Proposal

2022-07-14 Thread Yikun Jiang
> And maybe we also could ping Yikun Jiang who has done similar things in
Spark.

Thanks for @wangyang ping. Yes, I was involved in Spark's customized
scheduler support work and as the main completer.

For customized scheduler support, I can share scheduler's requirement in
here:

1. Help scheduler to *specify* the scheduler name

2. Help scheduler to create the* scheduler related label/annotation/CRD*,
such as
- Yunikorn needs labels/annotations
<https://yunikorn.apache.org/docs/user_guide/labels_and_annotations_in_yunikorn/>
(maybe task group CRD in future or not)
- Volcano needs annotations and CRD <https://volcano.sh/en/docs/podgroup/>
- Kube-batch needs annotations/CRD
<https://github.com/kubernetes-sigs/kube-batch/tree/master/config/crds>
- Kueue needs annotation support
<https://github.com/kubernetes-sigs/kueue/blob/888cedb6e62c315e008916086308a893cd21dd66/config/samples/sample-job.yaml#L6>
and
cluster level CRD

3. Help the scheduler to create the scheduler meta/CRD at the* right time*,
such as if users want to avoid pod max pending, we need to create the
scheduler required CRD before pod creation.

For complex requirements, Spark uses featurestep to support (looks flink
decorators are very similar to it)
For simple requirements, they can just use configuration or Pod Template.
[1]
https://spark.apache.org/docs/latest/running-on-kubernetes.html#customized-kubernetes-schedulers-for-spark-on-kubernetes

>From the FLIP, I can see the above requirements are covered.

BTW, I think Flink decorators' existing and new added interface have
already covered all requirements of Kubernetes, so I personally think the
K8s related scheduler requirement can also be well covered by it.

Regards,
Yikun


On Thu, Jul 14, 2022 at 5:11 PM Yang Wang  wrote:

> I think we could go over the customized scheduler plugin mechanism again
> with YuniKorn to make sure that it is common enough.
> But the implementation could be deferred.
>
> And maybe we also could ping Yikun Jiang who has done similar things in
> Spark.
>
> For the e2e tests, I admit that they could be improved. But I am not sure
> whether we really need the java implementation instead.
> This is out of the scope of this FLIP and let's keep the discussion
> under FLINK-20392.
>
>
> Best,
> Yang
>
> Martijn Visser  于2022年7月14日周四 15:28写道:
>
> > Hi Bo,
> >
> > Thanks for the info! I think I see that you've already updated the FLIP
> to
> > reflect how customized schedulers are beneficial for both batch and
> > streaming jobs.
> >
> > The reason why I'm not too happy that we would only create a reference
> > implementation for Volcano is that we don't know if the generic support
> for
> > customized scheduler plugins will also work for others. We think it will,
> > but since there would be no other implementation available, we are not
> > sure. My concern is that when someone tries to add support for another
> > scheduler, we notice that we actually made a mistake or should improve
> the
> > generic support.
> >
> > Best regards,
> >
> > Martijn
> >
> >
> >
> > Op do 14 jul. 2022 om 05:30 schreef bo zhaobo <
> bzhaojyathousa...@gmail.com
> > >:
> >
> > > Hi Martijn,
> > >
> > > Thank you for your comments. I will answer the questions one by one.
> > >
> > > ""
> > > * Regarding the motivation, it mentions that the development trend is
> > that
> > > Flink supports both batch and stream processing. I think the vision and
> > > trend is that we have unified batch- and stream processing. What I'm
> > > missing is the vision on what's the impact for customized Kubernetes
> > > schedulers on stream processing. Could there be some elaboration on
> that?
> > > ""
> > >
> > > >>
> > >
> > > We very much agree with you and the dev trend that Flink supports both
> > > batch and stream processing. Actually, using the K8S customized
> scheduler
> > > is beneficial for streaming scenarios too, such as avoiding resource
> > > deadlock and other problems, for example, the remaining resources in
> the
> > > K8S cluster are only enough for one job running, but we submitted two.
> At
> > > this time, both jobs will be prevented and hang from requesting
> resources
> > > at the same time when using the default K8S scheduler, but in this
> case,
> > > the customized scheduler Volcano won’t schedule overcommit pods if the
> > idle
> > > can not fit all following pods setup. So the benefits mentioned in FLIP
> > are
> > > not only

Re: [DISCUSS] ARM support for Flink

2019-08-01 Thread Yikun Jiang
@Chesnay @ Stephan  Thanks for the suggestion and help, and I open a JIRA
in [1].

Any other questions you could feel free to ping us.

[1]  https://issues.apache.org/jira/browse/INFRA-18822

Regards,
Yikun

Jiang Yikun(Kero)
Mail: yikunk...@gmail.com


Stephan Ewen  于2019年8月1日周四 下午4:41写道:

> Asking INFRA to add support means filing a JIRA ticket.
>
> That works the same way as filing a FLINK Jira ticket, but selecting INFRA
> as the project to file the ticket for.
>
> On Thu, Aug 1, 2019 at 4:17 AM Xiyuan Wang 
> wrote:
>
> > Thanks for your reply.
> >
> > We are now keeping investigating and debugging Flink on ARM.  It's hard
> for
> > us to say How many kinds of test are enough for ARM support at this
> moment,
> > but `core` and `test` are necessary of cause I think. What we do now is
> > following travis-ci, added all the module that tarvis-ci contains.
> >
> > During out local test, there are just few tests failed[1]. We have
> > solutions for some of them, others are still under debugging. Flink
> team's
> > idea is welcome. And very thanks for your jira issue[2], we will keep
> > updating it then.
> >
> > It'll be great if Infra Team could add OpenLab App[3](or other CI if
> Flink
> > choose) to Flink repo. I'm not  clear how to talk with Infra Team, should
> > Flink team start the discussion? Or I send a mail list to Infra? Need
> your
> > help.
> >
> > Then once app is added, perhaps we can add `core` and `test` jobs as the
> > first step, making them run stable and successful and then adding more
> > modules if needed.
> >
> > [1]: https://etherpad.net/p/flink_arm64_support
> > [2]: https://issues.apache.org/jira/browse/FLINK-13448
> > [3]: https://github.com/apps/theopenlab-ci
> >
> > Regards
> > wangxiyuan
> >
> > Stephan Ewen  于2019年7月31日周三 下午9:46写道:
> >
> > > Wow, that is pretty nice work, thanks a lot!
> > >
> > > We need some support from Apache Infra to see if we can connect the
> Flink
> > > Github Repo with the OpenLab CI.
> > > We would also need a discussion on the developer mailing list, to get
> > > community agreement.
> > >
> > > Have you looked at whether we need to run all tests with ARM, or
> whether
> > > maybe only the "core" and "tests" profile would be enough to get
> > confidence
> > > that Flink runs on ARM?
> > > Just asking because Flink has a lot of long running tests by now that
> can
> > > easily eat up a lot of CI capacity.
> > >
> > > Best,
> > > Stephan
> > >
> > >
> > >
> > > On Tue, Jul 30, 2019 at 3:45 AM Xiyuan Wang 
> > > wrote:
> > >
> > > > Hi Stephan,
> > > >   Maybe I misled you in the previous email. We don't need to migrate
> CI
> > > > completely, travis-ci is still there working for X86 arch. What we
> need
> > > to
> > > > do is to add another CI tool for ARM arch.
> > > >
> > > >   There are some ways to do it. As I wrote on
> > > > https://issues.apache.org/jira/browse/FLINK-13199 to @Chesnay:
> > > >
> > > > 1. Add OpenLab CI system for ARM arch test.OpenLab is very similar
> with
> > > > travis-ci. What Flilnk need to do is adding the openlab github app to
> > the
> > > > repo, then add the job define files inner Flink repo, Here is a POC
> by
> > > me:
> > > > https://github.com/theopenlab/flink/pull/1
> > > > 2. OpenLab will donate ARM resouces to Apache Infra team as well.
> Then
> > > > Flink can use the Apache offical  Jenkins system for Flink ARM test
> in
> > > the
> > > > future. https://builds.apache.org/
> > > > 3. Use Drony CI which support ARM arch as well. https://drone.io/
> > > >
> > > > Since I'm from OpenLab community, if Flink choose OpenLab CI, My
> > OpenLab
> > > > colleague and I can keep helping and maintaining the ARM CI job. If
> > > choose
> > > > the 2nd way, the CI maintainance work may be handled by apache-infra
> > > team I
> > > > guess.  If choose the 3rd Drony CI, what we can help is very limited.
> > > > AFAIK, Drony use container for CI test, which may not satisfy some
> > > > requiremnts. And OpenLab use VM for test.
> > > >
> > > > Need Flink core team's decision and reply.
> > > >
> > > > Thanks.
>