Re: YuniKorn for Streaming Use Cases

Chenya Zhang Wed, 05 Jan 2022 19:16:04 -0800

** Corrections: Apache YuniKorn meetup :) **

On Wed, Jan 5, 2022 at 4:56 PM Chenya Zhang <chenyazhangche...@gmail.com>
wrote:


> Hi Weiwei, thanks for sharing your past experience! This is a helpful
> discussion.
>
> We should set up some dedicated discussions and topic threads for
> "Streaming with Apache YuniKorn". I know a lot of folks from the industry
> would be interested. This would be a great opportunity to expand YuniKorn's
> footprints to more use case scenarios.
>
> In our next Apache Flink meetup, I could help to invite some speakers
> (please feel free to recommend any) and organize a roundtable for
> streaming-specific discussions so folks could share their experience/needs
> to identify any gaps for future improvement together.
>
> Please let me know what you think. +devs
>
> Best,
> Chenya
>
>
>
> On Wed, Jan 5, 2022 at 9:52 AM Weiwei Yang <w...@apache.org> wrote:
>
>> hi Chenya
>>
>> > As we know, streaming applications are long-running and need to secure
>> all
>> requested resources before starting to run. In most cases, they do not
>> have
>> a strong need to be queued, ordered, or preempted to wait to obtain or
>> give
>> back their resource.
>>
>> You are right if the assumption is pure streaming cases, all long-running
>> jobs, and the cluster has sufficient resources for all jobs. Maybe it is
>> fair to say it is not a day 1 challenge.
>> However, in my past experience, this is not always enough and will not be
>> enough. When we operate large-scale Flink jobs, the major issues we were
>> dealing with: resource utilization, resource contention, hot-spot,
>> isolation, etc. We used to have tens of queues per cluster and shared by
>> many users, and jobs have different priorities and high-priority jobs can
>> make room by preempting lower priority ones. We have a customized
>> node-score system in order to distribute pods more efficiently. As you
>> see,
>> resource queues, app-sorting, node-sorting, preemption, all play a role
>> here. Also central job management, scheduling latency/throughput are also
>> important.
>>
>> On K8s and Cloud, it brings more challenges. I guess one thing challenging
>> and also interesting is how to do auto-scaling more efficiently. Sometimes
>> we need a strategy to warm up resources on Cloud in order to fit new jobs
>> in low latency. Most likely the scheduler can give some hints for that.
>> This will be a fun part to explore too. With all being said, I do think a
>> customized scheduler (instead of the pod-level scheduler -
>> default-k8s-scheduler) will be necessary.
>>
>> On Tue, Jan 4, 2022 at 10:18 PM Chenya Zhang <chenyazhangche...@gmail.com
>> >
>> wrote:
>>
>> > Hi Weiwei
>> >
>> > Thanks for sharing. I checked the video and for Alibaba's use case, they
>> > have a mixed cluster for streaming and batch applications running with
>> > Apache Flink. Our use case is different. We only use Apache Flink for
>> > stream processing in physical clusters separate from Spark for batch
>> > processing.
>> >
>> > As we know, streaming applications are long-running and need to secure
>> all
>> > requested resources before starting to run. In most cases, they do not
>> have
>> > a strong need to be queued, ordered, or preempted to wait to obtain or
>> give
>> > back their resource.
>> >
>> > I'm gathering more streaming use case requirements that could not be
>> > satisfied by K8s namespace for resource quota management or other
>> advanced
>> > scheduling needs. Will keep this thread updated.
>> >
>> > Meanwhile, happy to hear more thoughts from you!
>> >
>> > Best,
>> > Chenya
>> >
>> > On Tue, Jan 4, 2022 at 9:20 PM Weiwei Yang <w...@apache.org> wrote:
>> >
>> > > Hi Chenya
>> > >
>> > > The use case is similar, YK will play a big role there. Lots of
>> features
>> > > are relevant, such as queues, job ordering, user/group ACLs,
>> preemption,
>> > > over-subscription, and performance etc.
>> > > Some of the basic functionalities are available in YK, some more
>> needs to
>> > > be built.
>> > > Please take a look at the slides from the Alibaba Flink team, they
>> have
>> > > shared how they use YK to address their use cases.
>> > > This was presented in ApacheConf:
>> > > https://www.youtube.com/watch?v=4hghJCuZk5M
>> > >
>> > > On Tue, Jan 4, 2022 at 6:35 PM Chenya Zhang <
>> chenyazhangche...@gmail.com
>> > >
>> > > wrote:
>> > >
>> > > > Hey folks,
>> > > >
>> > > > We have some new streaming use cases with Apache Flink that could
>> > > > potentially leverage YuniKorn for resource scheduling.
>> > > >
>> > > > The initial implementation is to use K8s namespace for resource
>> quota
>> > > > management. We are investigating what could be some strong benefits
>> > > > switching to YuniKorn in streaming cases for long-running services.
>> For
>> > > > example: Job queueing, job ordering, resource reservation, user
>> groups
>> > > etc
>> > > > all seem to be more desirable for batch use cases.
>> > > >
>> > > > Any thoughts or suggestions?
>> > > >
>> > > > Thanks,
>> > > > Chenya
>> > > >
>> > >
>> >
>>
>

Re: YuniKorn for Streaming Use Cases

Reply via email to