** Corrections: Apache YuniKorn meetup :) ** On Wed, Jan 5, 2022 at 4:56 PM Chenya Zhang <chenyazhangche...@gmail.com> wrote:
> Hi Weiwei, thanks for sharing your past experience! This is a helpful > discussion. > > We should set up some dedicated discussions and topic threads for > "Streaming with Apache YuniKorn". I know a lot of folks from the industry > would be interested. This would be a great opportunity to expand YuniKorn's > footprints to more use case scenarios. > > In our next Apache Flink meetup, I could help to invite some speakers > (please feel free to recommend any) and organize a roundtable for > streaming-specific discussions so folks could share their experience/needs > to identify any gaps for future improvement together. > > Please let me know what you think. +devs > > Best, > Chenya > > > > On Wed, Jan 5, 2022 at 9:52 AM Weiwei Yang <w...@apache.org> wrote: > >> hi Chenya >> >> > As we know, streaming applications are long-running and need to secure >> all >> requested resources before starting to run. In most cases, they do not >> have >> a strong need to be queued, ordered, or preempted to wait to obtain or >> give >> back their resource. >> >> You are right if the assumption is pure streaming cases, all long-running >> jobs, and the cluster has sufficient resources for all jobs. Maybe it is >> fair to say it is not a day 1 challenge. >> However, in my past experience, this is not always enough and will not be >> enough. When we operate large-scale Flink jobs, the major issues we were >> dealing with: resource utilization, resource contention, hot-spot, >> isolation, etc. We used to have tens of queues per cluster and shared by >> many users, and jobs have different priorities and high-priority jobs can >> make room by preempting lower priority ones. We have a customized >> node-score system in order to distribute pods more efficiently. As you >> see, >> resource queues, app-sorting, node-sorting, preemption, all play a role >> here. Also central job management, scheduling latency/throughput are also >> important. >> >> On K8s and Cloud, it brings more challenges. I guess one thing challenging >> and also interesting is how to do auto-scaling more efficiently. Sometimes >> we need a strategy to warm up resources on Cloud in order to fit new jobs >> in low latency. Most likely the scheduler can give some hints for that. >> This will be a fun part to explore too. With all being said, I do think a >> customized scheduler (instead of the pod-level scheduler - >> default-k8s-scheduler) will be necessary. >> >> On Tue, Jan 4, 2022 at 10:18 PM Chenya Zhang <chenyazhangche...@gmail.com >> > >> wrote: >> >> > Hi Weiwei >> > >> > Thanks for sharing. I checked the video and for Alibaba's use case, they >> > have a mixed cluster for streaming and batch applications running with >> > Apache Flink. Our use case is different. We only use Apache Flink for >> > stream processing in physical clusters separate from Spark for batch >> > processing. >> > >> > As we know, streaming applications are long-running and need to secure >> all >> > requested resources before starting to run. In most cases, they do not >> have >> > a strong need to be queued, ordered, or preempted to wait to obtain or >> give >> > back their resource. >> > >> > I'm gathering more streaming use case requirements that could not be >> > satisfied by K8s namespace for resource quota management or other >> advanced >> > scheduling needs. Will keep this thread updated. >> > >> > Meanwhile, happy to hear more thoughts from you! >> > >> > Best, >> > Chenya >> > >> > On Tue, Jan 4, 2022 at 9:20 PM Weiwei Yang <w...@apache.org> wrote: >> > >> > > Hi Chenya >> > > >> > > The use case is similar, YK will play a big role there. Lots of >> features >> > > are relevant, such as queues, job ordering, user/group ACLs, >> preemption, >> > > over-subscription, and performance etc. >> > > Some of the basic functionalities are available in YK, some more >> needs to >> > > be built. >> > > Please take a look at the slides from the Alibaba Flink team, they >> have >> > > shared how they use YK to address their use cases. >> > > This was presented in ApacheConf: >> > > https://www.youtube.com/watch?v=4hghJCuZk5M >> > > >> > > On Tue, Jan 4, 2022 at 6:35 PM Chenya Zhang < >> chenyazhangche...@gmail.com >> > > >> > > wrote: >> > > >> > > > Hey folks, >> > > > >> > > > We have some new streaming use cases with Apache Flink that could >> > > > potentially leverage YuniKorn for resource scheduling. >> > > > >> > > > The initial implementation is to use K8s namespace for resource >> quota >> > > > management. We are investigating what could be some strong benefits >> > > > switching to YuniKorn in streaming cases for long-running services. >> For >> > > > example: Job queueing, job ordering, resource reservation, user >> groups >> > > etc >> > > > all seem to be more desirable for batch use cases. >> > > > >> > > > Any thoughts or suggestions? >> > > > >> > > > Thanks, >> > > > Chenya >> > > > >> > > >> > >> >