Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

Yikun Jiang Wed, 05 Jan 2022 17:04:19 -0800

@Holden Karau <hol...@pigscanfly.ca>

Thanks for reminder, I will send the vote mail soon.


and thanks for all helps on discussion and design review.

Regards,
Yikun


Holden Karau <hol...@pigscanfly.ca> 于2022年1月6日周四 03:16写道：

> Do we want to move the SPIP forward to a vote? It seems like we're mostly
> agreeing in principle?
>
> On Wed, Jan 5, 2022 at 11:12 AM Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
>> Hi Bo,
>>
>> Thanks for the info. Let me elaborate:
>>
>> In theory you can set the number of executors to multiple values of
>> Nodes. For example if you have a three node k8s cluster (in my case Google
>> GKE), you can set the number of executors to 6 and end up with six
>> executors queuing to start but ultimately you finish with two running
>> executors plus the driver in a 3 node cluster as shown below
>>
>> hduser@ctpvm: /home/hduser> k get pods -n spark
>>
>> NAME                                         READY   STATUS    RESTARTS
>>  AGE
>>
>> *randomdatabigquery-d42d067e2b91c88a-exec-1   1/1     Running   0
>>   33s*
>>
>> *randomdatabigquery-d42d067e2b91c88a-exec-2   1/1     Running   0
>>   33s*
>>
>> randomdatabigquery-d42d067e2b91c88a-exec-3   0/1     Pending   0
>> 33s
>>
>> randomdatabigquery-d42d067e2b91c88a-exec-4   0/1     Pending   0
>> 33s
>>
>> randomdatabigquery-d42d067e2b91c88a-exec-5   0/1     Pending   0
>> 33s
>>
>> randomdatabigquery-d42d067e2b91c88a-exec-6   0/1     Pending   0
>> 33s
>>
>> *sparkbq-0beda77e2b919e01-driver              1/1     Running   0
>>   45s*
>>
>> hduser@ctpvm: /home/hduser> k get pods -n spark
>>
>> NAME                                         READY   STATUS    RESTARTS
>>  AGE
>>
>> randomdatabigquery-d42d067e2b91c88a-exec-1   1/1     Running   0
>> 38s
>>
>> randomdatabigquery-d42d067e2b91c88a-exec-2   1/1     Running   0
>> 38s
>>
>> sparkbq-0beda77e2b919e01-driver              1/1     Running   0
>> 50s
>>
>> hduser@ctpvm: /home/hduser> k get pods -n spark
>>
>> *NAME                                         READY   STATUS    RESTARTS
>>  AGE*
>>
>> *randomdatabigquery-d42d067e2b91c88a-exec-1   1/1     Running   0
>>   40s*
>>
>> *randomdatabigquery-d42d067e2b91c88a-exec-2   1/1     Running   0
>>   40s*
>>
>> *sparkbq-0beda77e2b919e01-driver              1/1     Running   0
>>   52s*
>>
>> So you end up with the three added executors dropping out. Hence the
>> conclusion seems to be you want to fit exactly one Spark executor pod
>> per Kubernetes node with the current model.
>>
>> HTH
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 5 Jan 2022 at 17:01, bo yang <bobyan...@gmail.com> wrote:
>>
>>> Hi Mich,
>>>
>>> Curious what do you mean “The constraint seems to be that you can fit one
>>> Spark executor pod per Kubernetes node and from my tests you don't seem to
>>> be able to allocate more than 50% of RAM on the node to the container",
>>> Would you help to explain a bit? Asking this because there could be
>>> multiple executor pods running on a single Kuberentes node.
>>>
>>> Thanks,
>>> Bo
>>>
>>>
>>> On Wed, Jan 5, 2022 at 1:13 AM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Thanks William for the info.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> The current model of Spark on k8s has certain drawbacks with pod based
>>>> scheduling as I  tested it on Google Kubernetes Cluster (GKE). The
>>>> constraint seems to be that you can fit one Spark executor pod per
>>>> Kubernetes node and from my tests you don't seem to be able to allocate
>>>> more than 50% of RAM on the node to the container.
>>>>
>>>>
>>>> [image: gke_memoeyPlot.png]
>>>>
>>>>
>>>> Anymore results in the container never been created (stuck at pending)
>>>>
>>>> kubectl describe pod sparkbq-b506ac7dc521b667-driver -n spark
>>>>
>>>>  Events:
>>>>
>>>>   Type     Reason             Age                   From                
>>>> Message
>>>>
>>>>   ----     ------             ----                  ----                
>>>> -------
>>>>
>>>>   Warning  FailedScheduling   17m                   default-scheduler   
>>>> 0/3 nodes are available: 3 Insufficient memory.
>>>>
>>>>   Warning  FailedScheduling   17m                   default-scheduler   
>>>> 0/3 nodes are available: 3 Insufficient memory.
>>>>
>>>>   Normal   NotTriggerScaleUp  2m28s (x92 over 17m)  cluster-autoscaler  
>>>> pod didn't trigger scale-up:
>>>>
>>>> Obviously this is far from ideal and this model although works is not
>>>> efficient.
>>>>
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> Mich
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction
>>>>
>>>> of data or any other property which may arise from relying on this
>>>> email's technical content is explicitly disclaimed.
>>>>
>>>> The author will in no case be liable for any monetary damages arising
>>>> from such
>>>>
>>>> loss, damage or destruction.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Wed, 5 Jan 2022 at 03:55, William Wang <wang.platf...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Mich,
>>>>>
>>>>> Here are parts of performance indications in Volcano.
>>>>> 1. Scheduler throughput: 1.5k pod/s (default scheduler: 100 Pod/s)
>>>>> 2. Spark application performance improved 30%+ with minimal resource
>>>>> reservation feature in case of insufficient resource.(tested with TPC-DS)
>>>>>
>>>>> We are still working on more optimizations. Besides the performance,
>>>>> Volcano is continuously enhanced in below four directions to provide
>>>>> abilities that users care about.
>>>>> - Full lifecycle management for jobs
>>>>> - Scheduling policies for high-performance workloads(fair-share,
>>>>> topology, sla, reservation, preemption, backfill etc)
>>>>> - Support for heterogeneous hardware
>>>>> - Performance optimization for high-performance workloads
>>>>>
>>>>> Thanks
>>>>> LeiBo
>>>>>
>>>>> Mich Talebzadeh <mich.talebza...@gmail.com> 于2022年1月4日周二 18:12写道：
>>>>>
>>>> Interesting,thanks
>>>>>>
>>>>>> Do you have any indication of the ballpark figure (a rough numerical
>>>>>> estimate) of adding Volcano as an alternative scheduler is going to
>>>>>> improve Spark on k8s performance?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility
>>>>>> for any loss, damage or destruction
>>>>>>
>>>>>> of data or any other property which may arise from relying on this
>>>>>> email's technical content is explicitly disclaimed.
>>>>>>
>>>>>> The author will in no case be liable for any monetary damages arising
>>>>>> from such
>>>>>>
>>>>>> loss, damage or destruction.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, 4 Jan 2022 at 09:43, Yikun Jiang <yikunk...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, folks! Wishing you all the best in 2022.
>>>>>>>
>>>>>>> I'd like to share the current status on "Support Customized K8S
>>>>>>> Scheduler in Spark".
>>>>>>>
>>>>>>>
>>>>>>> https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg/edit#heading=h.1quyr1r2kr5n
>>>>>>>
>>>>>>> Framework/Common support
>>>>>>>
>>>>>>> - Volcano and Yunikorn team join the discussion and complete the
>>>>>>> initial doc on framework/common part.
>>>>>>>
>>>>>>> - SPARK-37145 <https://issues.apache.org/jira/browse/SPARK-37145>
>>>>>>> (under reviewing): We proposed to extend the customized scheduler by 
>>>>>>> just
>>>>>>> using a custom feature step, it will meet the requirement of customized
>>>>>>> scheduler after it gets merged. After this, the user can enable 
>>>>>>> featurestep
>>>>>>> and scheduler like:
>>>>>>>
>>>>>>> spark-submit \
>>>>>>>
>>>>>>>     --conf spark.kubernete.scheduler.name volcano \
>>>>>>>
>>>>>>>     --conf spark.kubernetes.driver.pod.featureSteps
>>>>>>> org.apache.spark.deploy.k8s.features.scheduler.VolcanoFeatureStep
>>>>>>>
>>>>>>> --conf spark.kubernete.job.queue xxx
>>>>>>>
>>>>>>> (such as above, the VolcanoFeatureStep will help to set the the
>>>>>>> spark scheduler queue according user specified conf)
>>>>>>>
>>>>>>> - SPARK-37331 <https://issues.apache.org/jira/browse/SPARK-37331>:
>>>>>>> Added the ability to create kubernetes resources before driver pod 
>>>>>>> creation.
>>>>>>>
>>>>>>> - SPARK-36059 <https://issues.apache.org/jira/browse/SPARK-36059>:
>>>>>>> Add the ability to specify a scheduler in driver/executor
>>>>>>>
>>>>>>> After above all, the framework/common support would be ready for
>>>>>>> most of customized schedulers
>>>>>>>
>>>>>>> Volcano part:
>>>>>>>
>>>>>>> - SPARK-37258 <https://issues.apache.org/jira/browse/SPARK-37258>:
>>>>>>> Upgrade kubernetes-client to 5.11.1 to add volcano scheduler API 
>>>>>>> support.
>>>>>>>
>>>>>>> - SPARK-36061 <https://issues.apache.org/jira/browse/SPARK-36061>:
>>>>>>> Add a VolcanoFeatureStep to help users to create a PodGroup with user
>>>>>>> specified minimum resources required, there is also a WIP commit to
>>>>>>> show the preview of this
>>>>>>> <https://github.com/Yikun/spark/pull/45/commits/81bf6f98edb5c00ebd0662dc172bc73f980b6a34>
>>>>>>> .
>>>>>>>
>>>>>>> Yunikorn part:
>>>>>>>
>>>>>>> - @WeiweiYang is completing the doc of the Yunikorn part and
>>>>>>> implementing the Yunikorn part.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Yikun
>>>>>>>
>>>>>>>
>>>>>>> Weiwei Yang <w...@apache.org> 于2021年12月2日周四 02:00写道：
>>>>>>>
>>>>>>>> Thank you Yikun for the info, and thanks for inviting me to a
>>>>>>>> meeting to discuss this.
>>>>>>>> I appreciate your effort to put these together, and I agree that
>>>>>>>> the purpose is to make Spark easy/flexible enough to support other K8s
>>>>>>>> schedulers (not just for Volcano).
>>>>>>>> As discussed, could you please help to abstract out the things in
>>>>>>>> common and allow Spark to plug different implementations? I'd be happy 
>>>>>>>> to
>>>>>>>> work with you guys on this issue.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Nov 30, 2021 at 6:49 PM Yikun Jiang <yikunk...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> @Weiwei @Chenya
>>>>>>>>>
>>>>>>>>> > Thanks for bringing this up. This is quite interesting, we
>>>>>>>>> definitely should participate more in the discussions.
>>>>>>>>>
>>>>>>>>> Thanks for your reply and welcome to join the discussion, I think
>>>>>>>>> the input from Yunikorn is very critical.
>>>>>>>>>
>>>>>>>>> > The main thing here is, the Spark community should make Spark
>>>>>>>>> pluggable in order to support other schedulers, not just for Volcano. 
>>>>>>>>> It
>>>>>>>>> looks like this proposal is pushing really hard for adopting PodGroup,
>>>>>>>>> which isn't part of K8s yet, that to me is problematic.
>>>>>>>>>
>>>>>>>>> Definitely yes, we are on the same page.
>>>>>>>>>
>>>>>>>>> I think we have the same goal: propose a general and reasonable
>>>>>>>>> mechanism to make spark on k8s with a custom scheduler more usable.
>>>>>>>>>
>>>>>>>>> But for the PodGroup, just allow me to do a brief introduction:
>>>>>>>>> - The PodGroup definition has been approved by Kubernetes
>>>>>>>>> officially in KEP-583. [1]
>>>>>>>>> - It can be regarded as a general concept/standard in Kubernetes
>>>>>>>>> rather than a specific concept in Volcano, there are also others to
>>>>>>>>> implement it, such as [2][3].
>>>>>>>>> - Kubernetes recommends using CRD to do more extension to
>>>>>>>>> implement what they want. [4]
>>>>>>>>> - Volcano as extension provides an interface to maintain the life
>>>>>>>>> cycle PodGroup CRD and use volcano-scheduler to complete the 
>>>>>>>>> scheduling.
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/583-coscheduling
>>>>>>>>> [2]
>>>>>>>>> https://github.com/kubernetes-sigs/scheduler-plugins/tree/master/pkg/coscheduling#podgroup
>>>>>>>>> [3] https://github.com/kubernetes-sigs/kube-batch
>>>>>>>>> [4]
>>>>>>>>> https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Yikun
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Weiwei Yang <w...@apache.org> 于2021年12月1日周三 上午5:57写道：
>>>>>>>>>
>>>>>>>>>> Hi Chenya
>>>>>>>>>>
>>>>>>>>>> Thanks for bringing this up. This is quite interesting, we
>>>>>>>>>> definitely should participate more in the discussions.
>>>>>>>>>> The main thing here is, the Spark community should make Spark
>>>>>>>>>> pluggable in order to support other schedulers, not just for 
>>>>>>>>>> Volcano. It
>>>>>>>>>> looks like this proposal is pushing really hard for adopting 
>>>>>>>>>> PodGroup,
>>>>>>>>>> which isn't part of K8s yet, that to me is problematic.
>>>>>>>>>>
>>>>>>>>>> On Tue, Nov 30, 2021 at 9:21 AM Prasad Paravatha <
>>>>>>>>>> prasad.parava...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> This is a great feature/idea.
>>>>>>>>>>> I'd love to get involved in some form (testing and/or
>>>>>>>>>>> documentation). This could be my 1st contribution to Spark!
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Nov 30, 2021 at 10:46 PM John Zhuge <jzh...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> +1 Kudos to Yikun and the community for starting the discussion!
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Nov 30, 2021 at 8:47 AM Chenya Zhang <
>>>>>>>>>>>> chenyazhangche...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks folks for bringing up the topic of natively integrating
>>>>>>>>>>>>> Volcano and other alternative schedulers into Spark!
>>>>>>>>>>>>>
>>>>>>>>>>>>> +Weiwei, Wilfred, Chaoran. We would love to contribute to the
>>>>>>>>>>>>> discussion as well.
>>>>>>>>>>>>>
>>>>>>>>>>>>> From our side, we have been using and improving on one
>>>>>>>>>>>>> alternative resource scheduler, Apache YuniKorn (
>>>>>>>>>>>>> https://yunikorn.apache.org/), for Spark on Kubernetes in
>>>>>>>>>>>>> production at Apple with solid results in the past year. It is 
>>>>>>>>>>>>> capable of
>>>>>>>>>>>>> supporting Gang scheduling (similar to PodGroups), multi-tenant 
>>>>>>>>>>>>> resource
>>>>>>>>>>>>> queues (similar to YARN), FIFO, and other handy features like bin 
>>>>>>>>>>>>> packing
>>>>>>>>>>>>> to enable efficient autoscaling, etc.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Natively integrating with Spark would provide more flexibility
>>>>>>>>>>>>> for users and reduce the extra cost and potential inconsistency of
>>>>>>>>>>>>> maintaining different layers of resource strategies. One 
>>>>>>>>>>>>> interesting topic
>>>>>>>>>>>>> we hope to discuss more about is dynamic allocation, which would 
>>>>>>>>>>>>> benefit
>>>>>>>>>>>>> from native coordination between Spark and resource schedulers in 
>>>>>>>>>>>>> K8s &
>>>>>>>>>>>>> cloud environment for an optimal resource efficiency.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Tue, Nov 30, 2021 at 8:10 AM Holden Karau <
>>>>>>>>>>>>> hol...@pigscanfly.ca> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Thanks for putting this together, I’m really excited for us
>>>>>>>>>>>>>> to add better batch scheduling integrations.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Nov 30, 2021 at 12:46 AM Yikun Jiang <
>>>>>>>>>>>>>> yikunk...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hey everyone,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'd like to start a discussion on "Support
>>>>>>>>>>>>>>> Volcano/Alternative Schedulers Proposal".
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> This SPIP is proposed to make spark k8s schedulers provide
>>>>>>>>>>>>>>> more YARN like features (such as queues and minimum resources 
>>>>>>>>>>>>>>> before
>>>>>>>>>>>>>>> scheduling jobs) that many folks want on Kubernetes.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The goal of this SPIP is to improve current spark k8s
>>>>>>>>>>>>>>> scheduler implementations, add the ability of batch scheduling 
>>>>>>>>>>>>>>> and support
>>>>>>>>>>>>>>> volcano as one of implementations.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Design doc:
>>>>>>>>>>>>>>> https://docs.google.com/document/d/1xgQGRpaHQX6-QH_J9YV2C2Dh6RpXefUpLM7KGkzL6Fg
>>>>>>>>>>>>>>> JIRA: https://issues.apache.org/jira/browse/SPARK-36057
>>>>>>>>>>>>>>> Part of PRs:
>>>>>>>>>>>>>>> Ability to create resources
>>>>>>>>>>>>>>> https://github.com/apache/spark/pull/34599
>>>>>>>>>>>>>>> Add PodGroupFeatureStep:
>>>>>>>>>>>>>>> https://github.com/apache/spark/pull/34456
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Yikun
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Twitter: https://twitter.com/holdenkarau
>>>>>>>>>>>>>> Books (Learning Spark, High Performance Spark, etc.):
>>>>>>>>>>>>>> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
>>>>>>>>>>>>>> YouTube Live Streams:
>>>>>>>>>>>>>> https://www.youtube.com/user/holdenkarau
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> John Zhuge
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Regards,
>>>>>>>>>>> Prasad Paravatha
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  <https://amzn.to/2MaRAG9>
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

Re: [DISCUSSION] SPIP: Support Volcano/Alternative Schedulers Proposal

Reply via email to