Re: [EXTERNAL] Re: Re: Stage level scheduling - lower the number of executors when using GPUs

Sean Owen Thu, 03 Nov 2022 10:25:35 -0700

Er, wait, this is what stage-level scheduling is right? this has existed
since 3.1
https://issues.apache.org/jira/browse/SPARK-27495


On Thu, Nov 3, 2022 at 12:10 PM bo yang <bobyan...@gmail.com> wrote:

> Interesting discussion here, looks like Spark does not support configuring
> different number of executors in different stages. Would love to see the
> community come out such a feature.
>
> On Thu, Nov 3, 2022 at 9:10 AM Shay Elbaz <shay.el...@gm.com> wrote:
>
>> Thanks again Artemis, I really appreciate it. I have watched the video
>> but did not find an answer.
>>
>> Please bear with me just one more iteration 🙂
>>
>> Maybe I'll be more specific:
>> Suppose I start the application with maxExecutors=500, executors.cores=2,
>> because that's the amount of resources needed for the ETL part. But for the
>> DL part I only need 20 GPUs. SLS API only allows to set the resources per
>> executor/task, so Spark would (try to) allocate up to 500 GPUs, assuming I
>> configure the profile with 1 GPU per executor.
>> So, the question is how do I limit the stage resources to 20 GPUs total?
>>
>> Thanks again,
>> Shay
>>
>> ------------------------------
>> *From:* Artemis User <arte...@dtechspace.com>
>> *Sent:* Thursday, November 3, 2022 5:23 PM
>> *To:* user@spark.apache.org <user@spark.apache.org>
>> *Subject:* [EXTERNAL] Re: Re: Stage level scheduling - lower the number
>> of executors when using GPUs
>>
>>
>> *ATTENTION:* This email originated from outside of GM.
>>
>>   Shay,  You may find this video helpful (with some API code samples
>> that you are looking for).
>> https://www.youtube.com/watch?v=JNQu-226wUc&t=171s.  The issue here
>> isn't how to limit the number of executors but to request for the right
>> GPU-enabled executors dynamically.  Those executors used in pre-GPU stages
>> should be returned back to resource managers with dynamic resource
>> allocation enabled (and with the right DRA policies).  Hope this helps..
>>
>> Unfortunately there isn't a lot of detailed docs for this topic since GPU
>> acceleration is kind of new in Spark (not straightforward like in TF).   I
>> wish the Spark doc team could provide more details in the next release...
>>
>> On 11/3/22 2:37 AM, Shay Elbaz wrote:
>>
>> Thanks Artemis. We are *not* using Rapids, but rather using GPUs through
>> the Stage Level Scheduling feature with ResourceProfile. In Kubernetes
>> you have to turn on shuffle tracking for dynamic allocation, anyhow.
>> The question is how we can limit the *number of executors *when building
>> a new ResourceProfile, directly (API) or indirectly (some advanced
>> workaround).
>>
>> Thanks,
>> Shay
>>
>>
>> ------------------------------
>> *From:* Artemis User <arte...@dtechspace.com> <arte...@dtechspace.com>
>> *Sent:* Thursday, November 3, 2022 1:16 AM
>> *To:* user@spark.apache.org <user@spark.apache.org>
>> <user@spark.apache.org>
>> *Subject:* [EXTERNAL] Re: Stage level scheduling - lower the number of
>> executors when using GPUs
>>
>>
>> *ATTENTION:* This email originated from outside of GM.
>>
>>   Are you using Rapids for GPU support in Spark?  Couple of options you
>> may want to try:
>>
>>    1. In addition to dynamic allocation turned on, you may also need to
>>    turn on external shuffling service.
>>    2. Sounds like you are using Kubernetes.  In that case, you may also
>>    need to turn on shuffle tracking.
>>    3. The "stages" are controlled by the APIs.  The APIs for dynamic
>>    resource request (change of stage) do exist, but only for RDDs (e.g.
>>    TaskResourceRequest and ExecutorResourceRequest).
>>
>>
>> On 11/2/22 11:30 AM, Shay Elbaz wrote:
>>
>> Hi,
>>
>> Our typical applications need less *executors* for a GPU stage than for
>> a CPU stage. We are using dynamic allocation with stage level scheduling,
>> and Spark tries to maximize the number of executors also during the GPU
>> stage, causing a bit of resources chaos in the cluster. This forces us to
>> use a lower value for 'maxExecutors' in the first place, at the cost of the
>> CPU stages performance. Or try to solve this in the Kubernets scheduler
>> level, which is not straightforward and doesn't feel like the right way to
>> go.
>>
>> Is there a way to effectively use less executors in Stage Level
>> Scheduling? The API does not seem to include such an option, but maybe
>> there is some more advanced workaround?
>>
>> Thanks,
>> Shay
>>
>>
>>
>>
>>
>>
>>
>>

Re: [EXTERNAL] Re: Re: Stage level scheduling - lower the number of executors when using GPUs

Reply via email to