Re: [EXTERNAL] Re: Re: Re: Re: Stage level scheduling - lower the number of executors when using GPUs

ayan guha Sun, 06 Nov 2022 06:21:03 -0800

May I ask why the ETL job and DL ( Assuming you mean deep learning here)
task can not be run as 2 separate spark job?


IMHO it is better practice to split up entire pipeline into logical steps
and orchestrate them.

That way you can pick your profile as you need for 2 very different type of
workloads.

Ayan

On Sun, 6 Nov 2022 at 12:04 am, Shay Elbaz <shay.el...@gm.com> wrote:

> Consider this:
>
>    1. The application is allowed to use only 20 GPUs.
>    2. To ensure exactly 20 GPUs, I use the *df*.
>    *rdd.repartition(20).withResources(gpus.build).mapPartitions(func)* 
> technique.
>    (maxExecutors >> 20).
>    3. Given the volume of the input data, it takes 20 hours *total* to
>    run the DL part (computer vision) on 20 GPUs, or* 1 hour per GPU task*.
>
> Normally, I would repartition to 200 partitions to get a finer grained ~6
> minutes tasks instead of 1 hour. But here we're "forced" to use only 20
> partitions. To be clear, I'm only referring to potential failures/lags
> here. The job needs at least 20 hours total (on 20 GPUs) no matter what,
> but if any task fails after 50 minutes for example, we have to re-process
> these 50 minutes again. Or if a task/executor lags behind due to
> environment issues, then speculative execution will only trigger another
> task after 1 hour. These issues would be avoided if we used 200 partitions,
> but then Spark will try to allocate more than 20 GPUs.
>
> I hope that was more clear.
> Thank you very much for helping.
>
> Shay
>
> ------------------------------
> *From:* Tom Graves <tgraves...@yahoo.com>
> *Sent:* Friday, November 4, 2022 4:19 PM
> *To:* Tom Graves <tgraves...@yahoo.com.invalid>; Artemis User <
> arte...@dtechspace.com>; user@spark.apache.org <user@spark.apache.org>;
> Shay Elbaz <shay.el...@gm.com>
> *Subject:* [EXTERNAL] Re: Re: Re: Re: Stage level scheduling - lower the
> number of executors when using GPUs
>
>
> *ATTENTION:* This email originated from outside of GM.
>
>
> So I'm not sure I completely follow. Are you asking for a way to change
> the limit without having to do the repartition?  And your DL software
> doesn't care if you got say 30 executors instead of 20?  Normally I would
> expect the number fo partitions at that point to be 200 (or whatever you
> set for your shuffle partitions) unless you are using AQE coalescing
> partitions functionality and then it could change. Are you using the latter?
>
> > Normally I try to aim for anything between 30s-5m per
> *task (failure-wise)*, depending on the cluster, its stability, etc. But
> in this case, individual tasks can take 30-60 minutes, if not much more.
> Any failure during this long time is pretty expensive.
>
> Are you saying when you manually do the repartition your DL tasks take
> 30-60 minutes?  so again you want like AQE coalesce partitions to kick in
> to attempt to pick partition sizes for your?
>
>
> Tom
>
> On Thursday, November 3, 2022 at 03:18:07 PM CDT, Shay Elbaz <
> shay.el...@gm.com> wrote:
>
>
> This is exactly what we ended up doing! The only drawback I saw with this
> approach is that the GPU tasks get pretty big (in terms of data and compute
> time), and task failures become expansive. That's why I reached out to the
> mailing list in the first place 🙂
> Normally I try to aim for anything between 30s-5m per
> *task (failure-wise)*, depending on the cluster, its stability, etc. But
> in this case, individual tasks can take 30-60 minutes, if not much more.
> Any failure during this long time is pretty expensive.
>
>
> Shay
> ------------------------------
> *From:* Tom Graves <tgraves...@yahoo.com.INVALID>
> *Sent:* Thursday, November 3, 2022 7:56 PM
> *To:* Artemis User <arte...@dtechspace.com>; user@spark.apache.org <
> user@spark.apache.org>; Shay Elbaz <shay.el...@gm.com>
> *Subject:* [EXTERNAL] Re: Re: Re: Stage level scheduling - lower the
> number of executors when using GPUs
>
>
> *ATTENTION:* This email originated from outside of GM.
>
>
> Stage level scheduling does not allow you to change configs right now.
> This is something we thought about as follow on but have never
> implemented.  How many tasks on the DL stage are you running?  The typical
> case is run some etl lots of tasks... do mapPartitions and then run your DL
> stuff, before that mapPartitions you could do a repartition if necessary to
> get to exactly the number of tasks you want (20).  That way even if
> maxExecutors=500 you will only ever need 20 or whatever you repartition to
> and spark isn't going to ask for more then that.
>
> Tom
>
> On Thursday, November 3, 2022 at 11:10:31 AM CDT, Shay Elbaz <
> shay.el...@gm.com> wrote:
>
>
> Thanks again Artemis, I really appreciate it. I have watched the video
> but did not find an answer.
>
> Please bear with me just one more iteration 🙂
>
> Maybe I'll be more specific:
> Suppose I start the application with maxExecutors=500, executors.cores=2,
> because that's the amount of resources needed for the ETL part. But for the
> DL part I only need 20 GPUs. SLS API only allows to set the resources per
> executor/task, so Spark would (try to) allocate up to 500 GPUs, assuming I
> configure the profile with 1 GPU per executor.
> So, the question is how do I limit the stage resources to 20 GPUs total?
>
> Thanks again,
> Shay
>
> ------------------------------
> *From:* Artemis User <arte...@dtechspace.com>
> *Sent:* Thursday, November 3, 2022 5:23 PM
>
> *To:* user@spark.apache.org <user@spark.apache.org>
> *Subject:* [EXTERNAL] Re: Re: Stage level scheduling - lower the number
> of executors when using GPUs
>
>
> *ATTENTION:* This email originated from outside of GM.
>
>   Shay,  You may find this video helpful (with some API code samples that
> you are looking for).  https://www.youtube.com/watch?v=JNQu-226wUc&t=171s.
> The issue here isn't how to limit the number of executors but to request
> for the right GPU-enabled executors dynamically.  Those executors used in
> pre-GPU stages should be returned back to resource managers with dynamic
> resource allocation enabled (and with the right DRA policies).  Hope this
> helps..
>
> Unfortunately there isn't a lot of detailed docs for this topic since GPU
> acceleration is kind of new in Spark (not straightforward like in TF).   I
> wish the Spark doc team could provide more details in the next release...
>
> On 11/3/22 2:37 AM, Shay Elbaz wrote:
>
> Thanks Artemis. We are *not* using Rapids, but rather using GPUs through
> the Stage Level Scheduling feature with ResourceProfile. In Kubernetes
> you have to turn on shuffle tracking for dynamic allocation, anyhow.
> The question is how we can limit the *number of executors *when building
> a new ResourceProfile, directly (API) or indirectly (some advanced
> workaround).
>
> Thanks,
> Shay
>
>
> ------------------------------
> *From:* Artemis User <arte...@dtechspace.com> <arte...@dtechspace.com>
> *Sent:* Thursday, November 3, 2022 1:16 AM
> *To:* user@spark.apache.org <user@spark.apache.org>
> <user@spark.apache.org>
> *Subject:* [EXTERNAL] Re: Stage level scheduling - lower the number of
> executors when using GPUs
>
>
> *ATTENTION:* This email originated from outside of GM.
>
>   Are you using Rapids for GPU support in Spark?  Couple of options you
> may want to try:
>
>    1. In addition to dynamic allocation turned on, you may also need to
>    turn on external shuffling service.
>    2. Sounds like you are using Kubernetes.  In that case, you may also
>    need to turn on shuffle tracking.
>    3. The "stages" are controlled by the APIs.  The APIs for dynamic
>    resource request (change of stage) do exist, but only for RDDs (e.g.
>    TaskResourceRequest and ExecutorResourceRequest).
>
>
> On 11/2/22 11:30 AM, Shay Elbaz wrote:
>
> Hi,
>
> Our typical applications need less *executors* for a GPU stage than for a
> CPU stage. We are using dynamic allocation with stage level scheduling, and
> Spark tries to maximize the number of executors also during the GPU stage,
> causing a bit of resources chaos in the cluster. This forces us to use a
> lower value for 'maxExecutors' in the first place, at the cost of the CPU
> stages performance. Or try to solve this in the Kubernets scheduler level,
> which is not straightforward and doesn't feel like the right way to go.
>
> Is there a way to effectively use less executors in Stage Level
> Scheduling? The API does not seem to include such an option, but maybe
> there is some more advanced workaround?
>
> Thanks,
> Shay
>
>
>
>
>
>
>
> --
Best Regards,
Ayan Guha

Re: [EXTERNAL] Re: Re: Re: Re: Stage level scheduling - lower the number of executors when using GPUs

Reply via email to