Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

Mich Talebzadeh Mon, 15 May 2023 09:43:29 -0700

Couple of points

Why use spot or pre-empt intantes when your application as you stated
shuffles heavily.
Have you looked at why you are having these shuffles? What is the cause of
these large transformations ending up in shuffle


Also on your point:
"..then ideally we should expect that when an executor is killed/OOM'd and
a new executor is spawned on the same host, the new executor registers the
shuffle files to itself. Is that so?"

What guarantee is that the new executor with inherited shuffle files will
succeed?

Also OOM is often associated with some form of skewed data

HTH
.
Mich Talebzadeh,
Lead Solutions Architect/Engineering Lead
Palantir Technologies Limited
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 15 May 2023 at 13:11, Faiz Halde <[email protected]>
wrote:

> Hello,
>
> We've been in touch with a few spark specialists who suggested us a
> potential solution to improve the reliability of our jobs that are shuffle
> heavy
>
> Here is what our setup looks like
>
>    - Spark version: 3.3.1
>    - Java version: 1.8
>    - We do not use external shuffle service
>    - We use spot instances
>
> We run spark jobs on clusters that use Amazon EBS volumes. The
> spark.local.dir is mounted on this EBS volume. One of the offerings from
> the service we use is EBS migration which basically means if a host is
> about to get evicted, a new host is created and the EBS volume is attached
> to it
>
> When Spark assigns a new executor to the newly created instance, it
> basically can recover all the shuffle files that are already persisted in
> the migrated EBS volume
>
> Is this how it works? Do executors recover / re-register the shuffle files
> that they found?
>
> So far I have not come across any recovery mechanism. I can only see
>
> KubernetesLocalDiskShuffleDataIO
>
>  that has a pre-init step where it tries to register the available shuffle
> files to itself
>
> A natural follow-up on this,
>
> If what they claim is true, then ideally we should expect that when an
> executor is killed/OOM'd and a new executor is spawned on the same host,
> the new executor registers the shuffle files to itself. Is that so?
>
> Thanks
>
> ------------------------------
> Confidentiality note: This e-mail may contain confidential information
> from Nu Holdings Ltd and/or its affiliates. If you have received it by
> mistake, please let us know by e-mail reply and delete it from your system;
> you may not copy this message or disclose its contents to anyone; for
> details about what personal information we collect and why, please refer to
> our privacy policy
> <https://api.mziq.com/mzfilemanager/v2/d/59a081d2-0d63-4bb5-b786-4c07ae26bc74/6f4939b9-5f74-a528-1835-596b481dca54>
> .
>

Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

Reply via email to