[spark-core] Can executors recover/reuse shuffle files upon failure?

Faiz Halde Mon, 15 May 2023 05:10:36 -0700

Hello,

We've been in touch with a few spark specialists who suggested us a
potential solution to improve the reliability of our jobs that are shuffle
heavy


Here is what our setup looks like

   - Spark version: 3.3.1
   - Java version: 1.8
   - We do not use external shuffle service
   - We use spot instances

We run spark jobs on clusters that use Amazon EBS volumes. The
spark.local.dir is mounted on this EBS volume. One of the offerings from
the service we use is EBS migration which basically means if a host is
about to get evicted, a new host is created and the EBS volume is attached
to it

When Spark assigns a new executor to the newly created instance, it
basically can recover all the shuffle files that are already persisted in
the migrated EBS volume

Is this how it works? Do executors recover / re-register the shuffle files
that they found?

So far I have not come across any recovery mechanism. I can only see

KubernetesLocalDiskShuffleDataIO

 that has a pre-init step where it tries to register the available shuffle
files to itself

A natural follow-up on this,

If what they claim is true, then ideally we should expect that when an
executor is killed/OOM'd and a new executor is spawned on the same host,
the new executor registers the shuffle files to itself. Is that so?

Thanks

-- 

Confidentiality note: This e-mail may contain confidential information 
from Nu Holdings Ltd and/or its affiliates. If you have received it by 
mistake, please let us know by e-mail reply and delete it from your system; 
you may not copy this message or disclose its contents to anyone; for 
details about what personal information we collect and why, please refer to 
our privacy policy 
<https://api.mziq.com/mzfilemanager/v2/d/59a081d2-0d63-4bb5-b786-4c07ae26bc74/6f4939b9-5f74-a528-1835-596b481dca54>.

[spark-core] Can executors recover/reuse shuffle files upon failure?

Reply via email to