Hello, We've been in touch with a few spark specialists who suggested us a potential solution to improve the reliability of our jobs that are shuffle heavy
Here is what our setup looks like - Spark version: 3.3.1 - Java version: 1.8 - We do not use external shuffle service - We use spot instances We run spark jobs on clusters that use Amazon EBS volumes. The spark.local.dir is mounted on this EBS volume. One of the offerings from the service we use is EBS migration which basically means if a host is about to get evicted, a new host is created and the EBS volume is attached to it When Spark assigns a new executor to the newly created instance, it basically can recover all the shuffle files that are already persisted in the migrated EBS volume Is this how it works? Do executors recover / re-register the shuffle files that they found? So far I have not come across any recovery mechanism. I can only see KubernetesLocalDiskShuffleDataIO that has a pre-init step where it tries to register the available shuffle files to itself A natural follow-up on this, If what they claim is true, then ideally we should expect that when an executor is killed/OOM'd and a new executor is spawned on the same host, the new executor registers the shuffle files to itself. Is that so? Thanks -- Confidentiality note: This e-mail may contain confidential information from Nu Holdings Ltd and/or its affiliates. If you have received it by mistake, please let us know by e-mail reply and delete it from your system; you may not copy this message or disclose its contents to anyone; for details about what personal information we collect and why, please refer to our privacy policy <https://api.mziq.com/mzfilemanager/v2/d/59a081d2-0d63-4bb5-b786-4c07ae26bc74/6f4939b9-5f74-a528-1835-596b481dca54>.