Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

vaquar khan Wed, 17 May 2023 05:29:03 -0700

Following link you will get all required details

https://aws.amazon.com/blogs/containers/best-practices-for-running-spark-on-amazon-eks/


Let me know if you required further informations.


Regards,
Vaquar khan




On Mon, May 15, 2023, 10:14 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Couple of points
>
> Why use spot or pre-empt intantes when your application as you stated
> shuffles heavily.
> Have you looked at why you are having these shuffles? What is the cause of
> these large transformations ending up in shuffle
>
> Also on your point:
> "..then ideally we should expect that when an executor is killed/OOM'd
> and a new executor is spawned on the same host, the new executor registers
> the shuffle files to itself. Is that so?"
>
> What guarantee is that the new executor with inherited shuffle files will
> succeed?
>
> Also OOM is often associated with some form of skewed data
>
> HTH
> .
> Mich Talebzadeh,
> Lead Solutions Architect/Engineering Lead
> Palantir Technologies Limited
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 15 May 2023 at 13:11, Faiz Halde <faiz.ha...@nubank.com.br.invalid>
> wrote:
>
>> Hello,
>>
>> We've been in touch with a few spark specialists who suggested us a
>> potential solution to improve the reliability of our jobs that are shuffle
>> heavy
>>
>> Here is what our setup looks like
>>
>>    - Spark version: 3.3.1
>>    - Java version: 1.8
>>    - We do not use external shuffle service
>>    - We use spot instances
>>
>> We run spark jobs on clusters that use Amazon EBS volumes. The
>> spark.local.dir is mounted on this EBS volume. One of the offerings from
>> the service we use is EBS migration which basically means if a host is
>> about to get evicted, a new host is created and the EBS volume is attached
>> to it
>>
>> When Spark assigns a new executor to the newly created instance, it
>> basically can recover all the shuffle files that are already persisted in
>> the migrated EBS volume
>>
>> Is this how it works? Do executors recover / re-register the shuffle
>> files that they found?
>>
>> So far I have not come across any recovery mechanism. I can only see
>>
>> KubernetesLocalDiskShuffleDataIO
>>
>>  that has a pre-init step where it tries to register the available
>> shuffle files to itself
>>
>> A natural follow-up on this,
>>
>> If what they claim is true, then ideally we should expect that when an
>> executor is killed/OOM'd and a new executor is spawned on the same host,
>> the new executor registers the shuffle files to itself. Is that so?
>>
>> Thanks
>>
>> ------------------------------
>> Confidentiality note: This e-mail may contain confidential information
>> from Nu Holdings Ltd and/or its affiliates. If you have received it by
>> mistake, please let us know by e-mail reply and delete it from your system;
>> you may not copy this message or disclose its contents to anyone; for
>> details about what personal information we collect and why, please refer to
>> our privacy policy
>> <https://api.mziq.com/mzfilemanager/v2/d/59a081d2-0d63-4bb5-b786-4c07ae26bc74/6f4939b9-5f74-a528-1835-596b481dca54>
>> .
>>
>

Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

Reply via email to