Following link you will get all required details https://aws.amazon.com/blogs/containers/best-practices-for-running-spark-on-amazon-eks/
Let me know if you required further informations. Regards, Vaquar khan On Mon, May 15, 2023, 10:14 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > Couple of points > > Why use spot or pre-empt intantes when your application as you stated > shuffles heavily. > Have you looked at why you are having these shuffles? What is the cause of > these large transformations ending up in shuffle > > Also on your point: > "..then ideally we should expect that when an executor is killed/OOM'd > and a new executor is spawned on the same host, the new executor registers > the shuffle files to itself. Is that so?" > > What guarantee is that the new executor with inherited shuffle files will > succeed? > > Also OOM is often associated with some form of skewed data > > HTH > . > Mich Talebzadeh, > Lead Solutions Architect/Engineering Lead > Palantir Technologies Limited > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On Mon, 15 May 2023 at 13:11, Faiz Halde <faiz.ha...@nubank.com.br.invalid> > wrote: > >> Hello, >> >> We've been in touch with a few spark specialists who suggested us a >> potential solution to improve the reliability of our jobs that are shuffle >> heavy >> >> Here is what our setup looks like >> >> - Spark version: 3.3.1 >> - Java version: 1.8 >> - We do not use external shuffle service >> - We use spot instances >> >> We run spark jobs on clusters that use Amazon EBS volumes. The >> spark.local.dir is mounted on this EBS volume. One of the offerings from >> the service we use is EBS migration which basically means if a host is >> about to get evicted, a new host is created and the EBS volume is attached >> to it >> >> When Spark assigns a new executor to the newly created instance, it >> basically can recover all the shuffle files that are already persisted in >> the migrated EBS volume >> >> Is this how it works? Do executors recover / re-register the shuffle >> files that they found? >> >> So far I have not come across any recovery mechanism. I can only see >> >> KubernetesLocalDiskShuffleDataIO >> >> that has a pre-init step where it tries to register the available >> shuffle files to itself >> >> A natural follow-up on this, >> >> If what they claim is true, then ideally we should expect that when an >> executor is killed/OOM'd and a new executor is spawned on the same host, >> the new executor registers the shuffle files to itself. Is that so? >> >> Thanks >> >> ------------------------------ >> Confidentiality note: This e-mail may contain confidential information >> from Nu Holdings Ltd and/or its affiliates. If you have received it by >> mistake, please let us know by e-mail reply and delete it from your system; >> you may not copy this message or disclose its contents to anyone; for >> details about what personal information we collect and why, please refer to >> our privacy policy >> <https://api.mziq.com/mzfilemanager/v2/d/59a081d2-0d63-4bb5-b786-4c07ae26bc74/6f4939b9-5f74-a528-1835-596b481dca54> >> . >> >