You can try to shuffle to s3 using the cloud shuffle plugin for s3 (https://aws.amazon.com/blogs/big-data/introducing-the-cloud-shuffle-storage-plugin-for-apache-spark/) - the performance of the new plugin is for many spark jobs sufficient (it works also on EMR). Then you can use s3 lifecycle policies to clean up/expire objects older than one day (https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) - this then also cleans up files from crashed spark jobs.
For shuffle on disk you have not much choices as you mentioned. I would though avoid to have a long living app that loops - that never works so well on Spark (it is designed for batch jobs that eventually stop). Maybe you can simply trigger a new job when a new file arrives (s3 events ?). > Am 18.02.2024 um 00:39 schrieb Saha, Daniel <dans...@amazon.com.invalid>: > > > Hi, > > Background: I am running into executor disk space issues when running a > long-lived Spark 3.3 app with YARN on AWS EMR. The app performs back-to-back > spark jobs in a sequential loop with each iteration performing 100gb+ > shuffles. The files taking up the space are related to shuffle blocks [1]. > Disk is only cleared when restarting the YARN app. For all intents and > purposes, each job is independent. So once a job/iterator is complete, there > is no need to retain these shuffle files. I want to try stopping and > recreating the Spark context between loop iterations/jobs to indicate to > Spark DiskBlockManager that these intermediate results are no longer needed > [2]. > > Questions: > Are there better ways to remove/clean the directory containing these old, no > longer used, shuffle results (aside from cron or restarting yarn app)? > How to recreate the spark context within a single application? I see no > methods in Spark Session for doing this, and each new Spark session re-uses > the existing spark context. After stopping the SparkContext, SparkSession > does not re-create a new one. Further, creating a new SparkSession via > constructor and passing in a new SparkContext is not allowed as it is a > protected/private method. > > Thanks > Daniel > > [1] > /mnt/yarn/usercache/hadoop/appcache/application_1706835946137_0110/blockmgr-eda47882-56d6-4248-8e30-a959ddb912c5 > [2] https://stackoverflow.com/a/38791921