You can try to shuffle to s3 using the cloud shuffle plugin for s3 
(https://aws.amazon.com/blogs/big-data/introducing-the-cloud-shuffle-storage-plugin-for-apache-spark/)
 - the performance of the new plugin is for many spark jobs sufficient (it 
works also on EMR). Then you can use s3 lifecycle policies to clean up/expire 
objects older than one day 
(https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html)
 - this then also cleans up files from crashed spark jobs.

For shuffle on disk you have not much choices as you mentioned. I would though 
avoid to have a long living app that loops - that never works so well on Spark 
(it is designed for batch jobs that eventually stop). Maybe you can simply 
trigger a new job when a new file arrives (s3 events ?).

> Am 18.02.2024 um 00:39 schrieb Saha, Daniel <dans...@amazon.com.invalid>:
> 
> 
> Hi,
>  
> Background: I am running into executor disk space issues when running a 
> long-lived Spark 3.3 app with YARN on AWS EMR. The app performs back-to-back 
> spark jobs in a sequential loop with each iteration performing 100gb+ 
> shuffles. The files taking up the space are related to shuffle blocks [1]. 
> Disk is only cleared when restarting the YARN app. For all intents and 
> purposes, each job is independent. So once a job/iterator is complete, there 
> is no need to retain these shuffle files. I want to try stopping and 
> recreating the Spark context between loop iterations/jobs to indicate to 
> Spark DiskBlockManager that these intermediate results are no longer needed 
> [2].
>  
> Questions:
> Are there better ways to remove/clean the directory containing these old, no 
> longer used, shuffle results (aside from cron or restarting yarn app)?
> How to recreate the spark context within a single application? I see no 
> methods in Spark Session for doing this, and each new Spark session re-uses 
> the existing spark context. After stopping the SparkContext, SparkSession 
> does not re-create a new one. Further, creating a new SparkSession via 
> constructor and passing in a new SparkContext is not allowed as it is a 
> protected/private method.
>  
> Thanks
> Daniel
>  
> [1] 
> /mnt/yarn/usercache/hadoop/appcache/application_1706835946137_0110/blockmgr-eda47882-56d6-4248-8e30-a959ddb912c5
> [2] https://stackoverflow.com/a/38791921

Reply via email to