Daniel
>
>
>
> [1]
> https://github.com/apache/spark/blob/8f5a647b0bbb6e83ee484091d3422b24baea5a80/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L369
>
> [2]
> https://github.com/apache/spark/blob/c4e4497ff7e747eb71d087cdfb1b51673c53b83b/core/src/main/sc
, February 18, 2024 at 1:38 AM
Cc: "user@spark.apache.org"
Subject: RE: [EXTERNAL] Re-create SparkContext of SparkSession inside
long-lived Spark app
CAUTION: This email originated from outside of the organization. Do not click
links or open attachments unless you can confirm the sender an
Hi,
What do you propose or you think will help when these spark jobs are
independent of each other --> So once a job/iterator is complete, there is
no need to retain these shuffle files. You have a number of options to
consider starting from spark configuration parameters and so forth
https://spa
You can try to shuffle to s3 using the cloud shuffle plugin for s3
(https://aws.amazon.com/blogs/big-data/introducing-the-cloud-shuffle-storage-plugin-for-apache-spark/)
- the performance of the new plugin is for many spark jobs sufficient (it
works also on EMR). Then you can use s3 lifecycle po
If you're using dynamic allocation it could be caused by executors with
shuffle data being deallocated before the shuffle is cleaned up. These
shuffle files will never get cleaned up once that happens until the Yarn
application ends. This was a big issue for us so I added support for
deleting shuff