EnricoMi opened a new pull request, #51199: URL: https://github.com/apache/spark/pull/51199
### What changes were proposed in this pull request? Shuffle data of individual shuffles are deleted from the fallback storage during regular shuffle cleanup. ### Why are the changes needed? Currently, the shuffle data are only removed from the fallback storage on Spark context shutdown. Long running Spark jobs accumulate shuffle data, though this data is not used by Spark any more. Those shuffles should be cleaned up while Spark context is running. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests and manual test via [reproduction example](https://gist.github.com/EnricoMi/e9daa1176bce4c1211af3f3c5848112a/3140527bcbedec51ed2c571885db774c880cb941). Run the reproduction example without the ` <<< "$scala"`. In the Spark shell, execute this code: ```scala import org.apache.spark.sql.SaveMode val n = 100000000 val j = spark.sparkContext.broadcast(1000) val x = spark.range(0, n, 1, 100).select($"id".cast("int")) x.as[Int] .mapPartitions { it => if (it.hasNext && it.next < n / 100 * 80) Thread.sleep(2000); it } .groupBy($"value" % 1000).as[Int, Int] .flatMapSortedGroups($"value"){ case (m, it) => if (it.hasNext && it.next == 0) Thread.sleep(10000); it } .write.mode(SaveMode.Overwrite).csv("/tmp/spark.csv") ``` This writes some data of shuffle 0 to the fallback storage. Invoking `System.gc()` removes that shuffle directory from the fallback storage. Exiting the Spark shell removes the whole application directory. ### Was this patch authored or co-authored using generative AI tooling? No. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
