EnricoMi opened a new pull request, #51199:
URL: https://github.com/apache/spark/pull/51199

   ### What changes were proposed in this pull request?
   Shuffle data of individual shuffles are deleted from the fallback storage 
during regular shuffle cleanup.
   
   ### Why are the changes needed?
   Currently, the shuffle data are only removed from the fallback storage on 
Spark context shutdown. Long running Spark jobs accumulate shuffle data, though 
this data is not used by Spark any more. Those shuffles should be cleaned up 
while Spark context is running.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Unit tests and manual test via [reproduction 
example](https://gist.github.com/EnricoMi/e9daa1176bce4c1211af3f3c5848112a/3140527bcbedec51ed2c571885db774c880cb941).
   
   Run the reproduction example without the ` <<< "$scala"`. In the Spark 
shell, execute this code:
   
   ```scala
   import org.apache.spark.sql.SaveMode
   
   val n = 100000000
   val j = spark.sparkContext.broadcast(1000)
   val x = spark.range(0, n, 1, 100).select($"id".cast("int"))
   x.as[Int]
    .mapPartitions { it => if (it.hasNext && it.next < n / 100 * 80) 
Thread.sleep(2000); it }
    .groupBy($"value" % 1000).as[Int, Int]
    .flatMapSortedGroups($"value"){ case (m, it) => if (it.hasNext && it.next 
== 0) Thread.sleep(10000); it }
     .write.mode(SaveMode.Overwrite).csv("/tmp/spark.csv")
   ```
   This writes some data of shuffle 0 to the fallback storage.
   
   Invoking `System.gc()` removes that shuffle directory from the fallback 
storage. Exiting the Spark shell removes the whole application directory.
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to