Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-19 Thread Mich Talebzadeh
Daniel > > > > [1] > https://github.com/apache/spark/blob/8f5a647b0bbb6e83ee484091d3422b24baea5a80/core/src/main/scala/org/apache/spark/storage/DiskBlockManager.scala#L369 > > [2] > https://github.com/apache/spark/blob/c4e4497ff7e747eb71d087cdfb1b51673c53b83b/core/src/main/sc

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-19 Thread Saha, Daniel
, February 18, 2024 at 1:38 AM Cc: "user@spark.apache.org" Subject: RE: [EXTERNAL] Re-create SparkContext of SparkSession inside long-lived Spark app CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender an

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-18 Thread Mich Talebzadeh
Hi, What do you propose or you think will help when these spark jobs are independent of each other --> So once a job/iterator is complete, there is no need to retain these shuffle files. You have a number of options to consider starting from spark configuration parameters and so forth https://spa

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-17 Thread Jörn Franke
You can try to shuffle to s3 using the cloud shuffle plugin for s3 (https://aws.amazon.com/blogs/big-data/introducing-the-cloud-shuffle-storage-plugin-for-apache-spark/) - the performance of the new plugin is for many spark jobs sufficient (it works also on EMR). Then you can use s3 lifecycle po

Re: Re-create SparkContext of SparkSession inside long-lived Spark app

2024-02-17 Thread Adam Binford
If you're using dynamic allocation it could be caused by executors with shuffle data being deallocated before the shuffle is cleaned up. These shuffle files will never get cleaned up once that happens until the Yarn application ends. This was a big issue for us so I added support for deleting shuff