Instead of External Shuffle Shufle, Apache Celeborn might be a good option as a 
Remote Shuffle Service for Spark on K8s.

There are some useful resources you might be interested in.

[1] https://celeborn.apache.org/
[2] https://www.youtube.com/watch?v=s5xOtG6Venw
[3] https://github.com/aws-samples/emr-remote-shuffle-service
[4] https://github.com/apache/celeborn/issues/2140

Thanks,
Cheng Pan


> On Apr 6, 2024, at 21:41, Mich Talebzadeh <mich.talebza...@gmail.com> wrote:
> 
> I have seen some older references for shuffle service for k8s,
> although it is not clear they are talking about a generic shuffle
> service for k8s.
> 
> Anyhow with the advent of genai and the need to allow for a larger
> volume of data, I was wondering if there has been any more work on
> this matter. Specifically larger and scalable file systems like HDFS,
> GCS , S3 etc, offer significantly larger storage capacity than local
> disks on individual worker nodes in a k8s cluster, thus allowing
> handling much larger datasets more efficiently. Also the degree of
> parallelism and fault tolerance  with these files systems come into
> it. I will be interested in hearing more about any progress on this.
> 
> Thanks
> .
> 
> Mich Talebzadeh,
> 
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> 
> London
> United Kingdom
> 
> 
>   view my Linkedin profile
> 
> 
> https://en.everybodywiki.com/Mich_Talebzadeh
> 
> 
> 
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Reply via email to