Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
I think this answers your question about what to do if you need more space on nodes. https://spark.apache.org/docs/latest/running-on-kubernetes.html#local-storage Local Storage Spark supports using volumes to spill

Re: External Spark shuffle service for k8s

2024-04-11 Thread Bjørn Jørgensen
" In the end for my usecase I started using pvcs and pvc aware scheduling along with decommissioning. So far performance is good with this choice." How did you do this? tor. 11. apr. 2024 kl. 04:13 skrev Arun Ravi : > Hi Everyone, > > I had to explored IBM's and AWS's S3 shuffle plugins (some

Re: External Spark shuffle service for k8s

2024-04-10 Thread Arun Ravi
Hi Everyone, I had to explored IBM's and AWS's S3 shuffle plugins (some time back), I had also explored AWS FSX lustre in few of my production jobs which has ~20TB of shuffle operations with 200-300 executors. What I have observed is S3 and fax behaviour was fine during the write phase, however I

Re: External Spark shuffle service for k8s

2024-04-08 Thread Mich Talebzadeh
Hi, First thanks everyone for their contributions I was going to reply to @Enrico Minack but noticed additional info. As I understand for example, Apache Uniffle is an incubating project aimed at providing a pluggable shuffle service for Spark. So basically, all these "external shuffle

Re: External Spark shuffle service for k8s

2024-04-08 Thread Vakaris Baškirov
I see that both Uniffle and Celebron support S3/HDFS backends which is great. In the case someone is using S3/HDFS, I wonder what would be the advantages of using Celebron or Uniffle vs IBM shuffle service plugin or Cloud Shuffle Storage Plugin from AWS

Re: External Spark shuffle service for k8s

2024-04-08 Thread roryqi
Apache Uniffle (incubating) may be another solution. You can see https://github.com/apache/incubator-uniffle https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era Mich Talebzadeh 于2024年4月8日周一 07:15写道: > Splendid > > The

Re: External Spark shuffle service for k8s

2024-04-07 Thread Enrico Minack
There is Apache incubator project Uniffle: https://github.com/apache/incubator-uniffle It stores shuffle data on remote servers in memory, on local disk and HDFS. Cheers, Enrico Am 06.04.24 um 15:41 schrieb Mich Talebzadeh: I have seen some older references for shuffle service for k8s,

Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
Thanks Cheng for the heads up. I will have a look. Cheers Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile

Re: External Spark shuffle service for k8s

2024-04-07 Thread Vakaris Baškirov
There is an IBM shuffle service plugin that supports S3 https://github.com/IBM/spark-s3-shuffle Though I would think a feature like this could be a part of the main Spark repo. Trino already has out-of-box support for s3 exchange (shuffle) and it's very useful. Vakaris On Sun, Apr 7, 2024 at

Re: External Spark shuffle service for k8s

2024-04-07 Thread Cheng Pan
Instead of External Shuffle Shufle, Apache Celeborn might be a good option as a Remote Shuffle Service for Spark on K8s. There are some useful resources you might be interested in. [1] https://celeborn.apache.org/ [2] https://www.youtube.com/watch?v=s5xOtG6Venw [3]

Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh
Splendid The configurations below can be used with k8s deployments of Spark. Spark applications running on k8s can utilize these configurations to seamlessly access data stored in Google Cloud Storage (GCS) and Amazon S3. For Google GCS we may have spark_config_gcs = {

Re: External Spark shuffle service for k8s

2024-04-06 Thread Mich Talebzadeh
Thanks for your suggestion that I take it as a workaround. Whilst this workaround can potentially address storage allocation issues, I was more interested in exploring solutions that offer a more seamless integration with large distributed file systems like HDFS, GCS, or S3. This would ensure

Re: External Spark shuffle service for k8s

2024-04-06 Thread Bjørn Jørgensen
You can make a PVC on K8S call it 300GB make a folder in yours dockerfile WORKDIR /opt/spark/work-dir RUN chmod g+w /opt/spark/work-dir start spark with adding this .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName", "300gb") \

External Spark shuffle service for k8s

2024-04-06 Thread Mich Talebzadeh
I have seen some older references for shuffle service for k8s, although it is not clear they are talking about a generic shuffle service for k8s. Anyhow with the advent of genai and the need to allow for a larger volume of data, I was wondering if there has been any more work on this matter.