I see that both Uniffle and Celebron support S3/HDFS backends which is great. In the case someone is using S3/HDFS, I wonder what would be the advantages of using Celebron or Uniffle vs IBM shuffle service plugin <https://github.com/IBM/spark-s3-shuffle> or Cloud Shuffle Storage Plugin from AWS <https://docs.aws.amazon.com/glue/latest/dg/cloud-shuffle-storage-plugin.html> ?
These plugins do not require deploying a separate service. Are there any advantages to using Uniffle/Celebron in the case of using S3 backend, which would require deploying a separate service? Thanks Vakaris On Mon, Apr 8, 2024 at 10:03 AM roryqi <ror...@apache.org> wrote: > Apache Uniffle (incubating) may be another solution. > You can see > https://github.com/apache/incubator-uniffle > > https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era > > Mich Talebzadeh <mich.talebza...@gmail.com> 于2024年4月8日周一 07:15写道: > >> Splendid >> >> The configurations below can be used with k8s deployments of Spark. Spark >> applications running on k8s can utilize these configurations to seamlessly >> access data stored in Google Cloud Storage (GCS) and Amazon S3. >> >> For Google GCS we may have >> >> spark_config_gcs = { >> "spark.kubernetes.authenticate.driver.serviceAccountName": >> "service_account_name", >> "spark.hadoop.fs.gs.impl": >> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem", >> "spark.hadoop.google.cloud.auth.service.account.enable": "true", >> "spark.hadoop.google.cloud.auth.service.account.json.keyfile": >> "/path/to/keyfile.json", >> } >> >> For Amazon S3 similar >> >> spark_config_s3 = { >> "spark.kubernetes.authenticate.driver.serviceAccountName": >> "service_account_name", >> "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem", >> "spark.hadoop.fs.s3a.access.key": "s3_access_key", >> "spark.hadoop.fs.s3a.secret.key": "secret_key", >> } >> >> >> To implement these configurations and enable Spark applications to >> interact with GCS and S3, I guess we can approach it this way >> >> 1) Spark Repository Integration: These configurations need to be added to >> the Spark repository as part of the supported configuration options for k8s >> deployments. >> >> 2) Configuration Settings: Users need to specify these configurations >> when submitting Spark applications to a Kubernetes cluster. They can >> include these configurations in the Spark application code or pass them as >> command-line arguments or environment variables during application >> submission. >> >> HTH >> >> Mich Talebzadeh, >> >> Technologist | Solutions Architect | Data Engineer | Generative AI >> London >> United Kingdom >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* The information provided is correct to the best of my >> knowledge but of course cannot be guaranteed . It is essential to note >> that, as with any advice, quote "one test result is worth one-thousand >> expert opinions (Werner >> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >> >> >> On Sun, 7 Apr 2024 at 13:31, Vakaris Baškirov < >> vakaris.bashki...@gmail.com> wrote: >> >>> There is an IBM shuffle service plugin that supports S3 >>> https://github.com/IBM/spark-s3-shuffle >>> >>> Though I would think a feature like this could be a part of the main >>> Spark repo. Trino already has out-of-box support for s3 exchange (shuffle) >>> and it's very useful. >>> >>> Vakaris >>> >>> On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh < >>> mich.talebza...@gmail.com> wrote: >>> >>>> >>>> Thanks for your suggestion that I take it as a workaround. Whilst this >>>> workaround can potentially address storage allocation issues, I was more >>>> interested in exploring solutions that offer a more seamless integration >>>> with large distributed file systems like HDFS, GCS, or S3. This would >>>> ensure better performance and scalability for handling larger datasets >>>> efficiently. >>>> >>>> >>>> Mich Talebzadeh, >>>> Technologist | Solutions Architect | Data Engineer | Generative AI >>>> London >>>> United Kingdom >>>> >>>> >>>> view my Linkedin profile >>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>>> >>>> >>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>> >>>> >>>> >>>> *Disclaimer:* The information provided is correct to the best of my >>>> knowledge but of course cannot be guaranteed . It is essential to note >>>> that, as with any advice, quote "one test result is worth one-thousand >>>> expert opinions (Werner >>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>>> >>>> >>>> On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen <bjornjorgen...@gmail.com> >>>> wrote: >>>> >>>>> You can make a PVC on K8S call it 300GB >>>>> >>>>> make a folder in yours dockerfile >>>>> WORKDIR /opt/spark/work-dir >>>>> RUN chmod g+w /opt/spark/work-dir >>>>> >>>>> start spark with adding this >>>>> >>>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName", >>>>> "300gb") \ >>>>> >>>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path", >>>>> "/opt/spark/work-dir") \ >>>>> >>>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly", >>>>> "False") \ >>>>> >>>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName", >>>>> "300gb") \ >>>>> >>>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path", >>>>> "/opt/spark/work-dir") \ >>>>> >>>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly", >>>>> "False") \ >>>>> .config("spark.local.dir", "/opt/spark/work-dir") >>>>> >>>>> >>>>> >>>>> >>>>> lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh < >>>>> mich.talebza...@gmail.com>: >>>>> >>>>>> I have seen some older references for shuffle service for k8s, >>>>>> although it is not clear they are talking about a generic shuffle >>>>>> service for k8s. >>>>>> >>>>>> Anyhow with the advent of genai and the need to allow for a larger >>>>>> volume of data, I was wondering if there has been any more work on >>>>>> this matter. Specifically larger and scalable file systems like HDFS, >>>>>> GCS , S3 etc, offer significantly larger storage capacity than local >>>>>> disks on individual worker nodes in a k8s cluster, thus allowing >>>>>> handling much larger datasets more efficiently. Also the degree of >>>>>> parallelism and fault tolerance with these files systems come into >>>>>> it. I will be interested in hearing more about any progress on this. >>>>>> >>>>>> Thanks >>>>>> . >>>>>> >>>>>> Mich Talebzadeh, >>>>>> >>>>>> Technologist | Solutions Architect | Data Engineer | Generative AI >>>>>> >>>>>> London >>>>>> United Kingdom >>>>>> >>>>>> >>>>>> view my Linkedin profile >>>>>> >>>>>> >>>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>>> >>>>>> >>>>>> >>>>>> Disclaimer: The information provided is correct to the best of my >>>>>> knowledge but of course cannot be guaranteed . It is essential to note >>>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>>> expert opinions (Werner Von Braun)". >>>>>> >>>>>> --------------------------------------------------------------------- >>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>>> >>>>>> >>>>> >>>>> -- >>>>> Bjørn Jørgensen >>>>> Vestre Aspehaug 4, 6010 Ålesund >>>>> Norge >>>>> >>>>> +47 480 94 297 >>>>> >>>>