Apache Uniffle (incubating) may be another solution. You can see https://github.com/apache/incubator-uniffle https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era
Mich Talebzadeh <mich.talebza...@gmail.com> 于2024年4月8日周一 07:15写道: > Splendid > > The configurations below can be used with k8s deployments of Spark. Spark > applications running on k8s can utilize these configurations to seamlessly > access data stored in Google Cloud Storage (GCS) and Amazon S3. > > For Google GCS we may have > > spark_config_gcs = { > "spark.kubernetes.authenticate.driver.serviceAccountName": > "service_account_name", > "spark.hadoop.fs.gs.impl": > "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem", > "spark.hadoop.google.cloud.auth.service.account.enable": "true", > "spark.hadoop.google.cloud.auth.service.account.json.keyfile": > "/path/to/keyfile.json", > } > > For Amazon S3 similar > > spark_config_s3 = { > "spark.kubernetes.authenticate.driver.serviceAccountName": > "service_account_name", > "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem", > "spark.hadoop.fs.s3a.access.key": "s3_access_key", > "spark.hadoop.fs.s3a.secret.key": "secret_key", > } > > > To implement these configurations and enable Spark applications to > interact with GCS and S3, I guess we can approach it this way > > 1) Spark Repository Integration: These configurations need to be added to > the Spark repository as part of the supported configuration options for k8s > deployments. > > 2) Configuration Settings: Users need to specify these configurations when > submitting Spark applications to a Kubernetes cluster. They can include > these configurations in the Spark application code or pass them as > command-line arguments or environment variables during application > submission. > > HTH > > Mich Talebzadeh, > > Technologist | Solutions Architect | Data Engineer | Generative AI > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Sun, 7 Apr 2024 at 13:31, Vakaris Baškirov <vakaris.bashki...@gmail.com> > wrote: > >> There is an IBM shuffle service plugin that supports S3 >> https://github.com/IBM/spark-s3-shuffle >> >> Though I would think a feature like this could be a part of the main >> Spark repo. Trino already has out-of-box support for s3 exchange (shuffle) >> and it's very useful. >> >> Vakaris >> >> On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >>> >>> Thanks for your suggestion that I take it as a workaround. Whilst this >>> workaround can potentially address storage allocation issues, I was more >>> interested in exploring solutions that offer a more seamless integration >>> with large distributed file systems like HDFS, GCS, or S3. This would >>> ensure better performance and scalability for handling larger datasets >>> efficiently. >>> >>> >>> Mich Talebzadeh, >>> Technologist | Solutions Architect | Data Engineer | Generative AI >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> *Disclaimer:* The information provided is correct to the best of my >>> knowledge but of course cannot be guaranteed . It is essential to note >>> that, as with any advice, quote "one test result is worth one-thousand >>> expert opinions (Werner >>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >>> >>> >>> On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen <bjornjorgen...@gmail.com> >>> wrote: >>> >>>> You can make a PVC on K8S call it 300GB >>>> >>>> make a folder in yours dockerfile >>>> WORKDIR /opt/spark/work-dir >>>> RUN chmod g+w /opt/spark/work-dir >>>> >>>> start spark with adding this >>>> >>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName", >>>> "300gb") \ >>>> >>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path", >>>> "/opt/spark/work-dir") \ >>>> >>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly", >>>> "False") \ >>>> >>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName", >>>> "300gb") \ >>>> >>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path", >>>> "/opt/spark/work-dir") \ >>>> >>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly", >>>> "False") \ >>>> .config("spark.local.dir", "/opt/spark/work-dir") >>>> >>>> >>>> >>>> >>>> lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh < >>>> mich.talebza...@gmail.com>: >>>> >>>>> I have seen some older references for shuffle service for k8s, >>>>> although it is not clear they are talking about a generic shuffle >>>>> service for k8s. >>>>> >>>>> Anyhow with the advent of genai and the need to allow for a larger >>>>> volume of data, I was wondering if there has been any more work on >>>>> this matter. Specifically larger and scalable file systems like HDFS, >>>>> GCS , S3 etc, offer significantly larger storage capacity than local >>>>> disks on individual worker nodes in a k8s cluster, thus allowing >>>>> handling much larger datasets more efficiently. Also the degree of >>>>> parallelism and fault tolerance with these files systems come into >>>>> it. I will be interested in hearing more about any progress on this. >>>>> >>>>> Thanks >>>>> . >>>>> >>>>> Mich Talebzadeh, >>>>> >>>>> Technologist | Solutions Architect | Data Engineer | Generative AI >>>>> >>>>> London >>>>> United Kingdom >>>>> >>>>> >>>>> view my Linkedin profile >>>>> >>>>> >>>>> https://en.everybodywiki.com/Mich_Talebzadeh >>>>> >>>>> >>>>> >>>>> Disclaimer: The information provided is correct to the best of my >>>>> knowledge but of course cannot be guaranteed . It is essential to note >>>>> that, as with any advice, quote "one test result is worth one-thousand >>>>> expert opinions (Werner Von Braun)". >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>>> >>>>> >>>> >>>> -- >>>> Bjørn Jørgensen >>>> Vestre Aspehaug 4, 6010 Ålesund >>>> Norge >>>> >>>> +47 480 94 297 >>>> >>>