There is an IBM shuffle service plugin that supports S3 https://github.com/IBM/spark-s3-shuffle
Though I would think a feature like this could be a part of the main Spark repo. Trino already has out-of-box support for s3 exchange (shuffle) and it's very useful. Vakaris On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh <mich.talebza...@gmail.com> wrote: > > Thanks for your suggestion that I take it as a workaround. Whilst this > workaround can potentially address storage allocation issues, I was more > interested in exploring solutions that offer a more seamless integration > with large distributed file systems like HDFS, GCS, or S3. This would > ensure better performance and scalability for handling larger datasets > efficiently. > > > Mich Talebzadeh, > Technologist | Solutions Architect | Data Engineer | Generative AI > London > United Kingdom > > > view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > https://en.everybodywiki.com/Mich_Talebzadeh > > > > *Disclaimer:* The information provided is correct to the best of my > knowledge but of course cannot be guaranteed . It is essential to note > that, as with any advice, quote "one test result is worth one-thousand > expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von > Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". > > > On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen <bjornjorgen...@gmail.com> > wrote: > >> You can make a PVC on K8S call it 300GB >> >> make a folder in yours dockerfile >> WORKDIR /opt/spark/work-dir >> RUN chmod g+w /opt/spark/work-dir >> >> start spark with adding this >> >> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName", >> "300gb") \ >> >> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path", >> "/opt/spark/work-dir") \ >> >> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly", >> "False") \ >> >> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName", >> "300gb") \ >> >> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path", >> "/opt/spark/work-dir") \ >> >> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly", >> "False") \ >> .config("spark.local.dir", "/opt/spark/work-dir") >> >> >> >> >> lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh < >> mich.talebza...@gmail.com>: >> >>> I have seen some older references for shuffle service for k8s, >>> although it is not clear they are talking about a generic shuffle >>> service for k8s. >>> >>> Anyhow with the advent of genai and the need to allow for a larger >>> volume of data, I was wondering if there has been any more work on >>> this matter. Specifically larger and scalable file systems like HDFS, >>> GCS , S3 etc, offer significantly larger storage capacity than local >>> disks on individual worker nodes in a k8s cluster, thus allowing >>> handling much larger datasets more efficiently. Also the degree of >>> parallelism and fault tolerance with these files systems come into >>> it. I will be interested in hearing more about any progress on this. >>> >>> Thanks >>> . >>> >>> Mich Talebzadeh, >>> >>> Technologist | Solutions Architect | Data Engineer | Generative AI >>> >>> London >>> United Kingdom >>> >>> >>> view my Linkedin profile >>> >>> >>> https://en.everybodywiki.com/Mich_Talebzadeh >>> >>> >>> >>> Disclaimer: The information provided is correct to the best of my >>> knowledge but of course cannot be guaranteed . It is essential to note >>> that, as with any advice, quote "one test result is worth one-thousand >>> expert opinions (Werner Von Braun)". >>> >>> --------------------------------------------------------------------- >>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>> >>> >> >> -- >> Bjørn Jørgensen >> Vestre Aspehaug 4, 6010 Ålesund >> Norge >> >> +47 480 94 297 >> >