Thanks for your suggestion that I take it as a workaround. Whilst this workaround can potentially address storage allocation issues, I was more interested in exploring solutions that offer a more seamless integration with large distributed file systems like HDFS, GCS, or S3. This would ensure better performance and scalability for handling larger datasets efficiently.
Mich Talebzadeh, Technologist | Solutions Architect | Data Engineer | Generative AI London United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen <bjornjorgen...@gmail.com> wrote: > You can make a PVC on K8S call it 300GB > > make a folder in yours dockerfile > WORKDIR /opt/spark/work-dir > RUN chmod g+w /opt/spark/work-dir > > start spark with adding this > > .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName", > "300gb") \ > > .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path", > "/opt/spark/work-dir") \ > > .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly", > "False") \ > > .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName", > "300gb") \ > > .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path", > "/opt/spark/work-dir") \ > > .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly", > "False") \ > .config("spark.local.dir", "/opt/spark/work-dir") > > > > > lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh < > mich.talebza...@gmail.com>: > >> I have seen some older references for shuffle service for k8s, >> although it is not clear they are talking about a generic shuffle >> service for k8s. >> >> Anyhow with the advent of genai and the need to allow for a larger >> volume of data, I was wondering if there has been any more work on >> this matter. Specifically larger and scalable file systems like HDFS, >> GCS , S3 etc, offer significantly larger storage capacity than local >> disks on individual worker nodes in a k8s cluster, thus allowing >> handling much larger datasets more efficiently. Also the degree of >> parallelism and fault tolerance with these files systems come into >> it. I will be interested in hearing more about any progress on this. >> >> Thanks >> . >> >> Mich Talebzadeh, >> >> Technologist | Solutions Architect | Data Engineer | Generative AI >> >> London >> United Kingdom >> >> >> view my Linkedin profile >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> Disclaimer: The information provided is correct to the best of my >> knowledge but of course cannot be guaranteed . It is essential to note >> that, as with any advice, quote "one test result is worth one-thousand >> expert opinions (Werner Von Braun)". >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >> >> > > -- > Bjørn Jørgensen > Vestre Aspehaug 4, 6010 Ålesund > Norge > > +47 480 94 297 >