There is an IBM shuffle service plugin that supports S3
https://github.com/IBM/spark-s3-shuffle

Though I would think a feature like this could be a part of the main Spark
repo. Trino already has out-of-box support for s3 exchange (shuffle) and
it's very useful.

Vakaris

On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

>
> Thanks for your suggestion that I take it as a workaround. Whilst this
> workaround can potentially address storage allocation issues, I was more
> interested in exploring solutions that offer a more seamless integration
> with large distributed file systems like HDFS, GCS, or S3. This would
> ensure better performance and scalability for handling larger datasets
> efficiently.
>
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen <bjornjorgen...@gmail.com>
> wrote:
>
>> You can make a PVC on K8S call it 300GB
>>
>> make a folder in yours dockerfile
>> WORKDIR /opt/spark/work-dir
>> RUN chmod g+w /opt/spark/work-dir
>>
>> start spark with adding this
>>
>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName",
>> "300gb") \
>>
>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path",
>> "/opt/spark/work-dir") \
>>
>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>> "False") \
>>
>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName",
>> "300gb") \
>>
>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path",
>> "/opt/spark/work-dir") \
>>
>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>> "False") \
>>   .config("spark.local.dir", "/opt/spark/work-dir")
>>
>>
>>
>>
>> lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> I have seen some older references for shuffle service for k8s,
>>> although it is not clear they are talking about a generic shuffle
>>> service for k8s.
>>>
>>> Anyhow with the advent of genai and the need to allow for a larger
>>> volume of data, I was wondering if there has been any more work on
>>> this matter. Specifically larger and scalable file systems like HDFS,
>>> GCS , S3 etc, offer significantly larger storage capacity than local
>>> disks on individual worker nodes in a k8s cluster, thus allowing
>>> handling much larger datasets more efficiently. Also the degree of
>>> parallelism and fault tolerance  with these files systems come into
>>> it. I will be interested in hearing more about any progress on this.
>>>
>>> Thanks
>>> .
>>>
>>> Mich Talebzadeh,
>>>
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> Disclaimer: The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner Von Braun)".
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

Reply via email to