Re: External Spark shuffle service for k8s

Vakaris Baškirov Mon, 08 Apr 2024 06:23:10 -0700

I see that both Uniffle and Celebron support S3/HDFS backends which is
great.
In the case someone is using S3/HDFS, I wonder what would be the advantages
of using Celebron or Uniffle vs IBM shuffle service plugin
<https://github.com/IBM/spark-s3-shuffle> or Cloud Shuffle Storage Plugin
from AWS
<https://docs.aws.amazon.com/glue/latest/dg/cloud-shuffle-storage-plugin.html>
?


These plugins do not require deploying a separate service. Are there any
advantages to using Uniffle/Celebron in the case of using S3 backend, which
would require deploying a separate service?

Thanks
Vakaris

On Mon, Apr 8, 2024 at 10:03 AM roryqi <ror...@apache.org> wrote:

> Apache Uniffle (incubating) may be another solution.
> You can see
> https://github.com/apache/incubator-uniffle
>
> https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era
>
> Mich Talebzadeh <mich.talebza...@gmail.com> 于2024年4月8日周一 07:15写道：
>
>> Splendid
>>
>> The configurations below can be used with k8s deployments of Spark. Spark
>> applications running on k8s can utilize these configurations to seamlessly
>> access data stored in Google Cloud Storage (GCS) and Amazon S3.
>>
>> For Google GCS we may have
>>
>> spark_config_gcs = {
>>     "spark.kubernetes.authenticate.driver.serviceAccountName":
>> "service_account_name",
>>     "spark.hadoop.fs.gs.impl":
>> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
>>     "spark.hadoop.google.cloud.auth.service.account.enable": "true",
>>     "spark.hadoop.google.cloud.auth.service.account.json.keyfile":
>> "/path/to/keyfile.json",
>> }
>>
>> For Amazon S3 similar
>>
>> spark_config_s3 = {
>>     "spark.kubernetes.authenticate.driver.serviceAccountName":
>> "service_account_name",
>>     "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
>>     "spark.hadoop.fs.s3a.access.key": "s3_access_key",
>>     "spark.hadoop.fs.s3a.secret.key": "secret_key",
>> }
>>
>>
>> To implement these configurations and enable Spark applications to
>> interact with GCS and S3, I guess we can approach it this way
>>
>> 1) Spark Repository Integration: These configurations need to be added to
>> the Spark repository as part of the supported configuration options for k8s
>> deployments.
>>
>> 2) Configuration Settings: Users need to specify these configurations
>> when submitting Spark applications to a Kubernetes cluster. They can
>> include these configurations in the Spark application code or pass them as
>> command-line arguments or environment variables during application
>> submission.
>>
>> HTH
>>
>> Mich Talebzadeh,
>>
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Sun, 7 Apr 2024 at 13:31, Vakaris Baškirov <
>> vakaris.bashki...@gmail.com> wrote:
>>
>>> There is an IBM shuffle service plugin that supports S3
>>> https://github.com/IBM/spark-s3-shuffle
>>>
>>> Though I would think a feature like this could be a part of the main
>>> Spark repo. Trino already has out-of-box support for s3 exchange (shuffle)
>>> and it's very useful.
>>>
>>> Vakaris
>>>
>>> On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>>
>>>> Thanks for your suggestion that I take it as a workaround. Whilst this
>>>> workaround can potentially address storage allocation issues, I was more
>>>> interested in exploring solutions that offer a more seamless integration
>>>> with large distributed file systems like HDFS, GCS, or S3. This would
>>>> ensure better performance and scalability for handling larger datasets
>>>> efficiently.
>>>>
>>>>
>>>> Mich Talebzadeh,
>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>> London
>>>> United Kingdom
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>
>>>>
>>>>
>>>> *Disclaimer:* The information provided is correct to the best of my
>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>> expert opinions (Werner
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>
>>>>
>>>> On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen <bjornjorgen...@gmail.com>
>>>> wrote:
>>>>
>>>>> You can make a PVC on K8S call it 300GB
>>>>>
>>>>> make a folder in yours dockerfile
>>>>> WORKDIR /opt/spark/work-dir
>>>>> RUN chmod g+w /opt/spark/work-dir
>>>>>
>>>>> start spark with adding this
>>>>>
>>>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName",
>>>>> "300gb") \
>>>>>
>>>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path",
>>>>> "/opt/spark/work-dir") \
>>>>>
>>>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>>>>> "False") \
>>>>>
>>>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName",
>>>>> "300gb") \
>>>>>
>>>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path",
>>>>> "/opt/spark/work-dir") \
>>>>>
>>>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>>>>> "False") \
>>>>>   .config("spark.local.dir", "/opt/spark/work-dir")
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh <
>>>>> mich.talebza...@gmail.com>:
>>>>>
>>>>>> I have seen some older references for shuffle service for k8s,
>>>>>> although it is not clear they are talking about a generic shuffle
>>>>>> service for k8s.
>>>>>>
>>>>>> Anyhow with the advent of genai and the need to allow for a larger
>>>>>> volume of data, I was wondering if there has been any more work on
>>>>>> this matter. Specifically larger and scalable file systems like HDFS,
>>>>>> GCS , S3 etc, offer significantly larger storage capacity than local
>>>>>> disks on individual worker nodes in a k8s cluster, thus allowing
>>>>>> handling much larger datasets more efficiently. Also the degree of
>>>>>> parallelism and fault tolerance  with these files systems come into
>>>>>> it. I will be interested in hearing more about any progress on this.
>>>>>>
>>>>>> Thanks
>>>>>> .
>>>>>>
>>>>>> Mich Talebzadeh,
>>>>>>
>>>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>>>>
>>>>>> London
>>>>>> United Kingdom
>>>>>>
>>>>>>
>>>>>>    view my Linkedin profile
>>>>>>
>>>>>>
>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>
>>>>>>
>>>>>>
>>>>>> Disclaimer: The information provided is correct to the best of my
>>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>>> expert opinions (Werner Von Braun)".
>>>>>>
>>>>>> ---------------------------------------------------------------------
>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>> --
>>>>> Bjørn Jørgensen
>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>> Norge
>>>>>
>>>>> +47 480 94 297
>>>>>
>>>>

Re: External Spark shuffle service for k8s

Reply via email to