Re: External Spark shuffle service for k8s

Mich Talebzadeh Mon, 08 Apr 2024 06:39:17 -0700

Hi,

First thanks everyone for their contributions


I was going to reply to @Enrico Minack <i...@enrico.minack.dev>  but
noticed additional info. As I understand for example,  Apache Uniffle is an
incubating project aimed at providing a pluggable shuffle service for
Spark. So basically, all these "external shuffle services" have in common
is to offload shuffle data management to external services, thus reducing
the memory and CPU overhead on Spark executors. That is great.  While
Uniffle and others enhance shuffle performance and scalability, it would be
great to integrate them with Spark UI. This may require additional
development efforts. I suppose  the interest would be to have these
external matrices incorporated into Spark with one look and feel. This may
require customizing the UI to fetch and display metrics or statistics from
the external shuffle services. Has any project done this?

Thanks

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>


 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".


On Mon, 8 Apr 2024 at 14:19, Vakaris Baškirov <vakaris.bashki...@gmail.com>
wrote:

> I see that both Uniffle and Celebron support S3/HDFS backends which is
> great.
> In the case someone is using S3/HDFS, I wonder what would be the
> advantages of using Celebron or Uniffle vs IBM shuffle service plugin
> <https://github.com/IBM/spark-s3-shuffle> or Cloud Shuffle Storage Plugin
> from AWS
> <https://docs.aws.amazon.com/glue/latest/dg/cloud-shuffle-storage-plugin.html>
> ?
>
> These plugins do not require deploying a separate service. Are there any
> advantages to using Uniffle/Celebron in the case of using S3 backend, which
> would require deploying a separate service?
>
> Thanks
> Vakaris
>
> On Mon, Apr 8, 2024 at 10:03 AM roryqi <ror...@apache.org> wrote:
>
>> Apache Uniffle (incubating) may be another solution.
>> You can see
>> https://github.com/apache/incubator-uniffle
>>
>> https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era
>>
>> Mich Talebzadeh <mich.talebza...@gmail.com> 于2024年4月8日周一 07:15写道：
>>
>>> Splendid
>>>
>>> The configurations below can be used with k8s deployments of Spark.
>>> Spark applications running on k8s can utilize these configurations to
>>> seamlessly access data stored in Google Cloud Storage (GCS) and Amazon S3.
>>>
>>> For Google GCS we may have
>>>
>>> spark_config_gcs = {
>>>     "spark.kubernetes.authenticate.driver.serviceAccountName":
>>> "service_account_name",
>>>     "spark.hadoop.fs.gs.impl":
>>> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
>>>     "spark.hadoop.google.cloud.auth.service.account.enable": "true",
>>>     "spark.hadoop.google.cloud.auth.service.account.json.keyfile":
>>> "/path/to/keyfile.json",
>>> }
>>>
>>> For Amazon S3 similar
>>>
>>> spark_config_s3 = {
>>>     "spark.kubernetes.authenticate.driver.serviceAccountName":
>>> "service_account_name",
>>>     "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
>>>     "spark.hadoop.fs.s3a.access.key": "s3_access_key",
>>>     "spark.hadoop.fs.s3a.secret.key": "secret_key",
>>> }
>>>
>>>
>>> To implement these configurations and enable Spark applications to
>>> interact with GCS and S3, I guess we can approach it this way
>>>
>>> 1) Spark Repository Integration: These configurations need to be added
>>> to the Spark repository as part of the supported configuration options for
>>> k8s deployments.
>>>
>>> 2) Configuration Settings: Users need to specify these configurations
>>> when submitting Spark applications to a Kubernetes cluster. They can
>>> include these configurations in the Spark application code or pass them as
>>> command-line arguments or environment variables during application
>>> submission.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>>
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Sun, 7 Apr 2024 at 13:31, Vakaris Baškirov <
>>> vakaris.bashki...@gmail.com> wrote:
>>>
>>>> There is an IBM shuffle service plugin that supports S3
>>>> https://github.com/IBM/spark-s3-shuffle
>>>>
>>>> Though I would think a feature like this could be a part of the main
>>>> Spark repo. Trino already has out-of-box support for s3 exchange (shuffle)
>>>> and it's very useful.
>>>>
>>>> Vakaris
>>>>
>>>> On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh <
>>>> mich.talebza...@gmail.com> wrote:
>>>>
>>>>>
>>>>> Thanks for your suggestion that I take it as a workaround. Whilst this
>>>>> workaround can potentially address storage allocation issues, I was more
>>>>> interested in exploring solutions that offer a more seamless integration
>>>>> with large distributed file systems like HDFS, GCS, or S3. This would
>>>>> ensure better performance and scalability for handling larger datasets
>>>>> efficiently.
>>>>>
>>>>>
>>>>> Mich Talebzadeh,
>>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> *Disclaimer:* The information provided is correct to the best of my
>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>> expert opinions (Werner
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>>>
>>>>>
>>>>> On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen <bjornjorgen...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> You can make a PVC on K8S call it 300GB
>>>>>>
>>>>>> make a folder in yours dockerfile
>>>>>> WORKDIR /opt/spark/work-dir
>>>>>> RUN chmod g+w /opt/spark/work-dir
>>>>>>
>>>>>> start spark with adding this
>>>>>>
>>>>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName",
>>>>>> "300gb") \
>>>>>>
>>>>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path",
>>>>>> "/opt/spark/work-dir") \
>>>>>>
>>>>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>>>>>> "False") \
>>>>>>
>>>>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName",
>>>>>> "300gb") \
>>>>>>
>>>>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path",
>>>>>> "/opt/spark/work-dir") \
>>>>>>
>>>>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>>>>>> "False") \
>>>>>>   .config("spark.local.dir", "/opt/spark/work-dir")
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh <
>>>>>> mich.talebza...@gmail.com>:
>>>>>>
>>>>>>> I have seen some older references for shuffle service for k8s,
>>>>>>> although it is not clear they are talking about a generic shuffle
>>>>>>> service for k8s.
>>>>>>>
>>>>>>> Anyhow with the advent of genai and the need to allow for a larger
>>>>>>> volume of data, I was wondering if there has been any more work on
>>>>>>> this matter. Specifically larger and scalable file systems like HDFS,
>>>>>>> GCS , S3 etc, offer significantly larger storage capacity than local
>>>>>>> disks on individual worker nodes in a k8s cluster, thus allowing
>>>>>>> handling much larger datasets more efficiently. Also the degree of
>>>>>>> parallelism and fault tolerance  with these files systems come into
>>>>>>> it. I will be interested in hearing more about any progress on this.
>>>>>>>
>>>>>>> Thanks
>>>>>>> .
>>>>>>>
>>>>>>> Mich Talebzadeh,
>>>>>>>
>>>>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>>>>>
>>>>>>> London
>>>>>>> United Kingdom
>>>>>>>
>>>>>>>
>>>>>>>    view my Linkedin profile
>>>>>>>
>>>>>>>
>>>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Disclaimer: The information provided is correct to the best of my
>>>>>>> knowledge but of course cannot be guaranteed . It is essential to
>>>>>>> note
>>>>>>> that, as with any advice, quote "one test result is worth
>>>>>>> one-thousand
>>>>>>> expert opinions (Werner Von Braun)".
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> --
>>>>>> Bjørn Jørgensen
>>>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>>>> Norge
>>>>>>
>>>>>> +47 480 94 297
>>>>>>
>>>>>

Re: External Spark shuffle service for k8s

Reply via email to