Re: External Spark shuffle service for k8s

roryqi Mon, 08 Apr 2024 00:04:12 -0700

Apache Uniffle (incubating) may be another solution.
You can see
https://github.com/apache/incubator-uniffle
https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era


Mich Talebzadeh <[email protected]> 于2024年4月8日周一 07:15写道：

> Splendid
>
> The configurations below can be used with k8s deployments of Spark. Spark
> applications running on k8s can utilize these configurations to seamlessly
> access data stored in Google Cloud Storage (GCS) and Amazon S3.
>
> For Google GCS we may have
>
> spark_config_gcs = {
>     "spark.kubernetes.authenticate.driver.serviceAccountName":
> "service_account_name",
>     "spark.hadoop.fs.gs.impl":
> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
>     "spark.hadoop.google.cloud.auth.service.account.enable": "true",
>     "spark.hadoop.google.cloud.auth.service.account.json.keyfile":
> "/path/to/keyfile.json",
> }
>
> For Amazon S3 similar
>
> spark_config_s3 = {
>     "spark.kubernetes.authenticate.driver.serviceAccountName":
> "service_account_name",
>     "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
>     "spark.hadoop.fs.s3a.access.key": "s3_access_key",
>     "spark.hadoop.fs.s3a.secret.key": "secret_key",
> }
>
>
> To implement these configurations and enable Spark applications to
> interact with GCS and S3, I guess we can approach it this way
>
> 1) Spark Repository Integration: These configurations need to be added to
> the Spark repository as part of the supported configuration options for k8s
> deployments.
>
> 2) Configuration Settings: Users need to specify these configurations when
> submitting Spark applications to a Kubernetes cluster. They can include
> these configurations in the Spark application code or pass them as
> command-line arguments or environment variables during application
> submission.
>
> HTH
>
> Mich Talebzadeh,
>
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Sun, 7 Apr 2024 at 13:31, Vakaris Baškirov <[email protected]>
> wrote:
>
>> There is an IBM shuffle service plugin that supports S3
>> https://github.com/IBM/spark-s3-shuffle
>>
>> Though I would think a feature like this could be a part of the main
>> Spark repo. Trino already has out-of-box support for s3 exchange (shuffle)
>> and it's very useful.
>>
>> Vakaris
>>
>> On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh <
>> [email protected]> wrote:
>>
>>>
>>> Thanks for your suggestion that I take it as a workaround. Whilst this
>>> workaround can potentially address storage allocation issues, I was more
>>> interested in exploring solutions that offer a more seamless integration
>>> with large distributed file systems like HDFS, GCS, or S3. This would
>>> ensure better performance and scalability for handling larger datasets
>>> efficiently.
>>>
>>>
>>> Mich Talebzadeh,
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>> London
>>> United Kingdom
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>>
>>>
>>> On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen <[email protected]>
>>> wrote:
>>>
>>>> You can make a PVC on K8S call it 300GB
>>>>
>>>> make a folder in yours dockerfile
>>>> WORKDIR /opt/spark/work-dir
>>>> RUN chmod g+w /opt/spark/work-dir
>>>>
>>>> start spark with adding this
>>>>
>>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName",
>>>> "300gb") \
>>>>
>>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path",
>>>> "/opt/spark/work-dir") \
>>>>
>>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>>>> "False") \
>>>>
>>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName",
>>>> "300gb") \
>>>>
>>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path",
>>>> "/opt/spark/work-dir") \
>>>>
>>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>>>> "False") \
>>>>   .config("spark.local.dir", "/opt/spark/work-dir")
>>>>
>>>>
>>>>
>>>>
>>>> lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh <
>>>> [email protected]>:
>>>>
>>>>> I have seen some older references for shuffle service for k8s,
>>>>> although it is not clear they are talking about a generic shuffle
>>>>> service for k8s.
>>>>>
>>>>> Anyhow with the advent of genai and the need to allow for a larger
>>>>> volume of data, I was wondering if there has been any more work on
>>>>> this matter. Specifically larger and scalable file systems like HDFS,
>>>>> GCS , S3 etc, offer significantly larger storage capacity than local
>>>>> disks on individual worker nodes in a k8s cluster, thus allowing
>>>>> handling much larger datasets more efficiently. Also the degree of
>>>>> parallelism and fault tolerance  with these files systems come into
>>>>> it. I will be interested in hearing more about any progress on this.
>>>>>
>>>>> Thanks
>>>>> .
>>>>>
>>>>> Mich Talebzadeh,
>>>>>
>>>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>>>
>>>>> London
>>>>> United Kingdom
>>>>>
>>>>>
>>>>>    view my Linkedin profile
>>>>>
>>>>>
>>>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>>>
>>>>>
>>>>>
>>>>> Disclaimer: The information provided is correct to the best of my
>>>>> knowledge but of course cannot be guaranteed . It is essential to note
>>>>> that, as with any advice, quote "one test result is worth one-thousand
>>>>> expert opinions (Werner Von Braun)".
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe e-mail: [email protected]
>>>>>
>>>>>
>>>>
>>>> --
>>>> Bjørn Jørgensen
>>>> Vestre Aspehaug 4, 6010 Ålesund
>>>> Norge
>>>>
>>>> +47 480 94 297
>>>>
>>>

Re: External Spark shuffle service for k8s

Reply via email to