Re: External Spark shuffle service for k8s

2024-04-08 Thread Vakaris Baškirov
I see that both Uniffle and Celebron support S3/HDFS backends which is
great.
In the case someone is using S3/HDFS, I wonder what would be the advantages
of using Celebron or Uniffle vs IBM shuffle service plugin
<https://github.com/IBM/spark-s3-shuffle> or Cloud Shuffle Storage Plugin
from AWS
<https://docs.aws.amazon.com/glue/latest/dg/cloud-shuffle-storage-plugin.html>
?

These plugins do not require deploying a separate service. Are there any
advantages to using Uniffle/Celebron in the case of using S3 backend, which
would require deploying a separate service?

Thanks
Vakaris

On Mon, Apr 8, 2024 at 10:03 AM roryqi  wrote:

> Apache Uniffle (incubating) may be another solution.
> You can see
> https://github.com/apache/incubator-uniffle
>
> https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era
>
> Mich Talebzadeh  于2024年4月8日周一 07:15写道:
>
>> Splendid
>>
>> The configurations below can be used with k8s deployments of Spark. Spark
>> applications running on k8s can utilize these configurations to seamlessly
>> access data stored in Google Cloud Storage (GCS) and Amazon S3.
>>
>> For Google GCS we may have
>>
>> spark_config_gcs = {
>> "spark.kubernetes.authenticate.driver.serviceAccountName":
>> "service_account_name",
>> "spark.hadoop.fs.gs.impl":
>> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
>> "spark.hadoop.google.cloud.auth.service.account.enable": "true",
>> "spark.hadoop.google.cloud.auth.service.account.json.keyfile":
>> "/path/to/keyfile.json",
>> }
>>
>> For Amazon S3 similar
>>
>> spark_config_s3 = {
>> "spark.kubernetes.authenticate.driver.serviceAccountName":
>> "service_account_name",
>> "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
>> "spark.hadoop.fs.s3a.access.key": "s3_access_key",
>> "spark.hadoop.fs.s3a.secret.key": "secret_key",
>> }
>>
>>
>> To implement these configurations and enable Spark applications to
>> interact with GCS and S3, I guess we can approach it this way
>>
>> 1) Spark Repository Integration: These configurations need to be added to
>> the Spark repository as part of the supported configuration options for k8s
>> deployments.
>>
>> 2) Configuration Settings: Users need to specify these configurations
>> when submitting Spark applications to a Kubernetes cluster. They can
>> include these configurations in the Spark application code or pass them as
>> command-line arguments or environment variables during application
>> submission.
>>
>> HTH
>>
>> Mich Talebzadeh,
>>
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun
>> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>>
>>
>> On Sun, 7 Apr 2024 at 13:31, Vakaris Baškirov <
>> vakaris.bashki...@gmail.com> wrote:
>>
>>> There is an IBM shuffle service plugin that supports S3
>>> https://github.com/IBM/spark-s3-shuffle
>>>
>>> Though I would think a feature like this could be a part of the main
>>> Spark repo. Trino already has out-of-box support for s3 exchange (shuffle)
>>> and it's very useful.
>>>
>>> Vakaris
>>>
>>> On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>>
>>>> Thanks for your suggestion that I take it as a workaround. Whilst this
>>>> workaround can potentially address storage allocation issues, I was more
>>>> interested in exploring solutions that offer a more seamless integration
>>>> with large distributed file systems like HDFS, GCS, or S3. This would
>>>> ensure better performance and scalability for handling larger datasets
>>>> efficiently.
>>>>

Re: External Spark shuffle service for k8s

2024-04-07 Thread Vakaris Baškirov
There is an IBM shuffle service plugin that supports S3
https://github.com/IBM/spark-s3-shuffle

Though I would think a feature like this could be a part of the main Spark
repo. Trino already has out-of-box support for s3 exchange (shuffle) and
it's very useful.

Vakaris

On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh 
wrote:

>
> Thanks for your suggestion that I take it as a workaround. Whilst this
> workaround can potentially address storage allocation issues, I was more
> interested in exploring solutions that offer a more seamless integration
> with large distributed file systems like HDFS, GCS, or S3. This would
> ensure better performance and scalability for handling larger datasets
> efficiently.
>
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen 
> wrote:
>
>> You can make a PVC on K8S call it 300GB
>>
>> make a folder in yours dockerfile
>> WORKDIR /opt/spark/work-dir
>> RUN chmod g+w /opt/spark/work-dir
>>
>> start spark with adding this
>>
>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName",
>> "300gb") \
>>
>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path",
>> "/opt/spark/work-dir") \
>>
>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>> "False") \
>>
>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName",
>> "300gb") \
>>
>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path",
>> "/opt/spark/work-dir") \
>>
>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>> "False") \
>>   .config("spark.local.dir", "/opt/spark/work-dir")
>>
>>
>>
>>
>> lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> I have seen some older references for shuffle service for k8s,
>>> although it is not clear they are talking about a generic shuffle
>>> service for k8s.
>>>
>>> Anyhow with the advent of genai and the need to allow for a larger
>>> volume of data, I was wondering if there has been any more work on
>>> this matter. Specifically larger and scalable file systems like HDFS,
>>> GCS , S3 etc, offer significantly larger storage capacity than local
>>> disks on individual worker nodes in a k8s cluster, thus allowing
>>> handling much larger datasets more efficiently. Also the degree of
>>> parallelism and fault tolerance  with these files systems come into
>>> it. I will be interested in hearing more about any progress on this.
>>>
>>> Thanks
>>> .
>>>
>>> Mich Talebzadeh,
>>>
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> Disclaimer: The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner Von Braun)".
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>


Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2024-03-20 Thread Vakaris Baškirov
Hi!
Just wanted to inquire about the status of the official operator. We are
looking forward to contributing and later on switching to a Spark Operator
and we would prefer it to be the official one.

Thanks,
Vakaris

On Thu, Nov 30, 2023 at 7:09 AM Shiqi Sun  wrote:

> Hi Zhou,
>
> Thanks for the reply. For the language choice, since I don't think I've
> used many k8s components written in Java on k8s, I can't really tell, but
> at least for the components written in Golang, they are well-organized,
> easy to read/maintain and run well in general. In addition, goroutines
> really ease things a lot when writing concurrency code. Golang also has a
> lot less boilerplates, no complicated inheritance and easier dependency
> management and linting toolings. Together with all these points, that's why
> I prefer Golang for this k8s operator. I understand the Spark maintainers
> are more familiar with JVM languages, but I think we should consider the
> performance and maintainability vs the learning curve, to choose an option
> that can win in the long run. Plus, I believe most of the Spark maintainers
> who touch k8s related parts in the Spark project already have experiences
> with Golang, so it shouldn't be a big problem. Our team had some experience
> with the fabric8 client a couple years ago, and we've experienced some
> issues with its reliability, mainly about the request dropping issue (i.e.
> code call is made but the apiserver never receives the request), but that
> was awhile ago and I'm not sure whether everything is good with the client
> now. Anyway, this is my opinion about the language choice, and I will let
> other people comment about it as well.
>
> For compatibility, yes please make the CRD compatible from the user's
> standpoint, so that it's easy for people to adopt the new operator. The
> goal is to consolidate the many spark operators on the market to this new
> official operator, so an easy adoption experience is the key.
>
> Also, I feel that the discussion is pretty high level, and it's because
> the only info revealed for this new operator is the SPIP doc and I haven't
> got a chance to see the code yet. I understand the new operator project
> might still not be open-sourced yet, but is there any way for me to take an
> early peek into the code of your operator, so that we can discuss more
> specifically about the points of language choice and compatibility? Thank
> you so much!
>
> Best,
> Shiqi
>
> On Tue, Nov 28, 2023 at 10:42 AM Zhou Jiang 
> wrote:
>
>> Hi Shiqi,
>>
>> Thanks for the cross-posting here - sorry for the response delay during
>> the holiday break :)
>> We prefer Java for the operator project as it's JVM-based and widely
>> familiar within the Spark community. This choice aims to facilitate better
>> adoption and ease of onboarding for future maintainers. In addition, the
>> Java API client can also be considered as a mature option widely used, by
>> Spark itself and by other operator implementations like Flink.
>> For easier onboarding and potential migration, we'll consider
>> compatibility with existing CRD designs - the goal is to maintain
>> compatibility as best as possible while minimizing duplication efforts.
>> I'm enthusiastic about the idea of lean, version agnostic submission
>> worker. It aligns with one of the primary goals in the operator design.
>> Let's continue exploring this idea further in design doc.
>>
>> Thanks,
>> Zhou
>>
>>
>> On Wed, Nov 22, 2023 at 3:35 PM Shiqi Sun  wrote:
>>
>>> Hi all,
>>>
>>> Sorry for being late to the party. I went through the SPIP doc and I
>>> think this is a great proposal! I left a comment in the SPIP doc a couple
>>> days ago, but I don't see much activity there and no one replied, so I
>>> wanted to cross-post it here to get some feedback.
>>>
>>> I'm Shiqi Sun, and I work for Big Data Platform in Salesforce. My team
>>> has been running the Spark on k8s operator
>>>  (OSS
>>> from Google) in my company to serve Spark users on production for 4+ years,
>>> and we've been actively contributing to the Spark on k8s operator OSS and
>>> also, occasionally, the Spark OSS. According to our experience, Google's
>>> Spark Operator has its own problems, like its close coupling with the spark
>>> version, as well as the JVM overhead during job submission. However on the
>>> other side, it's been a great component in our team's service in the
>>> company, especially being written in golang, it's really easy to have it
>>> interact with k8s, and also its CRD covers a lot of different use cases, as
>>> it has been built up through time thanks to many users' contribution during
>>> these years. There were also a handful of sessions of Google's Spark
>>> Operator Spark Summit that made it widely adopted.
>>>
>>> For this SPIP, I really love the idea of this proposal for the official
>>> k8s operator of Spark project, as well as the separate layer of the
>>> 

Re: [VOTE] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-14 Thread Vakaris Baškirov
+1 (non-binding)

On Tue, Nov 14, 2023 at 8:03 PM Chao Sun  wrote:

> +1
>
> On Tue, Nov 14, 2023 at 9:52 AM L. C. Hsieh  wrote:
> >
> > +1
> >
> > On Tue, Nov 14, 2023 at 9:46 AM Ye Zhou  wrote:
> > >
> > > +1(Non-binding)
> > >
> > > On Tue, Nov 14, 2023 at 9:42 AM L. C. Hsieh  wrote:
> > >>
> > >> Hi all,
> > >>
> > >> I’d like to start a vote for SPIP: An Official Kubernetes Operator for
> > >> Apache Spark.
> > >>
> > >> The proposal is to develop an official Java-based Kubernetes operator
> > >> for Apache Spark to automate the deployment and simplify the lifecycle
> > >> management and orchestration of Spark applications and Spark clusters
> > >> on k8s at prod scale.
> > >>
> > >> This aims to reduce the learning curve and operation overhead for
> > >> Spark users so they can concentrate on core Spark logic.
> > >>
> > >> Please also refer to:
> > >>
> > >>- Discussion thread:
> > >> https://lists.apache.org/thread/wdy7jfhf7m8jy74p6s0npjfd15ym5rxz
> > >>- JIRA ticket: https://issues.apache.org/jira/browse/SPARK-45923
> > >>- SPIP doc:
> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
> > >>
> > >>
> > >> Please vote on the SPIP for the next 72 hours:
> > >>
> > >> [ ] +1: Accept the proposal as an official SPIP
> > >> [ ] +0
> > >> [ ] -1: I don’t think this is a good idea because …
> > >>
> > >>
> > >> Thank you!
> > >>
> > >> Liang-Chi Hsieh
> > >>
> > >> -
> > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >>
> > >
> > >
> > > --
> > >
> > > Zhou, Ye  周晔
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>