Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

Vakaris Baškirov Wed, 20 Mar 2024 05:26:30 -0700

Hi!
Just wanted to inquire about the status of the official operator. We are
looking forward to contributing and later on switching to a Spark Operator
and we would prefer it to be the official one.


Thanks,
Vakaris

On Thu, Nov 30, 2023 at 7:09 AM Shiqi Sun <jack.sun...@gmail.com> wrote:

> Hi Zhou,
>
> Thanks for the reply. For the language choice, since I don't think I've
> used many k8s components written in Java on k8s, I can't really tell, but
> at least for the components written in Golang, they are well-organized,
> easy to read/maintain and run well in general. In addition, goroutines
> really ease things a lot when writing concurrency code. Golang also has a
> lot less boilerplates, no complicated inheritance and easier dependency
> management and linting toolings. Together with all these points, that's why
> I prefer Golang for this k8s operator. I understand the Spark maintainers
> are more familiar with JVM languages, but I think we should consider the
> performance and maintainability vs the learning curve, to choose an option
> that can win in the long run. Plus, I believe most of the Spark maintainers
> who touch k8s related parts in the Spark project already have experiences
> with Golang, so it shouldn't be a big problem. Our team had some experience
> with the fabric8 client a couple years ago, and we've experienced some
> issues with its reliability, mainly about the request dropping issue (i.e.
> code call is made but the apiserver never receives the request), but that
> was awhile ago and I'm not sure whether everything is good with the client
> now. Anyway, this is my opinion about the language choice, and I will let
> other people comment about it as well.
>
> For compatibility, yes please make the CRD compatible from the user's
> standpoint, so that it's easy for people to adopt the new operator. The
> goal is to consolidate the many spark operators on the market to this new
> official operator, so an easy adoption experience is the key.
>
> Also, I feel that the discussion is pretty high level, and it's because
> the only info revealed for this new operator is the SPIP doc and I haven't
> got a chance to see the code yet. I understand the new operator project
> might still not be open-sourced yet, but is there any way for me to take an
> early peek into the code of your operator, so that we can discuss more
> specifically about the points of language choice and compatibility? Thank
> you so much!
>
> Best,
> Shiqi
>
> On Tue, Nov 28, 2023 at 10:42 AM Zhou Jiang <zhou.c.ji...@gmail.com>
> wrote:
>
>> Hi Shiqi,
>>
>> Thanks for the cross-posting here - sorry for the response delay during
>> the holiday break :)
>> We prefer Java for the operator project as it's JVM-based and widely
>> familiar within the Spark community. This choice aims to facilitate better
>> adoption and ease of onboarding for future maintainers. In addition, the
>> Java API client can also be considered as a mature option widely used, by
>> Spark itself and by other operator implementations like Flink.
>> For easier onboarding and potential migration, we'll consider
>> compatibility with existing CRD designs - the goal is to maintain
>> compatibility as best as possible while minimizing duplication efforts.
>> I'm enthusiastic about the idea of lean, version agnostic submission
>> worker. It aligns with one of the primary goals in the operator design.
>> Let's continue exploring this idea further in design doc.
>>
>> Thanks,
>> Zhou
>>
>>
>> On Wed, Nov 22, 2023 at 3:35 PM Shiqi Sun <jack.sun...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> Sorry for being late to the party. I went through the SPIP doc and I
>>> think this is a great proposal! I left a comment in the SPIP doc a couple
>>> days ago, but I don't see much activity there and no one replied, so I
>>> wanted to cross-post it here to get some feedback.
>>>
>>> I'm Shiqi Sun, and I work for Big Data Platform in Salesforce. My team
>>> has been running the Spark on k8s operator
>>> <https://github.com/GoogleCloudPlatform/spark-on-k8s-operator> (OSS
>>> from Google) in my company to serve Spark users on production for 4+ years,
>>> and we've been actively contributing to the Spark on k8s operator OSS and
>>> also, occasionally, the Spark OSS. According to our experience, Google's
>>> Spark Operator has its own problems, like its close coupling with the spark
>>> version, as well as the JVM overhead during job submission. However on the
>>> other side, it's been a great component in our team's service in the
>>> company, especially being written in golang, it's really easy to have it
>>> interact with k8s, and also its CRD covers a lot of different use cases, as
>>> it has been built up through time thanks to many users' contribution during
>>> these years. There were also a handful of sessions of Google's Spark
>>> Operator Spark Summit that made it widely adopted.
>>>
>>> For this SPIP, I really love the idea of this proposal for the official
>>> k8s operator of Spark project, as well as the separate layer of the
>>> submission worker and being spark version agnostic. I think we can get the
>>> best of the two:
>>> 1. I would advocate the new project to still use golang for the
>>> implementation, as golang is the go-to cloud native language that works the
>>> best with k8s.
>>> 2. We make sure the functionality of the current Google's spark operator
>>> CRD is preserved in the new official Spark Operator; if we can make it
>>> compatible or even merge the two projects to make it the new official
>>> operator in spark project, it would be the best.
>>> 3. The new Spark Operator should continue being spark agnostic and
>>> continue having this lightweight/separate layer of submission worker. We've
>>> seen scalability issues caused by the heavy JVM during spark-submit in
>>> Google's Spark Operator and we implemented an internal version of fix for
>>> it within our company.
>>>
>>> We can continue the discussion in more detail, but generally I love this
>>> move of the official spark operator, and I really appreciate the effort! In
>>> the SPIP doc. I see my comment has gained several upvotes from someone I
>>> don't know, so I believe there are other spark/spark operator users who
>>> agree with some of my points. Let me know what you all think and let's
>>> continue the discussion, so that we can make this operator a great new
>>> component of the Open Source Spark Project!
>>>
>>> Thanks!
>>>
>>> Shiqi
>>>
>>> On Mon, Nov 13, 2023 at 11:50 PM L. C. Hsieh <vii...@gmail.com> wrote:
>>>
>>>> Thanks for all the support from the community for the SPIP proposal.
>>>>
>>>> Since all questions/discussion are settled down (if I didn't miss any
>>>> major ones), if no more questions or concerns, I'll be the shepherd
>>>> for this SPIP proposal and call for a vote tomorrow.
>>>>
>>>> Thank you all!
>>>>
>>>> On Mon, Nov 13, 2023 at 6:43 PM Zhou Jiang <zhou.c.ji...@gmail.com>
>>>> wrote:
>>>> >
>>>> > Hi Holden,
>>>> >
>>>> > Thanks a lot for your feedback!
>>>> > Yes, this proposal attempts to integrate existing solutions,
>>>> especially from CRD perspective. The proposed schema retains similarity
>>>> with current designs, while reducing duplicates and maintaining a single
>>>> source of truth from conf properties. It also tends to be close to native
>>>> integration with k8s to minimize schema changes for new features.
>>>> > For dependencies, packing everything is the easiest way to get
>>>> started. It would be straightforward to add --packages and --repositories
>>>> support for Maven dependencies. It's technically possible to pull
>>>> dependencies in cloud storage from init containers (if defined by user). It
>>>> could be tricky to design a general solution that supports different cloud
>>>> providers from the operator layer. An enhancement that I can think of is to
>>>> add support for profile scripts that can enable additional user-defined
>>>> actions in application containers.
>>>> > Operator does not have to build everything for k8s version
>>>> compatibility. Similar to Spark, operator can be built on Fabric8 client(
>>>> https://github.com/fabric8io/kubernetes-client) for support across
>>>> versions, given that it makes similar API calls for resource management as
>>>> Spark. For tests, in addition to fabric8 mock server, we may also borrow
>>>> the idea from Flink operator to start minikube cluster for integration
>>>> tests.
>>>> > This operator is not starting from scratch as it is derived from an
>>>> internal project which has been working in prod scale for a few years. It
>>>> aims to include a few new features / enhancements, and a few
>>>> re-architecture mostly to incorporate lessons learnt for designing CRD /
>>>> API perspective.
>>>> > Benchmarking operator performance alone can be nuanced, often tied to
>>>> the underlying cluster. There's a testing strategy that Aaruna & I
>>>> discussed in a previous Data AI summit, involves scheduling wide (massive
>>>> light-weight applications) and deep (single application request a lot of
>>>> executors with heavy IO) cases, revealing typical bottlenecks at the k8s
>>>> API server and scheduler performance.Similar tests can be performed for
>>>> this as well.
>>>> >
>>>> > On Sun, Nov 12, 2023 at 4:32 PM Holden Karau <hol...@pigscanfly.ca>
>>>> wrote:
>>>> >>
>>>> >> To be clear: I am generally supportive of the idea (+1) but have
>>>> some follow-up questions:
>>>> >>
>>>> >> Have we taken the time to learn from the other operators? Do we have
>>>> a compatible CRD/API or not (and if so why?)
>>>> >> The API seems to assume that everything is packaged in the container
>>>> in advance, but I imagine that might not be the case for many folks who
>>>> have Java or Python packages published to cloud storage and they want to
>>>> use?
>>>> >> What's our plan for the testing on the potential version explosion
>>>> (not tying ourselves to operator version -> spark version makes a lot of
>>>> sense, but how do we reasonably assure ourselves that the cross product of
>>>> Operator Version, Kube Version, and Spark Version all function)? Do we have
>>>> CI resources for this?
>>>> >> Is there a current (non-open source operator) that folks from Apple
>>>> are using and planning to open source, or is this a fresh "from the ground
>>>> up" operator proposal?
>>>> >> One of the key reasons for this is listed as "An out-of-the-box
>>>> automation solution that scales effectively" but I don't see any discussion
>>>> of the target scale or plans to achieve it?
>>>> >>
>>>> >>
>>>> >>
>>>> >> On Thu, Nov 9, 2023 at 9:02 PM Zhou Jiang <zhou.c.ji...@gmail.com>
>>>> wrote:
>>>> >>>
>>>> >>> Hi Spark community,
>>>> >>>
>>>> >>> I'm reaching out to initiate a conversation about the possibility
>>>> of developing a Java-based Kubernetes operator for Apache Spark. Following
>>>> the operator pattern (
>>>> https://kubernetes.io/docs/concepts/extend-kubernetes/operator/),
>>>> Spark users may manage applications and related components seamlessly using
>>>> native tools like kubectl. The primary goal is to simplify the Spark user
>>>> experience on Kubernetes, minimizing the learning curve and operational
>>>> complexities and therefore enable users to focus on the Spark application
>>>> development.
>>>> >>>
>>>> >>> Although there are several open-source Spark on Kubernetes
>>>> operators available, none of them are officially integrated into the Apache
>>>> Spark project. As a result, these operators may lack active support and
>>>> development for new features. Within this proposal, our aim is to introduce
>>>> a Java-based Spark operator as an integral component of the Apache Spark
>>>> project. This solution has been employed internally at Apple for multiple
>>>> years, operating millions of executors in real production environments. The
>>>> use of Java in this solution is intended to accommodate a wider user and
>>>> contributor audience, especially those who are familiar with Scala.
>>>> >>>
>>>> >>> Ideally, this operator should have its dedicated repository,
>>>> similar to Spark Connect Golang or Spark Docker, allowing it to maintain a
>>>> loose connection with the Spark release cycle. This model is also followed
>>>> by the Apache Flink Kubernetes operator.
>>>> >>>
>>>> >>> We believe that this project holds the potential to evolve into a
>>>> thriving community project over the long run. A comparison can be drawn
>>>> with the Flink Kubernetes Operator: Apple has open-sourced internal Flink
>>>> Kubernetes operator, making it a part of the Apache Flink project (
>>>> https://github.com/apache/flink-kubernetes-operator). This move has
>>>> gained wide industry adoption and contributions from the community. In a
>>>> mere year, the Flink operator has garnered more than 600 stars and has
>>>> attracted contributions from over 80 contributors. This showcases the level
>>>> of community interest and collaborative momentum that can be achieved in
>>>> similar scenarios.
>>>> >>>
>>>> >>> More details can be found at SPIP doc : Spark Kubernetes Operator
>>>> https://docs.google.com/document/d/1f5mm9VpSKeWC72Y9IiKN2jbBn32rHxjWKUfLRaGEcLE
>>>> >>>
>>>> >>> Thanks,
>>>> >>>
>>>> >>> --
>>>> >>> Zhou JIANG
>>>> >>>
>>>> >>
>>>> >>
>>>> >> --
>>>> >> Twitter: https://twitter.com/holdenkarau
>>>> >> Books (Learning Spark, High Performance Spark, etc.):
>>>> https://amzn.to/2MaRAG9
>>>> >> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > Zhou JIANG
>>>> >
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>>
>>>>
>>
>> --
>> *Zhou JIANG*
>>
>>

Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

Reply via email to