Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-13 Thread L. C. Hsieh
Thanks for all the support from the community for the SPIP proposal.

Since all questions/discussion are settled down (if I didn't miss any
major ones), if no more questions or concerns, I'll be the shepherd
for this SPIP proposal and call for a vote tomorrow.

Thank you all!

On Mon, Nov 13, 2023 at 6:43 PM Zhou Jiang  wrote:
>
> Hi Holden,
>
> Thanks a lot for your feedback!
> Yes, this proposal attempts to integrate existing solutions, especially from 
> CRD perspective. The proposed schema retains similarity with current designs, 
> while reducing duplicates and maintaining a single source of truth from conf 
> properties. It also tends to be close to native integration with k8s to 
> minimize schema changes for new features.
> For dependencies, packing everything is the easiest way to get started. It 
> would be straightforward to add --packages and --repositories support for 
> Maven dependencies. It's technically possible to pull dependencies in cloud 
> storage from init containers (if defined by user). It could be tricky to 
> design a general solution that supports different cloud providers from the 
> operator layer. An enhancement that I can think of is to add support for 
> profile scripts that can enable additional user-defined actions in 
> application containers.
> Operator does not have to build everything for k8s version compatibility. 
> Similar to Spark, operator can be built on Fabric8 
> client(https://github.com/fabric8io/kubernetes-client) for support across 
> versions, given that it makes similar API calls for resource management as 
> Spark. For tests, in addition to fabric8 mock server, we may also borrow the 
> idea from Flink operator to start minikube cluster for integration tests.
> This operator is not starting from scratch as it is derived from an internal 
> project which has been working in prod scale for a few years. It aims to 
> include a few new features / enhancements, and a few re-architecture mostly 
> to incorporate lessons learnt for designing CRD / API perspective.
> Benchmarking operator performance alone can be nuanced, often tied to the 
> underlying cluster. There's a testing strategy that Aaruna & I discussed in a 
> previous Data AI summit, involves scheduling wide (massive light-weight 
> applications) and deep (single application request a lot of executors with 
> heavy IO) cases, revealing typical bottlenecks at the k8s API server and 
> scheduler performance.Similar tests can be performed for this as well.
>
> On Sun, Nov 12, 2023 at 4:32 PM Holden Karau  wrote:
>>
>> To be clear: I am generally supportive of the idea (+1) but have some 
>> follow-up questions:
>>
>> Have we taken the time to learn from the other operators? Do we have a 
>> compatible CRD/API or not (and if so why?)
>> The API seems to assume that everything is packaged in the container in 
>> advance, but I imagine that might not be the case for many folks who have 
>> Java or Python packages published to cloud storage and they want to use?
>> What's our plan for the testing on the potential version explosion (not 
>> tying ourselves to operator version -> spark version makes a lot of sense, 
>> but how do we reasonably assure ourselves that the cross product of Operator 
>> Version, Kube Version, and Spark Version all function)? Do we have CI 
>> resources for this?
>> Is there a current (non-open source operator) that folks from Apple are 
>> using and planning to open source, or is this a fresh "from the ground up" 
>> operator proposal?
>> One of the key reasons for this is listed as "An out-of-the-box automation 
>> solution that scales effectively" but I don't see any discussion of the 
>> target scale or plans to achieve it?
>>
>>
>>
>> On Thu, Nov 9, 2023 at 9:02 PM Zhou Jiang  wrote:
>>>
>>> Hi Spark community,
>>>
>>> I'm reaching out to initiate a conversation about the possibility of 
>>> developing a Java-based Kubernetes operator for Apache Spark. Following the 
>>> operator pattern 
>>> (https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), Spark 
>>> users may manage applications and related components seamlessly using 
>>> native tools like kubectl. The primary goal is to simplify the Spark user 
>>> experience on Kubernetes, minimizing the learning curve and operational 
>>> complexities and therefore enable users to focus on the Spark application 
>>> development.
>>>
>>> Although there are several open-source Spark on Kubernetes operators 
>>> available, none of them are officially integrated into the Apache Spark 
>>> project. As a result, these operators may lack active support and 
>>> development for new features. Within this proposal, our aim is to introduce 
>>> a Java-based Spark operator as an integral component of the Apache Spark 
>>> project. This solution has been employed internally at Apple for multiple 
>>> years, operating millions of executors in real production environments. The 
>>> use of Java in this solution is intended to 

Spark Docker Official image (Java 17) coming soon

2023-11-13 Thread Yikun Jiang
We added the Java 17 support for Apache Spark docker official image at [1].
(Thanks @vakarisbk efforts)

After the [2] merge in future, the first java17 series docker official
image will be available.

You can also have a try on ghcr test image:

all in one image: ghcr.io/apache/spark-docker/spark
:3.5.0-scala2.12-java17-python3-r-ubuntu
python3 image:ghcr.io/apache/spark-docker/spark
:3.5.0-scala2.12-java17-python3-ubuntu
r image:  ghcr.io/apache/spark-docker/spark
:3.5.0-scala2.12-java17-r-ubuntu
scala image:  ghcr.io/apache/spark-docker/spark
:3.5.0-scala2.12-java17-ubuntu

Please let us know if you have any other questions or concerns, otherwise,
it will be published soon [2].

[1] https://github.com/apache/spark-docker/pull/56
[2] https://github.com/docker-library/official-images/pull/15697

Regards,
Yikun


Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2023-11-13 Thread Zhou Jiang
Hi Holden,

Thanks a lot for your feedback!
Yes, this proposal attempts to integrate existing solutions, especially
from CRD perspective. The proposed schema retains similarity with current
designs, while reducing duplicates and maintaining a single source of truth
from conf properties. It also tends to be close to native integration with
k8s to minimize schema changes for new features.
For dependencies, packing everything is the easiest way to get started. It
would be straightforward to add --packages and --repositories support for
Maven dependencies. It's technically possible to pull dependencies in cloud
storage from init containers (if defined by user). It could be tricky to
design a general solution that supports different cloud providers from the
operator layer. An enhancement that I can think of is to add support for
profile scripts that can enable additional user-defined actions in
application containers.
Operator does not have to build everything for k8s version compatibility.
Similar to Spark, operator can be built on Fabric8 client(
https://github.com/fabric8io/kubernetes-client) for support across
versions, given that it makes similar API calls for resource management as
Spark. For tests, in addition to fabric8 mock server, we may also borrow
the idea from Flink operator to start minikube cluster for integration
tests.
This operator is not starting from scratch as it is derived from an
internal project which has been working in prod scale for a few years. It
aims to include a few new features / enhancements, and a few
re-architecture mostly to incorporate lessons learnt for designing CRD /
API perspective.
Benchmarking operator performance alone can be nuanced, often tied to the
underlying cluster. There's a testing strategy that Aaruna & I discussed in
a previous Data AI summit, involves scheduling wide (massive light-weight
applications) and deep (single application request a lot of executors with
heavy IO) cases, revealing typical bottlenecks at the k8s API server and
scheduler performance.Similar tests can be performed for this as well.

On Sun, Nov 12, 2023 at 4:32 PM Holden Karau  wrote:

> To be clear: I am generally supportive of the idea (+1) but have some
> follow-up questions:
>
> Have we taken the time to learn from the other operators? Do we have a
> compatible CRD/API or not (and if so why?)
> The API seems to assume that everything is packaged in the container in
> advance, but I imagine that might not be the case for many folks who have
> Java or Python packages published to cloud storage and they want to use?
> What's our plan for the testing on the potential version explosion (not
> tying ourselves to operator version -> spark version makes a lot of sense,
> but how do we reasonably assure ourselves that the cross product of
> Operator Version, Kube Version, and Spark Version all function)? Do we have
> CI resources for this?
> Is there a current (non-open source operator) that folks from Apple are
> using and planning to open source, or is this a fresh "from the ground up"
> operator proposal?
> One of the key reasons for this is listed as "An out-of-the-box automation
> solution that scales effectively" but I don't see any discussion of the
> target scale or plans to achieve it?
>
>
>
> On Thu, Nov 9, 2023 at 9:02 PM Zhou Jiang  wrote:
>
>> Hi Spark community,
>>
>> I'm reaching out to initiate a conversation about the possibility of
>> developing a Java-based Kubernetes operator for Apache Spark. Following the
>> operator pattern (
>> https://kubernetes.io/docs/concepts/extend-kubernetes/operator/), Spark
>> users may manage applications and related components seamlessly using
>> native tools like kubectl. The primary goal is to simplify the Spark user
>> experience on Kubernetes, minimizing the learning curve and operational
>> complexities and therefore enable users to focus on the Spark application
>> development.
>>
>> Although there are several open-source Spark on Kubernetes operators
>> available, none of them are officially integrated into the Apache Spark
>> project. As a result, these operators may lack active support and
>> development for new features. Within this proposal, our aim is to introduce
>> a Java-based Spark operator as an integral component of the Apache Spark
>> project. This solution has been employed internally at Apple for multiple
>> years, operating millions of executors in real production environments. The
>> use of Java in this solution is intended to accommodate a wider user and
>> contributor audience, especially those who are familiar with Scala.
>>
>> Ideally, this operator should have its dedicated repository, similar to
>> Spark Connect Golang or Spark Docker, allowing it to maintain a loose
>> connection with the Spark release cycle. This model is also followed by the
>> Apache Flink Kubernetes operator.
>>
>> We believe that this project holds the potential to evolve into a
>> thriving community project over the long run. A