Ready for Review: spark-kubernetes-operator Alpha Release

2024-04-02 Thread Zhou Jiang
Hi dev members,

I am writing to let you know that the first pull request has been raised to
the newly established spark-kubernetes-operator, as previously discussed
within the group. This PR includes the alpha version release of this
project.

https://github.com/apache/spark-kubernetes-operator/pull/2

Here are some key highlights of the PR:
* Introduction of the alpha version of spark-kubernetes-operator.
* Start & stop Spark apps with simple yaml schema
* Deploy and monitor SparkApplications throughout its lifecycle
* Version agnostic for Spark 3.2 and above
* Full logging and metrics integration
* Flexible deployments and native integration with Kubernetes tooling

To facilitate the review process, we have provided detailed documentation
and comments within the PR.

This PR also includes contributions from Qi Tan, Shruti Gumma, Nishchal
Venkataramana and Swami Jayaraman, whose efforts have been instrumental in
reaching this stage of the project.

We are currently in the phase of actively developing and refining the
project. This includes extensive testing across diverse workloads and the
integration of additional test frameworks to ensure the robustness and
reliability of Spark application. We are calling for reviews and inputs on
this PR. Please feel free to provide any suggestions, concerns, or feedback
that could help to improve the quality and functionality of the project. We
look forward to your feedback.

-- 
*Zhou JIANG*


Re: Scheduling jobs using FAIR pool

2024-04-02 Thread Varun Shah
Hi Hussein,

Thanks for clarifying my doubts.

It means that even if I configure 2 separate pools for 2 jobs or submit the
2 jobs in same pool, the submission time will take into effect only when
both the jobs are "running" in parallel ( ie if job 1 gets all resources,
job 2 has to wait unless until pool 2 had been assigned a set min executors
)

However, with separate pools ( small, preferred static pools defined over
dynamic ones ) , more control such as weightage for jobs when multiple jobs
are competing for the resources and assigning minimum executors for each
pool can be done.

Regards,
Varun Shah


On Mon, Apr 1, 2024, 18:50 Hussein Awala  wrote:

> IMO the questions are not limited to Databricks.
>
> > The Round-Robin distribution of executors only work in case of empty
> executors (achievable by enabling dynamic allocation). In case the jobs
> (part of the same pool) requires all executors, second jobs will still need
> to wait.
>
> This feature in Spark allows for optimal resource utilization. Consider a
> scenario with two stages, each with 500 tasks (500 partitions), generated
> by two threads, and a total of 100 Spark executors available in the fair
> pool.
> The first thread may be instantiated microseconds ahead of the second,
> resulting in the fair scheduler allocating 100 tasks to the first stage
> initially. Once some of the tasks are complete, the scheduler dynamically
> redistributes resources, ultimately splitting the capacity equally between
> both stages. This will work in the same way if you have a single stage but
> without splitting the capacity.
>
> Regarding the other three questions, dynamically creating pools may not be
> advisable due to several considerations (cleanup issues, mixing application
> and infrastructure management, + a lot of unexpected issues).
>
> For scenarios involving stages with few long-running tasks like yours,
> it's recommended to enable dynamic allocation to let Spark add executors as
> needed.
>
> In the context of streaming workloads, streaming dynamic allocation is
> preferred to address specific issues detailed in SPARK-12133
> . Although the
> configurations for this feature are not documented, they can be found in the
> source code
> 
> .
> But for structured streaming (your case), you should use batch one (
> spark.dynamicAllocation.*), as SPARK-24815
>  is not ready yet (it
> was accepted and will be ready soon), but it has some issues in the
> downscale step, you can check the JIRA issue for more details.
>
> On Mon, Apr 1, 2024 at 2:07 PM Varun Shah 
> wrote:
>
>> Hi Mich,
>>
>> I did not post in the databricks community, as most of the questions were
>> related to spark itself.
>>
>> But let me also post the question on databricks community.
>>
>> Thanks,
>> Varun Shah
>>
>> On Mon, Apr 1, 2024, 16:28 Mich Talebzadeh 
>> wrote:
>>
>>> Hi,
>>>
>>> Have you put this question to Databricks forum
>>>
>>> Data Engineering - Databricks
>>> 
>>>
>>>
>>> Mich Talebzadeh,
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Mon, 1 Apr 2024 at 07:22, Varun Shah 
>>> wrote:
>>>
 Hi Community,

 I am currently exploring the best use of "Scheduler Pools" for
 executing jobs in parallel, and require clarification and suggestions on a
 few points.

 The implementation consists of executing "Structured Streaming" jobs on
 Databricks using AutoLoader. Each stream is executed with trigger =
 'AvailableNow', ensuring that the streams don't keep running for the
 source. (we have ~4000 such streams, with no continuous stream from source,
 hence not keeping the streams running infinitely using other triggers).

 One way to achieve parallelism in the jobs is to use "MultiThreading",
 all using same SparkContext, as quoted from official docs: "Inside a given
 Spark application (SparkContext instance), multiple parallel jobs can run
 simultaneously if they were submitted from separate threads."

 There's also a availability of "FAIR Scheduler", which instead of FIFO
 Scheduler (default), assigns executors in 

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-02 Thread Tom Graves
 +1
Tom

On Sunday, March 31, 2024 at 10:09:28 PM CDT, Ruifeng Zheng 
 wrote:  
 
 +1

On Mon, Apr 1, 2024 at 10:06 AM Haejoon Lee 
 wrote:

+1

On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:

Hi all,

I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark 
Connect) 
JIRAPrototypeSPIP doc

Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks.



-- 
Ruifeng Zheng
E-mail: zrfli...@gmail.com
  

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-02 Thread Hyukjin Kwon
Yes

On Tue, Apr 2, 2024 at 6:36 PM Femi Anthony  wrote:

> So, to clarify - the purpose of this package is to enable connectivity to
> a remote Spark cluster without having to install any local JVM
> dependencies, right ?
>
> Sent from my iPhone
>
> On Mar 31, 2024, at 10:07 PM, Haejoon Lee
>  wrote:
>
> 
>
> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>> Connect)
>>
>> JIRA 
>> Prototype 
>> SPIP doc
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>>
>


Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-02 Thread Femi Anthony
So, to clarify - the purpose of this package is to enable connectivity to a remote Spark cluster without having to install any local JVM dependencies, right ? Sent from my iPhoneOn Mar 31, 2024, at 10:07 PM, Haejoon Lee  wrote:+1On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:Hi all,I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark Connect) JIRAPrototypeSPIP docPlease vote on the SPIP for the next 72 hours:[ ] +1: Accept the proposal as an official SPIP[ ] +0[ ] -1: I don’t think this is a good idea because …Thanks.