subject:""

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-09 Thread Holden Karau

I like the idea of improving flexibility of Sparks physical plans and
really anything that might reduce code duplication among the ~4 or so
different accelerators.

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Tue, Apr 9, 2024 at 3:14 AM Dongjoon Hyun 
wrote:

> Thank you for sharing, Jia.
>
> I have the same questions like the previous Weiting's thread.
>
> Do you think you can share the future milestone of Apache Gluten?
> I'm wondering when the first stable release will come and how we can
> coordinate across the ASF communities.
>
> > This project is still under active development now, and doesn't have a
> stable release.
> > https://github.com/apache/incubator-gluten/releases/tag/v1.1.1
>
> In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
> support.
> And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
> scheduled in October.
>
> For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if there
> is something we need to do from Spark side.
>
+1 I think any changes need to target 4.0

>
> Thanks,
> Dongjoon.
>
>
> On Tue, Apr 9, 2024 at 12:22 AM Ke Jia  wrote:
>
>> Apache Spark currently lacks an official mechanism to support
>> cross-platform execution of physical plans. The Gluten project offers a
>> mechanism that utilizes the Substrait standard to convert and optimize
>> Spark's physical plans. By introducing Gluten's plan conversion,
>> validation, and fallback mechanisms into Spark, we can significantly
>> enhance the portability and interoperability of Spark's physical plans,
>> enabling them to operate across a broader spectrum of execution
>> environments without requiring users to migrate, while also improving
>> Spark's execution efficiency through the utilization of Gluten's advanced
>> optimization techniques. And the integration of Gluten into Spark has
>> already shown significant performance improvements with ClickHouse and
>> Velox backends and has been successfully deployed in production by several
>> customers.
>>
>> References:
>> JIAR Ticket 
>> SPIP Doc
>> 
>>
>> Your feedback and comments are welcome and appreciated.  Thanks.
>>
>> Thanks,
>> Jia Ke
>>
>

Re: Versioning of Spark Operator

2024-04-09 Thread L. C. Hsieh

For Spark Operator, I think the answer is yes. According to my
impression, Spark Operator should be Spark version-agnostic. Zhou,
please correct me if I'm wrong.
I am not sure about the Spark Connector Go client, but if it is going
to talk with Spark cluster, I guess it should be still related to
Spark version (there is compatible issue).


> On 2024/04/09 21:35:45 bo yang wrote:
> > Thanks Liang-Chi for the Spark Operator work, and also the discussion here!
> >
> > For Spark Operator and Connector Go Client, I am guessing they need to
> > support multiple versions of Spark? e.g. same Spark Operator may support
> > running multiple versions of Spark, and Connector Go Client might support
> > multiple versions of Spark driver as well.
> >
> > How do people think of using the minimum supported Spark version as the
> > version name for Spark Operator and Connector Go Client? For example,
> > Spark Operator 3.5.x supports Spark 3.5 and above.
> >
> > Best,
> > Bo
> >
> >
> > On Tue, Apr 9, 2024 at 10:14 AM Dongjoon Hyun  wrote:
> >
> > > Ya, that's simple and possible.
> > >
> > > However, it may cause many confusions because it implies that new `Spark
> > > K8s Operator 4.0.0` and `Spark Connect Go 4.0.0` follow the same `Semantic
> > > Versioning` policy like Apache Spark 4.0.0.
> > >
> > > In addition, `Versioning` is directly related to the Release Cadence. It's
> > > unlikely for us to have `Spark K8s Operator` and `Spark Connect Go`
> > > releases at every Apache Spark maintenance release. For example, there is
> > > no commit in Spark Connect Go repository.
> > >
> > > I believe the versioning and release cadence is related to those
> > > subprojects' maturity more.
> > >
> > > Dongjoon.
> > >
> > > On 2024/04/09 16:59:40 DB Tsai wrote:
> > > >  Aligning with Spark releases is sensible, as it allows us to guarantee
> > > that the Spark operator functions correctly with the new version while 
> > > also
> > > maintaining support for previous versions.
> > > >
> > > > DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
> > > >
> > > > > On Apr 9, 2024, at 9:45 AM, Mridul Muralidharan 
> > > wrote:
> > > > >
> > > > >
> > > > >   I am trying to understand if we can simply align with Spark's
> > > version for this ?
> > > > > Makes the release and jira management much more simpler for developers
> > > and intuitive for users.
> > > > >
> > > > > Regards,
> > > > > Mridul
> > > > >
> > > > >
> > > > > On Tue, Apr 9, 2024 at 10:09 AM Dongjoon Hyun  > > > wrote:
> > > > >> Hi, Liang-Chi.
> > > > >>
> > > > >> Thank you for leading Apache Spark K8s operator as a shepherd.
> > > > >>
> > > > >> I took a look at `Apache Spark Connect Go` repo mentioned in the
> > > thread. Sadly, there is no release at all and no activity since last 6
> > > months. It seems to be the first time for Apache Spark community to
> > > consider these sister repositories (Go and K8s Operator).
> > > > >>
> > > > >> https://github.com/apache/spark-connect-go/commits/master/
> > > > >>
> > > > >> Dongjoon.
> > > > >>
> > > > >> On 2024/04/08 17:48:18 "L. C. Hsieh" wrote:
> > > > >> > Hi all,
> > > > >> >
> > > > >> > We've opened the dedicated repository of Spark Kubernetes Operator,
> > > > >> > and the first PR is created.
> > > > >> > Thank you for the review from the community so far.
> > > > >> >
> > > > >> > About the versioning of Spark Operator, there are questions.
> > > > >> >
> > > > >> > As we are using Spark JIRA, when we are going to merge PRs, we need
> > > to
> > > > >> > choose a Spark version. However, the Spark Operator is versioning
> > > > >> > differently than Spark. I'm wondering how we deal with this?
> > > > >> >
> > > > >> > Not sure if Connect also has its versioning different to Spark? If
> > > so,
> > > > >> > maybe we can follow how Connect does.
> > > > >> >
> > > > >> > Can someone who is familiar with Connect versioning give some
> > > suggestions?
> > > > >> >
> > > > >> > Thank you.
> > > > >> >
> > > > >> > Liang-Chi
> > > > >> >
> > > > >> >
> > > -
> > > > >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  > > dev-unsubscr...@spark.apache.org>
> > > > >> >
> > > > >> >
> > > > >>
> > > > >> -
> > > > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  > > dev-unsubscr...@spark.apache.org>
> > > > >>
> > > >
> > > >
> > >
> > > -
> > > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> > >
> > >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Versioning of Spark Operator

2024-04-09 Thread Dongjoon Hyun

Do we have a compatibility matrix of Apache Connect Go client already, Bo?

Specifically, I'm wondering which versions the existing Apache Spark Connect Go 
repository is able to support as of now.

We know that it is supposed to be compatible always, but do we have a way to 
verify that actually via CI to make it sure inside Go repository?

Dongjoon.

On 2024/04/09 21:35:45 bo yang wrote:
> Thanks Liang-Chi for the Spark Operator work, and also the discussion here!
> 
> For Spark Operator and Connector Go Client, I am guessing they need to
> support multiple versions of Spark? e.g. same Spark Operator may support
> running multiple versions of Spark, and Connector Go Client might support
> multiple versions of Spark driver as well.
> 
> How do people think of using the minimum supported Spark version as the
> version name for Spark Operator and Connector Go Client? For example,
> Spark Operator 3.5.x supports Spark 3.5 and above.
> 
> Best,
> Bo
> 
> 
> On Tue, Apr 9, 2024 at 10:14 AM Dongjoon Hyun  wrote:
> 
> > Ya, that's simple and possible.
> >
> > However, it may cause many confusions because it implies that new `Spark
> > K8s Operator 4.0.0` and `Spark Connect Go 4.0.0` follow the same `Semantic
> > Versioning` policy like Apache Spark 4.0.0.
> >
> > In addition, `Versioning` is directly related to the Release Cadence. It's
> > unlikely for us to have `Spark K8s Operator` and `Spark Connect Go`
> > releases at every Apache Spark maintenance release. For example, there is
> > no commit in Spark Connect Go repository.
> >
> > I believe the versioning and release cadence is related to those
> > subprojects' maturity more.
> >
> > Dongjoon.
> >
> > On 2024/04/09 16:59:40 DB Tsai wrote:
> > >  Aligning with Spark releases is sensible, as it allows us to guarantee
> > that the Spark operator functions correctly with the new version while also
> > maintaining support for previous versions.
> > >
> > > DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
> > >
> > > > On Apr 9, 2024, at 9:45 AM, Mridul Muralidharan 
> > wrote:
> > > >
> > > >
> > > >   I am trying to understand if we can simply align with Spark's
> > version for this ?
> > > > Makes the release and jira management much more simpler for developers
> > and intuitive for users.
> > > >
> > > > Regards,
> > > > Mridul
> > > >
> > > >
> > > > On Tue, Apr 9, 2024 at 10:09 AM Dongjoon Hyun  > > wrote:
> > > >> Hi, Liang-Chi.
> > > >>
> > > >> Thank you for leading Apache Spark K8s operator as a shepherd.
> > > >>
> > > >> I took a look at `Apache Spark Connect Go` repo mentioned in the
> > thread. Sadly, there is no release at all and no activity since last 6
> > months. It seems to be the first time for Apache Spark community to
> > consider these sister repositories (Go and K8s Operator).
> > > >>
> > > >> https://github.com/apache/spark-connect-go/commits/master/
> > > >>
> > > >> Dongjoon.
> > > >>
> > > >> On 2024/04/08 17:48:18 "L. C. Hsieh" wrote:
> > > >> > Hi all,
> > > >> >
> > > >> > We've opened the dedicated repository of Spark Kubernetes Operator,
> > > >> > and the first PR is created.
> > > >> > Thank you for the review from the community so far.
> > > >> >
> > > >> > About the versioning of Spark Operator, there are questions.
> > > >> >
> > > >> > As we are using Spark JIRA, when we are going to merge PRs, we need
> > to
> > > >> > choose a Spark version. However, the Spark Operator is versioning
> > > >> > differently than Spark. I'm wondering how we deal with this?
> > > >> >
> > > >> > Not sure if Connect also has its versioning different to Spark? If
> > so,
> > > >> > maybe we can follow how Connect does.
> > > >> >
> > > >> > Can someone who is familiar with Connect versioning give some
> > suggestions?
> > > >> >
> > > >> > Thank you.
> > > >> >
> > > >> > Liang-Chi
> > > >> >
> > > >> >
> > -
> > > >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  > dev-unsubscr...@spark.apache.org>
> > > >> >
> > > >> >
> > > >>
> > > >> -
> > > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  > dev-unsubscr...@spark.apache.org>
> > > >>
> > >
> > >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Versioning of Spark Operator

2024-04-09 Thread bo yang

Thanks Liang-Chi for the Spark Operator work, and also the discussion here!

For Spark Operator and Connector Go Client, I am guessing they need to
support multiple versions of Spark? e.g. same Spark Operator may support
running multiple versions of Spark, and Connector Go Client might support
multiple versions of Spark driver as well.

How do people think of using the minimum supported Spark version as the
version name for Spark Operator and Connector Go Client? For example,
Spark Operator 3.5.x supports Spark 3.5 and above.

Best,
Bo


On Tue, Apr 9, 2024 at 10:14 AM Dongjoon Hyun  wrote:

> Ya, that's simple and possible.
>
> However, it may cause many confusions because it implies that new `Spark
> K8s Operator 4.0.0` and `Spark Connect Go 4.0.0` follow the same `Semantic
> Versioning` policy like Apache Spark 4.0.0.
>
> In addition, `Versioning` is directly related to the Release Cadence. It's
> unlikely for us to have `Spark K8s Operator` and `Spark Connect Go`
> releases at every Apache Spark maintenance release. For example, there is
> no commit in Spark Connect Go repository.
>
> I believe the versioning and release cadence is related to those
> subprojects' maturity more.
>
> Dongjoon.
>
> On 2024/04/09 16:59:40 DB Tsai wrote:
> >  Aligning with Spark releases is sensible, as it allows us to guarantee
> that the Spark operator functions correctly with the new version while also
> maintaining support for previous versions.
> >
> > DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
> >
> > > On Apr 9, 2024, at 9:45 AM, Mridul Muralidharan 
> wrote:
> > >
> > >
> > >   I am trying to understand if we can simply align with Spark's
> version for this ?
> > > Makes the release and jira management much more simpler for developers
> and intuitive for users.
> > >
> > > Regards,
> > > Mridul
> > >
> > >
> > > On Tue, Apr 9, 2024 at 10:09 AM Dongjoon Hyun  > wrote:
> > >> Hi, Liang-Chi.
> > >>
> > >> Thank you for leading Apache Spark K8s operator as a shepherd.
> > >>
> > >> I took a look at `Apache Spark Connect Go` repo mentioned in the
> thread. Sadly, there is no release at all and no activity since last 6
> months. It seems to be the first time for Apache Spark community to
> consider these sister repositories (Go and K8s Operator).
> > >>
> > >> https://github.com/apache/spark-connect-go/commits/master/
> > >>
> > >> Dongjoon.
> > >>
> > >> On 2024/04/08 17:48:18 "L. C. Hsieh" wrote:
> > >> > Hi all,
> > >> >
> > >> > We've opened the dedicated repository of Spark Kubernetes Operator,
> > >> > and the first PR is created.
> > >> > Thank you for the review from the community so far.
> > >> >
> > >> > About the versioning of Spark Operator, there are questions.
> > >> >
> > >> > As we are using Spark JIRA, when we are going to merge PRs, we need
> to
> > >> > choose a Spark version. However, the Spark Operator is versioning
> > >> > differently than Spark. I'm wondering how we deal with this?
> > >> >
> > >> > Not sure if Connect also has its versioning different to Spark? If
> so,
> > >> > maybe we can follow how Connect does.
> > >> >
> > >> > Can someone who is familiar with Connect versioning give some
> suggestions?
> > >> >
> > >> > Thank you.
> > >> >
> > >> > Liang-Chi
> > >> >
> > >> >
> -
> > >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  dev-unsubscr...@spark.apache.org>
> > >> >
> > >> >
> > >>
> > >> -
> > >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org  dev-unsubscr...@spark.apache.org>
> > >>
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Versioning of Spark Operator

2024-04-09 Thread Dongjoon Hyun

Ya, that's simple and possible.

However, it may cause many confusions because it implies that new `Spark K8s 
Operator 4.0.0` and `Spark Connect Go 4.0.0` follow the same `Semantic 
Versioning` policy like Apache Spark 4.0.0.

In addition, `Versioning` is directly related to the Release Cadence. It's 
unlikely for us to have `Spark K8s Operator` and `Spark Connect Go` releases at 
every Apache Spark maintenance release. For example, there is no commit in 
Spark Connect Go repository.

I believe the versioning and release cadence is related to those subprojects' 
maturity more.

Dongjoon.

On 2024/04/09 16:59:40 DB Tsai wrote:
>  Aligning with Spark releases is sensible, as it allows us to guarantee that 
> the Spark operator functions correctly with the new version while also 
> maintaining support for previous versions.
>  
> DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1
> 
> > On Apr 9, 2024, at 9:45 AM, Mridul Muralidharan  wrote:
> > 
> > 
> >   I am trying to understand if we can simply align with Spark's version for 
> > this ?
> > Makes the release and jira management much more simpler for developers and 
> > intuitive for users.
> > 
> > Regards,
> > Mridul
> > 
> > 
> > On Tue, Apr 9, 2024 at 10:09 AM Dongjoon Hyun  > > wrote:
> >> Hi, Liang-Chi.
> >> 
> >> Thank you for leading Apache Spark K8s operator as a shepherd. 
> >> 
> >> I took a look at `Apache Spark Connect Go` repo mentioned in the thread. 
> >> Sadly, there is no release at all and no activity since last 6 months. It 
> >> seems to be the first time for Apache Spark community to consider these 
> >> sister repositories (Go and K8s Operator).
> >> 
> >> https://github.com/apache/spark-connect-go/commits/master/
> >> 
> >> Dongjoon.
> >> 
> >> On 2024/04/08 17:48:18 "L. C. Hsieh" wrote:
> >> > Hi all,
> >> > 
> >> > We've opened the dedicated repository of Spark Kubernetes Operator,
> >> > and the first PR is created.
> >> > Thank you for the review from the community so far.
> >> > 
> >> > About the versioning of Spark Operator, there are questions.
> >> > 
> >> > As we are using Spark JIRA, when we are going to merge PRs, we need to
> >> > choose a Spark version. However, the Spark Operator is versioning
> >> > differently than Spark. I'm wondering how we deal with this?
> >> > 
> >> > Not sure if Connect also has its versioning different to Spark? If so,
> >> > maybe we can follow how Connect does.
> >> > 
> >> > Can someone who is familiar with Connect versioning give some 
> >> > suggestions?
> >> > 
> >> > Thank you.
> >> > 
> >> > Liang-Chi
> >> > 
> >> > -
> >> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> >> > 
> >> > 
> >> > 
> >> 
> >> -
> >> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
> >> 
> >> 
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Versioning of Spark Operator

2024-04-09 Thread DB Tsai

 Aligning with Spark releases is sensible, as it allows us to guarantee that 
the Spark operator functions correctly with the new version while also 
maintaining support for previous versions.
 
DB Tsai  |  https://www.dbtsai.com/  |  PGP 42E5B25A8F7A82C1

> On Apr 9, 2024, at 9:45 AM, Mridul Muralidharan  wrote:
> 
> 
>   I am trying to understand if we can simply align with Spark's version for 
> this ?
> Makes the release and jira management much more simpler for developers and 
> intuitive for users.
> 
> Regards,
> Mridul
> 
> 
> On Tue, Apr 9, 2024 at 10:09 AM Dongjoon Hyun  > wrote:
>> Hi, Liang-Chi.
>> 
>> Thank you for leading Apache Spark K8s operator as a shepherd. 
>> 
>> I took a look at `Apache Spark Connect Go` repo mentioned in the thread. 
>> Sadly, there is no release at all and no activity since last 6 months. It 
>> seems to be the first time for Apache Spark community to consider these 
>> sister repositories (Go and K8s Operator).
>> 
>> https://github.com/apache/spark-connect-go/commits/master/
>> 
>> Dongjoon.
>> 
>> On 2024/04/08 17:48:18 "L. C. Hsieh" wrote:
>> > Hi all,
>> > 
>> > We've opened the dedicated repository of Spark Kubernetes Operator,
>> > and the first PR is created.
>> > Thank you for the review from the community so far.
>> > 
>> > About the versioning of Spark Operator, there are questions.
>> > 
>> > As we are using Spark JIRA, when we are going to merge PRs, we need to
>> > choose a Spark version. However, the Spark Operator is versioning
>> > differently than Spark. I'm wondering how we deal with this?
>> > 
>> > Not sure if Connect also has its versioning different to Spark? If so,
>> > maybe we can follow how Connect does.
>> > 
>> > Can someone who is familiar with Connect versioning give some suggestions?
>> > 
>> > Thank you.
>> > 
>> > Liang-Chi
>> > 
>> > -
>> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>> > 
>> > 
>> > 
>> 
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org 
>> 
>>

Re: Versioning of Spark Operator

2024-04-09 Thread Mridul Muralidharan

  I am trying to understand if we can simply align with Spark's version for
this ?
Makes the release and jira management much more simpler for developers and
intuitive for users.

Regards,
Mridul


On Tue, Apr 9, 2024 at 10:09 AM Dongjoon Hyun  wrote:

> Hi, Liang-Chi.
>
> Thank you for leading Apache Spark K8s operator as a shepherd.
>
> I took a look at `Apache Spark Connect Go` repo mentioned in the thread.
> Sadly, there is no release at all and no activity since last 6 months. It
> seems to be the first time for Apache Spark community to consider these
> sister repositories (Go and K8s Operator).
>
> https://github.com/apache/spark-connect-go/commits/master/
>
> Dongjoon.
>
> On 2024/04/08 17:48:18 "L. C. Hsieh" wrote:
> > Hi all,
> >
> > We've opened the dedicated repository of Spark Kubernetes Operator,
> > and the first PR is created.
> > Thank you for the review from the community so far.
> >
> > About the versioning of Spark Operator, there are questions.
> >
> > As we are using Spark JIRA, when we are going to merge PRs, we need to
> > choose a Spark version. However, the Spark Operator is versioning
> > differently than Spark. I'm wondering how we deal with this?
> >
> > Not sure if Connect also has its versioning different to Spark? If so,
> > maybe we can follow how Connect does.
> >
> > Can someone who is familiar with Connect versioning give some
> suggestions?
> >
> > Thank you.
> >
> > Liang-Chi
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Versioning of Spark Operator

2024-04-09 Thread Dongjoon Hyun

Hi, Liang-Chi.

Thank you for leading Apache Spark K8s operator as a shepherd. 

I took a look at `Apache Spark Connect Go` repo mentioned in the thread. Sadly, 
there is no release at all and no activity since last 6 months. It seems to be 
the first time for Apache Spark community to consider these sister repositories 
(Go and K8s Operator).

https://github.com/apache/spark-connect-go/commits/master/

Dongjoon.

On 2024/04/08 17:48:18 "L. C. Hsieh" wrote:
> Hi all,
> 
> We've opened the dedicated repository of Spark Kubernetes Operator,
> and the first PR is created.
> Thank you for the review from the community so far.
> 
> About the versioning of Spark Operator, there are questions.
> 
> As we are using Spark JIRA, when we are going to merge PRs, we need to
> choose a Spark version. However, the Spark Operator is versioning
> differently than Spark. I'm wondering how we deal with this?
> 
> Not sure if Connect also has its versioning different to Spark? If so,
> maybe we can follow how Connect does.
> 
> Can someone who is familiar with Connect versioning give some suggestions?
> 
> Thank you.
> 
> Liang-Chi
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 
> 

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-09 Thread Dongjoon Hyun

Thank you for sharing, Jia.

I have the same questions like the previous Weiting's thread.

Do you think you can share the future milestone of Apache Gluten?
I'm wondering when the first stable release will come and how we can
coordinate across the ASF communities.

> This project is still under active development now, and doesn't have a
stable release.
> https://github.com/apache/incubator-gluten/releases/tag/v1.1.1

In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
support.
And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
scheduled in October.

For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if there
is something we need to do from Spark side.

Thanks,
Dongjoon.

On Tue, Apr 9, 2024 at 12:22 AM Ke Jia  wrote:

> Apache Spark currently lacks an official mechanism to support
> cross-platform execution of physical plans. The Gluten project offers a
> mechanism that utilizes the Substrait standard to convert and optimize
> Spark's physical plans. By introducing Gluten's plan conversion,
> validation, and fallback mechanisms into Spark, we can significantly
> enhance the portability and interoperability of Spark's physical plans,
> enabling them to operate across a broader spectrum of execution
> environments without requiring users to migrate, while also improving
> Spark's execution efficiency through the utilization of Gluten's advanced
> optimization techniques. And the integration of Gluten into Spark has
> already shown significant performance improvements with ClickHouse and
> Velox backends and has been successfully deployed in production by several
> customers.
>
> References:
> JIAR Ticket 
> SPIP Doc
> 
>
> Your feedback and comments are welcome and appreciated.  Thanks.
>
> Thanks,
> Jia Ke
>

Re: Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-09 Thread Dongjoon Hyun

Thank you for sharing, Weiting.

Do you think you can share the future milestone of Apache Gluten?
I'm wondering when the first stable release will come and how we can
coordinate across the ASF communities.

> This project is still under active development now, and doesn't have a
stable release.
> https://github.com/apache/incubator-gluten/releases/tag/v1.1.1

In the Apache Spark community, Apache Spark 3.2 and 3.3 is the end of
support.
And, 3.4 will have 3.4.3 next week and 3.4.4 (another EOL release) is
scheduled in October.

For the SPIP, I guess it's applicable for Apache Spark 4.0.0 only if there
is something we need to do from Spark side.

Thanks,
Dongjoon.

On Mon, Apr 8, 2024 at 11:19 PM WeitingChen  wrote:

> Hi all,
>
> We are excited to introduce a new Apache incubating project called Gluten.
> Gluten serves as a middleware layer designed to offload Spark to native
> engines like Velox or ClickHouse.
> For more detailed information, please visit the project repository at
> https://github.com/apache/incubator-gluten
>
> Additionally, a new Spark SPIP related to Spark + Gluten collaboration has
> been proposed at https://issues.apache.org/jira/browse/SPARK-47773.
> We eagerly await feedback from the Spark community.
>
> Thanks,
> Weiting.
>
>

Introducing Apache Gluten(incubating), a middle layer to offload Spark to native engine

2024-04-09 Thread WeitingChen

Hi all,

We are excited to introduce a new Apache incubating project called Gluten.
Gluten serves as a middleware layer designed to offload Spark to native
engines like Velox or ClickHouse.
For more detailed information, please visit the project repository at
https://github.com/apache/incubator-gluten

Additionally, a new Spark SPIP related to Spark + Gluten collaboration has
been proposed at https://issues.apache.org/jira/browse/SPARK-47773.
We eagerly await feedback from the Spark community.

Thanks,
Weiting.

SPIP: Enhancing the Flexibility of Spark's Physical Plan to Enable Execution on Various Native Engines

2024-04-09 Thread Ke Jia

Apache Spark currently lacks an official mechanism to support
cross-platform execution of physical plans. The Gluten project offers a
mechanism that utilizes the Substrait standard to convert and optimize
Spark's physical plans. By introducing Gluten's plan conversion,
validation, and fallback mechanisms into Spark, we can significantly
enhance the portability and interoperability of Spark's physical plans,
enabling them to operate across a broader spectrum of execution
environments without requiring users to migrate, while also improving
Spark's execution efficiency through the utilization of Gluten's advanced
optimization techniques. And the integration of Gluten into Spark has
already shown significant performance improvements with ClickHouse and
Velox backends and has been successfully deployed in production by several
customers.

References:
JIAR Ticket 
SPIP Doc


Your feedback and comments are welcome and appreciated.  Thanks.

Thanks,
Jia Ke

Versioning of Spark Operator

2024-04-08 Thread L. C. Hsieh

Hi all,

We've opened the dedicated repository of Spark Kubernetes Operator,
and the first PR is created.
Thank you for the review from the community so far.

About the versioning of Spark Operator, there are questions.

As we are using Spark JIRA, when we are going to merge PRs, we need to
choose a Spark version. However, the Spark Operator is versioning
differently than Spark. I'm wondering how we deal with this?

Not sure if Connect also has its versioning different to Spark? If so,
maybe we can follow how Connect does.

Can someone who is familiar with Connect versioning give some suggestions?

Thank you.

Liang-Chi

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Unsubscribe

2024-04-08 Thread bruce COTTMAN




-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark 3.4.3 (?)

2024-04-08 Thread Dongjoon Hyun

Thank you, Holden, Mridul,  Kent, Liang-Chi, Mich, Jungtaek.

I added `Target Version: 3.4.3` to SPARK-47318 and am going to continue to 
prepare for RC1 (April 15th).

Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: External Spark shuffle service for k8s

2024-04-08 Thread Mich Talebzadeh

Hi,

First thanks everyone for their contributions

I was going to reply to @Enrico Minack   but
noticed additional info. As I understand for example,  Apache Uniffle is an
incubating project aimed at providing a pluggable shuffle service for
Spark. So basically, all these "external shuffle services" have in common
is to offload shuffle data management to external services, thus reducing
the memory and CPU overhead on Spark executors. That is great.  While
Uniffle and others enhance shuffle performance and scalability, it would be
great to integrate them with Spark UI. This may require additional
development efforts. I suppose  the interest would be to have these
external matrices incorporated into Spark with one look and feel. This may
require customizing the UI to fetch and display metrics or statistics from
the external shuffle services. Has any project done this?

Thanks

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 8 Apr 2024 at 14:19, Vakaris Baškirov 
wrote:

> I see that both Uniffle and Celebron support S3/HDFS backends which is
> great.
> In the case someone is using S3/HDFS, I wonder what would be the
> advantages of using Celebron or Uniffle vs IBM shuffle service plugin
>  or Cloud Shuffle Storage Plugin
> from AWS
> 
> ?
>
> These plugins do not require deploying a separate service. Are there any
> advantages to using Uniffle/Celebron in the case of using S3 backend, which
> would require deploying a separate service?
>
> Thanks
> Vakaris
>
> On Mon, Apr 8, 2024 at 10:03 AM roryqi  wrote:
>
>> Apache Uniffle (incubating) may be another solution.
>> You can see
>> https://github.com/apache/incubator-uniffle
>>
>> https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era
>>
>> Mich Talebzadeh  于2024年4月8日周一 07:15写道：
>>
>>> Splendid
>>>
>>> The configurations below can be used with k8s deployments of Spark.
>>> Spark applications running on k8s can utilize these configurations to
>>> seamlessly access data stored in Google Cloud Storage (GCS) and Amazon S3.
>>>
>>> For Google GCS we may have
>>>
>>> spark_config_gcs = {
>>> "spark.kubernetes.authenticate.driver.serviceAccountName":
>>> "service_account_name",
>>> "spark.hadoop.fs.gs.impl":
>>> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
>>> "spark.hadoop.google.cloud.auth.service.account.enable": "true",
>>> "spark.hadoop.google.cloud.auth.service.account.json.keyfile":
>>> "/path/to/keyfile.json",
>>> }
>>>
>>> For Amazon S3 similar
>>>
>>> spark_config_s3 = {
>>> "spark.kubernetes.authenticate.driver.serviceAccountName":
>>> "service_account_name",
>>> "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
>>> "spark.hadoop.fs.s3a.access.key": "s3_access_key",
>>> "spark.hadoop.fs.s3a.secret.key": "secret_key",
>>> }
>>>
>>>
>>> To implement these configurations and enable Spark applications to
>>> interact with GCS and S3, I guess we can approach it this way
>>>
>>> 1) Spark Repository Integration: These configurations need to be added
>>> to the Spark repository as part of the supported configuration options for
>>> k8s deployments.
>>>
>>> 2) Configuration Settings: Users need to specify these configurations
>>> when submitting Spark applications to a Kubernetes cluster. They can
>>> include these configurations in the Spark application code or pass them as
>>> command-line arguments or environment variables during application
>>> submission.
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>>
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Sun, 7 Apr 2024 at 13:31, Vakaris Baškirov <
>>> vakaris.bashki...@gmail.com> wrote:
>>>
 There is an IBM shuffle

Re: External Spark shuffle service for k8s

2024-04-08 Thread Vakaris Baškirov

I see that both Uniffle and Celebron support S3/HDFS backends which is
great.
In the case someone is using S3/HDFS, I wonder what would be the advantages
of using Celebron or Uniffle vs IBM shuffle service plugin
 or Cloud Shuffle Storage Plugin
from AWS

?

These plugins do not require deploying a separate service. Are there any
advantages to using Uniffle/Celebron in the case of using S3 backend, which
would require deploying a separate service?

Thanks
Vakaris

On Mon, Apr 8, 2024 at 10:03 AM roryqi  wrote:

> Apache Uniffle (incubating) may be another solution.
> You can see
> https://github.com/apache/incubator-uniffle
>
> https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era
>
> Mich Talebzadeh  于2024年4月8日周一 07:15写道：
>
>> Splendid
>>
>> The configurations below can be used with k8s deployments of Spark. Spark
>> applications running on k8s can utilize these configurations to seamlessly
>> access data stored in Google Cloud Storage (GCS) and Amazon S3.
>>
>> For Google GCS we may have
>>
>> spark_config_gcs = {
>> "spark.kubernetes.authenticate.driver.serviceAccountName":
>> "service_account_name",
>> "spark.hadoop.fs.gs.impl":
>> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
>> "spark.hadoop.google.cloud.auth.service.account.enable": "true",
>> "spark.hadoop.google.cloud.auth.service.account.json.keyfile":
>> "/path/to/keyfile.json",
>> }
>>
>> For Amazon S3 similar
>>
>> spark_config_s3 = {
>> "spark.kubernetes.authenticate.driver.serviceAccountName":
>> "service_account_name",
>> "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
>> "spark.hadoop.fs.s3a.access.key": "s3_access_key",
>> "spark.hadoop.fs.s3a.secret.key": "secret_key",
>> }
>>
>>
>> To implement these configurations and enable Spark applications to
>> interact with GCS and S3, I guess we can approach it this way
>>
>> 1) Spark Repository Integration: These configurations need to be added to
>> the Spark repository as part of the supported configuration options for k8s
>> deployments.
>>
>> 2) Configuration Settings: Users need to specify these configurations
>> when submitting Spark applications to a Kubernetes cluster. They can
>> include these configurations in the Spark application code or pass them as
>> command-line arguments or environment variables during application
>> submission.
>>
>> HTH
>>
>> Mich Talebzadeh,
>>
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Sun, 7 Apr 2024 at 13:31, Vakaris Baškirov <
>> vakaris.bashki...@gmail.com> wrote:
>>
>>> There is an IBM shuffle service plugin that supports S3
>>> https://github.com/IBM/spark-s3-shuffle
>>>
>>> Though I would think a feature like this could be a part of the main
>>> Spark repo. Trino already has out-of-box support for s3 exchange (shuffle)
>>> and it's very useful.
>>>
>>> Vakaris
>>>
>>> On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>

 Thanks for your suggestion that I take it as a workaround. Whilst this
 workaround can potentially address storage allocation issues, I was more
 interested in exploring solutions that offer a more seamless integration
 with large distributed file systems like HDFS, GCS, or S3. This would
 ensure better performance and scalability for handling larger datasets
 efficiently.


 Mich Talebzadeh,
 Technologist | Solutions Architect | Data Engineer  | Generative AI
 London
 United Kingdom


view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* The information provided is correct to the best of my
 knowledge but of course cannot be guaranteed . It is essential to note
 that, as with any advice, quote "one test result is worth one-thousand
 expert opinions (Werner
 Von Braun
 )".


 On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen 
 wrote:

> You can make a PVC on K8S call it 300GB
>
> make a folder in yours dockerfile
>

Re: External Spark shuffle service for k8s

2024-04-08 Thread roryqi

Apache Uniffle (incubating) may be another solution.
You can see
https://github.com/apache/incubator-uniffle
https://uniffle.apache.org/blog/2023/07/21/Uniffle%20-%20New%20chapter%20for%20the%20shuffle%20in%20the%20cloud%20native%20era

Mich Talebzadeh  于2024年4月8日周一 07:15写道：

> Splendid
>
> The configurations below can be used with k8s deployments of Spark. Spark
> applications running on k8s can utilize these configurations to seamlessly
> access data stored in Google Cloud Storage (GCS) and Amazon S3.
>
> For Google GCS we may have
>
> spark_config_gcs = {
> "spark.kubernetes.authenticate.driver.serviceAccountName":
> "service_account_name",
> "spark.hadoop.fs.gs.impl":
> "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
> "spark.hadoop.google.cloud.auth.service.account.enable": "true",
> "spark.hadoop.google.cloud.auth.service.account.json.keyfile":
> "/path/to/keyfile.json",
> }
>
> For Amazon S3 similar
>
> spark_config_s3 = {
> "spark.kubernetes.authenticate.driver.serviceAccountName":
> "service_account_name",
> "spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
> "spark.hadoop.fs.s3a.access.key": "s3_access_key",
> "spark.hadoop.fs.s3a.secret.key": "secret_key",
> }
>
>
> To implement these configurations and enable Spark applications to
> interact with GCS and S3, I guess we can approach it this way
>
> 1) Spark Repository Integration: These configurations need to be added to
> the Spark repository as part of the supported configuration options for k8s
> deployments.
>
> 2) Configuration Settings: Users need to specify these configurations when
> submitting Spark applications to a Kubernetes cluster. They can include
> these configurations in the Spark application code or pass them as
> command-line arguments or environment variables during application
> submission.
>
> HTH
>
> Mich Talebzadeh,
>
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Sun, 7 Apr 2024 at 13:31, Vakaris Baškirov 
> wrote:
>
>> There is an IBM shuffle service plugin that supports S3
>> https://github.com/IBM/spark-s3-shuffle
>>
>> Though I would think a feature like this could be a part of the main
>> Spark repo. Trino already has out-of-box support for s3 exchange (shuffle)
>> and it's very useful.
>>
>> Vakaris
>>
>> On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>
>>> Thanks for your suggestion that I take it as a workaround. Whilst this
>>> workaround can potentially address storage allocation issues, I was more
>>> interested in exploring solutions that offer a more seamless integration
>>> with large distributed file systems like HDFS, GCS, or S3. This would
>>> ensure better performance and scalability for handling larger datasets
>>> efficiently.
>>>
>>>
>>> Mich Talebzadeh,
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen 
>>> wrote:
>>>
 You can make a PVC on K8S call it 300GB

 make a folder in yours dockerfile
 WORKDIR /opt/spark/work-dir
 RUN chmod g+w /opt/spark/work-dir

 start spark with adding this

 .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName",
 "300gb") \

 .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path",
 "/opt/spark/work-dir") \

 .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly",
 "False") \

 .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName",
 "300gb") \

 .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path",
 "/opt/spark/work-dir") \

 .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly",

Re: Apache Spark 3.4.3 (?)

2024-04-07 Thread Jungtaek Lim

Sounds like a plan. +1 (non-binding) Thanks for volunteering!

On Sun, Apr 7, 2024 at 5:45 AM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85
> commits including important security and correctness patches like
> SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.
>
> https://github.com/apache/spark/releases/tag/v3.4.2
>
> $ git log --oneline v3.4.2..HEAD | wc -l
>   85
>
> SPARK-45580 Subquery changes the output schema of the outer query
> SPARK-46092 Overflow in Parquet row group filter creation causes incorrect
> results
> SPARK-46466 Vectorized parquet reader should never do rebase for timestamp
> ntz
> SPARK-46794 Incorrect results due to inferred predicate from checkpoint
> with subquery
> SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
> SPARK-45445 Upgrade snappy to 1.1.10.5
> SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
> SPARK-46239 Hide `Jetty` info
>
>
> Currently, I'm checking more applicable patches for branch-3.4. I'd like
> to propose to release Apache Spark 3.4.3 and volunteer as the release
> manager for Apache Spark 3.4.3. If there are no additional blockers, the
> first tentative RC1 vote date is April 15th (Monday).
>
> WDYT?
>
> Dongjoon.
>

Fwd: Apache Spark 3.4.3 (?)

2024-04-07 Thread Mich Talebzadeh

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


-- Forwarded message -
From: Mich Talebzadeh 
Date: Sun, 7 Apr 2024 at 11:56
Subject: Re: Apache Spark 3.4.3 (?)
To: Dongjoon Hyun 


Yes given that a good number of people are using some flavour of 3.4.n,
this will be a good fit.

+1 for me


Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Sat, 6 Apr 2024 at 23:02, Dongjoon Hyun  wrote:

> Hi, All.
>
> Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85
> commits including important security and correctness patches like
> SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.
>
> https://github.com/apache/spark/releases/tag/v3.4.2
>
> $ git log --oneline v3.4.2..HEAD | wc -l
>   85
>
> SPARK-45580 Subquery changes the output schema of the outer query
> SPARK-46092 Overflow in Parquet row group filter creation causes incorrect
> results
> SPARK-46466 Vectorized parquet reader should never do rebase for timestamp
> ntz
> SPARK-46794 Incorrect results due to inferred predicate from checkpoint
> with subquery
> SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
> SPARK-45445 Upgrade snappy to 1.1.10.5
> SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
> SPARK-46239 Hide `Jetty` info
>
>
> Currently, I'm checking more applicable patches for branch-3.4. I'd like
> to propose to release Apache Spark 3.4.3 and volunteer as the release
> manager for Apache Spark 3.4.3. If there are no additional blockers, the
> first tentative RC1 vote date is April 15th (Monday).
>
> WDYT?
>
>
> Dongjoon.
>

Re: Apache Spark 3.4.3 (?)

2024-04-07 Thread L. C. Hsieh

+1

Thanks Dongjoon!

On Sun, Apr 7, 2024 at 1:56 AM Kent Yao  wrote:
>
> +1, thank you, Dongjoon
>
>
> Kent
>
> Holden Karau  于2024年4月7日周日 14:54写道：
> >
> > Sounds good to me :)
> >
> > Twitter: https://twitter.com/holdenkarau
> > Books (Learning Spark, High Performance Spark, etc.): 
> > https://amzn.to/2MaRAG9
> > YouTube Live Streams: https://www.youtube.com/user/holdenkarau
> >
> >
> > On Sat, Apr 6, 2024 at 2:51 PM Dongjoon Hyun  
> > wrote:
> >>
> >> Hi, All.
> >>
> >> Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85 
> >> commits including important security and correctness patches like 
> >> SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.
> >>
> >> https://github.com/apache/spark/releases/tag/v3.4.2
> >>
> >> $ git log --oneline v3.4.2..HEAD | wc -l
> >>   85
> >>
> >> SPARK-45580 Subquery changes the output schema of the outer query
> >> SPARK-46092 Overflow in Parquet row group filter creation causes incorrect 
> >> results
> >> SPARK-46466 Vectorized parquet reader should never do rebase for timestamp 
> >> ntz
> >> SPARK-46794 Incorrect results due to inferred predicate from checkpoint 
> >> with subquery
> >> SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
> >> SPARK-45445 Upgrade snappy to 1.1.10.5
> >> SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
> >> SPARK-46239 Hide `Jetty` info
> >>
> >>
> >> Currently, I'm checking more applicable patches for branch-3.4. I'd like 
> >> to propose to release Apache Spark 3.4.3 and volunteer as the release 
> >> manager for Apache Spark 3.4.3. If there are no additional blockers, the 
> >> first tentative RC1 vote date is April 15th (Monday).
> >>
> >> WDYT?
> >>
> >>
> >> Dongjoon.
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh

Thanks Cheng for the heads up. I will have a look.

Cheers

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Sun, 7 Apr 2024 at 15:08, Cheng Pan  wrote:

> Instead of External Shuffle Shufle, Apache Celeborn might be a good option
> as a Remote Shuffle Service for Spark on K8s.
>
> There are some useful resources you might be interested in.
>
> [1] https://celeborn.apache.org/
> [2] https://www.youtube.com/watch?v=s5xOtG6Venw
> [3] https://github.com/aws-samples/emr-remote-shuffle-service
> [4] https://github.com/apache/celeborn/issues/2140
>
> Thanks,
> Cheng Pan
>
>
> > On Apr 6, 2024, at 21:41, Mich Talebzadeh 
> wrote:
> >
> > I have seen some older references for shuffle service for k8s,
> > although it is not clear they are talking about a generic shuffle
> > service for k8s.
> >
> > Anyhow with the advent of genai and the need to allow for a larger
> > volume of data, I was wondering if there has been any more work on
> > this matter. Specifically larger and scalable file systems like HDFS,
> > GCS , S3 etc, offer significantly larger storage capacity than local
> > disks on individual worker nodes in a k8s cluster, thus allowing
> > handling much larger datasets more efficiently. Also the degree of
> > parallelism and fault tolerance  with these files systems come into
> > it. I will be interested in hearing more about any progress on this.
> >
> > Thanks
> > .
> >
> > Mich Talebzadeh,
> >
> > Technologist | Solutions Architect | Data Engineer  | Generative AI
> >
> > London
> > United Kingdom
> >
> >
> >   view my Linkedin profile
> >
> >
> > https://en.everybodywiki.com/Mich_Talebzadeh
> >
> >
> >
> > Disclaimer: The information provided is correct to the best of my
> > knowledge but of course cannot be guaranteed . It is essential to note
> > that, as with any advice, quote "one test result is worth one-thousand
> > expert opinions (Werner Von Braun)".
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
>
>

Re: External Spark shuffle service for k8s

2024-04-07 Thread Cheng Pan

Instead of External Shuffle Shufle, Apache Celeborn might be a good option as a 
Remote Shuffle Service for Spark on K8s.

There are some useful resources you might be interested in.

[1] https://celeborn.apache.org/
[2] https://www.youtube.com/watch?v=s5xOtG6Venw
[3] https://github.com/aws-samples/emr-remote-shuffle-service
[4] https://github.com/apache/celeborn/issues/2140

Thanks,
Cheng Pan


> On Apr 6, 2024, at 21:41, Mich Talebzadeh  wrote:
> 
> I have seen some older references for shuffle service for k8s,
> although it is not clear they are talking about a generic shuffle
> service for k8s.
> 
> Anyhow with the advent of genai and the need to allow for a larger
> volume of data, I was wondering if there has been any more work on
> this matter. Specifically larger and scalable file systems like HDFS,
> GCS , S3 etc, offer significantly larger storage capacity than local
> disks on individual worker nodes in a k8s cluster, thus allowing
> handling much larger datasets more efficiently. Also the degree of
> parallelism and fault tolerance  with these files systems come into
> it. I will be interested in hearing more about any progress on this.
> 
> Thanks
> .
> 
> Mich Talebzadeh,
> 
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> 
> London
> United Kingdom
> 
> 
>   view my Linkedin profile
> 
> 
> https://en.everybodywiki.com/Mich_Talebzadeh
> 
> 
> 
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
> 
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> 


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: External Spark shuffle service for k8s

2024-04-07 Thread Mich Talebzadeh

Splendid

The configurations below can be used with k8s deployments of Spark. Spark
applications running on k8s can utilize these configurations to seamlessly
access data stored in Google Cloud Storage (GCS) and Amazon S3.

For Google GCS we may have

spark_config_gcs = {
"spark.kubernetes.authenticate.driver.serviceAccountName":
"service_account_name",
"spark.hadoop.fs.gs.impl":
"com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem",
"spark.hadoop.google.cloud.auth.service.account.enable": "true",
"spark.hadoop.google.cloud.auth.service.account.json.keyfile":
"/path/to/keyfile.json",
}

For Amazon S3 similar

spark_config_s3 = {
"spark.kubernetes.authenticate.driver.serviceAccountName":
"service_account_name",
"spark.hadoop.fs.s3a.impl": "org.apache.hadoop.fs.s3a.S3AFileSystem",
"spark.hadoop.fs.s3a.access.key": "s3_access_key",
"spark.hadoop.fs.s3a.secret.key": "secret_key",
}


To implement these configurations and enable Spark applications to interact
with GCS and S3, I guess we can approach it this way

1) Spark Repository Integration: These configurations need to be added to
the Spark repository as part of the supported configuration options for k8s
deployments.

2) Configuration Settings: Users need to specify these configurations when
submitting Spark applications to a Kubernetes cluster. They can include
these configurations in the Spark application code or pass them as
command-line arguments or environment variables during application
submission.

HTH

Mich Talebzadeh,

Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Sun, 7 Apr 2024 at 13:31, Vakaris Baškirov 
wrote:

> There is an IBM shuffle service plugin that supports S3
> https://github.com/IBM/spark-s3-shuffle
>
> Though I would think a feature like this could be a part of the main Spark
> repo. Trino already has out-of-box support for s3 exchange (shuffle) and
> it's very useful.
>
> Vakaris
>
> On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh 
> wrote:
>
>>
>> Thanks for your suggestion that I take it as a workaround. Whilst this
>> workaround can potentially address storage allocation issues, I was more
>> interested in exploring solutions that offer a more seamless integration
>> with large distributed file systems like HDFS, GCS, or S3. This would
>> ensure better performance and scalability for handling larger datasets
>> efficiently.
>>
>>
>> Mich Talebzadeh,
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen 
>> wrote:
>>
>>> You can make a PVC on K8S call it 300GB
>>>
>>> make a folder in yours dockerfile
>>> WORKDIR /opt/spark/work-dir
>>> RUN chmod g+w /opt/spark/work-dir
>>>
>>> start spark with adding this
>>>
>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName",
>>> "300gb") \
>>>
>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path",
>>> "/opt/spark/work-dir") \
>>>
>>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>>> "False") \
>>>
>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName",
>>> "300gb") \
>>>
>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path",
>>> "/opt/spark/work-dir") \
>>>
>>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>>> "False") \
>>>   .config("spark.local.dir", "/opt/spark/work-dir")
>>>
>>>
>>>
>>>
>>> lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh <
>>> mich.talebza...@gmail.com>:
>>>
 I have seen some older references for shuffle service for k8s,
 although it is not clear they are talking about a generic shuffle
 service for k8s.

 Anyhow with the advent of genai and the need to allow for a larger
 volume of data, I was wondering if there has been any more work on
 this

Re: External Spark shuffle service for k8s

2024-04-07 Thread Vakaris Baškirov

There is an IBM shuffle service plugin that supports S3
https://github.com/IBM/spark-s3-shuffle

Though I would think a feature like this could be a part of the main Spark
repo. Trino already has out-of-box support for s3 exchange (shuffle) and
it's very useful.

Vakaris

On Sun, Apr 7, 2024 at 12:27 PM Mich Talebzadeh 
wrote:

>
> Thanks for your suggestion that I take it as a workaround. Whilst this
> workaround can potentially address storage allocation issues, I was more
> interested in exploring solutions that offer a more seamless integration
> with large distributed file systems like HDFS, GCS, or S3. This would
> ensure better performance and scalability for handling larger datasets
> efficiently.
>
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen 
> wrote:
>
>> You can make a PVC on K8S call it 300GB
>>
>> make a folder in yours dockerfile
>> WORKDIR /opt/spark/work-dir
>> RUN chmod g+w /opt/spark/work-dir
>>
>> start spark with adding this
>>
>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName",
>> "300gb") \
>>
>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path",
>> "/opt/spark/work-dir") \
>>
>> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>> "False") \
>>
>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName",
>> "300gb") \
>>
>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path",
>> "/opt/spark/work-dir") \
>>
>> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly",
>> "False") \
>>   .config("spark.local.dir", "/opt/spark/work-dir")
>>
>>
>>
>>
>> lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> I have seen some older references for shuffle service for k8s,
>>> although it is not clear they are talking about a generic shuffle
>>> service for k8s.
>>>
>>> Anyhow with the advent of genai and the need to allow for a larger
>>> volume of data, I was wondering if there has been any more work on
>>> this matter. Specifically larger and scalable file systems like HDFS,
>>> GCS , S3 etc, offer significantly larger storage capacity than local
>>> disks on individual worker nodes in a k8s cluster, thus allowing
>>> handling much larger datasets more efficiently. Also the degree of
>>> parallelism and fault tolerance  with these files systems come into
>>> it. I will be interested in hearing more about any progress on this.
>>>
>>> Thanks
>>> .
>>>
>>> Mich Talebzadeh,
>>>
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>>
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> Disclaimer: The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner Von Braun)".
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>

Re: Apache Spark 3.4.3 (?)

2024-04-07 Thread Kent Yao

+1, thank you, Dongjoon


Kent

Holden Karau  于2024年4月7日周日 14:54写道：
>
> Sounds good to me :)
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Sat, Apr 6, 2024 at 2:51 PM Dongjoon Hyun  wrote:
>>
>> Hi, All.
>>
>> Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85 
>> commits including important security and correctness patches like 
>> SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.
>>
>> https://github.com/apache/spark/releases/tag/v3.4.2
>>
>> $ git log --oneline v3.4.2..HEAD | wc -l
>>   85
>>
>> SPARK-45580 Subquery changes the output schema of the outer query
>> SPARK-46092 Overflow in Parquet row group filter creation causes incorrect 
>> results
>> SPARK-46466 Vectorized parquet reader should never do rebase for timestamp 
>> ntz
>> SPARK-46794 Incorrect results due to inferred predicate from checkpoint with 
>> subquery
>> SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
>> SPARK-45445 Upgrade snappy to 1.1.10.5
>> SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
>> SPARK-46239 Hide `Jetty` info
>>
>>
>> Currently, I'm checking more applicable patches for branch-3.4. I'd like to 
>> propose to release Apache Spark 3.4.3 and volunteer as the release manager 
>> for Apache Spark 3.4.3. If there are no additional blockers, the first 
>> tentative RC1 vote date is April 15th (Monday).
>>
>> WDYT?
>>
>>
>> Dongjoon.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Apache Spark 3.4.3 (?)

2024-04-06 Thread Mridul Muralidharan

Hi Dongjoon,

  Thanks for volunteering !
I would suggest to wait for SPARK-47318 to be merged as well for 3.4

Regards,
Mridul

On Sat, Apr 6, 2024 at 6:49 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85
> commits including important security and correctness patches like
> SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.
>
> https://github.com/apache/spark/releases/tag/v3.4.2
>
> $ git log --oneline v3.4.2..HEAD | wc -l
>   85
>
> SPARK-45580 Subquery changes the output schema of the outer query
> SPARK-46092 Overflow in Parquet row group filter creation causes incorrect
> results
> SPARK-46466 Vectorized parquet reader should never do rebase for timestamp
> ntz
> SPARK-46794 Incorrect results due to inferred predicate from checkpoint
> with subquery
> SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
> SPARK-45445 Upgrade snappy to 1.1.10.5
> SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
> SPARK-46239 Hide `Jetty` info
>
>
> Currently, I'm checking more applicable patches for branch-3.4. I'd like
> to propose to release Apache Spark 3.4.3 and volunteer as the release
> manager for Apache Spark 3.4.3. If there are no additional blockers, the
> first tentative RC1 vote date is April 15th (Monday).
>
> WDYT?
>
>
> Dongjoon.
>

Re: Apache Spark 3.4.3 (?)

2024-04-06 Thread Holden Karau

Sounds good to me :)

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Sat, Apr 6, 2024 at 2:51 PM Dongjoon Hyun 
wrote:

> Hi, All.
>
> Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85
> commits including important security and correctness patches like
> SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.
>
> https://github.com/apache/spark/releases/tag/v3.4.2
>
> $ git log --oneline v3.4.2..HEAD | wc -l
>   85
>
> SPARK-45580 Subquery changes the output schema of the outer query
> SPARK-46092 Overflow in Parquet row group filter creation causes incorrect
> results
> SPARK-46466 Vectorized parquet reader should never do rebase for timestamp
> ntz
> SPARK-46794 Incorrect results due to inferred predicate from checkpoint
> with subquery
> SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
> SPARK-45445 Upgrade snappy to 1.1.10.5
> SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
> SPARK-46239 Hide `Jetty` info
>
>
> Currently, I'm checking more applicable patches for branch-3.4. I'd like
> to propose to release Apache Spark 3.4.3 and volunteer as the release
> manager for Apache Spark 3.4.3. If there are no additional blockers, the
> first tentative RC1 vote date is April 15th (Monday).
>
> WDYT?
>
>
> Dongjoon.
>

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-04-06 Thread Pavan Kotikalapudi

Hi Jungtaek,

Status on current SPARK-24815
:
Thomas Graves is reviewing the draft PR
. I need to add documentation
about the configs and usage details, I am planning to do that this week.
He did mention that it would be great if somebody with experience in
structured streaming would take a look at the algorithm. Will you be able
to review it?

Another point I wanted to discuss is, as you might have already seen
in the design
doc

we
use traditional DRA configs
spark.dynamicAllocation.enabled,
spark.dynamicAllocation.schedulerBacklogTimeout,
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout,
spark.dynamicAllocation.executorIdleTimeout,
spark.dynamicAllocation.cachedExecutorIdleTimeout,

and few new configs

spark.dynamicAllocation.streaming.enabled,
spark.dynamicAllocation.streaming.executorDeallocationRatio,
spark.dynamicAllocation.streaming.executorDeallocationTimeout.

to make the DRA work for structured streaming.

While in the design doc I did mention that we have to calculate  and set
scale out/back thresholds based on the trigger interval
.
We (internally in the company) do have helper functions to auto-generate
the above configs based on trigger interval and the threshold configs (we
also got similar feedback in reviews

).
Here are such configs

  # required - should be greater than 3 seconds as that gives enough
seconds for scaleOut and scaleBack thresholds to work with.
  "spark.sql.streaming.triggerInterval.seconds": 
  # optional - value should be between 0 and 1 and greater than
scaleBackThreshold : default is 0.9
  "spark.dynamicAllocation.streaming.scaleOutThreshold": 
  # optional - value should be between 0 and 1 and less than
scaleOutThreshold : default is 0.5
  "spark.dynamicAllocation.streaming.scaleBackThreshold": 

The above configs helps us to generate the below configs for app with
different trigger intervals ( or if they change them for some reason)

spark.dynamicAllocation.schedulerBacklogTimeout,
spark.dynamicAllocation.sustainedSchedulerBacklogTimeout,
spark.dynamicAllocation.executorIdleTimeout,
spark.dynamicAllocation.cachedExecutorIdleTimeout.

While our additional configs have its own limitations. I would like to get
some feedback if adding such kinds of configs to automate
the process of calculating the thresholds and their respective configs
makes sense?

Thank you,

Pavan

On Thu, Mar 28, 2024 at 3:38 PM Pavan Kotikalapudi 
wrote:

> Hi Jungtaek,
>
> Sorry for the late reply.
>
> I understand the concerns towards finding PMC members, I had similar
> concerns in the past. Do you think we have something to improve in the SPIP
> (certain areas) so that it would get traction from PMC members? Or this
> SPIP might not be a priority to the PMC right now?
>
> I agree this change is small enough that it might not be tagged as an
> SPIP. I started with the template SPIP questions so that it would be easier
> to understand the limitations of the current system, new solution, how it
> works, how to use it, limitations etcAs you might have already
> noticed in the PR, This change is turned off by default, will only work if
> `spark.dynamicAllocation.streaming.enabled` is true.
>
> Regarding the concerns about expertise in DRA,  I will find some core
> contributors of this module/DRA and tag them to this email with details,
> Mich has also highlighted the same in the past. Once we get approval from
> them we can further discuss and enhance this to make the user experience
> better.
>
> Thank you,
>
> Pavan
>
>
> On Tue, Mar 26, 2024 at 8:12 PM Jungtaek Lim 
> wrote:
>
>> Sounds good.
>>
>> One thing I'd like to clarify before shepherding this SPIP is the process
>> itself. Getting enough traction from PMC members is another issue to pass
>> the SPIP vote. Even a vote from committer is not counted. (I don't have a
>> binding vote.) I only see one PMC member (Thomas Graves, not my team) in
>> the design doc and we still don't get positive feedback. So still a long
>> way to go. We need three supporters from PMC members.
>>
>> Another thing is, I get the proposal at a high level, but I don't have
>> actual expertise in DRA. I could review the code in general, but I feel
>> like I'm not qualified to approve the code. We still need an expert on the
>> CORE area, especially who has expertise with DRA. (Could you please
>> annotate the code and enumerate several people who worked on the codebase?)
>> If they need an expertise of streaming to understand how things will work
>> then either you or I can explain, but I can't just approve and merge the
>> code.

Re: External Spark shuffle service for k8s

2024-04-06 Thread Mich Talebzadeh

Thanks for your suggestion that I take it as a workaround. Whilst this
workaround can potentially address storage allocation issues, I was more
interested in exploring solutions that offer a more seamless integration
with large distributed file systems like HDFS, GCS, or S3. This would
ensure better performance and scalability for handling larger datasets
efficiently.


Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Sat, 6 Apr 2024 at 21:28, Bjørn Jørgensen 
wrote:

> You can make a PVC on K8S call it 300GB
>
> make a folder in yours dockerfile
> WORKDIR /opt/spark/work-dir
> RUN chmod g+w /opt/spark/work-dir
>
> start spark with adding this
>
> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName",
> "300gb") \
>
> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path",
> "/opt/spark/work-dir") \
>
> .config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly",
> "False") \
>
> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName",
> "300gb") \
>
> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path",
> "/opt/spark/work-dir") \
>
> .config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly",
> "False") \
>   .config("spark.local.dir", "/opt/spark/work-dir")
>
>
>
>
> lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>> I have seen some older references for shuffle service for k8s,
>> although it is not clear they are talking about a generic shuffle
>> service for k8s.
>>
>> Anyhow with the advent of genai and the need to allow for a larger
>> volume of data, I was wondering if there has been any more work on
>> this matter. Specifically larger and scalable file systems like HDFS,
>> GCS , S3 etc, offer significantly larger storage capacity than local
>> disks on individual worker nodes in a k8s cluster, thus allowing
>> handling much larger datasets more efficiently. Also the degree of
>> parallelism and fault tolerance  with these files systems come into
>> it. I will be interested in hearing more about any progress on this.
>>
>> Thanks
>> .
>>
>> Mich Talebzadeh,
>>
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner Von Braun)".
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>

Apache Spark 3.4.3 (?)

2024-04-06 Thread Dongjoon Hyun

Hi, All.

Apache Spark 3.4.2 tag was created on Nov 24th and `branch-3.4` has 85
commits including important security and correctness patches like
SPARK-45580, SPARK-46092, SPARK-46466, SPARK-46794, and SPARK-46862.

https://github.com/apache/spark/releases/tag/v3.4.2

$ git log --oneline v3.4.2..HEAD | wc -l
  85

SPARK-45580 Subquery changes the output schema of the outer query
SPARK-46092 Overflow in Parquet row group filter creation causes incorrect
results
SPARK-46466 Vectorized parquet reader should never do rebase for timestamp
ntz
SPARK-46794 Incorrect results due to inferred predicate from checkpoint
with subquery
SPARK-46862 Incorrect count() of a dataframe loaded from CSV datasource
SPARK-45445 Upgrade snappy to 1.1.10.5
SPARK-47428 Upgrade Jetty to 9.4.54.v20240208
SPARK-46239 Hide `Jetty` info


Currently, I'm checking more applicable patches for branch-3.4. I'd like to
propose to release Apache Spark 3.4.3 and volunteer as the release manager
for Apache Spark 3.4.3. If there are no additional blockers, the first
tentative RC1 vote date is April 15th (Monday).

WDYT?

Dongjoon.

Re: External Spark shuffle service for k8s

2024-04-06 Thread Bjørn Jørgensen

You can make a PVC on K8S call it 300GB

make a folder in yours dockerfile
WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir

start spark with adding this

.config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.options.claimName",
"300gb") \

.config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.path",
"/opt/spark/work-dir") \

.config("spark.kubernetes.driver.volumes.persistentVolumeClaim.300gb.mount.readOnly",
"False") \

.config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.options.claimName",
"300gb") \

.config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.path",
"/opt/spark/work-dir") \

.config("spark.kubernetes.executor.volumes.persistentVolumeClaim.300gb.mount.readOnly",
"False") \
  .config("spark.local.dir", "/opt/spark/work-dir")




lør. 6. apr. 2024 kl. 15:45 skrev Mich Talebzadeh :

> I have seen some older references for shuffle service for k8s,
> although it is not clear they are talking about a generic shuffle
> service for k8s.
>
> Anyhow with the advent of genai and the need to allow for a larger
> volume of data, I was wondering if there has been any more work on
> this matter. Specifically larger and scalable file systems like HDFS,
> GCS , S3 etc, offer significantly larger storage capacity than local
> disks on individual worker nodes in a k8s cluster, thus allowing
> handling much larger datasets more efficiently. Also the degree of
> parallelism and fault tolerance  with these files systems come into
> it. I will be interested in hearing more about any progress on this.
>
> Thanks
> .
>
> Mich Talebzadeh,
>
> Technologist | Solutions Architect | Data Engineer  | Generative AI
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297

External Spark shuffle service for k8s

2024-04-06 Thread Mich Talebzadeh

I have seen some older references for shuffle service for k8s,
although it is not clear they are talking about a generic shuffle
service for k8s.

Anyhow with the advent of genai and the need to allow for a larger
volume of data, I was wondering if there has been any more work on
this matter. Specifically larger and scalable file systems like HDFS,
GCS , S3 etc, offer significantly larger storage capacity than local
disks on individual worker nodes in a k8s cluster, thus allowing
handling much larger datasets more efficiently. Also the degree of
parallelism and fault tolerance  with these files systems come into
it. I will be interested in hearing more about any progress on this.

Thanks
.

Mich Talebzadeh,

Technologist | Solutions Architect | Data Engineer  | Generative AI

London
United Kingdom


   view my Linkedin profile


 https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner Von Braun)".

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

[VOTE][RESULT] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-03 Thread Hyukjin Kwon

The vote passes with 19+1s (13 binding +1s).

(* = binding)
+1:
Haejoon Lee
Ruifeng Zheng(*)
Dongjoon Hyun(*)
Gengliang Wang(*)
Mridul Muralidharan(*)
Liang-Chi Hsieh(*)
Takuya Ueshin(*)
Kent Yao
Chao Sun(*)
Hussein Awala
Xiao Li(*)
Yuanjian Li(*)
Denny Lee
Felix Cheung(*)
Bo Yang
Xinrong Meng(*)
Holden Karau(*)
Femi Anthony
Tom Graves(*)

+0: None

-1: None

Thanks.

Participate in the ASF 25th Anniversary Campaign

2024-04-03 Thread Brian Proffitt

Hi everyone,

As part of The ASF’s 25th anniversary campaign[1], we will be celebrating
projects and communities in multiple ways.

We invite all projects and contributors to participate in the following
ways:

* Individuals - submit your first contribution:
https://news.apache.org/foundation/entry/the-asf-launches-firstasfcontribution-campaign
* Projects - share your public good story:
https://docs.google.com/forms/d/1vuN-tUnBwpTgOE5xj3Z5AG1hsOoDNLBmGIqQHwQT6k8/viewform?edit_requested=true
* Projects - submit a project spotlight for the blog:
https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=278466116
* Projects - contact the Voice of Apache podcast (formerly Feathercast) to
be featured: https://feathercast.apache.org/help/
*  Projects - use the 25th anniversary template and the #ASF25Years hashtag
on social media:
https://docs.google.com/presentation/d/1oDbMol3F_XQuCmttPYxBIOIjRuRBksUjDApjd8Ve3L8/edit#slide=id.g26b0919956e_0_13

If you have questions, email the Marketing & Publicity team at
mark...@apache.org.

Peace,
BKP

[1] https://apache.org/asf25years/

[NOTE: You are receiving this message because you are a contributor to an
Apache Software Foundation project. The ASF will very occasionally send out
messages relating to the Foundation to contributors and members, such as
this one.]

Brian Proffitt
VP, Marketing & Publicity
VP, Conferences

Ready for Review: spark-kubernetes-operator Alpha Release

2024-04-02 Thread Zhou Jiang

Hi dev members,

I am writing to let you know that the first pull request has been raised to
the newly established spark-kubernetes-operator, as previously discussed
within the group. This PR includes the alpha version release of this
project.

https://github.com/apache/spark-kubernetes-operator/pull/2

Here are some key highlights of the PR:
* Introduction of the alpha version of spark-kubernetes-operator.
* Start & stop Spark apps with simple yaml schema
* Deploy and monitor SparkApplications throughout its lifecycle
* Version agnostic for Spark 3.2 and above
* Full logging and metrics integration
* Flexible deployments and native integration with Kubernetes tooling

To facilitate the review process, we have provided detailed documentation
and comments within the PR.

This PR also includes contributions from Qi Tan, Shruti Gumma, Nishchal
Venkataramana and Swami Jayaraman, whose efforts have been instrumental in
reaching this stage of the project.

We are currently in the phase of actively developing and refining the
project. This includes extensive testing across diverse workloads and the
integration of additional test frameworks to ensure the robustness and
reliability of Spark application. We are calling for reviews and inputs on
this PR. Please feel free to provide any suggestions, concerns, or feedback
that could help to improve the quality and functionality of the project. We
look forward to your feedback.

-- 
*Zhou JIANG*

Re: Scheduling jobs using FAIR pool

2024-04-02 Thread Varun Shah

Hi Hussein,

Thanks for clarifying my doubts.

It means that even if I configure 2 separate pools for 2 jobs or submit the
2 jobs in same pool, the submission time will take into effect only when
both the jobs are "running" in parallel ( ie if job 1 gets all resources,
job 2 has to wait unless until pool 2 had been assigned a set min executors
)

However, with separate pools ( small, preferred static pools defined over
dynamic ones ) , more control such as weightage for jobs when multiple jobs
are competing for the resources and assigning minimum executors for each
pool can be done.

Regards,
Varun Shah


On Mon, Apr 1, 2024, 18:50 Hussein Awala  wrote:

> IMO the questions are not limited to Databricks.
>
> > The Round-Robin distribution of executors only work in case of empty
> executors (achievable by enabling dynamic allocation). In case the jobs
> (part of the same pool) requires all executors, second jobs will still need
> to wait.
>
> This feature in Spark allows for optimal resource utilization. Consider a
> scenario with two stages, each with 500 tasks (500 partitions), generated
> by two threads, and a total of 100 Spark executors available in the fair
> pool.
> The first thread may be instantiated microseconds ahead of the second,
> resulting in the fair scheduler allocating 100 tasks to the first stage
> initially. Once some of the tasks are complete, the scheduler dynamically
> redistributes resources, ultimately splitting the capacity equally between
> both stages. This will work in the same way if you have a single stage but
> without splitting the capacity.
>
> Regarding the other three questions, dynamically creating pools may not be
> advisable due to several considerations (cleanup issues, mixing application
> and infrastructure management, + a lot of unexpected issues).
>
> For scenarios involving stages with few long-running tasks like yours,
> it's recommended to enable dynamic allocation to let Spark add executors as
> needed.
>
> In the context of streaming workloads, streaming dynamic allocation is
> preferred to address specific issues detailed in SPARK-12133
> . Although the
> configurations for this feature are not documented, they can be found in the
> source code
> 
> .
> But for structured streaming (your case), you should use batch one (
> spark.dynamicAllocation.*), as SPARK-24815
>  is not ready yet (it
> was accepted and will be ready soon), but it has some issues in the
> downscale step, you can check the JIRA issue for more details.
>
> On Mon, Apr 1, 2024 at 2:07 PM Varun Shah 
> wrote:
>
>> Hi Mich,
>>
>> I did not post in the databricks community, as most of the questions were
>> related to spark itself.
>>
>> But let me also post the question on databricks community.
>>
>> Thanks,
>> Varun Shah
>>
>> On Mon, Apr 1, 2024, 16:28 Mich Talebzadeh 
>> wrote:
>>
>>> Hi,
>>>
>>> Have you put this question to Databricks forum
>>>
>>> Data Engineering - Databricks
>>> 
>>>
>>>
>>> Mich Talebzadeh,
>>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von Braun
>>> )".
>>>
>>>
>>> On Mon, 1 Apr 2024 at 07:22, Varun Shah 
>>> wrote:
>>>
 Hi Community,

 I am currently exploring the best use of "Scheduler Pools" for
 executing jobs in parallel, and require clarification and suggestions on a
 few points.

 The implementation consists of executing "Structured Streaming" jobs on
 Databricks using AutoLoader. Each stream is executed with trigger =
 'AvailableNow', ensuring that the streams don't keep running for the
 source. (we have ~4000 such streams, with no continuous stream from source,
 hence not keeping the streams running infinitely using other triggers).

 One way to achieve parallelism in the jobs is to use "MultiThreading",
 all using same SparkContext, as quoted from official docs: "Inside a given
 Spark application (SparkContext instance), multiple parallel jobs can run
 simultaneously if they were submitted from separate threads."

 There's also a availability of "FAIR Scheduler", which instead of FIFO
 Scheduler (default), assigns executors in

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-02 Thread Tom Graves

 +1
Tom

On Sunday, March 31, 2024 at 10:09:28 PM CDT, Ruifeng Zheng 
 wrote:  
 
 +1

On Mon, Apr 1, 2024 at 10:06 AM Haejoon Lee 
 wrote:

+1

On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:

Hi all,

I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark 
Connect) 
JIRAPrototypeSPIP doc

Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks.



-- 
Ruifeng Zheng
E-mail: zrfli...@gmail.com

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-02 Thread Hyukjin Kwon

Yes

On Tue, Apr 2, 2024 at 6:36 PM Femi Anthony  wrote:

> So, to clarify - the purpose of this package is to enable connectivity to
> a remote Spark cluster without having to install any local JVM
> dependencies, right ?
>
> Sent from my iPhone
>
> On Mar 31, 2024, at 10:07 PM, Haejoon Lee
>  wrote:
>
> 
>
> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>> Connect)
>>
>> JIRA 
>> Prototype 
>> SPIP doc
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>>
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-02 Thread Femi Anthony

So, to clarify - the purpose of this package is to enable connectivity to a remote Spark cluster without having to install any local JVM dependencies, right ? Sent from my iPhoneOn Mar 31, 2024, at 10:07 PM, Haejoon Lee  wrote:+1On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:Hi all,I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark Connect) JIRAPrototypeSPIP docPlease vote on the SPIP for the next 72 hours:[ ] +1: Accept the proposal as an official SPIP[ ] +0[ ] -1: I don’t think this is a good idea because …Thanks.

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Mridul Muralidharan

+1

Regards,
Mridul


On Mon, Apr 1, 2024 at 11:26 PM Holden Karau  wrote:

> +1
>
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>
>
> On Mon, Apr 1, 2024 at 5:44 PM Xinrong Meng  wrote:
>
>> +1
>>
>> Thank you @Hyukjin Kwon 
>>
>> On Mon, Apr 1, 2024 at 10:19 AM Felix Cheung 
>> wrote:
>>
>>> +1
>>> --
>>> *From:* Denny Lee 
>>> *Sent:* Monday, April 1, 2024 10:06:14 AM
>>> *To:* Hussein Awala 
>>> *Cc:* Chao Sun ; Hyukjin Kwon ;
>>> Mridul Muralidharan ; dev 
>>> *Subject:* Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)
>>>
>>> +1 (non-binding)
>>>
>>>
>>> On Mon, Apr 1, 2024 at 9:24 AM Hussein Awala  wrote:
>>>
>>> +1(non-binding) I add to the difference will it make that it will also
>>> simplify package maintenance and easily release a bug fix/new feature
>>> without needing to wait for Pyspark to release.
>>>
>>> On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:
>>>
>>> +1
>>>
>>> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
>>> wrote:
>>>
>>> Oh I didn't send the discussion thread out as it's pretty simple,
>>> non-invasive and the discussion was sort of done as part of the Spark
>>> Connect initial discussion ..
>>>
>>> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
>>> wrote:
>>>
>>>
>>> Can you point me to the SPIP’s discussion thread please ?
>>> I was not able to find it, but I was on vacation, and so might have
>>> missed this …
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>>
>>> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>>>  wrote:
>>>
>>> +1
>>>
>>> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
>>> wrote:
>>>
>>> Hi all,
>>>
>>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>>> Connect)
>>>
>>> JIRA 
>>> Prototype 
>>> SPIP doc
>>> 
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks.
>>>
>>>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Holden Karau

+1

Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.):
https://amzn.to/2MaRAG9  
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


On Mon, Apr 1, 2024 at 5:44 PM Xinrong Meng  wrote:

> +1
>
> Thank you @Hyukjin Kwon 
>
> On Mon, Apr 1, 2024 at 10:19 AM Felix Cheung 
> wrote:
>
>> +1
>> --
>> *From:* Denny Lee 
>> *Sent:* Monday, April 1, 2024 10:06:14 AM
>> *To:* Hussein Awala 
>> *Cc:* Chao Sun ; Hyukjin Kwon ;
>> Mridul Muralidharan ; dev 
>> *Subject:* Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)
>>
>> +1 (non-binding)
>>
>>
>> On Mon, Apr 1, 2024 at 9:24 AM Hussein Awala  wrote:
>>
>> +1(non-binding) I add to the difference will it make that it will also
>> simplify package maintenance and easily release a bug fix/new feature
>> without needing to wait for Pyspark to release.
>>
>> On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:
>>
>> +1
>>
>> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
>> wrote:
>>
>> Oh I didn't send the discussion thread out as it's pretty simple,
>> non-invasive and the discussion was sort of done as part of the Spark
>> Connect initial discussion ..
>>
>> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
>> wrote:
>>
>>
>> Can you point me to the SPIP’s discussion thread please ?
>> I was not able to find it, but I was on vacation, and so might have
>> missed this …
>>
>>
>> Regards,
>> Mridul
>>
>>
>> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>>  wrote:
>>
>> +1
>>
>> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
>> wrote:
>>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>> Connect)
>>
>> JIRA 
>> Prototype 
>> SPIP doc
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>>
>>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Xinrong Meng

+1

Thank you @Hyukjin Kwon 

On Mon, Apr 1, 2024 at 10:19 AM Felix Cheung 
wrote:

> +1
> --
> *From:* Denny Lee 
> *Sent:* Monday, April 1, 2024 10:06:14 AM
> *To:* Hussein Awala 
> *Cc:* Chao Sun ; Hyukjin Kwon ;
> Mridul Muralidharan ; dev 
> *Subject:* Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)
>
> +1 (non-binding)
>
>
> On Mon, Apr 1, 2024 at 9:24 AM Hussein Awala  wrote:
>
> +1(non-binding) I add to the difference will it make that it will also
> simplify package maintenance and easily release a bug fix/new feature
> without needing to wait for Pyspark to release.
>
> On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:
>
> +1
>
> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
> wrote:
>
> Oh I didn't send the discussion thread out as it's pretty simple,
> non-invasive and the discussion was sort of done as part of the Spark
> Connect initial discussion ..
>
> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
> wrote:
>
>
> Can you point me to the SPIP’s discussion thread please ?
> I was not able to find it, but I was on vacation, and so might have
> missed this …
>
>
> Regards,
> Mridul
>
>
> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>  wrote:
>
> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:
>
> Hi all,
>
> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
> Connect)
>
> JIRA 
> Prototype 
> SPIP doc
> 
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks.
>
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread bo yang

+1 (non-binding)

On Mon, Apr 1, 2024 at 10:19 AM Felix Cheung 
wrote:

> +1
> --
> *From:* Denny Lee 
> *Sent:* Monday, April 1, 2024 10:06:14 AM
> *To:* Hussein Awala 
> *Cc:* Chao Sun ; Hyukjin Kwon ;
> Mridul Muralidharan ; dev 
> *Subject:* Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)
>
> +1 (non-binding)
>
>
> On Mon, Apr 1, 2024 at 9:24 AM Hussein Awala  wrote:
>
> +1(non-binding) I add to the difference will it make that it will also
> simplify package maintenance and easily release a bug fix/new feature
> without needing to wait for Pyspark to release.
>
> On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:
>
> +1
>
> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
> wrote:
>
> Oh I didn't send the discussion thread out as it's pretty simple,
> non-invasive and the discussion was sort of done as part of the Spark
> Connect initial discussion ..
>
> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
> wrote:
>
>
> Can you point me to the SPIP’s discussion thread please ?
> I was not able to find it, but I was on vacation, and so might have
> missed this …
>
>
> Regards,
> Mridul
>
>
> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>  wrote:
>
> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:
>
> Hi all,
>
> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
> Connect)
>
> JIRA 
> Prototype 
> SPIP doc
> 
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks.
>
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Felix Cheung

+1

From: Denny Lee 
Sent: Monday, April 1, 2024 10:06:14 AM
To: Hussein Awala 
Cc: Chao Sun ; Hyukjin Kwon ; Mridul 
Muralidharan ; dev 
Subject: Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

+1 (non-binding)

On Mon, Apr 1, 2024 at 9:24 AM Hussein Awala 
mailto:huss...@awala.fr>> wrote:
+1(non-binding) I add to the difference will it make that it will also simplify 
package maintenance and easily release a bug fix/new feature without needing to 
wait for Pyspark to release.

On Mon, Apr 1, 2024 at 4:56 PM Chao Sun 
mailto:sunc...@apache.org>> wrote:
+1

On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
mailto:gurwls...@apache.org>> wrote:
Oh I didn't send the discussion thread out as it's pretty simple, non-invasive 
and the discussion was sort of done as part of the Spark Connect initial 
discussion ..

On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
mailto:mri...@gmail.com>> wrote:

Can you point me to the SPIP’s discussion thread please ?
I was not able to find it, but I was on vacation, and so might have missed this 
…

Regards,
Mridul

On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee 
 wrote:
+1

On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
mailto:gurwls...@apache.org>> wrote:
Hi all,

I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark Connect)

JIRA
Prototype
SPIP 
doc

Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks.

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Yuanjian Li

+1

Chao Sun  于2024年4月1日周一 07:56写道：

> +1
>
> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
> wrote:
>
>> Oh I didn't send the discussion thread out as it's pretty simple,
>> non-invasive and the discussion was sort of done as part of the Spark
>> Connect initial discussion ..
>>
>> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Can you point me to the SPIP’s discussion thread please ?
>>> I was not able to find it, but I was on vacation, and so might have
>>> missed this …
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>
>>> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>>>  wrote:
>>>
 +1

 On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
 wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: Pure Python Package in PyPI
> (Spark Connect)
>
> JIRA 
> Prototype 
> SPIP doc
> 
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks.
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Denny Lee

+1 (non-binding)


On Mon, Apr 1, 2024 at 9:24 AM Hussein Awala  wrote:

> +1(non-binding) I add to the difference will it make that it will also
> simplify package maintenance and easily release a bug fix/new feature
> without needing to wait for Pyspark to release.
>
> On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:
>
>> +1
>>
>> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
>> wrote:
>>
>>> Oh I didn't send the discussion thread out as it's pretty simple,
>>> non-invasive and the discussion was sort of done as part of the Spark
>>> Connect initial discussion ..
>>>
>>> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
>>> wrote:
>>>

 Can you point me to the SPIP’s discussion thread please ?
 I was not able to find it, but I was on vacation, and so might have
 missed this …


 Regards,
 Mridul

>>>
 On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
  wrote:

> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
> wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI
>> (Spark Connect)
>>
>> JIRA 
>> Prototype 
>> SPIP doc
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>>
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Xiao Li

+1

Hussein Awala  于2024年4月1日周一 08:07写道：

> +1(non-binding) I add to the difference will it make that it will also
> simplify package maintenance and easily release a bug fix/new feature
> without needing to wait for Pyspark to release.
>
> On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:
>
>> +1
>>
>> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
>> wrote:
>>
>>> Oh I didn't send the discussion thread out as it's pretty simple,
>>> non-invasive and the discussion was sort of done as part of the Spark
>>> Connect initial discussion ..
>>>
>>> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
>>> wrote:
>>>

 Can you point me to the SPIP’s discussion thread please ?
 I was not able to find it, but I was on vacation, and so might have
 missed this …


 Regards,
 Mridul

>>>
 On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
  wrote:

> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
> wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI
>> (Spark Connect)
>>
>> JIRA 
>> Prototype 
>> SPIP doc
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>>
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Hussein Awala

+1(non-binding) I add to the difference will it make that it will also
simplify package maintenance and easily release a bug fix/new feature
without needing to wait for Pyspark to release.

On Mon, Apr 1, 2024 at 4:56 PM Chao Sun  wrote:

> +1
>
> On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon 
> wrote:
>
>> Oh I didn't send the discussion thread out as it's pretty simple,
>> non-invasive and the discussion was sort of done as part of the Spark
>> Connect initial discussion ..
>>
>> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
>> wrote:
>>
>>>
>>> Can you point me to the SPIP’s discussion thread please ?
>>> I was not able to find it, but I was on vacation, and so might have
>>> missed this …
>>>
>>>
>>> Regards,
>>> Mridul
>>>
>>
>>> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>>>  wrote:
>>>
 +1

 On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
 wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: Pure Python Package in PyPI
> (Spark Connect)
>
> JIRA 
> Prototype 
> SPIP doc
> 
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks.
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Chao Sun

+1

On Sun, Mar 31, 2024 at 10:31 PM Hyukjin Kwon  wrote:

> Oh I didn't send the discussion thread out as it's pretty simple,
> non-invasive and the discussion was sort of done as part of the Spark
> Connect initial discussion ..
>
> On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan 
> wrote:
>
>>
>> Can you point me to the SPIP’s discussion thread please ?
>> I was not able to find it, but I was on vacation, and so might have
>> missed this …
>>
>>
>> Regards,
>> Mridul
>>
>
>> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>>  wrote:
>>
>>> +1
>>>
>>> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
>>> wrote:
>>>
 Hi all,

 I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
 Connect)

 JIRA 
 Prototype 
 SPIP doc
 

 Please vote on the SPIP for the next 72 hours:

 [ ] +1: Accept the proposal as an official SPIP
 [ ] +0
 [ ] -1: I don’t think this is a good idea because …

 Thanks.

>>>

Re: Scheduling jobs using FAIR pool

2024-04-01 Thread Hussein Awala

IMO the questions are not limited to Databricks.

> The Round-Robin distribution of executors only work in case of empty
executors (achievable by enabling dynamic allocation). In case the jobs
(part of the same pool) requires all executors, second jobs will still need
to wait.

This feature in Spark allows for optimal resource utilization. Consider a
scenario with two stages, each with 500 tasks (500 partitions), generated
by two threads, and a total of 100 Spark executors available in the fair
pool.
The first thread may be instantiated microseconds ahead of the second,
resulting in the fair scheduler allocating 100 tasks to the first stage
initially. Once some of the tasks are complete, the scheduler dynamically
redistributes resources, ultimately splitting the capacity equally between
both stages. This will work in the same way if you have a single stage but
without splitting the capacity.

Regarding the other three questions, dynamically creating pools may not be
advisable due to several considerations (cleanup issues, mixing application
and infrastructure management, + a lot of unexpected issues).

For scenarios involving stages with few long-running tasks like yours, it's
recommended to enable dynamic allocation to let Spark add executors as
needed.

In the context of streaming workloads, streaming dynamic allocation is
preferred to address specific issues detailed in SPARK-12133
. Although the
configurations for this feature are not documented, they can be found in the
source code

.
But for structured streaming (your case), you should use batch one (
spark.dynamicAllocation.*), as SPARK-24815
 is not ready yet (it
was accepted and will be ready soon), but it has some issues in the
downscale step, you can check the JIRA issue for more details.

On Mon, Apr 1, 2024 at 2:07 PM Varun Shah  wrote:

> Hi Mich,
>
> I did not post in the databricks community, as most of the questions were
> related to spark itself.
>
> But let me also post the question on databricks community.
>
> Thanks,
> Varun Shah
>
> On Mon, Apr 1, 2024, 16:28 Mich Talebzadeh 
> wrote:
>
>> Hi,
>>
>> Have you put this question to Databricks forum
>>
>> Data Engineering - Databricks
>> 
>>
>>
>> Mich Talebzadeh,
>> Technologist | Solutions Architect | Data Engineer  | Generative AI
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 1 Apr 2024 at 07:22, Varun Shah 
>> wrote:
>>
>>> Hi Community,
>>>
>>> I am currently exploring the best use of "Scheduler Pools" for executing
>>> jobs in parallel, and require clarification and suggestions on a few points.
>>>
>>> The implementation consists of executing "Structured Streaming" jobs on
>>> Databricks using AutoLoader. Each stream is executed with trigger =
>>> 'AvailableNow', ensuring that the streams don't keep running for the
>>> source. (we have ~4000 such streams, with no continuous stream from source,
>>> hence not keeping the streams running infinitely using other triggers).
>>>
>>> One way to achieve parallelism in the jobs is to use "MultiThreading",
>>> all using same SparkContext, as quoted from official docs: "Inside a given
>>> Spark application (SparkContext instance), multiple parallel jobs can run
>>> simultaneously if they were submitted from separate threads."
>>>
>>> There's also a availability of "FAIR Scheduler", which instead of FIFO
>>> Scheduler (default), assigns executors in Round-Robin fashion, ensuring the
>>> smaller jobs that were submitted later do not starve due to bigger jobs
>>> submitted early consuming all resources.
>>>
>>> Here are my questions:
>>> 1. The Round-Robin distribution of executors only work in case of empty
>>> executors (achievable by enabling dynamic allocation). In case the jobs
>>> (part of the same pool) requires all executors, second jobs will still need
>>> to wait.
>>> 2. If we create dynamic pools for submitting each stream (by setting
>>> spark property -> "spark.scheduler.pool" to a dynamic value as
>>> spark.sparkContext.setLocalProperty("spark.scheduler.pool", ">> string>") , how does executor allocation happen ? Since all pools created
>>> are created dynamically, they share equal weight. Does this also work the
>>> same way as

Re: Scheduling jobs using FAIR pool

2024-04-01 Thread Varun Shah

Hi Mich,

I did not post in the databricks community, as most of the questions were
related to spark itself.

But let me also post the question on databricks community.

Thanks,
Varun Shah

On Mon, Apr 1, 2024, 16:28 Mich Talebzadeh 
wrote:

> Hi,
>
> Have you put this question to Databricks forum
>
> Data Engineering - Databricks
> 
>
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 1 Apr 2024 at 07:22, Varun Shah  wrote:
>
>> Hi Community,
>>
>> I am currently exploring the best use of "Scheduler Pools" for executing
>> jobs in parallel, and require clarification and suggestions on a few points.
>>
>> The implementation consists of executing "Structured Streaming" jobs on
>> Databricks using AutoLoader. Each stream is executed with trigger =
>> 'AvailableNow', ensuring that the streams don't keep running for the
>> source. (we have ~4000 such streams, with no continuous stream from source,
>> hence not keeping the streams running infinitely using other triggers).
>>
>> One way to achieve parallelism in the jobs is to use "MultiThreading",
>> all using same SparkContext, as quoted from official docs: "Inside a given
>> Spark application (SparkContext instance), multiple parallel jobs can run
>> simultaneously if they were submitted from separate threads."
>>
>> There's also a availability of "FAIR Scheduler", which instead of FIFO
>> Scheduler (default), assigns executors in Round-Robin fashion, ensuring the
>> smaller jobs that were submitted later do not starve due to bigger jobs
>> submitted early consuming all resources.
>>
>> Here are my questions:
>> 1. The Round-Robin distribution of executors only work in case of empty
>> executors (achievable by enabling dynamic allocation). In case the jobs
>> (part of the same pool) requires all executors, second jobs will still need
>> to wait.
>> 2. If we create dynamic pools for submitting each stream (by setting
>> spark property -> "spark.scheduler.pool" to a dynamic value as
>> spark.sparkContext.setLocalProperty("spark.scheduler.pool", "> string>") , how does executor allocation happen ? Since all pools created
>> are created dynamically, they share equal weight. Does this also work the
>> same way as submitting streams to a single pool as a FAIR scheduler ?
>> 3. Official docs quote "inside each pool, jobs run in FIFO order.". Is
>> this true for the FAIR scheduler also ? By definition, it does not seem
>> right, but it's confusing. It says "By Default" , so does it mean for FIFO
>> scheduler or by default for both scheduling types ?
>> 4. Are there any overhead for spark driver while creating / using a
>> dynamically created spark pool vs pre-defined pools ?
>>
>> Apart from these, any suggestions or ways you have implemented
>> auto-scaling for such loads ? We are currently trying to auto-scale the
>> resources based on requests, but scaling down is an issue (known already
>> for which SPIP is already in discussion, but it does not cater to
>> submitting multiple streams in a single cluster.
>>
>> Thanks for reading !! Looking forward to your suggestions
>>
>> Regards,
>> Varun Shah
>>
>>
>>
>>
>>

Re: Scheduling jobs using FAIR pool

2024-04-01 Thread Mich Talebzadeh

Hi,

Have you put this question to Databricks forum

Data Engineering - Databricks



Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 1 Apr 2024 at 07:22, Varun Shah  wrote:

> Hi Community,
>
> I am currently exploring the best use of "Scheduler Pools" for executing
> jobs in parallel, and require clarification and suggestions on a few points.
>
> The implementation consists of executing "Structured Streaming" jobs on
> Databricks using AutoLoader. Each stream is executed with trigger =
> 'AvailableNow', ensuring that the streams don't keep running for the
> source. (we have ~4000 such streams, with no continuous stream from source,
> hence not keeping the streams running infinitely using other triggers).
>
> One way to achieve parallelism in the jobs is to use "MultiThreading", all
> using same SparkContext, as quoted from official docs: "Inside a given
> Spark application (SparkContext instance), multiple parallel jobs can run
> simultaneously if they were submitted from separate threads."
>
> There's also a availability of "FAIR Scheduler", which instead of FIFO
> Scheduler (default), assigns executors in Round-Robin fashion, ensuring the
> smaller jobs that were submitted later do not starve due to bigger jobs
> submitted early consuming all resources.
>
> Here are my questions:
> 1. The Round-Robin distribution of executors only work in case of empty
> executors (achievable by enabling dynamic allocation). In case the jobs
> (part of the same pool) requires all executors, second jobs will still need
> to wait.
> 2. If we create dynamic pools for submitting each stream (by setting spark
> property -> "spark.scheduler.pool" to a dynamic value as
> spark.sparkContext.setLocalProperty("spark.scheduler.pool", " string>") , how does executor allocation happen ? Since all pools created
> are created dynamically, they share equal weight. Does this also work the
> same way as submitting streams to a single pool as a FAIR scheduler ?
> 3. Official docs quote "inside each pool, jobs run in FIFO order.". Is
> this true for the FAIR scheduler also ? By definition, it does not seem
> right, but it's confusing. It says "By Default" , so does it mean for FIFO
> scheduler or by default for both scheduling types ?
> 4. Are there any overhead for spark driver while creating / using a
> dynamically created spark pool vs pre-defined pools ?
>
> Apart from these, any suggestions or ways you have implemented
> auto-scaling for such loads ? We are currently trying to auto-scale the
> resources based on requests, but scaling down is an issue (known already
> for which SPIP is already in discussion, but it does not cater to
> submitting multiple streams in a single cluster.
>
> Thanks for reading !! Looking forward to your suggestions
>
> Regards,
> Varun Shah
>
>
>
>
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-04-01 Thread Kent Yao

+1(non-binding). Thank you, Hyukjin.

Kent Yao

Takuya UESHIN  于2024年4月1日周一 18:04写道：
>
> +1
>
> On Sun, Mar 31, 2024 at 6:16 PM Hyukjin Kwon  wrote:
>>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark 
>> Connect)
>>
>> JIRA
>> Prototype
>> SPIP doc
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>
>
>
> --
> Takuya UESHIN
>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread L. C. Hsieh

+1

Thanks Hyukjin.

On Sun, Mar 31, 2024 at 10:52 PM Dongjoon Hyun  wrote:
>
> +1
>
> Thank you, Hyukjin.
>
> Dongjoon
>
> On Sun, Mar 31, 2024 at 19:07 Haejoon Lee 
>  wrote:
>>
>> +1
>>
>> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:
>>>
>>> Hi all,
>>>
>>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark 
>>> Connect)
>>>
>>> JIRA
>>> Prototype
>>> SPIP doc
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread Takuya UESHIN

+1

On Sun, Mar 31, 2024 at 6:16 PM Hyukjin Kwon  wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
> Connect)
>
> JIRA 
> Prototype 
> SPIP doc
> 
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks.
>


-- 
Takuya UESHIN

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread Hyukjin Kwon

Oh I didn't send the discussion thread out as it's pretty simple,
non-invasive and the discussion was sort of done as part of the Spark
Connect initial discussion ..

On Mon, Apr 1, 2024 at 1:59 PM Mridul Muralidharan  wrote:

>
> Can you point me to the SPIP’s discussion thread please ?
> I was not able to find it, but I was on vacation, and so might have
> missed this …
>
>
> Regards,
> Mridul
>
> On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
>  wrote:
>
>> +1
>>
>> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
>> wrote:
>>
>>> Hi all,
>>>
>>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>>> Connect)
>>>
>>> JIRA 
>>> Prototype 
>>> SPIP doc
>>> 
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks.
>>>
>>

Scheduling jobs using FAIR pool

2024-03-31 Thread Varun Shah

Hi Community,

I am currently exploring the best use of "Scheduler Pools" for executing
jobs in parallel, and require clarification and suggestions on a few points.

The implementation consists of executing "Structured Streaming" jobs on
Databricks using AutoLoader. Each stream is executed with trigger =
'AvailableNow', ensuring that the streams don't keep running for the
source. (we have ~4000 such streams, with no continuous stream from source,
hence not keeping the streams running infinitely using other triggers).

One way to achieve parallelism in the jobs is to use "MultiThreading", all
using same SparkContext, as quoted from official docs: "Inside a given
Spark application (SparkContext instance), multiple parallel jobs can run
simultaneously if they were submitted from separate threads."

There's also a availability of "FAIR Scheduler", which instead of FIFO
Scheduler (default), assigns executors in Round-Robin fashion, ensuring the
smaller jobs that were submitted later do not starve due to bigger jobs
submitted early consuming all resources.

Here are my questions:
1. The Round-Robin distribution of executors only work in case of empty
executors (achievable by enabling dynamic allocation). In case the jobs
(part of the same pool) requires all executors, second jobs will still need
to wait.
2. If we create dynamic pools for submitting each stream (by setting spark
property -> "spark.scheduler.pool" to a dynamic value as
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "") , how does executor allocation happen ? Since all pools created
are created dynamically, they share equal weight. Does this also work the
same way as submitting streams to a single pool as a FAIR scheduler ?
3. Official docs quote "inside each pool, jobs run in FIFO order.". Is this
true for the FAIR scheduler also ? By definition, it does not seem right,
but it's confusing. It says "By Default" , so does it mean for FIFO
scheduler or by default for both scheduling types ?
4. Are there any overhead for spark driver while creating / using a
dynamically created spark pool vs pre-defined pools ?

Apart from these, any suggestions or ways you have implemented auto-scaling
for such loads ? We are currently trying to auto-scale the resources based
on requests, but scaling down is an issue (known already for which SPIP is
already in discussion, but it does not cater to submitting multiple streams
in a single cluster.

Thanks for reading !! Looking forward to your suggestions

Regards,
Varun Shah

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread Mridul Muralidharan

Can you point me to the SPIP’s discussion thread please ?
I was not able to find it, but I was on vacation, and so might have missed
this …

Regards,
Mridul

On Sun, Mar 31, 2024 at 9:08 PM Haejoon Lee
 wrote:

> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>> Connect)
>>
>> JIRA 
>> Prototype 
>> SPIP doc
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>>
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread Gengliang Wang

+1

On Sun, Mar 31, 2024 at 8:24 PM Dongjoon Hyun 
wrote:

> +1
>
> Thank you, Hyukjin.
>
> Dongjoon
>
> On Sun, Mar 31, 2024 at 19:07 Haejoon Lee
>  wrote:
>
>> +1
>>
>> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon 
>> wrote:
>>
>>> Hi all,
>>>
>>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>>> Connect)
>>>
>>> JIRA 
>>> Prototype 
>>> SPIP doc
>>> 
>>>
>>> Please vote on the SPIP for the next 72 hours:
>>>
>>> [ ] +1: Accept the proposal as an official SPIP
>>> [ ] +0
>>> [ ] -1: I don’t think this is a good idea because …
>>>
>>> Thanks.
>>>
>>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread Dongjoon Hyun

+1

Thank you, Hyukjin.

Dongjoon

On Sun, Mar 31, 2024 at 19:07 Haejoon Lee
 wrote:

> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>> Connect)
>>
>> JIRA 
>> Prototype 
>> SPIP doc
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>>
>

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread Ruifeng Zheng

+1

On Mon, Apr 1, 2024 at 10:06 AM Haejoon Lee
 wrote:

> +1
>
> On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
>> Connect)
>>
>> JIRA 
>> Prototype 
>> SPIP doc
>> 
>>
>> Please vote on the SPIP for the next 72 hours:
>>
>> [ ] +1: Accept the proposal as an official SPIP
>> [ ] +0
>> [ ] -1: I don’t think this is a good idea because …
>>
>> Thanks.
>>
>

-- 
Ruifeng Zheng
E-mail: zrfli...@gmail.com

Re: [VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread Haejoon Lee

+1

On Mon, Apr 1, 2024 at 10:15 AM Hyukjin Kwon  wrote:

> Hi all,
>
> I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
> Connect)
>
> JIRA 
> Prototype 
> SPIP doc
> 
>
> Please vote on the SPIP for the next 72 hours:
>
> [ ] +1: Accept the proposal as an official SPIP
> [ ] +0
> [ ] -1: I don’t think this is a good idea because …
>
> Thanks.
>

[VOTE] SPIP: Pure Python Package in PyPI (Spark Connect)

2024-03-31 Thread Hyukjin Kwon

Hi all,

I'd like to start the vote for SPIP: Pure Python Package in PyPI (Spark
Connect)

JIRA 
Prototype 
SPIP doc


Please vote on the SPIP for the next 72 hours:

[ ] +1: Accept the proposal as an official SPIP
[ ] +0
[ ] -1: I don’t think this is a good idea because …

Thanks.

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-03-28 Thread Pavan Kotikalapudi

Hi Andrew, Sandy, Jerry, Thomas, marcelo, Whenchen, YangJie, Shixiong,

My apologies. I have tagged soo many of you (on multiple emails), I am in
the process of finding the core contributors of the Dynamic resource
allocation (DRA) feature in apache/spark ,
I could find you folks as some of the core contributing members to this
feature.

We(cc'd) would like to extend the current DRA to work for structured
streaming [SPARK-24815 ]
use-case (based on the heuristics of trigger interval).
Here is the design doc
.
We also have a draft PR  with
initial implementation.

This feature has been running well for the past one year at my company
(twilio) and there are a lot of folks in the community who are interested
in this feature.

Do get the PR to a mergeable state. We would love to leverage your
expertise on DRA. I request you to please review the design doc and the
draft PR, let us know your thoughts and concerns if any. This will hugely
benefit the community utilizing structured streaming applications in their
data pipelines.

Looking forward to hear back from you.

Thank you,

Pavan

On Thu, Mar 28, 2024 at 3:38 PM Pavan Kotikalapudi 
wrote:

> Hi Jungtaek,
>
> Sorry for the late reply.
>
> I understand the concerns towards finding PMC members, I had similar
> concerns in the past. Do you think we have something to improve in the SPIP
> (certain areas) so that it would get traction from PMC members? Or this
> SPIP might not be a priority to the PMC right now?
>
> I agree this change is small enough that it might not be tagged as an
> SPIP. I started with the template SPIP questions so that it would be easier
> to understand the limitations of the current system, new solution, how it
> works, how to use it, limitations etcAs you might have already
> noticed in the PR, This change is turned off by default, will only work if
> `spark.dynamicAllocation.streaming.enabled` is true.
>
> Regarding the concerns about expertise in DRA,  I will find some core
> contributors of this module/DRA and tag them to this email with details,
> Mich has also highlighted the same in the past. Once we get approval from
> them we can further discuss and enhance this to make the user experience
> better.
>
> Thank you,
>
> Pavan
>
>
> On Tue, Mar 26, 2024 at 8:12 PM Jungtaek Lim 
> wrote:
>
>> Sounds good.
>>
>> One thing I'd like to clarify before shepherding this SPIP is the process
>> itself. Getting enough traction from PMC members is another issue to pass
>> the SPIP vote. Even a vote from committer is not counted. (I don't have a
>> binding vote.) I only see one PMC member (Thomas Graves, not my team) in
>> the design doc and we still don't get positive feedback. So still a long
>> way to go. We need three supporters from PMC members.
>>
>> Another thing is, I get the proposal at a high level, but I don't have
>> actual expertise in DRA. I could review the code in general, but I feel
>> like I'm not qualified to approve the code. We still need an expert on the
>> CORE area, especially who has expertise with DRA. (Could you please
>> annotate the code and enumerate several people who worked on the codebase?)
>> If they need an expertise of streaming to understand how things will work
>> then either you or I can explain, but I can't just approve and merge the
>> code.
>>
>> That said, if we succeed in finding one and they review the code and
>> LGTM, I'd rather say not to go with taking the process of SPIP unless the
>> expert reviewing your code requires us to do so. The change you proposed is
>> rather small and does not seem to be invasive (experts can also weigh), and
>> there must never be the case that this feature is turned on by default (as
>> we pointed out limitation). It doesn't look like requiring SPIP, if we
>> carefully document the new change and also clearly describe the limitation.
>> (Also a warning in the codebase that this must not be enabled by default.)
>>
>>
>> On Tue, Mar 26, 2024 at 7:02 PM Pavan Kotikalapudi <
>> pkotikalap...@twilio.com> wrote:
>>
>>> Hi Bhuwan,
>>>
>>> Glad to hear back from you! Very much appreciate your help on reviewing
>>> the design doc/PR and endorsing this proposal.
>>>
>>> Thank you so much @Jungtaek Lim  , @Mich
>>> Talebzadeh   for graciously agreeing to
>>> mentor/shepherd this effort.
>>>
>>> Regarding Twilio copyright in Notice binary file:
>>> Twilio Opensource counsel was involved all through the process, I have
>>> placed it in the project file prior to Twilio signing a CCLA for the spark
>>> project contribution( Aug '23).
>>>
>>> Since the CCLA is signed now, I have removed the twilio copyright from
>>> that file. I didn't get a chance to update the PR after github-actions
>>> closed it.
>>>
>>> Please let me know of next steps needed to

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-03-28 Thread Pavan Kotikalapudi

Hi Jungtaek,

Sorry for the late reply.

I understand the concerns towards finding PMC members, I had similar
concerns in the past. Do you think we have something to improve in the SPIP
(certain areas) so that it would get traction from PMC members? Or this
SPIP might not be a priority to the PMC right now?

I agree this change is small enough that it might not be tagged as an SPIP.
I started with the template SPIP questions so that it would be easier to
understand the limitations of the current system, new solution, how it
works, how to use it, limitations etcAs you might have already
noticed in the PR, This change is turned off by default, will only work if
`spark.dynamicAllocation.streaming.enabled` is true.

Regarding the concerns about expertise in DRA,  I will find some core
contributors of this module/DRA and tag them to this email with details,
Mich has also highlighted the same in the past. Once we get approval from
them we can further discuss and enhance this to make the user experience
better.

Thank you,

Pavan


On Tue, Mar 26, 2024 at 8:12 PM Jungtaek Lim 
wrote:

> Sounds good.
>
> One thing I'd like to clarify before shepherding this SPIP is the process
> itself. Getting enough traction from PMC members is another issue to pass
> the SPIP vote. Even a vote from committer is not counted. (I don't have a
> binding vote.) I only see one PMC member (Thomas Graves, not my team) in
> the design doc and we still don't get positive feedback. So still a long
> way to go. We need three supporters from PMC members.
>
> Another thing is, I get the proposal at a high level, but I don't have
> actual expertise in DRA. I could review the code in general, but I feel
> like I'm not qualified to approve the code. We still need an expert on the
> CORE area, especially who has expertise with DRA. (Could you please
> annotate the code and enumerate several people who worked on the codebase?)
> If they need an expertise of streaming to understand how things will work
> then either you or I can explain, but I can't just approve and merge the
> code.
>
> That said, if we succeed in finding one and they review the code and LGTM,
> I'd rather say not to go with taking the process of SPIP unless the expert
> reviewing your code requires us to do so. The change you proposed is rather
> small and does not seem to be invasive (experts can also weigh), and there
> must never be the case that this feature is turned on by default (as we
> pointed out limitation). It doesn't look like requiring SPIP, if we
> carefully document the new change and also clearly describe the limitation.
> (Also a warning in the codebase that this must not be enabled by default.)
>
>
> On Tue, Mar 26, 2024 at 7:02 PM Pavan Kotikalapudi <
> pkotikalap...@twilio.com> wrote:
>
>> Hi Bhuwan,
>>
>> Glad to hear back from you! Very much appreciate your help on reviewing
>> the design doc/PR and endorsing this proposal.
>>
>> Thank you so much @Jungtaek Lim  , @Mich
>> Talebzadeh   for graciously agreeing to
>> mentor/shepherd this effort.
>>
>> Regarding Twilio copyright in Notice binary file:
>> Twilio Opensource counsel was involved all through the process, I have
>> placed it in the project file prior to Twilio signing a CCLA for the spark
>> project contribution( Aug '23).
>>
>> Since the CCLA is signed now, I have removed the twilio copyright from
>> that file. I didn't get a chance to update the PR after github-actions
>> closed it.
>>
>> Please let me know of next steps needed to bring this draft PR/effort to
>> completion.
>>
>> Thank you,
>>
>> Pavan
>>
>>
>> On Tue, Mar 26, 2024 at 12:01 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> I'm happy to, but it looks like I need to check one more thing about the
>>> license, according to the WIP PR
>>> 
>>> .
>>>
>>> @Pavan Kotikalapudi 
>>> I see you've added the copyright of Twilio in the NOTICE-binary file,
>>> which makes me wonder if Twilio had filed CCLA to the Apache Software
>>> Foundation.
>>>
>>> PMC members can correct me if I'm mistaken, but from my understanding
>>> (and experiences of PMC member in other ASF project), code contribution is
>>> considered as code donation and copyright belongs to ASF. That's why you
>>> can't find the copyright of employers for contributors in the codebase.
>>> What you see copyrights in NOTICE-binary is due to the fact we have binary
>>> dependency and their licenses may require to explicitly mention about
>>> copyright. It's not about direct code contribution.
>>>
>>> Is Twilio aware of this? Also, if Twilio did not file CCLA in prior,
>>> could you please engage with a relevant group in the company (could be a
>>> legal team, or similar with OSS advocate team if there is any) and ensure
>>> that CCLA is filed? The copyright issue is a legal issue,

Re: The dedicated repository for Kubernetes Operator for Apache Spark

2024-03-28 Thread Dongjoon Hyun

Thank you, Liang-Chi!

Dongjoon.

On Wed, Mar 27, 2024 at 10:56 PM L. C. Hsieh  wrote:

> Hi all,
>
> For the passed SPIP: An Official Kubernetes Operator for Apache Spark,
> the developers have been working on code cleaning and refactoring for
> open source in the last few months. They are ready to contribute the
> code to Spark now.
>
> As we discussed, I will go to create a dedicated repository for the
> Kubernetes Operator for Apache Spark. I think the repository name will
> be "spark-kubernetes-operator". I will try to create the repository
> tomorrow.
>
> After that, they will contribute the code as an initial PR for review
> from the Spark community.
>
> Thank you.
>
> Liang-Chi
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2024-03-28 Thread L. C. Hsieh

Hi Vakaris,

Sorry for the late reply. Thanks for being interested in the official operator.
The developers have been working on code cleaning and refactoring the
internal codes for open source in the last few months.
They are ready to contribute the code to Spark.

We will create a dedicated repository and contribute the code as
initial PR for review soon.

Liang-Chi

On Wed, Mar 20, 2024 at 8:27 AM Vakaris Baškirov
 wrote:
>
> Hi!
> Just wanted to inquire about the status of the official operator. We are 
> looking forward to contributing and later on switching to a Spark Operator 
> and we would prefer it to be the official one.
>
> Thanks,
> Vakaris
>
> On Thu, Nov 30, 2023 at 7:09 AM Shiqi Sun  wrote:
>>
>> Hi Zhou,
>>
>> Thanks for the reply. For the language choice, since I don't think I've used 
>> many k8s components written in Java on k8s, I can't really tell, but at 
>> least for the components written in Golang, they are well-organized, easy to 
>> read/maintain and run well in general. In addition, goroutines really ease 
>> things a lot when writing concurrency code. Golang also has a lot less 
>> boilerplates, no complicated inheritance and easier dependency management 
>> and linting toolings. Together with all these points, that's why I prefer 
>> Golang for this k8s operator. I understand the Spark maintainers are more 
>> familiar with JVM languages, but I think we should consider the performance 
>> and maintainability vs the learning curve, to choose an option that can win 
>> in the long run. Plus, I believe most of the Spark maintainers who touch k8s 
>> related parts in the Spark project already have experiences with Golang, so 
>> it shouldn't be a big problem. Our team had some experience with the fabric8 
>> client a couple years ago, and we've experienced some issues with its 
>> reliability, mainly about the request dropping issue (i.e. code call is made 
>> but the apiserver never receives the request), but that was awhile ago and 
>> I'm not sure whether everything is good with the client now. Anyway, this is 
>> my opinion about the language choice, and I will let other people comment 
>> about it as well.
>>
>> For compatibility, yes please make the CRD compatible from the user's 
>> standpoint, so that it's easy for people to adopt the new operator. The goal 
>> is to consolidate the many spark operators on the market to this new 
>> official operator, so an easy adoption experience is the key.
>>
>> Also, I feel that the discussion is pretty high level, and it's because the 
>> only info revealed for this new operator is the SPIP doc and I haven't got a 
>> chance to see the code yet. I understand the new operator project might 
>> still not be open-sourced yet, but is there any way for me to take an early 
>> peek into the code of your operator, so that we can discuss more 
>> specifically about the points of language choice and compatibility? Thank 
>> you so much!
>>
>> Best,
>> Shiqi
>>
>> On Tue, Nov 28, 2023 at 10:42 AM Zhou Jiang  wrote:
>>>
>>> Hi Shiqi,
>>>
>>> Thanks for the cross-posting here - sorry for the response delay during the 
>>> holiday break :)
>>> We prefer Java for the operator project as it's JVM-based and widely 
>>> familiar within the Spark community. This choice aims to facilitate better 
>>> adoption and ease of onboarding for future maintainers. In addition, the 
>>> Java API client can also be considered as a mature option widely used, by 
>>> Spark itself and by other operator implementations like Flink.
>>> For easier onboarding and potential migration, we'll consider compatibility 
>>> with existing CRD designs - the goal is to maintain compatibility as best 
>>> as possible while minimizing duplication efforts.
>>> I'm enthusiastic about the idea of lean, version agnostic submission 
>>> worker. It aligns with one of the primary goals in the operator design. 
>>> Let's continue exploring this idea further in design doc.
>>>
>>> Thanks,
>>> Zhou
>>>
>>>
>>> On Wed, Nov 22, 2023 at 3:35 PM Shiqi Sun  wrote:

 Hi all,

 Sorry for being late to the party. I went through the SPIP doc and I think 
 this is a great proposal! I left a comment in the SPIP doc a couple days 
 ago, but I don't see much activity there and no one replied, so I wanted 
 to cross-post it here to get some feedback.

 I'm Shiqi Sun, and I work for Big Data Platform in Salesforce. My team has 
 been running the Spark on k8s operator (OSS from Google) in my company to 
 serve Spark users on production for 4+ years, and we've been actively 
 contributing to the Spark on k8s operator OSS and also, occasionally, the 
 Spark OSS. According to our experience, Google's Spark Operator has its 
 own problems, like its close coupling with the spark version, as well as 
 the JVM overhead during job submission. However on the other side, it's 
 been a great component in our team's service in the company, especially

The dedicated repository for Kubernetes Operator for Apache Spark

2024-03-27 Thread L. C. Hsieh

Hi all,

For the passed SPIP: An Official Kubernetes Operator for Apache Spark,
the developers have been working on code cleaning and refactoring for
open source in the last few months. They are ready to contribute the
code to Spark now.

As we discussed, I will go to create a dedicated repository for the
Kubernetes Operator for Apache Spark. I think the repository name will
be "spark-kubernetes-operator". I will try to create the repository
tomorrow.

After that, they will contribute the code as an initial PR for review
from the Spark community.

Thank you.

Liang-Chi

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: Allowing Unicode Whitespace in Lexer

2024-03-27 Thread Mich Talebzadeh

looks fine except that processing all Unicode whitespace characters might
add overhead to the parsing process, potentially impacting performance.
Although I think this is a moot point

+1

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".

On Wed, 27 Mar 2024 at 22:57, Gengliang Wang  wrote:

> +1, this is a reasonable change.
>
> Gengliang
>
> On Wed, Mar 27, 2024 at 9:54 AM serge rielau.com  wrote:
>
>> Going once, going twice, …. last call for objections
>> On Mar 23, 2024 at 5:29 PM -0700, serge rielau.com ,
>> wrote:
>>
>> Hello,
>>
>> I have a PR https://github.com/apache/spark/pull/45620  ready to go that
>> will extend the definition of whitespace (what separates token) from the
>> small set of ASCII characters space, tab, linefeed to those defined in
>> Unicode.
>> While this is a small and safe change, it is one where we would have a
>> hard time changing our minds about later.
>> It is also a change that, AFAIK, cannot be controlled under a config.
>>
>> What does the community think?
>>
>> Cheers
>> Serge
>> SQL Architect at Databricks
>>
>>

Re: Allowing Unicode Whitespace in Lexer

2024-03-27 Thread Gengliang Wang

+1, this is a reasonable change.

Gengliang

On Wed, Mar 27, 2024 at 9:54 AM serge rielau.com  wrote:

> Going once, going twice, …. last call for objections
> On Mar 23, 2024 at 5:29 PM -0700, serge rielau.com ,
> wrote:
>
> Hello,
>
> I have a PR https://github.com/apache/spark/pull/45620  ready to go that
> will extend the definition of whitespace (what separates token) from the
> small set of ASCII characters space, tab, linefeed to those defined in
> Unicode.
> While this is a small and safe change, it is one where we would have a
> hard time changing our minds about later.
> It is also a change that, AFAIK, cannot be controlled under a config.
>
> What does the community think?
>
> Cheers
> Serge
> SQL Architect at Databricks
>
>

Re: Allowing Unicode Whitespace in Lexer

2024-03-27 Thread serge rielau . com

Going once, going twice, …. last call for objections
On Mar 23, 2024 at 5:29 PM -0700, serge rielau.com , wrote:
Hello,

I have a PR https://github.com/apache/spark/pull/45620  ready to go that will 
extend the definition of whitespace (what separates token) from the small set 
of ASCII characters space, tab, linefeed to those defined in Unicode.
While this is a small and safe change, it is one where we would have a hard 
time changing our minds about later.
It is also a change that, AFAIK, cannot be controlled under a config.

What does the community think?

Cheers
Serge
SQL Architect at Databricks

Re: Allowing Unicode Whitespace in Lexer

2024-03-27 Thread serge rielau . com

Yeah I heard about that. This IMHO is a bit more worrying, and we do not have 
teh "excuse" that it is transparent.
Also, which of these would be STRING and which IDENTIFIER?

On Mar 25, 2024 at 1:06 PM -0700, Alex Cruise , wrote:
While we're at it, maybe consider allowing "smart quotes" too :)

-0xe1a

On Sat, Mar 23, 2024 at 5:29 PM serge rielau.com 
mailto:se...@rielau.com>> wrote:
Hello,

I have a PR https://github.com/apache/spark/pull/45620  ready to go that will 
extend the definition of whitespace (what separates token) from the small set 
of ASCII characters space, tab, linefeed to those defined in Unicode.
While this is a small and safe change, it is one where we would have a hard 
time changing our minds about later.
It is also a change that, AFAIK, cannot be controlled under a config.

What does the community think?

Cheers
Serge
SQL Architect at Databricks

Community Over Code NA 2024 Travel Assistance Applications now open!

2024-03-27 Thread Gavin McDonald

Hello to all users, contributors and Committers!

[ You are receiving this email as a subscriber to one or more ASF project
dev or user
  mailing lists and is not being sent to you directly. It is important that
we reach all of our
  users and contributors/committers so that they may get a chance to
benefit from this.
  We apologise in advance if this doesn't interest you but it is on topic
for the mailing
  lists of the Apache Software Foundation; and it is important please that
you do not
  mark this as spam in your email client. Thank You! ]

The Travel Assistance Committee (TAC) are pleased to announce that
travel assistance applications for Community over Code NA 2024 are now
open!

We will be supporting Community over Code NA, Denver Colorado in
October 7th to the 10th 2024.

TAC exists to help those that would like to attend Community over Code
events, but are unable to do so for financial reasons. For more info
on this years applications and qualifying criteria, please visit the
TAC website at < https://tac.apache.org/ >. Applications are already
open on https://tac-apply.apache.org/, so don't delay!

The Apache Travel Assistance Committee will only be accepting
applications from those people that are able to attend the full event.

Important: Applications close on Monday 6th May, 2024.

Applicants have until the the closing date above to submit their
applications (which should contain as much supporting material as
required to efficiently and accurately process their request), this
will enable TAC to announce successful applications shortly
afterwards.

As usual, TAC expects to deal with a range of applications from a
diverse range of backgrounds; therefore, we encourage (as always)
anyone thinking about sending in an application to do so ASAP.

For those that will need a Visa to enter the Country - we advise you apply
now so that you have enough time in case of interview delays. So do not
wait until you know if you have been accepted or not.

We look forward to greeting many of you in Denver, Colorado , October 2024!

Kind Regards,

Gavin

(On behalf of the Travel Assistance Committee)

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-03-26 Thread Jungtaek Lim

Sounds good.

One thing I'd like to clarify before shepherding this SPIP is the process
itself. Getting enough traction from PMC members is another issue to pass
the SPIP vote. Even a vote from committer is not counted. (I don't have a
binding vote.) I only see one PMC member (Thomas Graves, not my team) in
the design doc and we still don't get positive feedback. So still a long
way to go. We need three supporters from PMC members.

Another thing is, I get the proposal at a high level, but I don't have
actual expertise in DRA. I could review the code in general, but I feel
like I'm not qualified to approve the code. We still need an expert on the
CORE area, especially who has expertise with DRA. (Could you please
annotate the code and enumerate several people who worked on the codebase?)
If they need an expertise of streaming to understand how things will work
then either you or I can explain, but I can't just approve and merge the
code.

That said, if we succeed in finding one and they review the code and LGTM,
I'd rather say not to go with taking the process of SPIP unless the expert
reviewing your code requires us to do so. The change you proposed is rather
small and does not seem to be invasive (experts can also weigh), and there
must never be the case that this feature is turned on by default (as we
pointed out limitation). It doesn't look like requiring SPIP, if we
carefully document the new change and also clearly describe the limitation.
(Also a warning in the codebase that this must not be enabled by default.)

On Tue, Mar 26, 2024 at 7:02 PM Pavan Kotikalapudi 
wrote:

> Hi Bhuwan,
>
> Glad to hear back from you! Very much appreciate your help on reviewing
> the design doc/PR and endorsing this proposal.
>
> Thank you so much @Jungtaek Lim  , @Mich
> Talebzadeh   for graciously agreeing to
> mentor/shepherd this effort.
>
> Regarding Twilio copyright in Notice binary file:
> Twilio Opensource counsel was involved all through the process, I have
> placed it in the project file prior to Twilio signing a CCLA for the spark
> project contribution( Aug '23).
>
> Since the CCLA is signed now, I have removed the twilio copyright from
> that file. I didn't get a chance to update the PR after github-actions
> closed it.
>
> Please let me know of next steps needed to bring this draft PR/effort to
> completion.
>
> Thank you,
>
> Pavan
>
>
> On Tue, Mar 26, 2024 at 12:01 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> I'm happy to, but it looks like I need to check one more thing about the
>> license, according to the WIP PR
>> 
>> .
>>
>> @Pavan Kotikalapudi 
>> I see you've added the copyright of Twilio in the NOTICE-binary file,
>> which makes me wonder if Twilio had filed CCLA to the Apache Software
>> Foundation.
>>
>> PMC members can correct me if I'm mistaken, but from my understanding
>> (and experiences of PMC member in other ASF project), code contribution is
>> considered as code donation and copyright belongs to ASF. That's why you
>> can't find the copyright of employers for contributors in the codebase.
>> What you see copyrights in NOTICE-binary is due to the fact we have binary
>> dependency and their licenses may require to explicitly mention about
>> copyright. It's not about direct code contribution.
>>
>> Is Twilio aware of this? Also, if Twilio did not file CCLA in prior,
>> could you please engage with a relevant group in the company (could be a
>> legal team, or similar with OSS advocate team if there is any) and ensure
>> that CCLA is filed? The copyright issue is a legal issue, so we have to be
>> conservative and 100% sure that the employer is aware of what is the
>> meaning of donating the code to ASF via reviewing CCLA and relevant doc,
>> and explicitly express that they are OK with it via filing CCLA.
>>
>> You can read the description of agreements on contribution and ICLA/CCLA
>> form from this page.
>> https://www.apache.org/licenses/contributor-agreements.html
>> 
>>
>> Please let me know if this is resolved. This seems to me as a blocker to
>> move on. Please also let me know if the contribution is withdrawn from the
>> employer.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Mon, Mar 25, 2024 at 11:47 PM Bhuwan Sahni
>>  wrote:
>>
>>> Hi Pavan,
>>>
>>> I looked at the PR, and the changes look simple and contained. It would
>>> be useful to add dynamic resource allocation to Spark Structured Streaming.
>>>
>>> Jungtaek. Would you be able to shepherd this change?
>>>
>>>
>>> On Tue, Mar 19, 2024 at 10:38 AM Bhuwan Sahni <
>>> bhuwan.sa...@databricks.com> wrote:
>>>

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-03-26 Thread Pavan Kotikalapudi

Sounds good.

Thanks again for your help on guiding the effort from discussion/review
through voting phases in the spark dev community.

Thank you,

Pavan

On Tue, Mar 26, 2024 at 4:20 AM Mich Talebzadeh 
wrote:

> Hi Pavan,
>
> Thanks for instigating this proposal. Looks like the proposal is ready and
> has enough votes to be implemented. Having a sheppard will make it more
> fruitful.
>
> I will leave it to @Jungtaek Lim  's
> capable hands to drive it forward.
>
> Will be there to help if needed.
>
> Cheers
>
> Mich Talebzadeh,
> Technologist | Solutions Architect | Data Engineer  | Generative AI
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
> 
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner
> Von
> Braun
> 
> )".
>
>
> On Tue, 26 Mar 2024 at 10:02, Pavan Kotikalapudi 
> wrote:
>
>> Hi Bhuwan,
>>
>> Glad to hear back from you! Very much appreciate your help on reviewing
>> the design doc/PR and endorsing this proposal.
>>
>> Thank you so much @Jungtaek Lim  , @Mich
>> Talebzadeh   for graciously agreeing to
>> mentor/shepherd this effort.
>>
>> Regarding Twilio copyright in Notice binary file:
>> Twilio Opensource counsel was involved all through the process, I have
>> placed it in the project file prior to Twilio signing a CCLA for the spark
>> project contribution( Aug '23).
>>
>> Since the CCLA is signed now, I have removed the twilio copyright from
>> that file. I didn't get a chance to update the PR after github-actions
>> closed it.
>>
>> Please let me know of next steps needed to bring this draft PR/effort to
>> completion.
>>
>> Thank you,
>>
>> Pavan
>>
>>
>> On Tue, Mar 26, 2024 at 12:01 AM Jungtaek Lim <
>> kabhwan.opensou...@gmail.com> wrote:
>>
>>> I'm happy to, but it looks like I need to check one more thing about the
>>> license, according to the WIP PR
>>> 
>>> .
>>>
>>> @Pavan Kotikalapudi 
>>> I see you've added the copyright of Twilio in the NOTICE-binary file,
>>> which makes me wonder if Twilio had filed CCLA to the Apache Software
>>> Foundation.
>>>
>>> PMC members can correct me if I'm mistaken, but from my understanding
>>> (and experiences of PMC member in other ASF project), code contribution is
>>> considered as code donation and copyright belongs to ASF. That's why you
>>> can't find the copyright of employers for contributors in the codebase.
>>> What you see copyrights in NOTICE-binary is due to the fact we have binary
>>> dependency and their licenses may require to explicitly mention about
>>> copyright. It's not about direct code contribution.
>>>
>>> Is Twilio aware of this? Also, if Twilio did not file CCLA in prior,
>>> could you please engage with a relevant group in the company (could be a
>>> legal team, or similar with OSS advocate team if there is any) and ensure
>>> that CCLA is filed? The copyright issue is a legal issue, so we have to be
>>> conservative and 100% sure that the employer is aware of what is the
>>> meaning of donating the code to ASF via reviewing CCLA and relevant doc,
>>> and explicitly express that they are OK with it via filing CCLA.
>>>
>>> You can read the description of agreements on contribution and ICLA/CCLA
>>> form from this page.
>>> https://www.apache.org/licenses/contributor-agreements.html
>>> 
>>>
>>> Please let me know if this is resolved. This seems to me as a blocker to
>>> move on. Please also let me know if the contribution is withdrawn from the
>>> employer.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>>
>>> On Mon, Mar 25, 2024 at 11:47 PM Bhuwan Sahni
>>>  wrote:
>>>
 Hi Pavan,

 I looked at the PR, and the changes look simple

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-03-26 Thread Mich Talebzadeh

Hi Pavan,

Thanks for instigating this proposal. Looks like the proposal is ready and
has enough votes to be implemented. Having a sheppard will make it more
fruitful.

I will leave it to @Jungtaek Lim  's
capable hands to drive it forward.

Will be there to help if needed.

Cheers

Mich Talebzadeh,
Technologist | Solutions Architect | Data Engineer  | Generative AI
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 26 Mar 2024 at 10:02, Pavan Kotikalapudi 
wrote:

> Hi Bhuwan,
>
> Glad to hear back from you! Very much appreciate your help on reviewing
> the design doc/PR and endorsing this proposal.
>
> Thank you so much @Jungtaek Lim  , @Mich
> Talebzadeh   for graciously agreeing to
> mentor/shepherd this effort.
>
> Regarding Twilio copyright in Notice binary file:
> Twilio Opensource counsel was involved all through the process, I have
> placed it in the project file prior to Twilio signing a CCLA for the spark
> project contribution( Aug '23).
>
> Since the CCLA is signed now, I have removed the twilio copyright from
> that file. I didn't get a chance to update the PR after github-actions
> closed it.
>
> Please let me know of next steps needed to bring this draft PR/effort to
> completion.
>
> Thank you,
>
> Pavan
>
>
> On Tue, Mar 26, 2024 at 12:01 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> I'm happy to, but it looks like I need to check one more thing about the
>> license, according to the WIP PR
>> 
>> .
>>
>> @Pavan Kotikalapudi 
>> I see you've added the copyright of Twilio in the NOTICE-binary file,
>> which makes me wonder if Twilio had filed CCLA to the Apache Software
>> Foundation.
>>
>> PMC members can correct me if I'm mistaken, but from my understanding
>> (and experiences of PMC member in other ASF project), code contribution is
>> considered as code donation and copyright belongs to ASF. That's why you
>> can't find the copyright of employers for contributors in the codebase.
>> What you see copyrights in NOTICE-binary is due to the fact we have binary
>> dependency and their licenses may require to explicitly mention about
>> copyright. It's not about direct code contribution.
>>
>> Is Twilio aware of this? Also, if Twilio did not file CCLA in prior,
>> could you please engage with a relevant group in the company (could be a
>> legal team, or similar with OSS advocate team if there is any) and ensure
>> that CCLA is filed? The copyright issue is a legal issue, so we have to be
>> conservative and 100% sure that the employer is aware of what is the
>> meaning of donating the code to ASF via reviewing CCLA and relevant doc,
>> and explicitly express that they are OK with it via filing CCLA.
>>
>> You can read the description of agreements on contribution and ICLA/CCLA
>> form from this page.
>> https://www.apache.org/licenses/contributor-agreements.html
>> 
>>
>> Please let me know if this is resolved. This seems to me as a blocker to
>> move on. Please also let me know if the contribution is withdrawn from the
>> employer.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Mon, Mar 25, 2024 at 11:47 PM Bhuwan Sahni
>>  wrote:
>>
>>> Hi Pavan,
>>>
>>> I looked at the PR, and the changes look simple and contained. It would
>>> be useful to add dynamic resource allocation to Spark Structured Streaming.
>>>
>>> Jungtaek. Would you be able to shepherd this change?
>>>
>>>
>>> On Tue, Mar 19, 2024 at 10:38 AM Bhuwan Sahni <
>>> bhuwan.sa...@databricks.com> wrote:
>>>
 Thanks a lot for creating the risk table Pavan. My apologies. I was
 tied up with high priority items for the last couple weeks and could not
 respond. I will review the PR by tomorrow's end, and get back to you.

 Appreciate your patience.

 Thanks
 Bhuwan Sahni

 On Sun, Mar 17, 2024 at 4:42 PM Pavan Kotikalapudi <
 pkotikalap...@twilio.com> wrote:

> Hi Bhuwan,
>
> I hope the team got a chance to review the draft PR, looking for some
> comments to see if the plan looks alright?. I have updated the document
> about the risks
>

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-03-26 Thread Pavan Kotikalapudi

Hi Bhuwan,

Glad to hear back from you! Very much appreciate your help on reviewing the
design doc/PR and endorsing this proposal.

Thank you so much @Jungtaek Lim  , @Mich
Talebzadeh   for graciously agreeing to
mentor/shepherd this effort.

Regarding Twilio copyright in Notice binary file:
Twilio Opensource counsel was involved all through the process, I have
placed it in the project file prior to Twilio signing a CCLA for the spark
project contribution( Aug '23).

Since the CCLA is signed now, I have removed the twilio copyright from that
file. I didn't get a chance to update the PR after github-actions closed it.

Please let me know of next steps needed to bring this draft PR/effort to
completion.

Thank you,

Pavan


On Tue, Mar 26, 2024 at 12:01 AM Jungtaek Lim 
wrote:

> I'm happy to, but it looks like I need to check one more thing about the
> license, according to the WIP PR
> 
> .
>
> @Pavan Kotikalapudi 
> I see you've added the copyright of Twilio in the NOTICE-binary file,
> which makes me wonder if Twilio had filed CCLA to the Apache Software
> Foundation.
>
> PMC members can correct me if I'm mistaken, but from my understanding (and
> experiences of PMC member in other ASF project), code contribution is
> considered as code donation and copyright belongs to ASF. That's why you
> can't find the copyright of employers for contributors in the codebase.
> What you see copyrights in NOTICE-binary is due to the fact we have binary
> dependency and their licenses may require to explicitly mention about
> copyright. It's not about direct code contribution.
>
> Is Twilio aware of this? Also, if Twilio did not file CCLA in prior, could
> you please engage with a relevant group in the company (could be a legal
> team, or similar with OSS advocate team if there is any) and ensure that
> CCLA is filed? The copyright issue is a legal issue, so we have to be
> conservative and 100% sure that the employer is aware of what is the
> meaning of donating the code to ASF via reviewing CCLA and relevant doc,
> and explicitly express that they are OK with it via filing CCLA.
>
> You can read the description of agreements on contribution and ICLA/CCLA
> form from this page.
> https://www.apache.org/licenses/contributor-agreements.html
> 
>
> Please let me know if this is resolved. This seems to me as a blocker to
> move on. Please also let me know if the contribution is withdrawn from the
> employer.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> On Mon, Mar 25, 2024 at 11:47 PM Bhuwan Sahni
>  wrote:
>
>> Hi Pavan,
>>
>> I looked at the PR, and the changes look simple and contained. It would
>> be useful to add dynamic resource allocation to Spark Structured Streaming.
>>
>> Jungtaek. Would you be able to shepherd this change?
>>
>>
>> On Tue, Mar 19, 2024 at 10:38 AM Bhuwan Sahni <
>> bhuwan.sa...@databricks.com> wrote:
>>
>>> Thanks a lot for creating the risk table Pavan. My apologies. I was tied
>>> up with high priority items for the last couple weeks and could not
>>> respond. I will review the PR by tomorrow's end, and get back to you.
>>>
>>> Appreciate your patience.
>>>
>>> Thanks
>>> Bhuwan Sahni
>>>
>>> On Sun, Mar 17, 2024 at 4:42 PM Pavan Kotikalapudi <
>>> pkotikalap...@twilio.com> wrote:
>>>
 Hi Bhuwan,

 I hope the team got a chance to review the draft PR, looking for some
 comments to see if the plan looks alright?. I have updated the document
 about the risks
 .(also
 mentioned below). Please confirm if it looks alright?

 *Spark application type*

 *auto-scaling capability*

 *with New auto-scaling capability*

 Spark Batch job

 Works with current DRA

 No - change

 Streaming query without trigger interval

 No implementation

 Can work with this implementation - (have to set certain scale back
 configs based on previous usage pattern) - maybe automate with future work?

 Spark Streaming query with Trigger interval

 No implementation

 With this implementation

 Spark Streaming query with one-time micro batch

 Works with current DRA

 No - change

 Spark Streaming query with

 Availablenow micro batch

 Works with current DRA

 No - change

 Batch + Streaming query (

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-03-26 Thread Jungtaek Lim

I'm happy to, but it looks like I need to check one more thing about the
license, according to the WIP PR
.

@Pavan Kotikalapudi 
I see you've added the copyright of Twilio in the NOTICE-binary file, which
makes me wonder if Twilio had filed CCLA to the Apache Software Foundation.

PMC members can correct me if I'm mistaken, but from my understanding (and
experiences of PMC member in other ASF project), code contribution is
considered as code donation and copyright belongs to ASF. That's why you
can't find the copyright of employers for contributors in the codebase.
What you see copyrights in NOTICE-binary is due to the fact we have binary
dependency and their licenses may require to explicitly mention about
copyright. It's not about direct code contribution.

Is Twilio aware of this? Also, if Twilio did not file CCLA in prior, could
you please engage with a relevant group in the company (could be a legal
team, or similar with OSS advocate team if there is any) and ensure that
CCLA is filed? The copyright issue is a legal issue, so we have to be
conservative and 100% sure that the employer is aware of what is the
meaning of donating the code to ASF via reviewing CCLA and relevant doc,
and explicitly express that they are OK with it via filing CCLA.

You can read the description of agreements on contribution and ICLA/CCLA
form from this page.
https://www.apache.org/licenses/contributor-agreements.html

Please let me know if this is resolved. This seems to me as a blocker to
move on. Please also let me know if the contribution is withdrawn from the
employer.

Thanks,
Jungtaek Lim (HeartSaVioR)

On Mon, Mar 25, 2024 at 11:47 PM Bhuwan Sahni
 wrote:

> Hi Pavan,
>
> I looked at the PR, and the changes look simple and contained. It would be
> useful to add dynamic resource allocation to Spark Structured Streaming.
>
> Jungtaek. Would you be able to shepherd this change?
>
>
> On Tue, Mar 19, 2024 at 10:38 AM Bhuwan Sahni 
> wrote:
>
>> Thanks a lot for creating the risk table Pavan. My apologies. I was tied
>> up with high priority items for the last couple weeks and could not
>> respond. I will review the PR by tomorrow's end, and get back to you.
>>
>> Appreciate your patience.
>>
>> Thanks
>> Bhuwan Sahni
>>
>> On Sun, Mar 17, 2024 at 4:42 PM Pavan Kotikalapudi <
>> pkotikalap...@twilio.com> wrote:
>>
>>> Hi Bhuwan,
>>>
>>> I hope the team got a chance to review the draft PR, looking for some
>>> comments to see if the plan looks alright?. I have updated the document
>>> about the risks
>>> .(also
>>> mentioned below). Please confirm if it looks alright?
>>>
>>> *Spark application type*
>>>
>>> *auto-scaling capability*
>>>
>>> *with New auto-scaling capability*
>>>
>>> Spark Batch job
>>>
>>> Works with current DRA
>>>
>>> No - change
>>>
>>> Streaming query without trigger interval
>>>
>>> No implementation
>>>
>>> Can work with this implementation - (have to set certain scale back
>>> configs based on previous usage pattern) - maybe automate with future work?
>>>
>>> Spark Streaming query with Trigger interval
>>>
>>> No implementation
>>>
>>> With this implementation
>>>
>>> Spark Streaming query with one-time micro batch
>>>
>>> Works with current DRA
>>>
>>> No - change
>>>
>>> Spark Streaming query with
>>>
>>> Availablenow micro batch
>>>
>>> Works with current DRA
>>>
>>> No - change
>>>
>>> Batch + Streaming query (
>>>
>>> default/
>>>
>>> triggger-interval/
>>>
>>> once/
>>>
>>> availablenow modes), other notebook use cases.
>>>
>>> No implementation
>>>
>>> No implementation
>>>
>>>
>>>
>>> We are more than happy to collaborate on a call to make better progress
>>> on this enhancement. Please let us know.
>>>
>>> Thank you,
>>>
>>> Pavan
>>>
>>> On Fri, Mar 1, 2024 at 12:26 PM Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>

 Hi Bhuwan et al,

 Thank you for passing on the DataBricks Structured Streaming team's
 review of the SPIP document. FYI, I work closely with Pawan and other
 members to help deliver this piece of work. We appreciate your insights,
 especially regarding the cost savings potential from the PoC.

 Pavan already furnished you with some additional info. Your team's
 point about the SPIP currently addressing a specific use case (single
 streaming query with Processing Time trigger) is well-taken. We agree that
 maintaining simplicity is key, particularly as we explore more general
 resource allocation mechanisms in the future. To address the concerns and
 foster open discussion, The DataBricks team are invited to directly add
 their comments and suggestions to the Jira itself

 [SPARK-24815] Structured Streaming should support dynamic allocation -
 ASF JIRA (apache.org)

Re: Improved Structured Streaming Documentation Proof-of-Concept

2024-03-25 Thread Neil Ramaswamy

I'm glad you think it's generally a good idea!

I will mention, though, that with these better docs I've almost finished,
I'm hoping that Structured Streaming no longer stays a specialist topic
that requires "trench warfare." With good pedagogy, I think that it's very
approachable. The Knowledge Sharing Hub could be useful for e2e real-world
use-cases, but I think that operator semantics, stream configurations, etc.
have a better home in the official documentation.

Thanks for your engagement, Mich. Looking forward to hearing others'
opinions.

Neil

On Mon, Mar 25, 2024 at 2:50 PM Mich Talebzadeh 
wrote:

> Hi,
>
> Your intended work on improving the Structured Streaming documentation is
> great! Clear and well-organized instructions are important for everyone
> using Spark, beginners and experts alike.
> Having said that, Spark Structured Streaming much like other specialist
> topics with Spark say (k8s) or otherwise cannot be mastered by
> documentation alone. These topics require a considerable amount of practice
> and trench warfare so to speak to master them. Suffice to say that I agree
> with the proposals of making examples. However, it is an area that many try
> to master but fail( judging by typical issues brought up in the user group
> and otherwise). Perhaps using a section such as the proposed "Knowledge
> Sharing Hub'', may become more relevant. Moreover, the examples have to
> reflect real life scenarios and conversly will be of limited use otherwise.
>
> HTH
>
> Mich Talebzadeh,
>
> Technologist | Data | Generative AI | Financial Fraud
>
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>
> Mich Talebzadeh,
> Technologist | Data | Generative AI | Financial Fraud
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 25 Mar 2024 at 21:19, Neil Ramaswamy  wrote:
>
>> Hi all,
>>
>> I recently started an effort to improve the Structured Streaming
>> documentation. I thought that the current documentation, while very
>> comprehensive, could be improved in terms of organization, clarity, and
>> presence of examples.
>>
>> You can view the repo here
>> , and you can see
>> a preview of the site here .
>> It's almost at full parity with the programming guide, and it also has
>> additional content, like a guide on unit testing and an in-depth
>> explanation of watermarks. I think it's at a point where we can bring this
>> to completion if it's something that the community wants.
>>
>> I'd love to hear feedback from everyone: is this something that we would
>> want to move forward with? As it borrows certain parts from the programming
>> guide, it has an Apache License, so I'd be more than happy if it is adopted
>> by an official Spark repo.
>>
>> Best,
>> Neil
>>
>

Re: Improved Structured Streaming Documentation Proof-of-Concept

2024-03-25 Thread Mich Talebzadeh

Hi,

Your intended work on improving the Structured Streaming documentation is
great! Clear and well-organized instructions are important for everyone
using Spark, beginners and experts alike.
Having said that, Spark Structured Streaming much like other specialist
topics with Spark say (k8s) or otherwise cannot be mastered by
documentation alone. These topics require a considerable amount of practice
and trench warfare so to speak to master them. Suffice to say that I agree
with the proposals of making examples. However, it is an area that many try
to master but fail( judging by typical issues brought up in the user group
and otherwise). Perhaps using a section such as the proposed "Knowledge
Sharing Hub'', may become more relevant. Moreover, the examples have to
reflect real life scenarios and conversly will be of limited use otherwise.

HTH

Mich Talebzadeh,

Technologist | Data | Generative AI | Financial Fraud

London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh

Disclaimer: The information provided is correct to the best of my knowledge
but of course cannot be guaranteed . It is essential to note that, as with
any advice, quote "one test result is worth one-thousand expert opinions
(Werner Von Braun)".

Mich Talebzadeh,
Technologist | Data | Generative AI | Financial Fraud
London
United Kingdom

   view my Linkedin profile

 https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".

On Mon, 25 Mar 2024 at 21:19, Neil Ramaswamy  wrote:

> Hi all,
>
> I recently started an effort to improve the Structured Streaming
> documentation. I thought that the current documentation, while very
> comprehensive, could be improved in terms of organization, clarity, and
> presence of examples.
>
> You can view the repo here
> , and you can see
> a preview of the site here .
> It's almost at full parity with the programming guide, and it also has
> additional content, like a guide on unit testing and an in-depth
> explanation of watermarks. I think it's at a point where we can bring this
> to completion if it's something that the community wants.
>
> I'd love to hear feedback from everyone: is this something that we would
> want to move forward with? As it borrows certain parts from the programming
> guide, it has an Apache License, so I'd be more than happy if it is adopted
> by an official Spark repo.
>
> Best,
> Neil
>

Re: Allowing Unicode Whitespace in Lexer

2024-03-25 Thread Alex Cruise

While we're at it, maybe consider allowing "smart quotes" too :)

-0xe1a

On Sat, Mar 23, 2024 at 5:29 PM serge rielau.com  wrote:

> Hello,
>
> I have a PR https://github.com/apache/spark/pull/45620  ready to go that
> will extend the definition of whitespace (what separates token) from the
> small set of ASCII characters space, tab, linefeed to those defined in
> Unicode.
> While this is a small and safe change, it is one where we would have a
> hard time changing our minds about later.
> It is also a change that, AFAIK, cannot be controlled under a config.
>
> What does the community think?
>
> Cheers
> Serge
> SQL Architect at Databricks
>
>

Improved Structured Streaming Documentation Proof-of-Concept

2024-03-25 Thread Neil Ramaswamy

Hi all,

I recently started an effort to improve the Structured Streaming
documentation. I thought that the current documentation, while very
comprehensive, could be improved in terms of organization, clarity, and
presence of examples.

You can view the repo here
, and you can see a
preview of the site here . It's
almost at full parity with the programming guide, and it also has
additional content, like a guide on unit testing and an in-depth
explanation of watermarks. I think it's at a point where we can bring this
to completion if it's something that the community wants.

I'd love to hear feedback from everyone: is this something that we would
want to move forward with? As it borrows certain parts from the programming
guide, it has an Apache License, so I'd be more than happy if it is adopted
by an official Spark repo.

Best,
Neil

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-03-25 Thread Bhuwan Sahni

Hi Pavan,

I looked at the PR, and the changes look simple and contained. It would be
useful to add dynamic resource allocation to Spark Structured Streaming.

Jungtaek. Would you be able to shepherd this change?


On Tue, Mar 19, 2024 at 10:38 AM Bhuwan Sahni 
wrote:

> Thanks a lot for creating the risk table Pavan. My apologies. I was tied
> up with high priority items for the last couple weeks and could not
> respond. I will review the PR by tomorrow's end, and get back to you.
>
> Appreciate your patience.
>
> Thanks
> Bhuwan Sahni
>
> On Sun, Mar 17, 2024 at 4:42 PM Pavan Kotikalapudi <
> pkotikalap...@twilio.com> wrote:
>
>> Hi Bhuwan,
>>
>> I hope the team got a chance to review the draft PR, looking for some
>> comments to see if the plan looks alright?. I have updated the document
>> about the risks
>> .(also
>> mentioned below). Please confirm if it looks alright?
>>
>> *Spark application type*
>>
>> *auto-scaling capability*
>>
>> *with New auto-scaling capability*
>>
>> Spark Batch job
>>
>> Works with current DRA
>>
>> No - change
>>
>> Streaming query without trigger interval
>>
>> No implementation
>>
>> Can work with this implementation - (have to set certain scale back
>> configs based on previous usage pattern) - maybe automate with future work?
>>
>> Spark Streaming query with Trigger interval
>>
>> No implementation
>>
>> With this implementation
>>
>> Spark Streaming query with one-time micro batch
>>
>> Works with current DRA
>>
>> No - change
>>
>> Spark Streaming query with
>>
>> Availablenow micro batch
>>
>> Works with current DRA
>>
>> No - change
>>
>> Batch + Streaming query (
>>
>> default/
>>
>> triggger-interval/
>>
>> once/
>>
>> availablenow modes), other notebook use cases.
>>
>> No implementation
>>
>> No implementation
>>
>>
>>
>> We are more than happy to collaborate on a call to make better progress
>> on this enhancement. Please let us know.
>>
>> Thank you,
>>
>> Pavan
>>
>> On Fri, Mar 1, 2024 at 12:26 PM Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>>
>>> Hi Bhuwan et al,
>>>
>>> Thank you for passing on the DataBricks Structured Streaming team's
>>> review of the SPIP document. FYI, I work closely with Pawan and other
>>> members to help deliver this piece of work. We appreciate your insights,
>>> especially regarding the cost savings potential from the PoC.
>>>
>>> Pavan already furnished you with some additional info. Your team's point
>>> about the SPIP currently addressing a specific use case (single streaming
>>> query with Processing Time trigger) is well-taken. We agree that
>>> maintaining simplicity is key, particularly as we explore more general
>>> resource allocation mechanisms in the future. To address the concerns and
>>> foster open discussion, The DataBricks team are invited to directly add
>>> their comments and suggestions to the Jira itself
>>>
>>> [SPARK-24815] Structured Streaming should support dynamic allocation -
>>> ASF JIRA (apache.org)
>>> 
>>> This will ensure everyone involved can benefit from your team's
>>> expertise and facilitate further collaboration.
>>>
>>> Thanks
>>>
>>> Mich Talebzadeh,
>>> Dad | Technologist | Solutions Architect | Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner
>>> Von
>>> Braun
>>> 
>>> )".
>>>
>>>
>>> On Fri, 1 Mar 2024 at 19:59, Pavan Kotikalapudi
>>>  wrote:
>>>
 Thanks Bhuwan and rest of the databricks team for the reviews,

 I appreciate your reviews, was very helpful in evaluating a few options
 that were overlooked earlier (especially about mixed spark apps running on
 notebooks).

Re: [DISCUSS] MySQL version support policy

2024-03-25 Thread Cheng Pan

Thanks Dongjoon’s reply and questions,

> A. Adding a new Apache Spark community policy (contract) to guarantee MySQL
> LTS Versions Support.

Yes, at least the latest MySQL LTS version. To reduce the maintenance efforts 
on the Spark side, I think we can only run CI with the latest LTS version but 
accept reasonable patches for compatibility with older LTS versions. For 
example, Spark on K8s is only verified with the latest minikube in CI, and also 
accepts reasonable patches for older K8s.

> B. Dropping the support of non-LTS version support (MySQL 8.3/8.2/8.1)

Those versions likely work well too. For example, Spark currently officially 
supports JDK 17 and 21, it likely works on JDK 20 too, but has not been 
verified by the community.

> 1. For (A), do you mean MySQL LTS versions are not supported by Apache Spark 
> releases properly due to the improper test suite?

Not yet. MySQL retains good backward compatibilities so far, I see a lot of 
users use MySQL 8.0 drivers to access both MySQL 5.7/8.0 servers through Spark 
JDBC datasource, everything goes well so far.

> 2. For (B), why does Apache Spark need to drop non-LTS MySQL support?

I think we can accept reasonable patches with careful review, but neither 
official support declaration nor CI verification is required, just like we do 
for JDK version support.

> 3. What about MariaDB? Do we need to stick to some versions?

I’m not familiar with MariaDB, but I would treat it as a MySQL-compatible 
product, in the same position as Amazon RDS for MySQL, neither official support 
declaration nor CI verification is required, but considering the adoption rate 
of those products, reasonable patches should be considered too.

Thanks,
Cheng Pan

On 2024/03/25 06:47:10 Dongjoon Hyun wrote:
> Hi, Cheng.
> 
> Thank you for the suggestion. Your suggestion seems to have at least two
> themes.
> 
> A. Adding a new Apache Spark community policy (contract) to guarantee MySQL
> LTS Versions Support.
> B. Dropping the support of non-LTS version support (MySQL 8.3/8.2/8.1)
> 
> And, it brings me three questions.
> 
> 1. For (A), do you mean MySQL LTS versions are not supported by Apache
> Spark releases properly due to the improper test suite?
> 2. For (B), why does Apache Spark need to drop non-LTS MySQL support?
> 3. What about MariaDB? Do we need to stick to some versions?
> 
> To be clear, if needed, we can have daily GitHub Action CIs easily like
> Python CI (Python 3.8/3.10/3.11/3.12).
> 
> -
> https://github.com/apache/spark/blob/master/.github/workflows/build_python.yml
> 
> Thanks,
> Dongjoon.
> 
> 
> On Sun, Mar 24, 2024 at 10:29 PM Cheng Pan  wrote:
> 
> > Hi, Spark community,
> >
> > I noticed that the Spark JDBC connector MySQL dialect is testing against
> > the 8.3.0[1] now, a non-LTS version.
> >
> > MySQL changed the version policy recently[2], which is now very similar to
> > the Java version policy. In short, 5.5, 5.6, 5.7, 8.0 is the LTS version,
> > 8.1, 8.2, 8.3 is non-LTS, and the next LTS version is 8.4.
> >
> > I would say that MySQL is one of the most important infrastructures today,
> > I checked the AWS RDS MySQL[4] and Azure Database for MySQL[5] version
> > support policy, and both only support 5.7 and 8.0.
> >
> > Also, Spark officially only supports LTS Java versions, like JDK 17 and
> > 21, but not 22. I would recommend using MySQL 8.0 for testing until the
> > next MySQL LTS version (8.4) is available.
> >
> > Additional discussion can be found at [3]
> >
> > [1] https://issues.apache.org/jira/browse/SPARK-47453
> > [2]
> > https://dev.mysql.com/blog-archive/introducing-mysql-innovation-and-long-term-support-lts-versions/
> > [3] https://github.com/apache/spark/pull/45581
> > [4] https://aws.amazon.com/rds/mysql/
> > [5] https://learn.microsoft.com/en-us/azure/mysql/concepts-version-policy
> >
> > Thanks,
> > Cheng Pan
> >
> >
> >
> > -
> > To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
> >
> >
>  

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [DISCUSS] MySQL version support policy

2024-03-25 Thread Dongjoon Hyun

Hi, Cheng.

Thank you for the suggestion. Your suggestion seems to have at least two
themes.

A. Adding a new Apache Spark community policy (contract) to guarantee MySQL
LTS Versions Support.
B. Dropping the support of non-LTS version support (MySQL 8.3/8.2/8.1)

And, it brings me three questions.

1. For (A), do you mean MySQL LTS versions are not supported by Apache
Spark releases properly due to the improper test suite?
2. For (B), why does Apache Spark need to drop non-LTS MySQL support?
3. What about MariaDB? Do we need to stick to some versions?

To be clear, if needed, we can have daily GitHub Action CIs easily like
Python CI (Python 3.8/3.10/3.11/3.12).

-
https://github.com/apache/spark/blob/master/.github/workflows/build_python.yml

Thanks,
Dongjoon.

On Sun, Mar 24, 2024 at 10:29 PM Cheng Pan  wrote:

> Hi, Spark community,
>
> I noticed that the Spark JDBC connector MySQL dialect is testing against
> the 8.3.0[1] now, a non-LTS version.
>
> MySQL changed the version policy recently[2], which is now very similar to
> the Java version policy. In short, 5.5, 5.6, 5.7, 8.0 is the LTS version,
> 8.1, 8.2, 8.3 is non-LTS, and the next LTS version is 8.4.
>
> I would say that MySQL is one of the most important infrastructures today,
> I checked the AWS RDS MySQL[4] and Azure Database for MySQL[5] version
> support policy, and both only support 5.7 and 8.0.
>
> Also, Spark officially only supports LTS Java versions, like JDK 17 and
> 21, but not 22. I would recommend using MySQL 8.0 for testing until the
> next MySQL LTS version (8.4) is available.
>
> Additional discussion can be found at [3]
>
> [1] https://issues.apache.org/jira/browse/SPARK-47453
> [2]
> https://dev.mysql.com/blog-archive/introducing-mysql-innovation-and-long-term-support-lts-versions/
> [3] https://github.com/apache/spark/pull/45581
> [4] https://aws.amazon.com/rds/mysql/
> [5] https://learn.microsoft.com/en-us/azure/mysql/concepts-version-policy
>
> Thanks,
> Cheng Pan
>
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

[DISCUSS] MySQL version support policy

2024-03-24 Thread Cheng Pan

Hi, Spark community,

I noticed that the Spark JDBC connector MySQL dialect is testing against the 
8.3.0[1] now, a non-LTS version.

MySQL changed the version policy recently[2], which is now very similar to the 
Java version policy. In short, 5.5, 5.6, 5.7, 8.0 is the LTS version, 8.1, 8.2, 
8.3 is non-LTS, and the next LTS version is 8.4. 

I would say that MySQL is one of the most important infrastructures today, I 
checked the AWS RDS MySQL[4] and Azure Database for MySQL[5] version support 
policy, and both only support 5.7 and 8.0.

Also, Spark officially only supports LTS Java versions, like JDK 17 and 21, but 
not 22. I would recommend using MySQL 8.0 for testing until the next MySQL LTS 
version (8.4) is available.

Additional discussion can be found at [3]

[1] https://issues.apache.org/jira/browse/SPARK-47453
[2] 
https://dev.mysql.com/blog-archive/introducing-mysql-innovation-and-long-term-support-lts-versions/
[3] https://github.com/apache/spark/pull/45581
[4] https://aws.amazon.com/rds/mysql/
[5] https://learn.microsoft.com/en-us/azure/mysql/concepts-version-policy

Thanks,
Cheng Pan



-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Allowing Unicode Whitespace in Lexer

2024-03-23 Thread serge rielau . com

Hello,

I have a PR https://github.com/apache/spark/pull/45620  ready to go that will 
extend the definition of whitespace (what separates token) from the small set 
of ASCII characters space, tab, linefeed to those defined in Unicode.
While this is a small and safe change, it is one where we would have a hard 
time changing our minds about later.
It is also a change that, AFAIK, cannot be controlled under a config.

What does the community think?

Cheers
Serge
SQL Architect at Databricks

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-23 Thread Jay Han

+1. It sounds awesome!

Kiran Kumar Dusi  于2024年3月21日周四 14:16写道：

> +1
>
> On Thu, 21 Mar 2024 at 7:46 AM, Farshid Ashouri <
> farsheed.asho...@gmail.com> wrote:
>
>> +1
>>
>> On Mon, 18 Mar 2024, 11:00 Mich Talebzadeh, 
>> wrote:
>>
>>> Some of you may be aware that Databricks community Home | Databricks
>>> have just launched a knowledge sharing hub. I thought it would be a
>>> good idea for the Apache Spark user group to have the same, especially
>>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>> Streaming, Spark Mlib and so forth.
>>>
>>> Apache Spark user and dev groups have been around for a good while.
>>> They are serving their purpose . We went through creating a slack
>>> community that managed to create more more heat than light.. This is
>>> what Databricks community came up with and I quote
>>>
>>> "Knowledge Sharing Hub
>>> Dive into a collaborative space where members like YOU can exchange
>>> knowledge, tips, and best practices. Join the conversation today and
>>> unlock a wealth of collective wisdom to enhance your experience and
>>> drive success."
>>>
>>> I don't know the logistics of setting it up.but I am sure that should
>>> not be that difficult. If anyone is supportive of this proposal, let
>>> the usual +1, 0, -1 decide
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Dad | Technologist | Solutions Architect | Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> Disclaimer: The information provided is correct to the best of my
>>> knowledge but of course cannot be guaranteed . It is essential to note
>>> that, as with any advice, quote "one test result is worth one-thousand
>>> expert opinions (Werner Von Braun)".
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>

Unsubscribe

2024-03-22 Thread Dusty Williams

Unsubscribe

unsubscribe

2024-03-22 Thread madhan kumar

Re: [DISCUSSION] SPIP: An Official Kubernetes Operator for Apache Spark

2024-03-20 Thread Vakaris Baškirov

Hi!
Just wanted to inquire about the status of the official operator. We are
looking forward to contributing and later on switching to a Spark Operator
and we would prefer it to be the official one.

Thanks,
Vakaris

On Thu, Nov 30, 2023 at 7:09 AM Shiqi Sun  wrote:

> Hi Zhou,
>
> Thanks for the reply. For the language choice, since I don't think I've
> used many k8s components written in Java on k8s, I can't really tell, but
> at least for the components written in Golang, they are well-organized,
> easy to read/maintain and run well in general. In addition, goroutines
> really ease things a lot when writing concurrency code. Golang also has a
> lot less boilerplates, no complicated inheritance and easier dependency
> management and linting toolings. Together with all these points, that's why
> I prefer Golang for this k8s operator. I understand the Spark maintainers
> are more familiar with JVM languages, but I think we should consider the
> performance and maintainability vs the learning curve, to choose an option
> that can win in the long run. Plus, I believe most of the Spark maintainers
> who touch k8s related parts in the Spark project already have experiences
> with Golang, so it shouldn't be a big problem. Our team had some experience
> with the fabric8 client a couple years ago, and we've experienced some
> issues with its reliability, mainly about the request dropping issue (i.e.
> code call is made but the apiserver never receives the request), but that
> was awhile ago and I'm not sure whether everything is good with the client
> now. Anyway, this is my opinion about the language choice, and I will let
> other people comment about it as well.
>
> For compatibility, yes please make the CRD compatible from the user's
> standpoint, so that it's easy for people to adopt the new operator. The
> goal is to consolidate the many spark operators on the market to this new
> official operator, so an easy adoption experience is the key.
>
> Also, I feel that the discussion is pretty high level, and it's because
> the only info revealed for this new operator is the SPIP doc and I haven't
> got a chance to see the code yet. I understand the new operator project
> might still not be open-sourced yet, but is there any way for me to take an
> early peek into the code of your operator, so that we can discuss more
> specifically about the points of language choice and compatibility? Thank
> you so much!
>
> Best,
> Shiqi
>
> On Tue, Nov 28, 2023 at 10:42 AM Zhou Jiang 
> wrote:
>
>> Hi Shiqi,
>>
>> Thanks for the cross-posting here - sorry for the response delay during
>> the holiday break :)
>> We prefer Java for the operator project as it's JVM-based and widely
>> familiar within the Spark community. This choice aims to facilitate better
>> adoption and ease of onboarding for future maintainers. In addition, the
>> Java API client can also be considered as a mature option widely used, by
>> Spark itself and by other operator implementations like Flink.
>> For easier onboarding and potential migration, we'll consider
>> compatibility with existing CRD designs - the goal is to maintain
>> compatibility as best as possible while minimizing duplication efforts.
>> I'm enthusiastic about the idea of lean, version agnostic submission
>> worker. It aligns with one of the primary goals in the operator design.
>> Let's continue exploring this idea further in design doc.
>>
>> Thanks,
>> Zhou
>>
>>
>> On Wed, Nov 22, 2023 at 3:35 PM Shiqi Sun  wrote:
>>
>>> Hi all,
>>>
>>> Sorry for being late to the party. I went through the SPIP doc and I
>>> think this is a great proposal! I left a comment in the SPIP doc a couple
>>> days ago, but I don't see much activity there and no one replied, so I
>>> wanted to cross-post it here to get some feedback.
>>>
>>> I'm Shiqi Sun, and I work for Big Data Platform in Salesforce. My team
>>> has been running the Spark on k8s operator
>>>  (OSS
>>> from Google) in my company to serve Spark users on production for 4+ years,
>>> and we've been actively contributing to the Spark on k8s operator OSS and
>>> also, occasionally, the Spark OSS. According to our experience, Google's
>>> Spark Operator has its own problems, like its close coupling with the spark
>>> version, as well as the JVM overhead during job submission. However on the
>>> other side, it's been a great component in our team's service in the
>>> company, especially being written in golang, it's really easy to have it
>>> interact with k8s, and also its CRD covers a lot of different use cases, as
>>> it has been built up through time thanks to many users' contribution during
>>> these years. There were also a handful of sessions of Google's Spark
>>> Operator Spark Summit that made it widely adopted.
>>>
>>> For this SPIP, I really love the idea of this proposal for the official
>>> k8s operator of Spark project, as well as the separate layer of the
>>>

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Mich Talebzadeh

I concur. Whilst  Databricks' (a commercial entity) Knowledge Sharing Hub
can be a useful resource for sharing knowledge and engaging with their
respective community, ASF likely prioritizes platforms and channels that
align more closely with its principles of open source, and vendor
neutrality.

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 19 Mar 2024 at 21:14, Steve Loughran 
wrote:

>
> ASF will be unhappy about this. and stack overflow exists. otherwise:
> apache Confluent and linkedIn exist; LI is the option I'd point at
>
> On Mon, 18 Mar 2024 at 10:59, Mich Talebzadeh 
> wrote:
>
>> Some of you may be aware that Databricks community Home | Databricks
>> have just launched a knowledge sharing hub. I thought it would be a
>> good idea for the Apache Spark user group to have the same, especially
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>> Streaming, Spark Mlib and so forth.
>>
>> Apache Spark user and dev groups have been around for a good while.
>> They are serving their purpose . We went through creating a slack
>> community that managed to create more more heat than light.. This is
>> what Databricks community came up with and I quote
>>
>> "Knowledge Sharing Hub
>> Dive into a collaborative space where members like YOU can exchange
>> knowledge, tips, and best practices. Join the conversation today and
>> unlock a wealth of collective wisdom to enhance your experience and
>> drive success."
>>
>> I don't know the logistics of setting it up.but I am sure that should
>> not be that difficult. If anyone is supportive of this proposal, let
>> the usual +1, 0, -1 decide
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner Von Braun)".
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Steve Loughran

ASF will be unhappy about this. and stack overflow exists. otherwise:
apache Confluent and linkedIn exist; LI is the option I'd point at

On Mon, 18 Mar 2024 at 10:59, Mich Talebzadeh 
wrote:

> Some of you may be aware that Databricks community Home | Databricks
> have just launched a knowledge sharing hub. I thought it would be a
> good idea for the Apache Spark user group to have the same, especially
> for repeat questions on Spark core, Spark SQL, Spark Structured
> Streaming, Spark Mlib and so forth.
>
> Apache Spark user and dev groups have been around for a good while.
> They are serving their purpose . We went through creating a slack
> community that managed to create more more heat than light.. This is
> what Databricks community came up with and I quote
>
> "Knowledge Sharing Hub
> Dive into a collaborative space where members like YOU can exchange
> knowledge, tips, and best practices. Join the conversation today and
> unlock a wealth of collective wisdom to enhance your experience and
> drive success."
>
> I don't know the logistics of setting it up.but I am sure that should
> not be that difficult. If anyone is supportive of this proposal, let
> the usual +1, 0, -1 decide
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: Vote on Dynamic resource allocation for structured streaming [SPARK-24815]

2024-03-19 Thread Bhuwan Sahni

Thanks a lot for creating the risk table Pavan. My apologies. I was tied up
with high priority items for the last couple weeks and could not respond. I
will review the PR by tomorrow's end, and get back to you.

Appreciate your patience.

Thanks
Bhuwan Sahni

On Sun, Mar 17, 2024 at 4:42 PM Pavan Kotikalapudi 
wrote:

> Hi Bhuwan,
>
> I hope the team got a chance to review the draft PR, looking for some
> comments to see if the plan looks alright?. I have updated the document
> about the risks
> .(also
> mentioned below). Please confirm if it looks alright?
>
> *Spark application type*
>
> *auto-scaling capability*
>
> *with New auto-scaling capability*
>
> Spark Batch job
>
> Works with current DRA
>
> No - change
>
> Streaming query without trigger interval
>
> No implementation
>
> Can work with this implementation - (have to set certain scale back
> configs based on previous usage pattern) - maybe automate with future work?
>
> Spark Streaming query with Trigger interval
>
> No implementation
>
> With this implementation
>
> Spark Streaming query with one-time micro batch
>
> Works with current DRA
>
> No - change
>
> Spark Streaming query with
>
> Availablenow micro batch
>
> Works with current DRA
>
> No - change
>
> Batch + Streaming query (
>
> default/
>
> triggger-interval/
>
> once/
>
> availablenow modes), other notebook use cases.
>
> No implementation
>
> No implementation
>
>
>
> We are more than happy to collaborate on a call to make better progress
> on this enhancement. Please let us know.
>
> Thank you,
>
> Pavan
>
> On Fri, Mar 1, 2024 at 12:26 PM Mich Talebzadeh 
> wrote:
>
>>
>> Hi Bhuwan et al,
>>
>> Thank you for passing on the DataBricks Structured Streaming team's
>> review of the SPIP document. FYI, I work closely with Pawan and other
>> members to help deliver this piece of work. We appreciate your insights,
>> especially regarding the cost savings potential from the PoC.
>>
>> Pavan already furnished you with some additional info. Your team's point
>> about the SPIP currently addressing a specific use case (single streaming
>> query with Processing Time trigger) is well-taken. We agree that
>> maintaining simplicity is key, particularly as we explore more general
>> resource allocation mechanisms in the future. To address the concerns and
>> foster open discussion, The DataBricks team are invited to directly add
>> their comments and suggestions to the Jira itself
>>
>> [SPARK-24815] Structured Streaming should support dynamic allocation -
>> ASF JIRA (apache.org)
>> 
>> This will ensure everyone involved can benefit from your team's expertise
>> and facilitate further collaboration.
>>
>> Thanks
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von
>> Braun
>> 
>> )".
>>
>>
>> On Fri, 1 Mar 2024 at 19:59, Pavan Kotikalapudi
>>  wrote:
>>
>>> Thanks Bhuwan and rest of the databricks team for the reviews,
>>>
>>> I appreciate your reviews, was very helpful in evaluating a few options
>>> that were overlooked earlier (especially about mixed spark apps running on
>>> notebooks). Regarding the use-cases, It could handle multiple streaming
>>> queries provided that they are run on the same trigger interval processing
>>> time (very similar to how current batch dra is set up)..but I felt like it
>>> would be beneficial if we separate out streaming queries when setting up
>>> production pipelines.
>>>
>>> Regarding the implementation, here is the draft PR
>>> https://github.com/apache/spark/pull/42352
>>>

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Mich Talebzadeh

One option that comes to my mind, is that given the cyclic nature of these
types of proposals in these two forums, we should be able to use
Databricks's existing knowledge sharing hub Knowledge Sharing Hub -
Databricks

as well.

The majority of topics will be of interest to their audience as well. In
addition, they seem to invite everyone to contribute. Unless you have an
overriding concern why we should not take this approach, I can enquire from
Databricks community managers whether they can entertain this idea. They
seem to have a well defined structure for hosting topics.

Let me know your thoughts

Thanks

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Tue, 19 Mar 2024 at 08:25, Joris Billen 
wrote:

> +1
>
>
> On 18 Mar 2024, at 21:53, Mich Talebzadeh 
> wrote:
>
> Well as long as it works.
>
> Please all check this link from Databricks and let us know your thoughts.
> Will something similar work for us?. Of course Databricks have much deeper
> pockets than our ASF community. Will it require moderation in our side to
> block spams and nutcases.
>
> Knowledge Sharing Hub - Databricks
> 
>
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen 
> wrote:
>
>> something like this  Spark community · GitHub
>> 
>>
>>
>> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud
>> :
>>
>>> Good idea. Will be useful
>>>
>>>
>>>
>>> +1
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> *From: *ashok34...@yahoo.com.INVALID 
>>> *Date: *Monday, March 18, 2024 at 6:36 AM
>>> *To: *user @spark , Spark dev list <
>>> dev@spark.apache.org>, Mich Talebzadeh 
>>> *Cc: *Matei Zaharia 
>>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>>> Apache Spark Community
>>>
>>> External message, be mindful when clicking links or attachments
>>>
>>>
>>>
>>> Good idea. Will be useful
>>>
>>>
>>>
>>> +1
>>>
>>>
>>>
>>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>
>>>
>>>
>>>
>>> Some of you may be aware that Databricks community Home | Databricks
>>>
>>> have just launched a knowledge sharing hub. I thought it would be a
>>>
>>> good idea for the Apache Spark user group to have the same, especially
>>>
>>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>>
>>> Streaming, Spark Mlib and so forth.
>>>
>>>
>>>
>>> Apache Spark user and dev groups have been around for a good while.
>>>
>>> They are serving their purpose . We went through creating a slack
>>>
>>> community that managed to create more more heat than light.. This is
>>>
>>> what Databricks community came up with and I quote
>>>
>>>
>>>
>>> "Knowledge Sharing Hub
>>>
>>> Dive into a collaborative space where members like YOU can exchange
>>>
>>> knowledge, tips, and best practices. Join the conversation today and
>>>
>>> unlock a wealth of collective wisdom to enhance your experience and
>>>
>>> drive success."
>>>
>>>
>>>
>>> I don't know the logistics of setting it up.but I am sure that should
>>>
>>> not be that difficult. If anyone is supportive of this proposal, let
>>>
>>> the usual +1, 0, -1 decide
>>>
>>>
>>>
>>> HTH
>>>
>>>
>>>
>>> Mich Talebzadeh,
>>>
>>> Dad | Technologist | Solutions Architect | Engineer
>>>
>>> London
>>>
>>> United Kingdom
>>>
>>>
>>>
>>>
>>>
>>>   view my Linkedin profile
>>>
>>>
>>>
>>>
>>>
>>> https://en.everybodywiki.com/Mich_Talebzadeh
>>> 
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-19 Thread Varun Shah

+1  Great initiative.

QQ : Stack overflow has a similar feature called "Collectives", but I am
not sure of the expenses to create one for Apache Spark. With SO being used
( atleast before ChatGPT became quite the norm for searching questions), it
already has a lot of questions asked and answered by the community over a
period of time and hence, if possible, we could leverage it as the starting
point for building a community before creating a complete new website from
scratch. Any thoughts on this?

Regards,
Varun Shah


On Mon, Mar 18, 2024, 16:29 Mich Talebzadeh 
wrote:

> Some of you may be aware that Databricks community Home | Databricks
> have just launched a knowledge sharing hub. I thought it would be a
> good idea for the Apache Spark user group to have the same, especially
> for repeat questions on Spark core, Spark SQL, Spark Structured
> Streaming, Spark Mlib and so forth.
>
> Apache Spark user and dev groups have been around for a good while.
> They are serving their purpose . We went through creating a slack
> community that managed to create more more heat than light.. This is
> what Databricks community came up with and I quote
>
> "Knowledge Sharing Hub
> Dive into a collaborative space where members like YOU can exchange
> knowledge, tips, and best practices. Join the conversation today and
> unlock a wealth of collective wisdom to enhance your experience and
> drive success."
>
> I don't know the logistics of setting it up.but I am sure that should
> not be that difficult. If anyone is supportive of this proposal, let
> the usual +1, 0, -1 decide
>
> HTH
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> Disclaimer: The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner Von Braun)".
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Deepak Sharma

+1 .
I can contribute to it as well .

On Tue, 19 Mar 2024 at 9:19 AM, Code Tutelage 
wrote:

> +1
>
> Thanks for proposing
>
> On Mon, Mar 18, 2024 at 9:25 AM Parsian, Mahmoud
>  wrote:
>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>>
>>
>>
>>
>> *From: *ashok34...@yahoo.com.INVALID 
>> *Date: *Monday, March 18, 2024 at 6:36 AM
>> *To: *user @spark , Spark dev list <
>> dev@spark.apache.org>, Mich Talebzadeh 
>> *Cc: *Matei Zaharia 
>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>> Apache Spark Community
>>
>> External message, be mindful when clicking links or attachments
>>
>>
>>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>>
>>
>>
>> Some of you may be aware that Databricks community Home | Databricks
>>
>> have just launched a knowledge sharing hub. I thought it would be a
>>
>> good idea for the Apache Spark user group to have the same, especially
>>
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>
>> Streaming, Spark Mlib and so forth.
>>
>>
>>
>> Apache Spark user and dev groups have been around for a good while.
>>
>> They are serving their purpose . We went through creating a slack
>>
>> community that managed to create more more heat than light.. This is
>>
>> what Databricks community came up with and I quote
>>
>>
>>
>> "Knowledge Sharing Hub
>>
>> Dive into a collaborative space where members like YOU can exchange
>>
>> knowledge, tips, and best practices. Join the conversation today and
>>
>> unlock a wealth of collective wisdom to enhance your experience and
>>
>> drive success."
>>
>>
>>
>> I don't know the logistics of setting it up.but I am sure that should
>>
>> not be that difficult. If anyone is supportive of this proposal, let
>>
>> the usual +1, 0, -1 decide
>>
>>
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>
>>
>>   view my Linkedin profile
>>
>>
>>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>
>>
>>
>>
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>>
>> knowledge but of course cannot be guaranteed . It is essential to note
>>
>> that, as with any advice, quote "one test result is worth one-thousand
>>
>> expert opinions (Werner Von Braun)".
>>
>>
>>
>> -
>>
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Hyukjin Kwon

One very good example is SparkR releases in Conda channel (
https://github.com/conda-forge/r-sparkr-feedstock).
This is fully run by the community unofficially.

On Tue, 19 Mar 2024 at 09:54, Mich Talebzadeh 
wrote:

> +1 for me
>
> Mich Talebzadeh,
> Dad | Technologist | Solutions Architect | Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner  Von
> Braun )".
>
>
> On Mon, 18 Mar 2024 at 16:23, Parsian, Mahmoud 
> wrote:
>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>>
>>
>>
>>
>> *From: *ashok34...@yahoo.com.INVALID 
>> *Date: *Monday, March 18, 2024 at 6:36 AM
>> *To: *user @spark , Spark dev list <
>> dev@spark.apache.org>, Mich Talebzadeh 
>> *Cc: *Matei Zaharia 
>> *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
>> Apache Spark Community
>>
>> External message, be mindful when clicking links or attachments
>>
>>
>>
>> Good idea. Will be useful
>>
>>
>>
>> +1
>>
>>
>>
>> On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>
>>
>>
>>
>> Some of you may be aware that Databricks community Home | Databricks
>>
>> have just launched a knowledge sharing hub. I thought it would be a
>>
>> good idea for the Apache Spark user group to have the same, especially
>>
>> for repeat questions on Spark core, Spark SQL, Spark Structured
>>
>> Streaming, Spark Mlib and so forth.
>>
>>
>>
>> Apache Spark user and dev groups have been around for a good while.
>>
>> They are serving their purpose . We went through creating a slack
>>
>> community that managed to create more more heat than light.. This is
>>
>> what Databricks community came up with and I quote
>>
>>
>>
>> "Knowledge Sharing Hub
>>
>> Dive into a collaborative space where members like YOU can exchange
>>
>> knowledge, tips, and best practices. Join the conversation today and
>>
>> unlock a wealth of collective wisdom to enhance your experience and
>>
>> drive success."
>>
>>
>>
>> I don't know the logistics of setting it up.but I am sure that should
>>
>> not be that difficult. If anyone is supportive of this proposal, let
>>
>> the usual +1, 0, -1 decide
>>
>>
>>
>> HTH
>>
>>
>>
>> Mich Talebzadeh,
>>
>> Dad | Technologist | Solutions Architect | Engineer
>>
>> London
>>
>> United Kingdom
>>
>>
>>
>>
>>
>>   view my Linkedin profile
>>
>>
>>
>>
>>
>> https://en.everybodywiki.com/Mich_Talebzadeh
>> 
>>
>>
>>
>>
>>
>>
>>
>> Disclaimer: The information provided is correct to the best of my
>>
>> knowledge but of course cannot be guaranteed . It is essential to note
>>
>> that, as with any advice, quote "one test result is worth one-thousand
>>
>> expert opinions (Werner Von Braun)".
>>
>>
>>
>> -
>>
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>>
>

Re: A proposal for creating a Knowledge Sharing Hub for Apache Spark Community

2024-03-18 Thread Mich Talebzadeh

OK thanks for the update.

What does officially blessed signify here? Can we have and run it as a
sister site? The reason this comes to my mind is that the interested
parties should have easy access to this site (from ISUG Spark sites) as a
reference repository. I guess the advice would be that the information
(topics) are provided as best efforts and cannot be guaranteed.

Mich Talebzadeh,
Dad | Technologist | Solutions Architect | Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner  Von
Braun )".


On Mon, 18 Mar 2024 at 21:04, Reynold Xin  wrote:

> One of the problem in the past when something like this was brought up was
> that the ASF couldn't have officially blessed venues beyond the already
> approved ones. So that's something to look into.
>
> Now of course you are welcome to run unofficial things unblessed as long
> as they follow trademark rules.
>
>
>
> On Mon, Mar 18, 2024 at 1:53 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> Well as long as it works.
>>
>> Please all check this link from Databricks and let us know your thoughts.
>> Will something similar work for us?. Of course Databricks have much deeper
>> pockets than our ASF community. Will it require moderation in our side to
>> block spams and nutcases.
>>
>> Knowledge Sharing Hub - Databricks
>> 
>>
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Mon, 18 Mar 2024 at 20:31, Bjørn Jørgensen 
>> wrote:
>>
>>> something like this  Spark community · GitHub
>>> 
>>>
>>>
>>> man. 18. mars 2024 kl. 17:26 skrev Parsian, Mahmoud <
>>> mpars...@illumina.com.invalid>:
>>>
 Good idea. Will be useful



 +1







 *From: *ashok34...@yahoo.com.INVALID 
 *Date: *Monday, March 18, 2024 at 6:36 AM
 *To: *user @spark , Spark dev list <
 dev@spark.apache.org>, Mich Talebzadeh 
 *Cc: *Matei Zaharia 
 *Subject: *Re: A proposal for creating a Knowledge Sharing Hub for
 Apache Spark Community

 External message, be mindful when clicking links or attachments



 Good idea. Will be useful



 +1



 On Monday, 18 March 2024 at 11:00:40 GMT, Mich Talebzadeh <
 mich.talebza...@gmail.com> wrote:





 Some of you may be aware that Databricks community Home | Databricks

 have just launched a knowledge sharing hub. I thought it would be a

 good idea for the Apache Spark user group to have the same, especially

 for repeat questions on Spark core, Spark SQL, Spark Structured

 Streaming, Spark Mlib and so forth.



 Apache Spark user and dev groups have been around for a good while.

 They are serving their purpose . We went through creating a slack

 community that managed to create more more heat than light.. This is

 what Databricks community came up with and I quote



 "Knowledge Sharing Hub

 Dive into a collaborative space where members like YOU can exchange

 knowledge, tips, and best practices. Join the conversation today and

 unlock a wealth of collective wisdom to enhance your experience and

 drive success."



 I don't know the logistics of setting it up.but I am sure that should

 not be that difficult. If anyone is supportive of this proposal, let

 the usual +1, 0, -1 decide



 HTH



 Mich Talebzadeh,

 Dad | Technologist | Solutions Architect | Engineer

 London

 United Kingdom





   view my Linkedin profile





 https://en.everybodywiki.com/Mich_Talebzadeh

< 1 2 3 4 5 6 7 8 9 10 >

401 - 500 of 28416 matches

Mail list logo