Re: [CONNECT] New Clients for Go and Rust

2023-05-25 Thread Martin Grund
Thanks everyone for your feedback! I will work on figuring out what it
takes to get started with a repo for the go client.

On Thu 25. May 2023 at 21:51 Chao Sun  wrote:

> +1 on separate repo too
>
> On Thu, May 25, 2023 at 12:43 PM Dongjoon Hyun 
> wrote:
> >
> > +1 for starting on a separate repo.
> >
> > Dongjoon.
> >
> > On Thu, May 25, 2023 at 9:53 AM yangjie01  wrote:
> >>
> >> +1 on start this with a separate repo.
> >>
> >> Which new clients can be placed in the main repo should be discussed
> after they are mature enough,
> >>
> >>
> >>
> >> Yang Jie
> >>
> >>
> >>
> >> 发件人: Denny Lee 
> >> 日期: 2023年5月24日 星期三 21:31
> >> 收件人: Hyukjin Kwon 
> >> 抄送: Maciej , "dev@spark.apache.org" <
> dev@spark.apache.org>
> >> 主题: Re: [CONNECT] New Clients for Go and Rust
> >>
> >>
> >>
> >> +1 on separate repo allowing different APIs to run at different speeds
> and ensuring they get community support.
> >>
> >>
> >>
> >> On Wed, May 24, 2023 at 00:37 Hyukjin Kwon 
> wrote:
> >>
> >> I think we can just start this with a separate repo.
> >> I am fine with the second option too but in this case we would have to
> triage which language to add into the main repo.
> >>
> >>
> >>
> >> On Fri, 19 May 2023 at 22:28, Maciej  wrote:
> >>
> >> Hi,
> >>
> >>
> >>
> >> Personally, I'm strongly against the second option and have some
> preference towards the third one (or maybe a mix of the first one and the
> third one).
> >>
> >>
> >>
> >> The project is already pretty large as-is and, with an extremely
> conservative approach towards removal of APIs, it only tends to grow over
> time. Making it even larger is not going to make things more maintainable
> and is likely to create an entry barrier for new contributors (that's
> similar to Jia's arguments).
> >>
> >>
> >>
> >> Moreover, we've seen quite a few different language clients over the
> years and all but one or two survived while none is particularly active, as
> far as I'm aware.  Taking responsibility for more clients, without being
> sure that we have resources to maintain them and there is enough community
> around them to make such effort worthwhile, doesn't seem like a good idea.
> >>
> >>
> >>
> >> --
> >>
> >> Best regards,
> >>
> >> Maciej Szymkiewicz
> >>
> >>
> >>
> >> Web: https://zero323.net
> >>
> >> PGP: A30CEF0C31A501EC
> >>
> >>
> >>
> >>
> >>
> >> On 5/19/23 14:57, Jia Fan wrote:
> >>
> >> Hi,
> >>
> >>
> >>
> >> Thanks for contribution!
> >>
> >> I prefer (1). There are some reason:
> >>
> >>
> >>
> >> 1. Different repository can maintain independent versions, different
> release times, and faster bug fix releases.
> >>
> >>
> >>
> >> 2. Different languages have different build tools. Putting them in one
> repository will make the main repository more and more complicated, and it
> will become extremely difficult to perform a complete build in the main
> repository.
> >>
> >>
> >>
> >> 3. Different repository will make CI configuration and execute easier,
> and the PR and commit lists will be clearer.
> >>
> >>
> >>
> >> 4. Other repository also have different client to governed, like
> clickhouse. It use different repository for jdbc, odbc, c++. Please refer:
> >>
> >> https://github.com/ClickHouse/clickhouse-java
> >>
> >> https://github.com/ClickHouse/clickhouse-odbc
> >>
> >> https://github.com/ClickHouse/clickhouse-cpp
> >>
> >>
> >>
> >> PS: I'm looking forward to the javascript connect client!
> >>
> >>
> >>
> >> Thanks Regards
> >>
> >> Jia Fan
> >>
> >>
> >>
> >> Martin Grund  于2023年5月19日周五 20:03写道:
> >>
> >> Hi folks,
> >>
> >>
> >>
> >> When Bo (thanks for the time and contribution) started the work on
> https://github.com/apache/spark/pull/41036 he started the Go client
> directly in the Spark repository. In the meantime, I was approached by
> other engineers who are willing to contribute to working on a Rust client
> for Spark Connect.
> >>
> >>
> >>
> >> Now one of the key questions is where should these connectors live and
> how we manage expectations most effectively.
> >>
> >>
> >>
> >> At the high level, there are two approaches:
> >>
> >>
> >>
> >> (1) "3rd party" (non-JVM / Python) clients should live in separate
> repositories owned and governed by the Apache Spark community.
> >>
> >>
> >>
> >> (2) All clients should live in the main Apache Spark repository in the
> `connector/connect/client` directory.
> >>
> >>
> >>
> >> (3) Non-native (Python, JVM) Spark Connect clients should not be part
> of the Apache Spark repository and governance rules.
> >>
> >>
> >>
> >> Before we iron out how exactly, we mark these clients as experimental
> and how we align their release process etc with Spark, my suggestion would
> be to get a consensus on this first question.
> >>
> >>
> >>
> >> Personally, I'm fine with (1) and (2) with a preference for (2).
> >>
> >>
> >>
> >> Would love to get feedback from other members of the community!
> >>
> >>
> >>
> >> Thanks
> >>
> >> Martin
> >>
> >>
> >>
> >>
> >>
> >>
> >>
> >>
>
> 

Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-25 Thread Hyukjin Kwon
I don't think it'd be a release blocker .. I think we can implement them
across multiple releases.

On Fri, May 26, 2023 at 1:01 AM Dongjoon Hyun 
wrote:

> Thank you for the proposal.
>
> I'm wondering if we are going to consider them as release blockers or not.
>
> In general, I don't think those SQL functions should be available in all
> languages as release blockers.
> (Especially in R or new Spark Connect languages like Go and Rust).
>
> If they are not release blockers, we may allow some existing or future
> community PRs only before feature freeze (= branch cut).
>
> Thanks,
> Dongjoon.
>
>
> On Wed, May 24, 2023 at 7:09 PM Jia Fan  wrote:
>
>> +1
>> It is important that different APIs can be used to call the same function
>>
>> Ryan Berti  于2023年5月25日周四 01:48写道:
>>
>>> During my recent experience developing functions, I found that
>>> identifying locations (sql + connect functions.scala + functions.py,
>>> FunctionRegistry, + whatever is required for R) and standards for adding
>>> function signatures was not straight forward (should you use optional args
>>> or overload functions? which col/lit helpers should be used when?). Are
>>> there docs describing all of the locations + standards for defining a
>>> function? If not, that'd be great to have too.
>>>
>>> Ryan Berti
>>>
>>> Senior Data Engineer  |  Ads DE
>>>
>>> M 7023217573
>>>
>>> 5808 W Sunset Blvd  |  Los Angeles, CA 90028
>>> 
>>>
>>>
>>>
>>> On Wed, May 24, 2023 at 12:44 AM Enrico Minack 
>>> wrote:
>>>
 +1

 Functions available in SQL (more general in one API) should be
 available in all APIs. I am very much in favor of this.

 Enrico


 Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:

 Hi all,

 I would like to discuss adding all SQL functions into Scala, Python and
 R API.
 We have SQL functions that do not exist in Scala, Python and R around
 175.
 For example, we don’t have pyspark.sql.functions.percentile but you
 can invoke
 it as a SQL function, e.g., SELECT percentile(...).

 The reason why we do not have all functions in the first place is that
 we want to
 only add commonly used functions, see also
 https://github.com/apache/spark/pull/21318 (which I agreed at that
 time)

 However, this has been raised multiple times over years, from the OSS
 community, dev mailing list, JIRAs, stackoverflow, etc.
 Seems it’s confusing about which function is available or not.

 Yes, we have a workaround. We can call all expressions by expr("...")
  or call_udf("...", Columns ...)
 But still it seems that it’s not very user-friendly because they expect
 them available under the functions namespace.

 Therefore, I would like to propose adding all expressions into all
 languages so that Spark is simpler and less confusing, e.g., which API is
 in functions or not.

 Any thoughts?





Re: [CONNECT] New Clients for Go and Rust

2023-05-25 Thread Chao Sun
+1 on separate repo too

On Thu, May 25, 2023 at 12:43 PM Dongjoon Hyun  wrote:
>
> +1 for starting on a separate repo.
>
> Dongjoon.
>
> On Thu, May 25, 2023 at 9:53 AM yangjie01  wrote:
>>
>> +1 on start this with a separate repo.
>>
>> Which new clients can be placed in the main repo should be discussed after 
>> they are mature enough,
>>
>>
>>
>> Yang Jie
>>
>>
>>
>> 发件人: Denny Lee 
>> 日期: 2023年5月24日 星期三 21:31
>> 收件人: Hyukjin Kwon 
>> 抄送: Maciej , "dev@spark.apache.org" 
>> 
>> 主题: Re: [CONNECT] New Clients for Go and Rust
>>
>>
>>
>> +1 on separate repo allowing different APIs to run at different speeds and 
>> ensuring they get community support.
>>
>>
>>
>> On Wed, May 24, 2023 at 00:37 Hyukjin Kwon  wrote:
>>
>> I think we can just start this with a separate repo.
>> I am fine with the second option too but in this case we would have to 
>> triage which language to add into the main repo.
>>
>>
>>
>> On Fri, 19 May 2023 at 22:28, Maciej  wrote:
>>
>> Hi,
>>
>>
>>
>> Personally, I'm strongly against the second option and have some preference 
>> towards the third one (or maybe a mix of the first one and the third one).
>>
>>
>>
>> The project is already pretty large as-is and, with an extremely 
>> conservative approach towards removal of APIs, it only tends to grow over 
>> time. Making it even larger is not going to make things more maintainable 
>> and is likely to create an entry barrier for new contributors (that's 
>> similar to Jia's arguments).
>>
>>
>>
>> Moreover, we've seen quite a few different language clients over the years 
>> and all but one or two survived while none is particularly active, as far as 
>> I'm aware.  Taking responsibility for more clients, without being sure that 
>> we have resources to maintain them and there is enough community around them 
>> to make such effort worthwhile, doesn't seem like a good idea.
>>
>>
>>
>> --
>>
>> Best regards,
>>
>> Maciej Szymkiewicz
>>
>>
>>
>> Web: https://zero323.net
>>
>> PGP: A30CEF0C31A501EC
>>
>>
>>
>>
>>
>> On 5/19/23 14:57, Jia Fan wrote:
>>
>> Hi,
>>
>>
>>
>> Thanks for contribution!
>>
>> I prefer (1). There are some reason:
>>
>>
>>
>> 1. Different repository can maintain independent versions, different release 
>> times, and faster bug fix releases.
>>
>>
>>
>> 2. Different languages have different build tools. Putting them in one 
>> repository will make the main repository more and more complicated, and it 
>> will become extremely difficult to perform a complete build in the main 
>> repository.
>>
>>
>>
>> 3. Different repository will make CI configuration and execute easier, and 
>> the PR and commit lists will be clearer.
>>
>>
>>
>> 4. Other repository also have different client to governed, like clickhouse. 
>> It use different repository for jdbc, odbc, c++. Please refer:
>>
>> https://github.com/ClickHouse/clickhouse-java
>>
>> https://github.com/ClickHouse/clickhouse-odbc
>>
>> https://github.com/ClickHouse/clickhouse-cpp
>>
>>
>>
>> PS: I'm looking forward to the javascript connect client!
>>
>>
>>
>> Thanks Regards
>>
>> Jia Fan
>>
>>
>>
>> Martin Grund  于2023年5月19日周五 20:03写道:
>>
>> Hi folks,
>>
>>
>>
>> When Bo (thanks for the time and contribution) started the work on 
>> https://github.com/apache/spark/pull/41036 he started the Go client directly 
>> in the Spark repository. In the meantime, I was approached by other 
>> engineers who are willing to contribute to working on a Rust client for 
>> Spark Connect.
>>
>>
>>
>> Now one of the key questions is where should these connectors live and how 
>> we manage expectations most effectively.
>>
>>
>>
>> At the high level, there are two approaches:
>>
>>
>>
>> (1) "3rd party" (non-JVM / Python) clients should live in separate 
>> repositories owned and governed by the Apache Spark community.
>>
>>
>>
>> (2) All clients should live in the main Apache Spark repository in the 
>> `connector/connect/client` directory.
>>
>>
>>
>> (3) Non-native (Python, JVM) Spark Connect clients should not be part of the 
>> Apache Spark repository and governance rules.
>>
>>
>>
>> Before we iron out how exactly, we mark these clients as experimental and 
>> how we align their release process etc with Spark, my suggestion would be to 
>> get a consensus on this first question.
>>
>>
>>
>> Personally, I'm fine with (1) and (2) with a preference for (2).
>>
>>
>>
>> Would love to get feedback from other members of the community!
>>
>>
>>
>> Thanks
>>
>> Martin
>>
>>
>>
>>
>>
>>
>>
>>

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [CONNECT] New Clients for Go and Rust

2023-05-25 Thread Dongjoon Hyun
+1 for starting on a separate repo.

Dongjoon.

On Thu, May 25, 2023 at 9:53 AM yangjie01  wrote:

> +1 on start this with a separate repo.
>
> Which new clients can be placed in the main repo should be discussed after
> they are mature enough,
>
>
>
> Yang Jie
>
>
>
> *发件人**: *Denny Lee 
> *日期**: *2023年5月24日 星期三 21:31
> *收件人**: *Hyukjin Kwon 
> *抄送**: *Maciej , "dev@spark.apache.org" <
> dev@spark.apache.org>
> *主题**: *Re: [CONNECT] New Clients for Go and Rust
>
>
>
> +1 on separate repo allowing different APIs to run at different speeds and
> ensuring they get community support.
>
>
>
> On Wed, May 24, 2023 at 00:37 Hyukjin Kwon  wrote:
>
> I think we can just start this with a separate repo.
> I am fine with the second option too but in this case we would have to
> triage which language to add into the main repo.
>
>
>
> On Fri, 19 May 2023 at 22:28, Maciej  wrote:
>
> Hi,
>
>
>
> Personally, I'm strongly against the second option and have some
> preference towards the third one (or maybe a mix of the first one and the
> third one).
>
>
>
> The project is already pretty large as-is and, with an extremely
> conservative approach towards removal of APIs, it only tends to grow over
> time. Making it even larger is not going to make things more maintainable
> and is likely to create an entry barrier for new contributors (that's
> similar to Jia's arguments).
>
>
>
> Moreover, we've seen quite a few different language clients over the years
> and all but one or two survived while none is particularly active, as far
> as I'm aware.  Taking responsibility for more clients, without being sure
> that we have resources to maintain them and there is enough community
> around them to make such effort worthwhile, doesn't seem like a good idea.
>
>
>
> --
>
> Best regards,
>
> Maciej Szymkiewicz
>
>
>
> Web: https://zero323.net 
> 
>
> PGP: A30CEF0C31A501EC
>
>
>
>
>
> On 5/19/23 14:57, Jia Fan wrote:
>
> Hi,
>
>
>
> Thanks for contribution!
>
> I prefer (1). There are some reason:
>
>
>
> 1. Different repository can maintain independent versions, different
> release times, and faster bug fix releases.
>
>
>
> 2. Different languages have different build tools. Putting them in one
> repository will make the main repository more and more complicated, and it
> will become extremely difficult to perform a complete build in the main
> repository.
>
>
>
> 3. Different repository will make CI configuration and execute easier, and
> the PR and commit lists will be clearer.
>
>
>
> 4. Other repository also have different client to governed, like
> clickhouse. It use different repository for jdbc, odbc, c++. Please refer:
>
> https://github.com/ClickHouse/clickhouse-java
> 
>
> https://github.com/ClickHouse/clickhouse-odbc
> 
>
> https://github.com/ClickHouse/clickhouse-cpp
> 
>
>
>
> PS: I'm looking forward to the javascript connect client!
>
>
>
> Thanks Regards
>
> Jia Fan
>
>
>
> Martin Grund  于2023年5月19日周五 20:03写道:
>
> Hi folks,
>
>
>
> When Bo (thanks for the time and contribution) started the work on
> https://github.com/apache/spark/pull/41036
> 
> he started the Go client directly in the Spark repository. In the meantime,
> I was approached by other engineers who are willing to contribute to
> working on a Rust client for Spark Connect.
>
>
>
> Now one of the key questions is where should these connectors live and how
> we manage expectations most effectively.
>
>
>
> At the high level, there are two approaches:
>
>
>
> (1) "3rd party" (non-JVM / Python) clients should live in separate
> repositories owned and governed by the Apache Spark community.
>
>
>
> (2) All clients should live in the main Apache Spark repository in the
> `connector/connect/client` directory.
>
>
>
> (3) Non-native (Python, JVM) Spark Connect clients should not be part of
> the Apache Spark repository and governance rules.
>
>
>
> Before we iron out how exactly, we mark these clients as experimental and
> how we align their release process etc with Spark, my suggestion would be
> to get a consensus on this first question.
>
>
>
> Personally, I'm fine with (1) and (2) with a preference for (2).
>
>
>
> Would love to get feedback from other members of the community!
>
>
>
> Thanks
>
> Martin
>
>
>
>
>
>
>
>
>
>


Re: [CONNECT] New Clients for Go and Rust

2023-05-25 Thread yangjie01
+1 on start this with a separate repo.
Which new clients can be placed in the main repo should be discussed after they 
are mature enough,

Yang Jie

发件人: Denny Lee 
日期: 2023年5月24日 星期三 21:31
收件人: Hyukjin Kwon 
抄送: Maciej , "dev@spark.apache.org" 

主题: Re: [CONNECT] New Clients for Go and Rust

+1 on separate repo allowing different APIs to run at different speeds and 
ensuring they get community support.

On Wed, May 24, 2023 at 00:37 Hyukjin Kwon 
mailto:gurwls...@apache.org>> wrote:
I think we can just start this with a separate repo.
I am fine with the second option too but in this case we would have to triage 
which language to add into the main repo.

On Fri, 19 May 2023 at 22:28, Maciej 
mailto:mszymkiew...@gmail.com>> wrote:
Hi,

Personally, I'm strongly against the second option and have some preference 
towards the third one (or maybe a mix of the first one and the third one).

The project is already pretty large as-is and, with an extremely conservative 
approach towards removal of APIs, it only tends to grow over time. Making it 
even larger is not going to make things more maintainable and is likely to 
create an entry barrier for new contributors (that's similar to Jia's 
arguments).

Moreover, we've seen quite a few different language clients over the years and 
all but one or two survived while none is particularly active, as far as I'm 
aware.  Taking responsibility for more clients, without being sure that we have 
resources to maintain them and there is enough community around them to make 
such effort worthwhile, doesn't seem like a good idea.


--

Best regards,

Maciej Szymkiewicz



Web: 
https://zero323.net

PGP: A30CEF0C31A501EC


On 5/19/23 14:57, Jia Fan wrote:
Hi,

Thanks for contribution!
I prefer (1). There are some reason:

1. Different repository can maintain independent versions, different release 
times, and faster bug fix releases.

2. Different languages have different build tools. Putting them in one 
repository will make the main repository more and more complicated, and it will 
become extremely difficult to perform a complete build in the main repository.

3. Different repository will make CI configuration and execute easier, and the 
PR and commit lists will be clearer.

4. Other repository also have different client to governed, like clickhouse. It 
use different repository for jdbc, odbc, c++. Please refer:
https://github.com/ClickHouse/clickhouse-java
https://github.com/ClickHouse/clickhouse-odbc
https://github.com/ClickHouse/clickhouse-cpp

PS: I'm looking forward to the javascript connect client!

Thanks Regards
Jia Fan

Martin Grund mailto:mgr...@apache.org>> 于2023年5月19日周五 
20:03写道:
Hi folks,

When Bo (thanks for the time and contribution) started the work on 
https://github.com/apache/spark/pull/41036
 he started the Go client directly in the Spark repository. In the meantime, I 
was approached by other engineers who are willing to contribute to working on a 
Rust client for Spark Connect.

Now one of the key questions is where should these connectors live and how we 
manage expectations most effectively.

At the high level, there are two approaches:

(1) "3rd party" (non-JVM / Python) clients should live in separate repositories 
owned and governed by the Apache Spark community.

(2) All clients should live in the main Apache Spark repository in the 
`connector/connect/client` directory.

(3) Non-native (Python, JVM) Spark Connect clients should not be part of the 
Apache Spark repository and governance rules.

Before we iron out how exactly, we mark these clients as experimental and how 
we align their release process etc with Spark, my suggestion would be to get a 
consensus on this first question.

Personally, I'm fine with (1) and (2) with a preference for (2).

Would love to get feedback from other members of the community!

Thanks
Martin






Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-25 Thread Dongjoon Hyun
Thank you for the proposal.

I'm wondering if we are going to consider them as release blockers or not.

In general, I don't think those SQL functions should be available in all
languages as release blockers.
(Especially in R or new Spark Connect languages like Go and Rust).

If they are not release blockers, we may allow some existing or future
community PRs only before feature freeze (= branch cut).

Thanks,
Dongjoon.


On Wed, May 24, 2023 at 7:09 PM Jia Fan  wrote:

> +1
> It is important that different APIs can be used to call the same function
>
> Ryan Berti  于2023年5月25日周四 01:48写道:
>
>> During my recent experience developing functions, I found that
>> identifying locations (sql + connect functions.scala + functions.py,
>> FunctionRegistry, + whatever is required for R) and standards for adding
>> function signatures was not straight forward (should you use optional args
>> or overload functions? which col/lit helpers should be used when?). Are
>> there docs describing all of the locations + standards for defining a
>> function? If not, that'd be great to have too.
>>
>> Ryan Berti
>>
>> Senior Data Engineer  |  Ads DE
>>
>> M 7023217573
>>
>> 5808 W Sunset Blvd  |  Los Angeles, CA 90028
>>
>>
>>
>> On Wed, May 24, 2023 at 12:44 AM Enrico Minack 
>> wrote:
>>
>>> +1
>>>
>>> Functions available in SQL (more general in one API) should be available
>>> in all APIs. I am very much in favor of this.
>>>
>>> Enrico
>>>
>>>
>>> Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:
>>>
>>> Hi all,
>>>
>>> I would like to discuss adding all SQL functions into Scala, Python and
>>> R API.
>>> We have SQL functions that do not exist in Scala, Python and R around
>>> 175.
>>> For example, we don’t have pyspark.sql.functions.percentile but you can
>>> invoke
>>> it as a SQL function, e.g., SELECT percentile(...).
>>>
>>> The reason why we do not have all functions in the first place is that
>>> we want to
>>> only add commonly used functions, see also
>>> https://github.com/apache/spark/pull/21318 (which I agreed at that time)
>>>
>>> However, this has been raised multiple times over years, from the OSS
>>> community, dev mailing list, JIRAs, stackoverflow, etc.
>>> Seems it’s confusing about which function is available or not.
>>>
>>> Yes, we have a workaround. We can call all expressions by expr("...")
>>>  or call_udf("...", Columns ...)
>>> But still it seems that it’s not very user-friendly because they expect
>>> them available under the functions namespace.
>>>
>>> Therefore, I would like to propose adding all expressions into all
>>> languages so that Spark is simpler and less confusing, e.g., which API is
>>> in functions or not.
>>>
>>> Any thoughts?
>>>
>>>
>>>