Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-31 Thread Hyukjin Kwon
Thanks all. I created a JIRA at
https://issues.apache.org/jira/browse/SPARK-43907.

On Mon, 29 May 2023 at 09:12, Hyukjin Kwon  wrote:

> Yes, some were cases like you mentioned.
> But I found myself explaining that reason to a lot of people, not only
> developers but users - I was asked in a conference, email, slack,
> internally and externally.
> Then realised that maybe we're doing something wrong. This is based on my
> experience so I wanted to open a discussion and see what others think about
> this :-).
>
>
>
>
> On Sat, 27 May 2023 at 00:19, Maciej  wrote:
>
>> Weren't some of these functions provided only for compatibility  and
>> intentionally left out of the language APIs?
>>
>> --
>> Best regards,
>> Maciej
>>
>> On 5/25/23 23:21, Hyukjin Kwon wrote:
>>
>> I don't think it'd be a release blocker .. I think we can implement them
>> across multiple releases.
>>
>> On Fri, May 26, 2023 at 1:01 AM Dongjoon Hyun 
>> wrote:
>>
>>> Thank you for the proposal.
>>>
>>> I'm wondering if we are going to consider them as release blockers or
>>> not.
>>>
>>> In general, I don't think those SQL functions should be available in all
>>> languages as release blockers.
>>> (Especially in R or new Spark Connect languages like Go and Rust).
>>>
>>> If they are not release blockers, we may allow some existing or future
>>> community PRs only before feature freeze (= branch cut).
>>>
>>> Thanks,
>>> Dongjoon.
>>>
>>>
>>> On Wed, May 24, 2023 at 7:09 PM Jia Fan  wrote:
>>>
 +1
 It is important that different APIs can be used to call the same
 function

 Ryan Berti  
 于2023年5月25日周四 01:48写道:

> During my recent experience developing functions, I found that
> identifying locations (sql + connect functions.scala + functions.py,
> FunctionRegistry, + whatever is required for R) and standards for adding
> function signatures was not straight forward (should you use optional args
> or overload functions? which col/lit helpers should be used when?). Are
> there docs describing all of the locations + standards for defining a
> function? If not, that'd be great to have too.
>
> Ryan Berti
>
> Senior Data Engineer  |  Ads DE
>
> M 7023217573
>
> 5808 W Sunset Blvd  |  Los Angeles, CA 90028
> 
>
>
>
> On Wed, May 24, 2023 at 12:44 AM Enrico Minack 
> wrote:
>
>> +1
>>
>> Functions available in SQL (more general in one API) should be
>> available in all APIs. I am very much in favor of this.
>>
>> Enrico
>>
>>
>> Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:
>>
>> Hi all,
>>
>> I would like to discuss adding all SQL functions into Scala, Python
>> and R API.
>> We have SQL functions that do not exist in Scala, Python and R around
>> 175.
>> For example, we don’t have pyspark.sql.functions.percentile but you
>> can invoke
>> it as a SQL function, e.g., SELECT percentile(...).
>>
>> The reason why we do not have all functions in the first place is
>> that we want to
>> only add commonly used functions, see also
>> https://github.com/apache/spark/pull/21318 (which I agreed at that
>> time)
>>
>> However, this has been raised multiple times over years, from the OSS
>> community, dev mailing list, JIRAs, stackoverflow, etc.
>> Seems it’s confusing about which function is available or not.
>>
>> Yes, we have a workaround. We can call all expressions by expr("...")
>>  or call_udf("...", Columns ...)
>> But still it seems that it’s not very user-friendly because they
>> expect them available under the functions namespace.
>>
>> Therefore, I would like to propose adding all expressions into all
>> languages so that Spark is simpler and less confusing, e.g., which API is
>> in functions or not.
>>
>> Any thoughts?
>>
>>
>>
>>


Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-28 Thread Hyukjin Kwon
Yes, some were cases like you mentioned.
But I found myself explaining that reason to a lot of people, not only
developers but users - I was asked in a conference, email, slack,
internally and externally.
Then realised that maybe we're doing something wrong. This is based on my
experience so I wanted to open a discussion and see what others think about
this :-).




On Sat, 27 May 2023 at 00:19, Maciej  wrote:

> Weren't some of these functions provided only for compatibility  and
> intentionally left out of the language APIs?
>
> --
> Best regards,
> Maciej
>
> On 5/25/23 23:21, Hyukjin Kwon wrote:
>
> I don't think it'd be a release blocker .. I think we can implement them
> across multiple releases.
>
> On Fri, May 26, 2023 at 1:01 AM Dongjoon Hyun 
> wrote:
>
>> Thank you for the proposal.
>>
>> I'm wondering if we are going to consider them as release blockers or not.
>>
>> In general, I don't think those SQL functions should be available in all
>> languages as release blockers.
>> (Especially in R or new Spark Connect languages like Go and Rust).
>>
>> If they are not release blockers, we may allow some existing or future
>> community PRs only before feature freeze (= branch cut).
>>
>> Thanks,
>> Dongjoon.
>>
>>
>> On Wed, May 24, 2023 at 7:09 PM Jia Fan  wrote:
>>
>>> +1
>>> It is important that different APIs can be used to call the same function
>>>
>>> Ryan Berti  
>>> 于2023年5月25日周四 01:48写道:
>>>
 During my recent experience developing functions, I found that
 identifying locations (sql + connect functions.scala + functions.py,
 FunctionRegistry, + whatever is required for R) and standards for adding
 function signatures was not straight forward (should you use optional args
 or overload functions? which col/lit helpers should be used when?). Are
 there docs describing all of the locations + standards for defining a
 function? If not, that'd be great to have too.

 Ryan Berti

 Senior Data Engineer  |  Ads DE

 M 7023217573

 5808 W Sunset Blvd  |  Los Angeles, CA 90028
 



 On Wed, May 24, 2023 at 12:44 AM Enrico Minack 
 wrote:

> +1
>
> Functions available in SQL (more general in one API) should be
> available in all APIs. I am very much in favor of this.
>
> Enrico
>
>
> Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:
>
> Hi all,
>
> I would like to discuss adding all SQL functions into Scala, Python
> and R API.
> We have SQL functions that do not exist in Scala, Python and R around
> 175.
> For example, we don’t have pyspark.sql.functions.percentile but you
> can invoke
> it as a SQL function, e.g., SELECT percentile(...).
>
> The reason why we do not have all functions in the first place is that
> we want to
> only add commonly used functions, see also
> https://github.com/apache/spark/pull/21318 (which I agreed at that
> time)
>
> However, this has been raised multiple times over years, from the OSS
> community, dev mailing list, JIRAs, stackoverflow, etc.
> Seems it’s confusing about which function is available or not.
>
> Yes, we have a workaround. We can call all expressions by expr("...")
>  or call_udf("...", Columns ...)
> But still it seems that it’s not very user-friendly because they
> expect them available under the functions namespace.
>
> Therefore, I would like to propose adding all expressions into all
> languages so that Spark is simpler and less confusing, e.g., which API is
> in functions or not.
>
> Any thoughts?
>
>
>
>


Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-26 Thread Maciej
Weren't some of these functions provided only for compatibility  and 
intentionally left out of the language APIs?


--
Best regards,
Maciej

On 5/25/23 23:21, Hyukjin Kwon wrote:
I don't think it'd be a release blocker .. I think we can implement 
them across multiple releases.


On Fri, May 26, 2023 at 1:01 AM Dongjoon Hyun 
 wrote:


Thank you for the proposal.

I'm wondering if we are going to consider them as release blockers
or not.

In general, I don't think those SQL functions should be available
in all languages as release blockers.
(Especially in R or new Spark Connect languages like Go and Rust).

If they are not release blockers, we may allow some existing or
future community PRs only before feature freeze (= branch cut).

Thanks,
Dongjoon.


On Wed, May 24, 2023 at 7:09 PM Jia Fan  wrote:

+1
It is important that different APIs can be used to call the
same function

Ryan Berti  于2023年5月25日周四
01:48写道:

During my recent experience developing functions, I found
that identifying locations (sql + connect
functions.scala + functions.py, FunctionRegistry, +
whatever is required for R) and standards for adding
function signatures was not straight forward (should you
use optional args or overload functions? which col/lit
helpers should be used when?). Are there docs describing
all of the locations + standards for defining a function?
If not, that'd be great to have too.

Ryan Berti

Senior Data Engineer  |  Ads DE

M 7023217573

5808 W Sunset Blvd  |  Los Angeles, CA 90028





On Wed, May 24, 2023 at 12:44 AM Enrico Minack
 wrote:

+1

Functions available in SQL (more general in one API)
should be available in all APIs. I am very much in
favor of this.

Enrico


Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:


Hi all,

I would like to discuss adding all SQL functions into
Scala, Python and R API.
We have SQL functions that do not exist in Scala,
Python and R around 175.
For example, we don’t have
|pyspark.sql.functions.percentile| but you can invoke
it as a SQL function, e.g., |SELECT percentile(...)|.

The reason why we do not have all functions in the
first place is that we want to
only add commonly used functions, see also
https://github.com/apache/spark/pull/21318 (which I
agreed at that time)

However, this has been raised multiple times over
years, from the OSS community, dev mailing list,
JIRAs, stackoverflow, etc.
Seems it’s confusing about which function is
available or not.

Yes, we have a workaround. We can call all
expressions by |expr("...")| or |call_udf("...",
Columns ...)|
But still it seems that it’s not very user-friendly
because they expect them available under the
functions namespace.

Therefore, I would like to propose adding all
expressions into all languages so that Spark is
simpler and less confusing, e.g., which API is in
functions or not.

Any thoughts?







OpenPGP_signature
Description: OpenPGP digital signature


Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-25 Thread Hyukjin Kwon
I don't think it'd be a release blocker .. I think we can implement them
across multiple releases.

On Fri, May 26, 2023 at 1:01 AM Dongjoon Hyun 
wrote:

> Thank you for the proposal.
>
> I'm wondering if we are going to consider them as release blockers or not.
>
> In general, I don't think those SQL functions should be available in all
> languages as release blockers.
> (Especially in R or new Spark Connect languages like Go and Rust).
>
> If they are not release blockers, we may allow some existing or future
> community PRs only before feature freeze (= branch cut).
>
> Thanks,
> Dongjoon.
>
>
> On Wed, May 24, 2023 at 7:09 PM Jia Fan  wrote:
>
>> +1
>> It is important that different APIs can be used to call the same function
>>
>> Ryan Berti  于2023年5月25日周四 01:48写道:
>>
>>> During my recent experience developing functions, I found that
>>> identifying locations (sql + connect functions.scala + functions.py,
>>> FunctionRegistry, + whatever is required for R) and standards for adding
>>> function signatures was not straight forward (should you use optional args
>>> or overload functions? which col/lit helpers should be used when?). Are
>>> there docs describing all of the locations + standards for defining a
>>> function? If not, that'd be great to have too.
>>>
>>> Ryan Berti
>>>
>>> Senior Data Engineer  |  Ads DE
>>>
>>> M 7023217573
>>>
>>> 5808 W Sunset Blvd  |  Los Angeles, CA 90028
>>> 
>>>
>>>
>>>
>>> On Wed, May 24, 2023 at 12:44 AM Enrico Minack 
>>> wrote:
>>>
 +1

 Functions available in SQL (more general in one API) should be
 available in all APIs. I am very much in favor of this.

 Enrico


 Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:

 Hi all,

 I would like to discuss adding all SQL functions into Scala, Python and
 R API.
 We have SQL functions that do not exist in Scala, Python and R around
 175.
 For example, we don’t have pyspark.sql.functions.percentile but you
 can invoke
 it as a SQL function, e.g., SELECT percentile(...).

 The reason why we do not have all functions in the first place is that
 we want to
 only add commonly used functions, see also
 https://github.com/apache/spark/pull/21318 (which I agreed at that
 time)

 However, this has been raised multiple times over years, from the OSS
 community, dev mailing list, JIRAs, stackoverflow, etc.
 Seems it’s confusing about which function is available or not.

 Yes, we have a workaround. We can call all expressions by expr("...")
  or call_udf("...", Columns ...)
 But still it seems that it’s not very user-friendly because they expect
 them available under the functions namespace.

 Therefore, I would like to propose adding all expressions into all
 languages so that Spark is simpler and less confusing, e.g., which API is
 in functions or not.

 Any thoughts?





Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-25 Thread Dongjoon Hyun
Thank you for the proposal.

I'm wondering if we are going to consider them as release blockers or not.

In general, I don't think those SQL functions should be available in all
languages as release blockers.
(Especially in R or new Spark Connect languages like Go and Rust).

If they are not release blockers, we may allow some existing or future
community PRs only before feature freeze (= branch cut).

Thanks,
Dongjoon.


On Wed, May 24, 2023 at 7:09 PM Jia Fan  wrote:

> +1
> It is important that different APIs can be used to call the same function
>
> Ryan Berti  于2023年5月25日周四 01:48写道:
>
>> During my recent experience developing functions, I found that
>> identifying locations (sql + connect functions.scala + functions.py,
>> FunctionRegistry, + whatever is required for R) and standards for adding
>> function signatures was not straight forward (should you use optional args
>> or overload functions? which col/lit helpers should be used when?). Are
>> there docs describing all of the locations + standards for defining a
>> function? If not, that'd be great to have too.
>>
>> Ryan Berti
>>
>> Senior Data Engineer  |  Ads DE
>>
>> M 7023217573
>>
>> 5808 W Sunset Blvd  |  Los Angeles, CA 90028
>>
>>
>>
>> On Wed, May 24, 2023 at 12:44 AM Enrico Minack 
>> wrote:
>>
>>> +1
>>>
>>> Functions available in SQL (more general in one API) should be available
>>> in all APIs. I am very much in favor of this.
>>>
>>> Enrico
>>>
>>>
>>> Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:
>>>
>>> Hi all,
>>>
>>> I would like to discuss adding all SQL functions into Scala, Python and
>>> R API.
>>> We have SQL functions that do not exist in Scala, Python and R around
>>> 175.
>>> For example, we don’t have pyspark.sql.functions.percentile but you can
>>> invoke
>>> it as a SQL function, e.g., SELECT percentile(...).
>>>
>>> The reason why we do not have all functions in the first place is that
>>> we want to
>>> only add commonly used functions, see also
>>> https://github.com/apache/spark/pull/21318 (which I agreed at that time)
>>>
>>> However, this has been raised multiple times over years, from the OSS
>>> community, dev mailing list, JIRAs, stackoverflow, etc.
>>> Seems it’s confusing about which function is available or not.
>>>
>>> Yes, we have a workaround. We can call all expressions by expr("...")
>>>  or call_udf("...", Columns ...)
>>> But still it seems that it’s not very user-friendly because they expect
>>> them available under the functions namespace.
>>>
>>> Therefore, I would like to propose adding all expressions into all
>>> languages so that Spark is simpler and less confusing, e.g., which API is
>>> in functions or not.
>>>
>>> Any thoughts?
>>>
>>>
>>>


Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-24 Thread Jia Fan
+1
It is important that different APIs can be used to call the same function

Ryan Berti  于2023年5月25日周四 01:48写道:

> During my recent experience developing functions, I found that identifying
> locations (sql + connect functions.scala + functions.py, FunctionRegistry,
> + whatever is required for R) and standards for adding function signatures
> was not straight forward (should you use optional args or overload
> functions? which col/lit helpers should be used when?). Are there docs
> describing all of the locations + standards for defining a function? If
> not, that'd be great to have too.
>
> Ryan Berti
>
> Senior Data Engineer  |  Ads DE
>
> M 7023217573
>
> 5808 W Sunset Blvd  |  Los Angeles, CA 90028
>
>
>
> On Wed, May 24, 2023 at 12:44 AM Enrico Minack 
> wrote:
>
>> +1
>>
>> Functions available in SQL (more general in one API) should be available
>> in all APIs. I am very much in favor of this.
>>
>> Enrico
>>
>>
>> Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:
>>
>> Hi all,
>>
>> I would like to discuss adding all SQL functions into Scala, Python and R
>> API.
>> We have SQL functions that do not exist in Scala, Python and R around 175.
>> For example, we don’t have pyspark.sql.functions.percentile but you can
>> invoke
>> it as a SQL function, e.g., SELECT percentile(...).
>>
>> The reason why we do not have all functions in the first place is that we
>> want to
>> only add commonly used functions, see also
>> https://github.com/apache/spark/pull/21318 (which I agreed at that time)
>>
>> However, this has been raised multiple times over years, from the OSS
>> community, dev mailing list, JIRAs, stackoverflow, etc.
>> Seems it’s confusing about which function is available or not.
>>
>> Yes, we have a workaround. We can call all expressions by expr("...") or 
>> call_udf("...",
>> Columns ...)
>> But still it seems that it’s not very user-friendly because they expect
>> them available under the functions namespace.
>>
>> Therefore, I would like to propose adding all expressions into all
>> languages so that Spark is simpler and less confusing, e.g., which API is
>> in functions or not.
>>
>> Any thoughts?
>>
>>
>>


Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-24 Thread Ryan Berti
During my recent experience developing functions, I found that identifying
locations (sql + connect functions.scala + functions.py, FunctionRegistry,
+ whatever is required for R) and standards for adding function signatures
was not straight forward (should you use optional args or overload
functions? which col/lit helpers should be used when?). Are there docs
describing all of the locations + standards for defining a function? If
not, that'd be great to have too.

Ryan Berti

Senior Data Engineer  |  Ads DE

M 7023217573

5808 W Sunset Blvd  |  Los Angeles, CA 90028



On Wed, May 24, 2023 at 12:44 AM Enrico Minack 
wrote:

> +1
>
> Functions available in SQL (more general in one API) should be available
> in all APIs. I am very much in favor of this.
>
> Enrico
>
>
> Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:
>
> Hi all,
>
> I would like to discuss adding all SQL functions into Scala, Python and R
> API.
> We have SQL functions that do not exist in Scala, Python and R around 175.
> For example, we don’t have pyspark.sql.functions.percentile but you can
> invoke
> it as a SQL function, e.g., SELECT percentile(...).
>
> The reason why we do not have all functions in the first place is that we
> want to
> only add commonly used functions, see also
> https://github.com/apache/spark/pull/21318 (which I agreed at that time)
>
> However, this has been raised multiple times over years, from the OSS
> community, dev mailing list, JIRAs, stackoverflow, etc.
> Seems it’s confusing about which function is available or not.
>
> Yes, we have a workaround. We can call all expressions by expr("...") or 
> call_udf("...",
> Columns ...)
> But still it seems that it’s not very user-friendly because they expect
> them available under the functions namespace.
>
> Therefore, I would like to propose adding all expressions into all
> languages so that Spark is simpler and less confusing, e.g., which API is
> in functions or not.
>
> Any thoughts?
>
>
>


Re: [DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-24 Thread Enrico Minack

+1

Functions available in SQL (more general in one API) should be available 
in all APIs. I am very much in favor of this.


Enrico


Am 24.05.23 um 09:41 schrieb Hyukjin Kwon:


Hi all,

I would like to discuss adding all SQL functions into Scala, Python 
and R API.

We have SQL functions that do not exist in Scala, Python and R around 175.
For example, we don’t have |pyspark.sql.functions.percentile| but you 
can invoke

it as a SQL function, e.g., |SELECT percentile(...)|.

The reason why we do not have all functions in the first place is that 
we want to
only add commonly used functions, see also 
https://github.com/apache/spark/pull/21318 (which I agreed at that time)


However, this has been raised multiple times over years, from the OSS 
community, dev mailing list, JIRAs, stackoverflow, etc.

Seems it’s confusing about which function is available or not.

Yes, we have a workaround. We can call all expressions by 
|expr("...")| or |call_udf("...", Columns ...)|
But still it seems that it’s not very user-friendly because they 
expect them available under the functions namespace.


Therefore, I would like to propose adding all expressions into all 
languages so that Spark is simpler and less confusing, e.g., which API 
is in functions or not.


Any thoughts?



[DISCUSS] Add SQL functions into Scala, Python and R API

2023-05-24 Thread Hyukjin Kwon
Hi all,

I would like to discuss adding all SQL functions into Scala, Python and R
API.
We have SQL functions that do not exist in Scala, Python and R around 175.
For example, we don’t have pyspark.sql.functions.percentile but you can
invoke
it as a SQL function, e.g., SELECT percentile(...).

The reason why we do not have all functions in the first place is that we
want to
only add commonly used functions, see also
https://github.com/apache/spark/pull/21318 (which I agreed at that time)

However, this has been raised multiple times over years, from the OSS
community, dev mailing list, JIRAs, stackoverflow, etc.
Seems it’s confusing about which function is available or not.

Yes, we have a workaround. We can call all expressions by expr("...")
or call_udf("...",
Columns ...)
But still it seems that it’s not very user-friendly because they expect
them available under the functions namespace.

Therefore, I would like to propose adding all expressions into all
languages so that Spark is simpler and less confusing, e.g., which API is
in functions or not.

Any thoughts?