[ 
https://issues.apache.org/jira/browse/SPARK-26979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andre Sa de Mello updated SPARK-26979:
--------------------------------------
    Description: 
Most SQL functions defined in _org.apache.spark.sql.functions_ have two 
variations, one taking a Column object as input, and another taking a string 
representing a column name, which is then converted into a Column object 
internally.

There are, however, a few notable exceptions:
 * lower()
 * upper()
 * abs()
 * bitwiseNOT()

While this doesn't break anything, as you can easily create a Column object 
yourself prior to passing it to one of these functions, it has two undesirable 
consequences:
 # It is surprising - it breaks coder's expectations when they are first 
starting with Spark. Every API should be as consistent as possible, so as to 
make the learning curve smoother and to reduce causes for human error;
 # It gets in the way of stylistic conventions. Most of the time it makes 
Python/Scala/Java code more readable to use literal names, and the API provides 
ample support for that, but these few exceptions prevent this pattern from 
being universally applicable.

This is a very easy fix, and I see no reason not to apply it. I have a PR ready.

*UPDATE:* Turns out there are many exceptions over this pattern that I wasn't 
aware of. The reason I missed them is because I had been looking at things from 
PySpark's point of view, and the API there does support column name literals 
for almost all SQL functions.

Exceptions for the PySpark API include all the above plus:
 * ltrim()
 * rtrim()
 * trim()
 * ascii()
 * base64()
 * unbase64()

The argument for making the API consistent still stands, however. I have been 
working on a PR to fix this on *PySpark's side*, and it should still be a 
painless change. 

  was:
Most SQL functions defined in _org.apache.spark.sql.functions_ have two 
variations, one taking a Column object as input, and another taking a string 
representing a column name, which is then converted into a Column object 
internally.

There are, however, a few notable exceptions:
 * lower()
 * upper()
 * abs()
 * bitwiseNOT()

While this doesn't break anything, as you can easily create a Column object 
yourself prior to passing it to one of these functions, it has two undesirable 
consequences:
 # It is surprising - it breaks coder's expectations when they are first 
starting with Spark. Every API should be as consistent as possible, so as to 
make the learning curve smoother and to reduce causes for human error;
 # It gets in the way of stylistic conventions. Most of the time it makes 
Python/Scala/Java code more readable to use literal names, and the API provides 
ample support for that, but these few exceptions prevent this pattern from 
being universally applicable.

This is a very easy fix, and I see no reason not to apply it. I have a PR ready.

*UPDATE:* Turns out there are many exceptions over this pattern that I wasn't 
aware of. The reason I missed them is because I had been looking at things from 
PySpark's point of view, and the API there does support column name literals 
for almost all SQL functions.

Exceptions for the PySpark API include all the above plus:
 * ltrim()
 * rtrim()
 * trim()
 * ascii()
 * initcap()
 * base64()
 * unbase64()

The argument for making the API consistent still stands, however. I have been 
working on a PR to fix this on *PySpark's side*, and it should still be a 
painless change. 


> [PySpark] Some SQL functions do not take column names
> -----------------------------------------------------
>
>                 Key: SPARK-26979
>                 URL: https://issues.apache.org/jira/browse/SPARK-26979
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.4.0
>            Reporter: Andre Sa de Mello
>            Priority: Minor
>              Labels: easyfix, pull-request-available, usability
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Most SQL functions defined in _org.apache.spark.sql.functions_ have two 
> variations, one taking a Column object as input, and another taking a string 
> representing a column name, which is then converted into a Column object 
> internally.
> There are, however, a few notable exceptions:
>  * lower()
>  * upper()
>  * abs()
>  * bitwiseNOT()
> While this doesn't break anything, as you can easily create a Column object 
> yourself prior to passing it to one of these functions, it has two 
> undesirable consequences:
>  # It is surprising - it breaks coder's expectations when they are first 
> starting with Spark. Every API should be as consistent as possible, so as to 
> make the learning curve smoother and to reduce causes for human error;
>  # It gets in the way of stylistic conventions. Most of the time it makes 
> Python/Scala/Java code more readable to use literal names, and the API 
> provides ample support for that, but these few exceptions prevent this 
> pattern from being universally applicable.
> This is a very easy fix, and I see no reason not to apply it. I have a PR 
> ready.
> *UPDATE:* Turns out there are many exceptions over this pattern that I wasn't 
> aware of. The reason I missed them is because I had been looking at things 
> from PySpark's point of view, and the API there does support column name 
> literals for almost all SQL functions.
> Exceptions for the PySpark API include all the above plus:
>  * ltrim()
>  * rtrim()
>  * trim()
>  * ascii()
>  * base64()
>  * unbase64()
> The argument for making the API consistent still stands, however. I have been 
> working on a PR to fix this on *PySpark's side*, and it should still be a 
> painless change. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to