[jira] [Updated] (SPARK-26979) [PySpark] Some SQL functions do not take column names

Andre Sa de Mello (JIRA) Sun, 24 Feb 2019 04:52:17 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-26979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Andre Sa de Mello updated SPARK-26979:
--------------------------------------
    Component/s:     (was: SQL)
                 PySpark

> [PySpark] Some SQL functions do not take column names
> -----------------------------------------------------
>
>                 Key: SPARK-26979
>                 URL: https://issues.apache.org/jira/browse/SPARK-26979
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 2.4.0
>            Reporter: Andre Sa de Mello
>            Priority: Minor
>              Labels: easyfix, pull-request-available, usability
>   Original Estimate: 0h
>  Remaining Estimate: 0h
>
> Most SQL functions defined in _org.apache.spark.sql.functions_ have two 
> variations, one taking a Column object as input, and another taking a string 
> representing a column name, which is then converted into a Column object 
> internally.
> There are, however, a few notable exceptions:
>  * lower()
>  * upper()
>  * abs()
>  * bitwiseNOT()
> While this doesn't break anything, as you can easily create a Column object 
> yourself prior to passing it to one of these functions, it has two 
> undesirable consequences:
>  # It is surprising - it breaks coder's expectations when they are first 
> starting with Spark. Every API should be as consistent as possible, so as to 
> make the learning curve smoother and to reduce causes for human error;
>  # It gets in the way of stylistic conventions. Most of the time it makes 
> Python/Scala/Java code more readable to use literal names, and the API 
> provides ample support for that, but these few exceptions prevent this 
> pattern from being universally applicable.
> This is a very easy fix, and I see no reason not to apply it. I have a PR 
> ready.
> *UPDATE:* Turns out there are many exceptions over this pattern that I wasn't 
> aware of. The reason I missed them is because I had been looking at things 
> from PySpark's point of view, and the API there does support column name 
> literals for almost all SQL functions.
> Exceptions for the PySpark API include all the above plus:
>  * ltrim()
>  * rtrim()
>  * trim()
>  * ascii()
>  * initcap()
>  * base64()
>  * unbase64()
> The argument for making the API consistent still stands, however. I have been 
> working on a PR to fix this on PySpark's side, and it should still be a 
> painless change. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-26979) [PySpark] Some SQL functions do not take column names

Reply via email to