[ 
https://issues.apache.org/jira/browse/SPARK-33310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224074#comment-17224074
 ] 

Apache Spark commented on SPARK-33310:
--------------------------------------

User 'dhimmel' has created a pull request for this issue:
https://github.com/apache/spark/pull/30209

> Relax pyspark typing for sql str functions
> ------------------------------------------
>
>                 Key: SPARK-33310
>                 URL: https://issues.apache.org/jira/browse/SPARK-33310
>             Project: Spark
>          Issue Type: Wish
>          Components: PySpark
>    Affects Versions: 3.1.0
>            Reporter: Daniel Himmelstein
>            Priority: Minor
>              Labels: pyspark.sql.functions, type
>             Fix For: 3.1.0
>
>
> Several pyspark.sql.functions have overly strict typing, in that the type is 
> more restrictive than the functionality. Specifically, the function allows 
> specifying the column to operate on with a pyspark.sql.Column or a str. This 
> is handled internally by 
> [_to_java_column|https://github.com/apache/spark/blob/491a0fb08b0c57a99894a0b33c5814854db8de3d/python/pyspark/sql/column.py#L39-L50],
>  which accepts a Column or string.
> There is a pre-existing type for this: 
> [ColumnOrName|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/_typing.pyi#L37].
>  ColumnOrName is used for many of the type definitions of 
> pyspark.sql.functions arguments, but [not 
> for|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/functions.pyi#L158-L162]
>  locate, lpad, rpad, repeat, and split.
> {code:java}
> def locate(substr: str, str: Column, pos: int = ...) -> Column: ...
> def lpad(col: Column, len: int, pad: str) -> Column: ...
> def rpad(col: Column, len: int, pad: str) -> Column: ...
> def repeat(col: Column, n: int) -> Column: ...
> def split(str: Column, pattern: str, limit: int = ...) -> Column: ...{code}
> ColumnOrName was not added by [~zero323] since Maciej "was concerned that 
> this might be confusing or ambiguous", because these functions take a column 
> to operate on as well strings which are used in the operation.
> But I think ColumnOrName makes clear that this variable refers to the column 
> and not a string parameter. Also there are other ways to address confusion, 
> such as via the docstring or by changing the argument name for the column to 
> col from str.
> Finally, there's considerable convenience for users to not have to wrap 
> column names in pyspark.sql.functions.col. Elsewhere the API seems pretty 
> consistent in its willingness to accept columns by name and not Column object 
> (at least when there is not alternative meaning for a string value, exception 
> would be .when/.otherwise).
> For example, we were pyspark.sql.functions.split with a string value for the 
> str argument (specifying which column to split). And I noticed this when we 
> enforced typing with pyspark-stubs in preparation for pyspark 3.1.
> Pre-existing PRs to address this:
>  * https://github.com/apache/spark/pull/30209
>  * https://github.com/zero323/pyspark-stubs/pull/420



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to