[ 
https://issues.apache.org/jira/browse/SPARK-33310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Daniel Himmelstein updated SPARK-33310:
---------------------------------------
    Description: 
Several pyspark.sql.functions have overly strict typing, in that the type is 
more restrictive than the functionality. Specifically, the function allows 
specifying the column to operate on with a pyspark.sql.Column or a str. This is 
handled internally by 
[_to_java_column|https://github.com/apache/spark/blob/491a0fb08b0c57a99894a0b33c5814854db8de3d/python/pyspark/sql/column.py#L39-L50],
 which accepts a Column or string.

There is a pre-existing type for this: 
[ColumnOrName|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/_typing.pyi#L37].
 ColumnOrName is used for many of the type definitions of pyspark.sql.functions 
arguments, but [not 
for|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/functions.pyi#L158-L162]
 locate, lpad, rpad, repeat, and split.
{code:java}
def locate(substr: str, str: Column, pos: int = ...) -> Column: ...
def lpad(col: Column, len: int, pad: str) -> Column: ...
def rpad(col: Column, len: int, pad: str) -> Column: ...
def repeat(col: Column, n: int) -> Column: ...
def split(str: Column, pattern: str, limit: int = ...) -> Column: ...{code}
ColumnOrName was not added by [~zero323] since Maciej "was concerned that this 
might be confusing or ambiguous", because these functions take a column to 
operate on as well strings which are used in the operation.

But I think ColumnOrName makes clear that this variable refers to the column 
and not a string parameter. Also there are other ways to address confusion, 
such as via the docstring or by changing the argument name for the column to 
col from str.

Finally, there's considerable convenience for users to not have to wrap column 
names in pyspark.sql.functions.col. Elsewhere the API seems pretty consistent 
in its willingness to accept columns by name and not Column object (at least 
when there is not alternative meaning for a string value, exception would be 
.when/.otherwise).

For example, we were calling pyspark.sql.functions.split with a string value 
for the str argument (specifying which column to split). And I noticed this 
when we enforced typing with pyspark-stubs in preparation for pyspark 3.1. For 
users that will enable typing in 3.1, this is a restriction in functionality.

Pre-existing PRs to address this:
 * [https://github.com/apache/spark/pull/30209]
 * [https://github.com/zero323/pyspark-stubs/pull/420]

  was:
Several pyspark.sql.functions have overly strict typing, in that the type is 
more restrictive than the functionality. Specifically, the function allows 
specifying the column to operate on with a pyspark.sql.Column or a str. This is 
handled internally by 
[_to_java_column|https://github.com/apache/spark/blob/491a0fb08b0c57a99894a0b33c5814854db8de3d/python/pyspark/sql/column.py#L39-L50],
 which accepts a Column or string.

There is a pre-existing type for this: 
[ColumnOrName|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/_typing.pyi#L37].
 ColumnOrName is used for many of the type definitions of pyspark.sql.functions 
arguments, but [not 
for|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/functions.pyi#L158-L162]
 locate, lpad, rpad, repeat, and split.
{code:java}
def locate(substr: str, str: Column, pos: int = ...) -> Column: ...
def lpad(col: Column, len: int, pad: str) -> Column: ...
def rpad(col: Column, len: int, pad: str) -> Column: ...
def repeat(col: Column, n: int) -> Column: ...
def split(str: Column, pattern: str, limit: int = ...) -> Column: ...{code}
ColumnOrName was not added by [~zero323] since Maciej "was concerned that this 
might be confusing or ambiguous", because these functions take a column to 
operate on as well strings which are used in the operation.

But I think ColumnOrName makes clear that this variable refers to the column 
and not a string parameter. Also there are other ways to address confusion, 
such as via the docstring or by changing the argument name for the column to 
col from str.

Finally, there's considerable convenience for users to not have to wrap column 
names in pyspark.sql.functions.col. Elsewhere the API seems pretty consistent 
in its willingness to accept columns by name and not Column object (at least 
when there is not alternative meaning for a string value, exception would be 
.when/.otherwise).

For example, we were pyspark.sql.functions.split with a string value for the 
str argument (specifying which column to split). And I noticed this when we 
enforced typing with pyspark-stubs in preparation for pyspark 3.1.

Pre-existing PRs to address this:
 * https://github.com/apache/spark/pull/30209
 * https://github.com/zero323/pyspark-stubs/pull/420


> Relax pyspark typing for sql str functions
> ------------------------------------------
>
>                 Key: SPARK-33310
>                 URL: https://issues.apache.org/jira/browse/SPARK-33310
>             Project: Spark
>          Issue Type: Wish
>          Components: PySpark
>    Affects Versions: 3.1.0
>            Reporter: Daniel Himmelstein
>            Priority: Minor
>              Labels: pyspark.sql.functions, type
>             Fix For: 3.1.0
>
>
> Several pyspark.sql.functions have overly strict typing, in that the type is 
> more restrictive than the functionality. Specifically, the function allows 
> specifying the column to operate on with a pyspark.sql.Column or a str. This 
> is handled internally by 
> [_to_java_column|https://github.com/apache/spark/blob/491a0fb08b0c57a99894a0b33c5814854db8de3d/python/pyspark/sql/column.py#L39-L50],
>  which accepts a Column or string.
> There is a pre-existing type for this: 
> [ColumnOrName|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/_typing.pyi#L37].
>  ColumnOrName is used for many of the type definitions of 
> pyspark.sql.functions arguments, but [not 
> for|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/functions.pyi#L158-L162]
>  locate, lpad, rpad, repeat, and split.
> {code:java}
> def locate(substr: str, str: Column, pos: int = ...) -> Column: ...
> def lpad(col: Column, len: int, pad: str) -> Column: ...
> def rpad(col: Column, len: int, pad: str) -> Column: ...
> def repeat(col: Column, n: int) -> Column: ...
> def split(str: Column, pattern: str, limit: int = ...) -> Column: ...{code}
> ColumnOrName was not added by [~zero323] since Maciej "was concerned that 
> this might be confusing or ambiguous", because these functions take a column 
> to operate on as well strings which are used in the operation.
> But I think ColumnOrName makes clear that this variable refers to the column 
> and not a string parameter. Also there are other ways to address confusion, 
> such as via the docstring or by changing the argument name for the column to 
> col from str.
> Finally, there's considerable convenience for users to not have to wrap 
> column names in pyspark.sql.functions.col. Elsewhere the API seems pretty 
> consistent in its willingness to accept columns by name and not Column object 
> (at least when there is not alternative meaning for a string value, exception 
> would be .when/.otherwise).
> For example, we were calling pyspark.sql.functions.split with a string value 
> for the str argument (specifying which column to split). And I noticed this 
> when we enforced typing with pyspark-stubs in preparation for pyspark 3.1. 
> For users that will enable typing in 3.1, this is a restriction in 
> functionality.
> Pre-existing PRs to address this:
>  * [https://github.com/apache/spark/pull/30209]
>  * [https://github.com/zero323/pyspark-stubs/pull/420]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to