[ https://issues.apache.org/jira/browse/SPARK-33310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17224074#comment-17224074 ]
Apache Spark commented on SPARK-33310: -------------------------------------- User 'dhimmel' has created a pull request for this issue: https://github.com/apache/spark/pull/30209 > Relax pyspark typing for sql str functions > ------------------------------------------ > > Key: SPARK-33310 > URL: https://issues.apache.org/jira/browse/SPARK-33310 > Project: Spark > Issue Type: Wish > Components: PySpark > Affects Versions: 3.1.0 > Reporter: Daniel Himmelstein > Priority: Minor > Labels: pyspark.sql.functions, type > Fix For: 3.1.0 > > > Several pyspark.sql.functions have overly strict typing, in that the type is > more restrictive than the functionality. Specifically, the function allows > specifying the column to operate on with a pyspark.sql.Column or a str. This > is handled internally by > [_to_java_column|https://github.com/apache/spark/blob/491a0fb08b0c57a99894a0b33c5814854db8de3d/python/pyspark/sql/column.py#L39-L50], > which accepts a Column or string. > There is a pre-existing type for this: > [ColumnOrName|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/_typing.pyi#L37]. > ColumnOrName is used for many of the type definitions of > pyspark.sql.functions arguments, but [not > for|https://github.com/apache/spark/blob/72ad9dcd5d484a8dd64c08889de85ef9de2a6077/python/pyspark/sql/functions.pyi#L158-L162] > locate, lpad, rpad, repeat, and split. > {code:java} > def locate(substr: str, str: Column, pos: int = ...) -> Column: ... > def lpad(col: Column, len: int, pad: str) -> Column: ... > def rpad(col: Column, len: int, pad: str) -> Column: ... > def repeat(col: Column, n: int) -> Column: ... > def split(str: Column, pattern: str, limit: int = ...) -> Column: ...{code} > ColumnOrName was not added by [~zero323] since Maciej "was concerned that > this might be confusing or ambiguous", because these functions take a column > to operate on as well strings which are used in the operation. > But I think ColumnOrName makes clear that this variable refers to the column > and not a string parameter. Also there are other ways to address confusion, > such as via the docstring or by changing the argument name for the column to > col from str. > Finally, there's considerable convenience for users to not have to wrap > column names in pyspark.sql.functions.col. Elsewhere the API seems pretty > consistent in its willingness to accept columns by name and not Column object > (at least when there is not alternative meaning for a string value, exception > would be .when/.otherwise). > For example, we were pyspark.sql.functions.split with a string value for the > str argument (specifying which column to split). And I noticed this when we > enforced typing with pyspark-stubs in preparation for pyspark 3.1. > Pre-existing PRs to address this: > * https://github.com/apache/spark/pull/30209 > * https://github.com/zero323/pyspark-stubs/pull/420 -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org