This is an automated email from the ASF dual-hosted git repository. ruifengz pushed a commit to branch master in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push: new 36da27fae56a [SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the datatype of `seed` 36da27fae56a is described below commit 36da27fae56aae4132ee1a2d646d0e4249e7d7e3 Author: Ruifeng Zheng <ruife...@apache.org> AuthorDate: Fri Jan 19 14:11:44 2024 +0800 [SPARK-46765][PYTHON][CONNECT] Make `shuffle` specify the datatype of `seed` ### What changes were proposed in this pull request? Make `shuffle` specify the datatype of `seed` ### Why are the changes needed? `shuffle` function may fail with an extreme low possibility (~ 2e-10) : `shuffle` requires a Long type `seed`, in an unregistered function, and this Long value is extracted in Planner. in Scala client the `SparkClassUtils.random.nextLong` make sure the type; while in Python, `lit(random.randint(0, sys.maxsize))` may return a Literal Integer instead of Literal Long. ``` In [26]: from pyspark.sql import functions as sf In [27]: df = spark.createDataFrame([([1, 20, 3, 5],)], ['data']) In [28]: df.select(sf.shuffle(df.data)).show() +-------------+ |shuffle(data)| +-------------+ |[1, 3, 5, 20]| +-------------+ In [29]: df.select(sf.call_udf("shuffle", df.data, sf.lit(123456789000000))).show() +-------------+ |shuffle(data)| +-------------+ |[20, 1, 5, 3]| +-------------+ In [30]: df.select(sf.call_udf("shuffle", df.data, sf.lit(12345))).show() ... SparkConnectGrpcException: (org.apache.spark.sql.connect.common.InvalidPlanInput) seed should be a literal long, but got 12345 ``` Another case is `uuid`, but it is not supported in Python due to namespace conflicts. I don't find other similar cases. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? manually check ### Was this patch authored or co-authored using generative AI tooling? no Closes #44793 from zhengruifeng/py_shuffle_long. Authored-by: Ruifeng Zheng <ruife...@apache.org> Signed-off-by: Ruifeng Zheng <ruife...@apache.org> --- python/pyspark/sql/connect/functions/builtin.py | 15 +++++++++++++-- 1 file changed, 13 insertions(+), 2 deletions(-) diff --git a/python/pyspark/sql/connect/functions/builtin.py b/python/pyspark/sql/connect/functions/builtin.py index 1e22a42c6241..7276cead88ef 100644 --- a/python/pyspark/sql/connect/functions/builtin.py +++ b/python/pyspark/sql/connect/functions/builtin.py @@ -59,7 +59,14 @@ from pyspark.sql.connect.udf import _create_py_udf from pyspark.sql.connect.udtf import AnalyzeArgument, AnalyzeResult # noqa: F401 from pyspark.sql.connect.udtf import _create_py_udtf from pyspark.sql import functions as pysparkfuncs -from pyspark.sql.types import _from_numpy_type, DataType, StructType, ArrayType, StringType +from pyspark.sql.types import ( + _from_numpy_type, + DataType, + LongType, + StructType, + ArrayType, + StringType, +) # The implementation of pandas_udf is embedded in pyspark.sql.function.pandas_udf # for code reuse. @@ -2116,7 +2123,11 @@ schema_of_xml.__doc__ = pysparkfuncs.schema_of_xml.__doc__ def shuffle(col: "ColumnOrName") -> Column: - return _invoke_function("shuffle", _to_col(col), lit(random.randint(0, sys.maxsize))) + return _invoke_function( + "shuffle", + _to_col(col), + LiteralExpression(random.randint(0, sys.maxsize), LongType()), + ) shuffle.__doc__ = pysparkfuncs.shuffle.__doc__ --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org