Takuya Ueshin created SPARK-46684: ------------------------------------- Summary: CoGroup.applyInPandas/Arrow should pass arguments properly Key: SPARK-46684 URL: https://issues.apache.org/jira/browse/SPARK-46684 Project: Spark Issue Type: Bug Components: Connect Affects Versions: 3.5.0 Reporter: Takuya Ueshin
In Spark Connect, {{CoGroup.applyInPandas/Arrow}} doesn't take arguments properly, so the arguments of the UDF can be broken: {noformat} >>> import pandas as pd >>> >>> df1 = spark.createDataFrame( ... [(1, 1.0, "a"), (2, 2.0, "b"), (1, 3.0, "c"), (2, 4.0, "d")], ("id", "v1", "v2") ... ) >>> df2 = spark.createDataFrame([(1, "x"), (2, "y"), (1, "z")], ("id", "v3")) >>> >>> def summarize(left, right): ... return pd.DataFrame( ... { ... "left_rows": [len(left)], ... "left_columns": [len(left.columns)], ... "right_rows": [len(right)], ... "right_columns": [len(right.columns)], ... } ... ) ... >>> df = ( ... df1.groupby("id") ... .cogroup(df2.groupby("id")) ... .applyInPandas( ... summarize, ... schema="left_rows long, left_columns long, right_rows long, right_columns long", ... ) ... ) >>> >>> df.show() +---------+------------+----------+-------------+ |left_rows|left_columns|right_rows|right_columns| +---------+------------+----------+-------------+ | 2| 1| 2| 1| | 2| 1| 1| 1| +---------+------------+----------+-------------+ {noformat} The result should be: {noformat} +---------+------------+----------+-------------+ |left_rows|left_columns|right_rows|right_columns| +---------+------------+----------+-------------+ | 2| 3| 2| 2| | 2| 3| 1| 2| +---------+------------+----------+-------------+ {noformat} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org