Takuya Ueshin created SPARK-46684:
-------------------------------------

             Summary: CoGroup.applyInPandas/Arrow should pass arguments properly
                 Key: SPARK-46684
                 URL: https://issues.apache.org/jira/browse/SPARK-46684
             Project: Spark
          Issue Type: Bug
          Components: Connect
    Affects Versions: 3.5.0
            Reporter: Takuya Ueshin


In Spark Connect, {{CoGroup.applyInPandas/Arrow}} doesn't take arguments 
properly, so the arguments of the UDF can be broken:
{noformat}
>>> import pandas as pd
>>>
>>> df1 = spark.createDataFrame(
...     [(1, 1.0, "a"), (2, 2.0, "b"), (1, 3.0, "c"), (2, 4.0, "d")], ("id", 
"v1", "v2")
... )
>>> df2 = spark.createDataFrame([(1, "x"), (2, "y"), (1, "z")], ("id", "v3"))
>>>
>>> def summarize(left, right):
...     return pd.DataFrame(
...         {
...             "left_rows": [len(left)],
...             "left_columns": [len(left.columns)],
...             "right_rows": [len(right)],
...             "right_columns": [len(right.columns)],
...         }
...     )
...
>>> df = (
...     df1.groupby("id")
...     .cogroup(df2.groupby("id"))
...     .applyInPandas(
...         summarize,
...         schema="left_rows long, left_columns long, right_rows long, 
right_columns long",
...     )
... )
>>>
>>> df.show()
+---------+------------+----------+-------------+
|left_rows|left_columns|right_rows|right_columns|
+---------+------------+----------+-------------+
|        2|           1|         2|            1|
|        2|           1|         1|            1|
+---------+------------+----------+-------------+
{noformat}

The result should be:

{noformat}
+---------+------------+----------+-------------+
|left_rows|left_columns|right_rows|right_columns|
+---------+------------+----------+-------------+
|        2|           3|         2|            2|
|        2|           3|         1|            2|
+---------+------------+----------+-------------+
{noformat}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to