[ 
https://issues.apache.org/jira/browse/SPARK-46684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-46684.
----------------------------------
    Fix Version/s: 4.0.0
       Resolution: Fixed

Issue resolved by pull request 44695
[https://github.com/apache/spark/pull/44695]

> CoGroup.applyInPandas/Arrow should pass arguments properly
> ----------------------------------------------------------
>
>                 Key: SPARK-46684
>                 URL: https://issues.apache.org/jira/browse/SPARK-46684
>             Project: Spark
>          Issue Type: Bug
>          Components: Connect
>    Affects Versions: 3.5.0
>            Reporter: Takuya Ueshin
>            Assignee: Takuya Ueshin
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>
> In Spark Connect, {{CoGroup.applyInPandas/Arrow}} doesn't take arguments 
> properly, so the arguments of the UDF can be broken:
> {noformat}
> >>> import pandas as pd
> >>>
> >>> df1 = spark.createDataFrame(
> ...     [(1, 1.0, "a"), (2, 2.0, "b"), (1, 3.0, "c"), (2, 4.0, "d")], ("id", 
> "v1", "v2")
> ... )
> >>> df2 = spark.createDataFrame([(1, "x"), (2, "y"), (1, "z")], ("id", "v3"))
> >>>
> >>> def summarize(left, right):
> ...     return pd.DataFrame(
> ...         {
> ...             "left_rows": [len(left)],
> ...             "left_columns": [len(left.columns)],
> ...             "right_rows": [len(right)],
> ...             "right_columns": [len(right.columns)],
> ...         }
> ...     )
> ...
> >>> df = (
> ...     df1.groupby("id")
> ...     .cogroup(df2.groupby("id"))
> ...     .applyInPandas(
> ...         summarize,
> ...         schema="left_rows long, left_columns long, right_rows long, 
> right_columns long",
> ...     )
> ... )
> >>>
> >>> df.show()
> +---------+------------+----------+-------------+
> |left_rows|left_columns|right_rows|right_columns|
> +---------+------------+----------+-------------+
> |        2|           1|         2|            1|
> |        2|           1|         1|            1|
> +---------+------------+----------+-------------+
> {noformat}
> The result should be:
> {noformat}
> +---------+------------+----------+-------------+
> |left_rows|left_columns|right_rows|right_columns|
> +---------+------------+----------+-------------+
> |        2|           3|         2|            2|
> |        2|           3|         1|            2|
> +---------+------------+----------+-------------+
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to