[ https://issues.apache.org/jira/browse/SPARK-46684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hyukjin Kwon resolved SPARK-46684. ---------------------------------- Fix Version/s: 4.0.0 Resolution: Fixed Issue resolved by pull request 44695 [https://github.com/apache/spark/pull/44695] > CoGroup.applyInPandas/Arrow should pass arguments properly > ---------------------------------------------------------- > > Key: SPARK-46684 > URL: https://issues.apache.org/jira/browse/SPARK-46684 > Project: Spark > Issue Type: Bug > Components: Connect > Affects Versions: 3.5.0 > Reporter: Takuya Ueshin > Assignee: Takuya Ueshin > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > > In Spark Connect, {{CoGroup.applyInPandas/Arrow}} doesn't take arguments > properly, so the arguments of the UDF can be broken: > {noformat} > >>> import pandas as pd > >>> > >>> df1 = spark.createDataFrame( > ... [(1, 1.0, "a"), (2, 2.0, "b"), (1, 3.0, "c"), (2, 4.0, "d")], ("id", > "v1", "v2") > ... ) > >>> df2 = spark.createDataFrame([(1, "x"), (2, "y"), (1, "z")], ("id", "v3")) > >>> > >>> def summarize(left, right): > ... return pd.DataFrame( > ... { > ... "left_rows": [len(left)], > ... "left_columns": [len(left.columns)], > ... "right_rows": [len(right)], > ... "right_columns": [len(right.columns)], > ... } > ... ) > ... > >>> df = ( > ... df1.groupby("id") > ... .cogroup(df2.groupby("id")) > ... .applyInPandas( > ... summarize, > ... schema="left_rows long, left_columns long, right_rows long, > right_columns long", > ... ) > ... ) > >>> > >>> df.show() > +---------+------------+----------+-------------+ > |left_rows|left_columns|right_rows|right_columns| > +---------+------------+----------+-------------+ > | 2| 1| 2| 1| > | 2| 1| 1| 1| > +---------+------------+----------+-------------+ > {noformat} > The result should be: > {noformat} > +---------+------------+----------+-------------+ > |left_rows|left_columns|right_rows|right_columns| > +---------+------------+----------+-------------+ > | 2| 3| 2| 2| > | 2| 3| 1| 2| > +---------+------------+----------+-------------+ > {noformat} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org