ueshin commented on PR #40692: URL: https://github.com/apache/spark/pull/40692#issuecomment-1499840711
> Just FYI, vanilla PySpark's DataFrame.toPandas also has this issue [issues.apache.org/jira/browse/SPARK-41971](https://issues.apache.org/jira/browse/SPARK-41971) Is it possible to move the changes to ArrowUtils to fix them all? Yes, I'm aware of the issue, but let me hold on it to the following PRs. TLDR; Actually this PR still has an issue with `toPandas`. ```py >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas() x y 0 1 {'a_0': 1, 'a_1': 2} ``` The duplicated fields have suffix `_1`, `_2`, and so on. Also, handling struct type in `toPandas` was not well-defined and there are behavior difference even between Arrow enabled/disabled in PySpark. ```py >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', False) >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas() x y 0 1 (1, 2) >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True) >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas() x y 0 1 {'a': 1, 'b': 2} ``` Currently PySpark with Arrow enabled, and Spark Connect, use a map for the struct type object as a result, whereas `Row` object in PySpark without Arrow. The options are: 1. It's ok to be different, also with suffix. - In this case, the suffix is a must because a map object will hold only one value for the duplicates. 2. `Row` object should be used for the struct. - In this case, we will lose the benefit of Arrow -> pandas fast conversion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org