[GitHub] [spark] ueshin commented on pull request #40692: [SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names

via GitHub Thu, 06 Apr 2023 18:56:39 -0700


ueshin commented on PR #40692:
URL: https://github.com/apache/spark/pull/40692#issuecomment-1499840711


   > Just FYI, vanilla PySpark's DataFrame.toPandas also has this issue 
[issues.apache.org/jira/browse/SPARK-41971](https://issues.apache.org/jira/browse/SPARK-41971)
   Is it possible to move the changes to ArrowUtils to fix them all?
   
   Yes, I'm aware of the issue, but let me hold on it to the following PRs.
   
   TLDR;
   
   Actually this PR still has an issue with `toPandas`.
   
   ```py
   >>> spark.sql("values (1, struct(1 as a, 2 as a)) as t(x, y)").toPandas()
      x                     y
   0  1  {'a_0': 1, 'a_1': 2}
   ```
   
   The duplicated fields have suffix `_1`, `_2`, and so on.
   
   Also, handling struct type in `toPandas` was not well-defined and there are 
behavior difference even between Arrow enabled/disabled in PySpark.
   
   ```py
   >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', False)
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
      x       y
   0  1  (1, 2)
   >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True)
   >>> spark.sql("values (1, struct(1 as a, 2 as b)) as t(x, y)").toPandas()
      x                 y
   0  1  {'a': 1, 'b': 2}
   ```
   
   Currently PySpark with Arrow enabled, and Spark Connect, use a map for the 
struct type object as a result, whereas `Row` object in PySpark without Arrow.
   
   The options are:
   
   1. It's ok to be different, also with suffix.
       - In this case, the suffix is a must because a map object will hold only 
one value for the duplicates.
   2. `Row` object should be used for the struct.
       - In this case, we will lose the benefit of Arrow -> pandas fast 
conversion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ueshin commented on pull request #40692: [SPARK-43055][CONNECT][PYTHON] Support duplicated nested field names

Reply via email to