ueshin opened a new pull request, #41190: URL: https://github.com/apache/spark/pull/41190
### What changes were proposed in this pull request? Support duplicated field names in `createDataFrame` with pandas DataFrame. For with Arrow, without Arrow, and Spark Connect: ```py >>> spark.createDataFrame(pdf, schema).show() +--------+---------------+ |struct_0| struct_1| +--------+---------------+ | {a, 1}|{2, 3, b, 4, c}| | {x, 6}|{7, 8, y, 9, z}| +--------+---------------+ ``` ### Why are the changes needed? If there are duplicated field names, `createDataFrame` with pandas DataFrame fallbacks to without Arrow, or fails in Spark Connect. ```py >>> import pandas as pd >>> from pyspark.sql.types import * >>> >>> schema = ( ... StructType() ... .add("struct_0", StructType().add("x", StringType()).add("x", IntegerType())) ... .add( ... "struct_1", ... StructType() ... .add("a", IntegerType()) ... .add("x", IntegerType()) ... .add("x", StringType()) ... .add("y", IntegerType()) ... .add("y", StringType()), ... ) ... ) >>> >>> data = [Row(Row("a", 1), Row(2, 3, "b", 4, "c")), Row(Row("x", 6), Row(7, 8, "y", 9, "z"))] >>> pdf = pd.DataFrame.from_records(data, columns=schema.names) ``` - Without Arrow: Works fine. ```py >>> spark.createDataFrame(pdf, schema).show() +--------+---------------+ |struct_0| struct_1| +--------+---------------+ | {a, 1}|{2, 3, b, 4, c}| | {x, 6}|{7, 8, y, 9, z}| +--------+---------------+ ``` - With Arrow: Works with fallback. ```py >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True) >>> spark.createDataFrame(pdf, schema).show() /.../pyspark/sql/pandas/conversion.py:347: UserWarning: createDataFrame attempted Arrow optimization because 'spark.sql.execution.arrow.pyspark.enabled' is set to true; however, failed by the reason below: [DUPLICATED_FIELD_NAME_IN_ARROW_STRUCT] Duplicated field names in Arrow Struct are not allowed, got [x, x]. Attempting non-optimization as 'spark.sql.execution.arrow.pyspark.fallback.enabled' is set to true. warn(msg) +--------+---------------+ |struct_0| struct_1| +--------+---------------+ | {a, 1}|{2, 3, b, 4, c}| | {x, 6}|{7, 8, y, 9, z}| +--------+---------------+ ``` - Spark Connect Fails. ```py >>> spark.createDataFrame(pdf, schema).show() ... Traceback (most recent call last): ... pyspark.errors.exceptions.connect.IllegalArgumentException: not all nodes and buffers were consumed. ... ``` ### Does this PR introduce _any_ user-facing change? The duplicated field names will work. ### How was this patch tested? Added the related test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org