mbutrovich commented on issue #1875:
URL:
https://github.com/apache/datafusion-comet/issues/1875#issuecomment-2967624631
Spark 3.4 and 3.5 handle struct conversion in this test case differently.
3.4 inserts a `cast` expression in the Project operator, while 3.5 used a
`named_struct` expression.
Spark 3.4:
```
== Physical Plan ==
CommandResult <empty>
+- Execute InsertIntoHadoopFsRelationCommand
file:/Users/matt/git/datafusion-comet/spark-warehouse/tab2, false, Parquet,
[path=file:/Users/matt/git/datafusion-comet/spark-warehouse/tab2], Append,
`spark_catalog`.`default`.`tab2`,
org.apache.spark.sql.execution.datasources.InMemoryFileIndex(file:/Users/matt/git/datafusion-comet/spark-warehouse/tab2),
[p]
+- WriteFiles
+- *(1) Project [cast(s#3 as struct<b:string,a:string>) AS p#6]
+- *(1) ColumnarToRow
+- FileScan parquet spark_catalog.default.tab1[s#3] Batched:
true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1
paths)[file:/Users/matt/git/datafusion-comet/spark-warehouse/tab1],
PartitionFilters: [], PushedFilters: [], ReadSchema:
struct<s:struct<a:string,b:string>>
```
Spark 3.5:
```
== Physical Plan ==
CommandResult <empty>
+- Execute InsertIntoHadoopFsRelationCommand
file:/Users/matt/git/datafusion-comet/spark-warehouse/tab2, false, Parquet,
[path=file:/Users/matt/git/datafusion-comet/spark-warehouse/tab2], Append,
`spark_catalog`.`default`.`tab2`,
org.apache.spark.sql.execution.datasources.InMemoryFileIndex(file:/Users/matt/git/datafusion-comet/spark-warehouse/tab2),
[p]
+- WriteFiles
+- *(1) Project [if (isnull(s#12)) null else named_struct(b,
s#12.a, a, s#12.b) AS p#20]
+- *(1) ColumnarToRow
+- FileScan parquet spark_catalog.default.tab1[s#12] Batched:
true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(1
paths)[file:/Users/matt/git/datafusion-comet/spark-warehouse/tab1],
PartitionFilters: [], PushedFilters: [], ReadSchema:
struct<s:struct<a:string,b:string>>
```
This test doesn't fail on Spark 3.5 with Comet, but that doesn't mean the
cast logic is correct--we just don't get that expression in 3.5. Here's a
succinct test case to show the cast issue still exists on 3.5:
```scala
test("cast StructType to StructType with different names") {
withTable("tab1") {
sql("""
|CREATE TABLE tab1 (s struct<a: string, b: string>)
|USING parquet
""".stripMargin)
sql("INSERT INTO TABLE tab1 SELECT
named_struct('col1','1','col2','2')")
if (usingDataSourceExec) {
checkSparkAnswerAndOperator(
"SELECT CAST(s AS struct<field1:string, field2:string>) AS
new_struct FROM tab1")
} else {
// Should just fall back to Spark since non-DataSoruceExec scan does
not support nested types.
checkSparkAnswer(
"SELECT CAST(s AS struct<field1:string, field2:string>) AS
new_struct FROM tab1")
}
}
}
```
I will work on a fix in cast.rs, and include the provided test.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]