[ https://issues.apache.org/jira/browse/SPARK-44517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-44517: ----------------------------------- Labels: pull-request-available (was: ) > first operator should respect the nullability of child expression as well as > ignoreNulls option > ----------------------------------------------------------------------------------------------- > > Key: SPARK-44517 > URL: https://issues.apache.org/jira/browse/SPARK-44517 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.2.0, 3.2.1, 3.3.0, 3.2.2, 3.3.1, 3.2.3, 3.2.4, 3.3.2, > 3.4.0, 3.4.1 > Reporter: Nan Zhu > Priority: Major > Labels: pull-request-available > > I found the following problem when using Spark recently: > > {code:java} > // code placeholder > import spark.implicits._ > val s = Seq((1.2, "s", 2.2)).toDF("v1", "v2", "v3") > val schema = StructType(Seq(StructField("v1", DoubleType, nullable = > false),StructField("v2", StringType, nullable = true),StructField("v3", > DoubleType, nullable = false))) > val df = spark.createDataFrame(s.rdd, schema)val inputDF = > val inputDF = df.dropDuplicates("v3") > spark.sql("CREATE TABLE local.db.table (\n v1 DOUBLE NOT NULL,\n v2 STRING, > v3 DOUBLE NOT NULL)") > inputDF.write.mode("overwrite").format("iceberg").save("local.db.table") > {code} > > > when I use the above code to write to iceberg (i guess Delta Lake will have > the same problem) , I got very confusing exception > {code:java} > Exception in thread "main" java.lang.IllegalArgumentException: Cannot write > incompatible dataset to table with schema: > table > { 1: v1: required double 2: v2: optional string 3: v3: required double} > Provided schema: > table { 1: v1: optional double 2: v2: optional string 3: v3: required > double} {code} > basically it complains that we have v1 as the nullable column in our > `inputDF` above which is not allowed since we created table with the v1 as > not nullable. The confusion comes from that, if we check the schema with > printSchema() of inputDF, v1 is not nullable > {noformat} > root > |-- v1: double (nullable = false) > |-- v2: string (nullable = true) > |-- v3: double (nullable = false){noformat} > Clearly, something changed the v1's nullability unexpectedly! > > After some debugging I found that the key is that dropDuplicates("v3"). In > optimization phase, we have ReplaceDeduplicateWithAggregate to replace the > Deduplicate with aggregate on v3 and run first() over all other columns. > However, first() operator has hard coded nullable as always "true" which is > the source of changed nullability of v1 > > this is a very confusing behavior of Spark, and probably no one really > noticed as we do not care too much without the new table formats like delta > lake and iceberg which can make nullability check correctly. Nowadays, we > users adopt them more and more, this is surfaced up > > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org