amogh-jahagirdar opened a new issue, #9555:
URL: https://github.com/apache/iceberg/issues/9555
### Apache Iceberg version
1.4.3 (latest release)
### Query engine
Spark
### Please describe the bug 🐞
Reproduction:
Here's a simple unit test (can copy/paste this into `TestMerge`)
```
@Test
public void testMergeIntoTsIssue() {
createAndInitTable(
"id INT, ts TIMESTAMP",
"{ \"id\": 1, \"ts\": \"2000-01-01 00:00:00\" }\n" + "{ \"id\": 6,
\"ts\": null }");
createOrReplaceView(
"source",
"id INT NOT NULL, dep STRING",
"{ \"id\": 1, \"ts\": \"2000-01-01 00:00:00\" }\n");
sql(
"MERGE INTO %s t USING source s "
+ "ON t.id == s.id "
+ "WHEN MATCHED THEN "
+ " UPDATE SET id=123, ts=current_timestamp()",
commitTarget());
sql("SELECT * FROM %s", commitTarget());
}
```
In short:
1.) create a table
2.) insert some records where at least one of the records has a NULL column
value.
3.) MERGE into the table with an update on matched records and set the
column with the null value
Expected:
Record 1: id=123, ts=current_timestamp()
Record 2: id=6, ts=null
However, in Spark 3.4 we get
Record 1:id=123, ts=current_timestamp()
Record 2: id=6, ts=01-01-1970 00:00:000 (basically Unix epoch. in practice
it's timestamp with tz so it'll appear to your timezone)
I've done some debugging and what's happening is that the schema for the
`SparkWrite` in 3.4 is treating all the fields as required, leading to the
default behavior.
The reason why it's treating it as required is because the Spark expression
`nullability` is that the attributes for the fields
https://github.com/apache/iceberg/blob/main/spark/v3.4/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteMergeIntoTable.scala#L182
aren't being passed to the merge output schema during planning. This
nullability needs to be passed correctly so that the null values in the
non-matched cases get written correctly.
I'm currently looking into this, but creating this issue for tracking and
awareness.
Important Note:
Spark 3.3 and Spark 3.5 do not have this bug based on my testing.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]