[I] Spark 3.4 MERGE INTO for CoW replacing NULL unmatched records with default values [iceberg]

via GitHub Wed, 24 Jan 2024 09:43:59 -0800


amogh-jahagirdar opened a new issue, #9555:
URL: https://github.com/apache/iceberg/issues/9555


   ### Apache Iceberg version
   
   1.4.3 (latest release)
   
   ### Query engine
   
   Spark
   
   ### Please describe the bug 🐞
   
   Reproduction:
   
   Here's a simple unit test (can copy/paste this into `TestMerge`)
   
   ```
     @Test
     public void testMergeIntoTsIssue() {
       createAndInitTable(
           "id INT, ts TIMESTAMP",
           "{ \"id\": 1, \"ts\": \"2000-01-01 00:00:00\" }\n" + "{ \"id\": 6, 
\"ts\": null }");
   
       createOrReplaceView(
           "source",
           "id INT NOT NULL, dep STRING",
           "{ \"id\": 1, \"ts\": \"2000-01-01 00:00:00\" }\n");
       sql(
           "MERGE INTO %s t USING source s "
               + "ON t.id == s.id "
               + "WHEN MATCHED THEN "
               + "  UPDATE SET id=123, ts=current_timestamp()",
           commitTarget());
   
       sql("SELECT * FROM %s", commitTarget());
     }
   ```
   
   In short:
   1.) create a table
   2.) insert some records where at least one of the records has a NULL column 
value.
   3.) MERGE into the table with an update on matched records and set the 
column with the null value
   
   Expected:
   
   Record 1:  id=123, ts=current_timestamp()
   Record 2: id=6, ts=null
   
   However, in Spark 3.4 we get
   Record 1:id=123, ts=current_timestamp()
   Record 2: id=6, ts=01-01-1970 00:00:000 (basically Unix epoch. in practice 
it's timestamp with tz so it'll appear to your timezone) 
   
   
   I've done some debugging and what's happening is that the schema for the 
`SparkWrite` in 3.4 is treating all the fields as required, leading to the 
default behavior.
   
   The reason why it's treating it as required is because the Spark expression 
`nullability`  is that the attributes for the fields 
https://github.com/apache/iceberg/blob/main/spark/v3.4/spark-extensions/src/main/scala/org/apache/spark/sql/catalyst/analysis/RewriteMergeIntoTable.scala#L182
 aren't being passed to the merge output schema during planning. This 
nullability needs to be passed correctly so that the null values in the 
non-matched cases get written correctly.
   
   I'm currently looking into this, but creating this issue for tracking and 
awareness.
   
   Important Note:
   
   Spark 3.3 and Spark 3.5 do not have this bug based on my testing.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Spark 3.4 MERGE INTO for CoW replacing NULL unmatched records with default values [iceberg]

Reply via email to