Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

via GitHub Mon, 23 Oct 2023 11:02:03 -0700


yihua commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1369060242



##########
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala:
##########
@@ -411,10 +414,14 @@ object ExpressionPayload {
     parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA))
   }
 
-  private def getWriterSchema(props: Properties): Schema = {
-    
ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key),
-      s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property")
-    parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key))
+  private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): 
Schema = {
+    if (isPartialUpdate) {
+      
parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key))

Review Comment:
   In this PR, for updates in MOR tables, after processing the Spark SQL MERGE 
INTO statement, the writer gets the updates with partial schema and pass them 
to the `HoodieAppendHandle`.  Regardless, the original intent to include 
`FULL_SCHEMA` is for merging partial updates at the reader side.
   
   If we assume that values for a non-updated column should be either existing 
value (column in the existing schema) or null (new column in the evolved 
schema) in merging partial updates, the `FULL_SCHEMA` may not be stored in the 
log block header.  See the following examples:
   
   ```
   Example 1:
   base file: schema (col1, col2) (full schema at this instant: (col1, col2))
   log 1: partial, schema (col2, col3) (full schema at this instant: (col1, 
col2, col3))
   after log merging: schema (col1, col2, col3) 
   (col1 values from base file, col2, col3 values from log1 for overwrite with 
latest)
   
   Example 2:
   base file: schema (col1, col2) (full schema at this instant: (col1, col2))
   log 1: partial, schema (col2, col3) (full schema at this instant: (col1, 
col2, col3, col4))
   after log merging: schema (col1, col2, col3)
   project to full schema: (col1, col2, col3) -> (col1, col2, col3, col4), with 
nulls in col4
   (col1 values from base file, col2, col3 values from log1 for overwrite with 
latest, col4 has nulls)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

Reply via email to