yihua commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1369060242
########## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala: ########## @@ -411,10 +414,14 @@ object ExpressionPayload { parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA)) } - private def getWriterSchema(props: Properties): Schema = { - ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key), - s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property") - parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key)) + private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): Schema = { + if (isPartialUpdate) { + parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key)) Review Comment: In this PR, for updates in MOR tables, after processing the Spark SQL MERGE INTO statement, the writer gets the updates with partial schema and pass them to the `HoodieAppendHandle`. Regardless, the original intent to include `FULL_SCHEMA` is for merging partial updates at the reader side. If we assume that values for a non-updated column should be either existing value (column in the existing schema) or null (new column in the evolved schema) in merging partial updates, the `FULL_SCHEMA` may not be stored in the log block header. See the following examples: ``` Example 1: base file: schema (col1, col2) (full schema at this instant: (col1, col2)) log 1: partial, schema (col2, col3) (full schema at this instant: (col1, col2, col3)) after log merging: schema (col1, col2, col3) (col1 values from base file, col2, col3 values from log1 for overwrite with latest) Example 2: base file: schema (col1, col2) (full schema at this instant: (col1, col2)) log 1: partial, schema (col2, col3) (full schema at this instant: (col1, col2, col3, col4)) after log merging: schema (col1, col2, col3) project to full schema: (col1, col2, col3) -> (col1, col2, col3, col4), with nulls in col4 (col1 values from base file, col2, col3 values from log1 for overwrite with latest, col4 has nulls) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org