Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua merged PR #9876: URL: https://github.com/apache/hudi/pull/9876 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1780092114 ## CI report: * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN * eb5b62e94807c1b2b6942402b117fe9dc57d425b Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20487) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1779778741 ## CI report: * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467) * eb5b62e94807c1b2b6942402b117fe9dc57d425b Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20487) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1779765466 ## CI report: * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467) * eb5b62e94807c1b2b6942402b117fe9dc57d425b UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1371173652 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala: ## @@ -261,7 +262,8 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase with ScalaAssertionSuppo } test("Test MergeInto for MOR table ") { -withRecordType()(withTempDir {tmp => +spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0") +withRecordType()(withTempDir { tmp => Review Comment: Yes, I'd like to make sure that my changes do not break MERGE INTO on MOR tables. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
danny0405 commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1371170669 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala: ## @@ -261,7 +262,8 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase with ScalaAssertionSuppo } test("Test MergeInto for MOR table ") { -withRecordType()(withTempDir {tmp => +spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0") +withRecordType()(withTempDir { tmp => Review Comment: Got it, is it related with this change? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1371157540 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala: ## @@ -261,7 +262,8 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase with ScalaAssertionSuppo } test("Test MergeInto for MOR table ") { -withRecordType()(withTempDir {tmp => +spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0") +withRecordType()(withTempDir { tmp => Review Comment: This is to ensure that for MOR table, log files are written. Otherwise, the MOR table generated by the test may not contain log files, which is not different than COW. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778503193 ## CI report: * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778461940 ## CI report: * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN * d96a7423b1c1bae13148744547726ed95ee5c6b7 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20465) * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
danny0405 commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1371105049 ## hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala: ## @@ -261,7 +262,8 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase with ScalaAssertionSuppo } test("Test MergeInto for MOR table ") { -withRecordType()(withTempDir {tmp => +spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0") +withRecordType()(withTempDir { tmp => Review Comment: Why this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778351985 ## CI report: * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432) * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN * d96a7423b1c1bae13148744547726ed95ee5c6b7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20465) * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778343787 ## CI report: * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432) * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN * d96a7423b1c1bae13148744547726ed95ee5c6b7 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20465) * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778336139 ## CI report: * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432) * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN * d96a7423b1c1bae13148744547726ed95ee5c6b7 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778296319 ## CI report: * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432) * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778076328 I discussed the comments with @danny0405 offline. Two things to address in this PR: (1) Instead of putting both partial and full schemas in the log block header, when partial updates are enabled, only the partial schema is added to the log block header in the same `SCHEMA` header and the full schema for snapshot reads is always going to be passed in from the table schema. To indicate the schema is partial, a new log block header `IS_PARTIAL` should be added. (2) We should let users in the MERGE INTO statement to specify if they want partial updates in the log files in MOR tables, e.g., using sth like `col = EXISTING` to indicate that the column values should be kept as is. We may not support this in the PR, but instead we should have an interim write config to control this behavior. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1370828538 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala: ## @@ -411,10 +414,14 @@ object ExpressionPayload { parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA)) } - private def getWriterSchema(props: Properties): Schema = { - ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key), - s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property") -parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key)) + private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): Schema = { +if (isPartialUpdate) { + parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key)) Review Comment: Agree that option 1 is the most natural handling. In the current schema evolution on write, the write schema is evolved based on the input, and the evolved schema is written to the commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
danny0405 commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1369447936 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala: ## @@ -411,10 +414,14 @@ object ExpressionPayload { parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA)) } - private def getWriterSchema(props: Properties): Schema = { - ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key), - s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property") -parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key)) + private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): Schema = { +if (isPartialUpdate) { + parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key)) Review Comment: Generally we may have 3 modes for fields that not updated in partial update: 1. keep it as it is; 2. force update it as null; (which I think should never happen in real case); 3. overwrite with default (if the detault is defined in the schema) I think 1 is the most natural handling, but in any case, there reader should always use it's own reader schema for merging, not the writer schema. Another question is when to evolve the table schema, does it happends before or after the commit succeed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1369060242 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala: ## @@ -411,10 +414,14 @@ object ExpressionPayload { parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA)) } - private def getWriterSchema(props: Properties): Schema = { - ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key), - s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property") -parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key)) + private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): Schema = { +if (isPartialUpdate) { + parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key)) Review Comment: In this PR, for updates in MOR tables, after processing the Spark SQL MERGE INTO statement, the writer gets the updates with partial schema and pass them to the `HoodieAppendHandle`. Regardless, the original intent to include `FULL_SCHEMA` is for merging partial updates at the reader side. If we assume that values for a non-updated column should be either existing value (column in the existing schema) or null (new column in the evolved schema) in merging partial updates, the `FULL_SCHEMA` may not be stored in the log block header. See the following examples: ``` Example 1: base file: schema (col1, col2) (full schema at this instant: (col1, col2)) log 1: partial, schema (col2, col3) (full schema at this instant: (col1, col2, col3)) after log merging: schema (col1, col2, col3) (col1 values from base file, col2, col3 values from log1 for overwrite with latest) Example 2: base file: schema (col1, col2) (full schema at this instant: (col1, col2)) log 1: partial, schema (col2, col3) (full schema at this instant: (col1, col2, col3, col4)) after log merging: schema (col1, col2, col3) project to full schema: (col1, col2, col3) -> (col1, col2, col3, col4), with nulls in col4 (col1 values from base file, col2, col3 values from log1 for overwrite with latest, col4 has nulls) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1369039030 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -652,6 +660,16 @@ private static Map getUpdatedHeader(Map
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
danny0405 commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1368076982 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala: ## @@ -411,10 +414,14 @@ object ExpressionPayload { parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA)) } - private def getWriterSchema(props: Properties): Schema = { - ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key), - s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property") -parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key)) + private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): Schema = { +if (isPartialUpdate) { + parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key)) Review Comment: I'm not very approval with two schemas here, partial update is actually a special case of schema evolution (that only supplementing new columns and never drop columns), the writer should still take the full schema always, and we can persist the partial schema to the lock block for some optimization purposes (for example, the handing of missing values as nulls, with partial schema, we can know that whether a null value comes from partial update or it should be force updated as null). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
danny0405 commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1368076982 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala: ## @@ -411,10 +414,14 @@ object ExpressionPayload { parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA)) } - private def getWriterSchema(props: Properties): Schema = { - ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key), - s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property") -parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key)) + private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): Schema = { +if (isPartialUpdate) { + parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key)) Review Comment: I'm not very approval with two schemas here, partial update is actually a special case of schema evolution (that only supplementing new columns and never drop columns), the writer should still takes the full schema always, and we can persist the partial schema to the lock block for some optimization purposes (for example, the handing of missing values as nulls, with partial schema, we can know that whether a null value comes from partial update or it should be force updated as null). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
danny0405 commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1368075473 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -652,6 +660,16 @@ private static Map getUpdatedHeader(Map
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1773965020 ## CI report: * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1773948275 ## CI report: * 794904512405851fa42c10927c315ca55d82fbdc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20372) * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1773947285 ## CI report: * 794904512405851fa42c10927c315ca55d82fbdc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20372) * b8bc65dc87cfd1305634bf16f96a97944ce85816 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1367807590 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala: ## @@ -449,21 +466,58 @@ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends Hoodie new StructType(targetTableSchema), structName, nameSpace) } + /** + * @param conditionalAssignments Conditional assignments. + * @return Updated fields based on the conditional assignments in the MERGE INTO statement. + */ + private def getUpdatedFields(conditionalAssignments: Seq[Seq[Assignment]]): Seq[String] = { +val updatedFieldsSeq = { + conditionalAssignments.flatMap { +case assignments => + // Extract all fields that are updated through the assignments + if (assignments.nonEmpty) { +assignments.map { + case Assignment(attr: Attribute, _) => attr + case a => +throw new AnalysisException(s"Only assignments of the form `t.field = ...` are supported at the moment (provided: `${a.sql}`)") Review Comment: No, this is the same logic as how the expressions are processed to generate assignments, just that here the logic only concerns about the updated fields. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1367807326 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala: ## @@ -411,10 +414,14 @@ object ExpressionPayload { parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA)) } - private def getWriterSchema(props: Properties): Schema = { - ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key), - s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property") -parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key)) + private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): Schema = { +if (isPartialUpdate) { + parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key)) Review Comment: It depends on what the `SCHEMA` header in the log block refers to. My assumption is that the `SCHEMA` header corresponds to the records written to the log block, so if the records are partial, `SCHMEA` is partial, and we use another header to indicate it's partial, to differentiate from schema evolution. Either way, do you agree that we need to keep both full and partial schemas in the log block header? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1367806982 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -652,6 +660,16 @@ private static Map getUpdatedHeader(Map
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
yihua commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1367806866 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -755,6 +756,24 @@ public class HoodieWriteConfig extends HoodieConfig { .withDocumentation("Whether to write record positions to the block header for data blocks containing updates and delete blocks. " + "The record positions can be used to improve the performance of merging records from base and log files."); + public static final ConfigProperty WRITE_PARTIAL_UPDATES = ConfigProperty + .key("hoodie.write.partial.updates") + .defaultValue(false) + .markAdvanced() + .sinceVersion("1.0.0") + .withDocumentation("Whether to write partial updates to the data blocks containing updates " + + "in MOR tables. The data blocks containing partial updates have a schema with a " + + "subset of fields compared to the full schema of the table. Partial updates are " + + "automatically turned on for Spark SQL MERGE INTO statement with upserts to MOR tables."); + + public static final ConfigProperty WRITE_PARTIAL_UPDATE_SCHEMA = ConfigProperty + .key("hoodie.write.partial.update.schema") + .defaultValue("") + .markAdvanced() Review Comment: Makes sense. will fix it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
danny0405 commented on code in PR #9876: URL: https://github.com/apache/hudi/pull/9876#discussion_r1365081774 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala: ## @@ -449,21 +466,58 @@ case class MergeIntoHoodieTableCommand(mergeInto: MergeIntoTable) extends Hoodie new StructType(targetTableSchema), structName, nameSpace) } + /** + * @param conditionalAssignments Conditional assignments. + * @return Updated fields based on the conditional assignments in the MERGE INTO statement. + */ + private def getUpdatedFields(conditionalAssignments: Seq[Seq[Assignment]]): Seq[String] = { +val updatedFieldsSeq = { + conditionalAssignments.flatMap { +case assignments => + // Extract all fields that are updated through the assignments + if (assignments.nonEmpty) { +assignments.map { + case Assignment(attr: Attribute, _) => attr + case a => +throw new AnalysisException(s"Only assignments of the form `t.field = ...` are supported at the moment (provided: `${a.sql}`)") Review Comment: Does the throwing cause any functional regression? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java: ## @@ -755,6 +756,24 @@ public class HoodieWriteConfig extends HoodieConfig { .withDocumentation("Whether to write record positions to the block header for data blocks containing updates and delete blocks. " + "The record positions can be used to improve the performance of merging records from base and log files."); + public static final ConfigProperty WRITE_PARTIAL_UPDATES = ConfigProperty + .key("hoodie.write.partial.updates") + .defaultValue(false) + .markAdvanced() + .sinceVersion("1.0.0") + .withDocumentation("Whether to write partial updates to the data blocks containing updates " + + "in MOR tables. The data blocks containing partial updates have a schema with a " + + "subset of fields compared to the full schema of the table. Partial updates are " + + "automatically turned on for Spark SQL MERGE INTO statement with upserts to MOR tables."); + + public static final ConfigProperty WRITE_PARTIAL_UPDATE_SCHEMA = ConfigProperty + .key("hoodie.write.partial.update.schema") + .defaultValue("") + .markAdvanced() Review Comment: Can we merge the two options into one, if `hoodie.write.partial.update.schema` is non-empty, that means the partial update is turned on. ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala: ## @@ -411,10 +414,14 @@ object ExpressionPayload { parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA)) } - private def getWriterSchema(props: Properties): Schema = { - ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key), - s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property") -parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key)) + private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): Schema = { +if (isPartialUpdate) { + parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key)) Review Comment: I was actually expecting we always use the full schema as the write schema, while for partial update, it takes along another configuration for the partial schema, which is more straight forward. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java: ## @@ -652,6 +660,16 @@ private static Map getUpdatedHeader(Map
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1767619892 ## CI report: * 794904512405851fa42c10927c315ca55d82fbdc Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20372) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1767534035 ## CI report: * 794904512405851fa42c10927c315ca55d82fbdc Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20372) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]
hudi-bot commented on PR #9876: URL: https://github.com/apache/hudi/pull/9876#issuecomment-1767526069 ## CI report: * 794904512405851fa42c10927c315ca55d82fbdc UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org