Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-25 Thread via GitHub


yihua merged PR #9876:
URL: https://github.com/apache/hudi/pull/9876


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1780092114

   
   ## CI report:
   
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   * eb5b62e94807c1b2b6942402b117fe9dc57d425b Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20487)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1779778741

   
   ## CI report:
   
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467)
 
   * eb5b62e94807c1b2b6942402b117fe9dc57d425b Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20487)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-25 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1779765466

   
   ## CI report:
   
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467)
 
   * eb5b62e94807c1b2b6942402b117fe9dc57d425b UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub


yihua commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1371173652


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala:
##
@@ -261,7 +262,8 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase 
with ScalaAssertionSuppo
   }
 
   test("Test MergeInto for MOR table ") {
-withRecordType()(withTempDir {tmp =>
+spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0")
+withRecordType()(withTempDir { tmp =>

Review Comment:
   Yes, I'd like to make sure that my changes do not break MERGE INTO on MOR 
tables. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub


danny0405 commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1371170669


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala:
##
@@ -261,7 +262,8 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase 
with ScalaAssertionSuppo
   }
 
   test("Test MergeInto for MOR table ") {
-withRecordType()(withTempDir {tmp =>
+spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0")
+withRecordType()(withTempDir { tmp =>

Review Comment:
   Got it, is it related with this change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub


yihua commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1371157540


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala:
##
@@ -261,7 +262,8 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase 
with ScalaAssertionSuppo
   }
 
   test("Test MergeInto for MOR table ") {
-withRecordType()(withTempDir {tmp =>
+spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0")
+withRecordType()(withTempDir { tmp =>

Review Comment:
   This is to ensure that for MOR table, log files are written.  Otherwise, the 
MOR table generated by the test may not contain log files, which is not 
different than COW.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778503193

   
   ## CI report:
   
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778461940

   
   ## CI report:
   
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   * d96a7423b1c1bae13148744547726ed95ee5c6b7 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20465)
 
   * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub


danny0405 commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1371105049


##
hudi-spark-datasource/hudi-spark/src/test/scala/org/apache/spark/sql/hudi/TestMergeIntoTable.scala:
##
@@ -261,7 +262,8 @@ class TestMergeIntoTable extends HoodieSparkSqlTestBase 
with ScalaAssertionSuppo
   }
 
   test("Test MergeInto for MOR table ") {
-withRecordType()(withTempDir {tmp =>
+spark.sql(s"set ${MERGE_SMALL_FILE_GROUP_CANDIDATES_LIMIT.key} = 0")
+withRecordType()(withTempDir { tmp =>

Review Comment:
   Why this change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778351985

   
   ## CI report:
   
   * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432)
 
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   * d96a7423b1c1bae13148744547726ed95ee5c6b7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20465)
 
   * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20467)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778343787

   
   ## CI report:
   
   * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432)
 
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   * d96a7423b1c1bae13148744547726ed95ee5c6b7 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20465)
 
   * bfdb36f31ef0b8670c82c308494f9af2f7ef1272 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778336139

   
   ## CI report:
   
   * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432)
 
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   * d96a7423b1c1bae13148744547726ed95ee5c6b7 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778296319

   
   ## CI report:
   
   * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432)
 
   * 3672dea3c9d2512071dc27b99e24dfb3922a3b38 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub


yihua commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1778076328

   I discussed the comments with @danny0405 offline.  Two things to address in 
this PR:
   
   (1) Instead of putting both partial and full schemas in the log block 
header, when partial updates are enabled, only the partial schema is added to 
the log block header in the same `SCHEMA` header and the full schema for 
snapshot reads is always going to be passed in from the table schema.  To 
indicate the schema is partial, a new log block header `IS_PARTIAL` should be 
added.
   (2) We should let users in the MERGE INTO statement to specify if they want 
partial updates in the log files in MOR tables, e.g., using sth like `col = 
EXISTING` to indicate that the column values should be kept as is.  We may not 
support this in the PR, but instead we should have an interim write config to 
control this behavior.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-24 Thread via GitHub


yihua commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1370828538


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala:
##
@@ -411,10 +414,14 @@ object ExpressionPayload {
 parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA))
   }
 
-  private def getWriterSchema(props: Properties): Schema = {
-
ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key),
-  s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property")
-parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key))
+  private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): 
Schema = {
+if (isPartialUpdate) {
+  
parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key))

Review Comment:
   Agree that option 1 is the most natural handling.
   
   In the current schema evolution on write, the write schema is evolved based 
on the input, and the evolved schema is written to the commit.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-23 Thread via GitHub


danny0405 commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1369447936


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala:
##
@@ -411,10 +414,14 @@ object ExpressionPayload {
 parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA))
   }
 
-  private def getWriterSchema(props: Properties): Schema = {
-
ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key),
-  s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property")
-parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key))
+  private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): 
Schema = {
+if (isPartialUpdate) {
+  
parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key))

Review Comment:
   Generally we may have 3 modes for fields that not updated in partial update:
   1. keep it as it is;
   2. force update it as null; (which I think should never happen in real case);
   3. overwrite with default (if the detault is defined in the schema)
   
   I think 1 is the most natural handling, but in any case, there reader should 
always use it's own reader schema for merging, not the writer schema.
   
   Another question is when to evolve the table schema, does it happends before 
or after the commit succeed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-23 Thread via GitHub


yihua commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1369060242


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala:
##
@@ -411,10 +414,14 @@ object ExpressionPayload {
 parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA))
   }
 
-  private def getWriterSchema(props: Properties): Schema = {
-
ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key),
-  s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property")
-parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key))
+  private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): 
Schema = {
+if (isPartialUpdate) {
+  
parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key))

Review Comment:
   In this PR, for updates in MOR tables, after processing the Spark SQL MERGE 
INTO statement, the writer gets the updates with partial schema and pass them 
to the `HoodieAppendHandle`.  Regardless, the original intent to include 
`FULL_SCHEMA` is for merging partial updates at the reader side.
   
   If we assume that values for a non-updated column should be either existing 
value (column in the existing schema) or null (new column in the evolved 
schema) in merging partial updates, the `FULL_SCHEMA` may not be stored in the 
log block header.  See the following examples:
   
   ```
   Example 1:
   base file: schema (col1, col2) (full schema at this instant: (col1, col2))
   log 1: partial, schema (col2, col3) (full schema at this instant: (col1, 
col2, col3))
   after log merging: schema (col1, col2, col3) 
   (col1 values from base file, col2, col3 values from log1 for overwrite with 
latest)
   
   Example 2:
   base file: schema (col1, col2) (full schema at this instant: (col1, col2))
   log 1: partial, schema (col2, col3) (full schema at this instant: (col1, 
col2, col3, col4))
   after log merging: schema (col1, col2, col3)
   project to full schema: (col1, col2, col3) -> (col1, col2, col3, col4), with 
nulls in col4
   (col1 values from base file, col2, col3 values from log1 for overwrite with 
latest, col4 has nulls)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-23 Thread via GitHub


yihua commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1369039030


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##
@@ -652,6 +660,16 @@ private static Map 
getUpdatedHeader(Map

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-22 Thread via GitHub


danny0405 commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1368076982


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala:
##
@@ -411,10 +414,14 @@ object ExpressionPayload {
 parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA))
   }
 
-  private def getWriterSchema(props: Properties): Schema = {
-
ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key),
-  s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property")
-parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key))
+  private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): 
Schema = {
+if (isPartialUpdate) {
+  
parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key))

Review Comment:
   I'm not very approval with two schemas here, partial update is actually a 
special case of schema evolution (that only supplementing new columns and never 
drop columns), the writer should still take the full schema always, and we can 
persist the partial schema to the lock block for some optimization purposes 
(for example, the handing of missing values as nulls, with partial schema, we 
can know that whether a null value comes from partial update or it should be 
force updated as null).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-22 Thread via GitHub


danny0405 commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1368076982


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala:
##
@@ -411,10 +414,14 @@ object ExpressionPayload {
 parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA))
   }
 
-  private def getWriterSchema(props: Properties): Schema = {
-
ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key),
-  s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property")
-parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key))
+  private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): 
Schema = {
+if (isPartialUpdate) {
+  
parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key))

Review Comment:
   I'm not very approval with two schemas here, partial update is actually a 
special case of schema evolution (that only supplementing new columns and never 
drop columns), the writer should still takes the full schema always, and we can 
persist the partial schema to the lock block for some optimization purposes 
(for example, the handing of missing values as nulls, with partial schema, we 
can know that whether a null value comes from partial update or it should be 
force updated as null).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-22 Thread via GitHub


danny0405 commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1368075473


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##
@@ -652,6 +660,16 @@ private static Map 
getUpdatedHeader(Map

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-21 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1773965020

   
   ## CI report:
   
   * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-21 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1773948275

   
   ## CI report:
   
   * 794904512405851fa42c10927c315ca55d82fbdc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20372)
 
   * b8bc65dc87cfd1305634bf16f96a97944ce85816 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20432)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-21 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1773947285

   
   ## CI report:
   
   * 794904512405851fa42c10927c315ca55d82fbdc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20372)
 
   * b8bc65dc87cfd1305634bf16f96a97944ce85816 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-21 Thread via GitHub


yihua commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1367807590


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala:
##
@@ -449,21 +466,58 @@ case class MergeIntoHoodieTableCommand(mergeInto: 
MergeIntoTable) extends Hoodie
   new StructType(targetTableSchema), structName, nameSpace)
   }
 
+  /**
+   * @param conditionalAssignments Conditional assignments.
+   * @return Updated fields based on the conditional assignments in the MERGE 
INTO statement.
+   */
+  private def getUpdatedFields(conditionalAssignments: Seq[Seq[Assignment]]): 
Seq[String] = {
+val updatedFieldsSeq = {
+  conditionalAssignments.flatMap {
+case assignments =>
+  // Extract all fields that are updated through the assignments
+  if (assignments.nonEmpty) {
+assignments.map {
+  case Assignment(attr: Attribute, _) => attr
+  case a =>
+throw new AnalysisException(s"Only assignments of the form 
`t.field = ...` are supported at the moment (provided: `${a.sql}`)")

Review Comment:
   No, this is the same logic as how the expressions are processed to generate 
assignments, just that here the logic only concerns about the updated fields.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-21 Thread via GitHub


yihua commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1367807326


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala:
##
@@ -411,10 +414,14 @@ object ExpressionPayload {
 parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA))
   }
 
-  private def getWriterSchema(props: Properties): Schema = {
-
ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key),
-  s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property")
-parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key))
+  private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): 
Schema = {
+if (isPartialUpdate) {
+  
parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key))

Review Comment:
   It depends on what the `SCHEMA` header in the log block refers to.  My 
assumption is that the `SCHEMA` header corresponds to the records written to 
the log block, so if the records are partial, `SCHMEA` is partial, and we use 
another header to indicate it's partial, to differentiate from schema evolution.
   
   Either way, do you agree that we need to keep both full and partial schemas 
in the log block header?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-21 Thread via GitHub


yihua commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1367806982


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##
@@ -652,6 +660,16 @@ private static Map 
getUpdatedHeader(Map

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-21 Thread via GitHub


yihua commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1367806866


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -755,6 +756,24 @@ public class HoodieWriteConfig extends HoodieConfig {
   .withDocumentation("Whether to write record positions to the block 
header for data blocks containing updates and delete blocks. "
   + "The record positions can be used to improve the performance of 
merging records from base and log files.");
 
+  public static final ConfigProperty WRITE_PARTIAL_UPDATES = 
ConfigProperty
+  .key("hoodie.write.partial.updates")
+  .defaultValue(false)
+  .markAdvanced()
+  .sinceVersion("1.0.0")
+  .withDocumentation("Whether to write partial updates to the data blocks 
containing updates "
+  + "in MOR tables. The data blocks containing partial updates have a 
schema with a "
+  + "subset of fields compared to the full schema of the table. 
Partial updates are "
+  + "automatically turned on for Spark SQL MERGE INTO statement with 
upserts to MOR tables.");
+
+  public static final ConfigProperty WRITE_PARTIAL_UPDATE_SCHEMA = 
ConfigProperty
+  .key("hoodie.write.partial.update.schema")
+  .defaultValue("")
+  .markAdvanced()

Review Comment:
   Makes sense.  will fix it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-19 Thread via GitHub


danny0405 commented on code in PR #9876:
URL: https://github.com/apache/hudi/pull/9876#discussion_r1365081774


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/MergeIntoHoodieTableCommand.scala:
##
@@ -449,21 +466,58 @@ case class MergeIntoHoodieTableCommand(mergeInto: 
MergeIntoTable) extends Hoodie
   new StructType(targetTableSchema), structName, nameSpace)
   }
 
+  /**
+   * @param conditionalAssignments Conditional assignments.
+   * @return Updated fields based on the conditional assignments in the MERGE 
INTO statement.
+   */
+  private def getUpdatedFields(conditionalAssignments: Seq[Seq[Assignment]]): 
Seq[String] = {
+val updatedFieldsSeq = {
+  conditionalAssignments.flatMap {
+case assignments =>
+  // Extract all fields that are updated through the assignments
+  if (assignments.nonEmpty) {
+assignments.map {
+  case Assignment(attr: Attribute, _) => attr
+  case a =>
+throw new AnalysisException(s"Only assignments of the form 
`t.field = ...` are supported at the moment (provided: `${a.sql}`)")

Review Comment:
   Does the throwing cause any functional regression?



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieWriteConfig.java:
##
@@ -755,6 +756,24 @@ public class HoodieWriteConfig extends HoodieConfig {
   .withDocumentation("Whether to write record positions to the block 
header for data blocks containing updates and delete blocks. "
   + "The record positions can be used to improve the performance of 
merging records from base and log files.");
 
+  public static final ConfigProperty WRITE_PARTIAL_UPDATES = 
ConfigProperty
+  .key("hoodie.write.partial.updates")
+  .defaultValue(false)
+  .markAdvanced()
+  .sinceVersion("1.0.0")
+  .withDocumentation("Whether to write partial updates to the data blocks 
containing updates "
+  + "in MOR tables. The data blocks containing partial updates have a 
schema with a "
+  + "subset of fields compared to the full schema of the table. 
Partial updates are "
+  + "automatically turned on for Spark SQL MERGE INTO statement with 
upserts to MOR tables.");
+
+  public static final ConfigProperty WRITE_PARTIAL_UPDATE_SCHEMA = 
ConfigProperty
+  .key("hoodie.write.partial.update.schema")
+  .defaultValue("")
+  .markAdvanced()

Review Comment:
   Can we merge the two options into one, if 
`hoodie.write.partial.update.schema` is non-empty, that means the partial 
update is turned on.



##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/payload/ExpressionPayload.scala:
##
@@ -411,10 +414,14 @@ object ExpressionPayload {
 parseSchema(props.getProperty(PAYLOAD_RECORD_AVRO_SCHEMA))
   }
 
-  private def getWriterSchema(props: Properties): Schema = {
-
ValidationUtils.checkArgument(props.containsKey(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key),
-  s"Missing ${HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key} property")
-parseSchema(props.getProperty(HoodieWriteConfig.WRITE_SCHEMA_OVERRIDE.key))
+  private def getWriterSchema(props: Properties, isPartialUpdate: Boolean): 
Schema = {
+if (isPartialUpdate) {
+  
parseSchema(props.getProperty(HoodieWriteConfig.WRITE_PARTIAL_UPDATE_SCHEMA.key))

Review Comment:
   I was actually expecting we always use the full schema as the write schema, 
while for partial update, it takes along another configuration for the partial 
schema, which is more straight forward.



##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##
@@ -652,6 +660,16 @@ private static Map 
getUpdatedHeader(Map

Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1767619892

   
   ## CI report:
   
   * 794904512405851fa42c10927c315ca55d82fbdc Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20372)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1767534035

   
   ## CI report:
   
   * 794904512405851fa42c10927c315ca55d82fbdc Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=20372)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-6800] Support writing partial updates to the data blocks in MOR tables [hudi]

2023-10-17 Thread via GitHub


hudi-bot commented on PR #9876:
URL: https://github.com/apache/hudi/pull/9876#issuecomment-1767526069

   
   ## CI report:
   
   * 794904512405851fa42c10927c315ca55d82fbdc UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org