Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-18 Thread via GitHub


yihua closed pull request #10980: [HUDI-7578] Avoid unnecessary rewriting when 
copy old data from old base to new base file to improve compaction performance
URL: https://github.com/apache/hudi/pull/10980


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-16 Thread via GitHub


danny0405 commented on PR #10980:
URL: https://github.com/apache/hudi/pull/10980#issuecomment-2060281356

   Close because it is fixed in #11028


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11028:
URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058284585

   
   ## CI report:
   
   * 67ca721df255223c873303aeccf7900c29f7811a Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23278)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11028:
URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058193126

   
   ## CI report:
   
   * 67832fce75903cce3b3f66beb125f6a02fb82e11 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23277)
 
   * 67ca721df255223c873303aeccf7900c29f7811a Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23278)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11028:
URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058187793

   
   ## CI report:
   
   * 67832fce75903cce3b3f66beb125f6a02fb82e11 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23277)
 
   * 67ca721df255223c873303aeccf7900c29f7811a UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11028:
URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058148444

   
   ## CI report:
   
   * 67832fce75903cce3b3f66beb125f6a02fb82e11 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23277)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11028:
URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058142670

   
   ## CI report:
   
   * 8fc55507a82ee1295f14c1125876b8395cfc27df Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23276)
 
   * 67832fce75903cce3b3f66beb125f6a02fb82e11 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23277)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11028:
URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058137317

   
   ## CI report:
   
   * 8fc55507a82ee1295f14c1125876b8395cfc27df Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23276)
 
   * 67832fce75903cce3b3f66beb125f6a02fb82e11 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11028:
URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058104791

   
   ## CI report:
   
   * 8fc55507a82ee1295f14c1125876b8395cfc27df Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23276)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


hudi-bot commented on PR #11028:
URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058098918

   
   ## CI report:
   
   * 8fc55507a82ee1295f14c1125876b8395cfc27df UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-15 Thread via GitHub


danny0405 opened a new pull request, #11028:
URL: https://github.com/apache/hudi/pull/11028

   ### Change Logs
   
   There is no need to copy for most of the use cases.
   
   ### Impact
   
   no impact.
   
   ### Risk level (write none, low medium or high below)
   
   none
   
   ### Documentation Update
   
   none
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-09 Thread via GitHub


danny0405 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1557147035


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// The compactor avoids heavy rewriting when copy the old record from old 
base file into new base file
+if (config.populateMetaFields()) {
+  LOG.info("Using update instead of rewriting during compaction");

Review Comment:
   `config.populateMetaFields()` is always true for data table(accounting to 
the metadata table), if the schema evolution is handled correctly, I guess we 
can always set up the metadata fields directly?
   
   And please move the log level into "DEBUG".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-09 Thread via GitHub


danny0405 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1557137910


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// The compactor avoids heavy rewriting when copy the old record from old 
base file into new base file

Review Comment:
   `The compactor avoids heavy rewriting when copy the old record from old base 
file into new base file` ->
   `The compactor avoids costly rewriting while copying the old record from the 
old base file into a new one`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10980:
URL: https://github.com/apache/hudi/pull/10980#issuecomment-2044184717

   
   ## CI report:
   
   * c382de2b71540404831449de82e40d9488a38575 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23155)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10980:
URL: https://github.com/apache/hudi/pull/10980#issuecomment-2044130880

   
   ## CI report:
   
   * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23148)
 
   * c382de2b71540404831449de82e40d9488a38575 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23155)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10980:
URL: https://github.com/apache/hudi/pull/10980#issuecomment-2044125667

   
   ## CI report:
   
   * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23148)
 
   * c382de2b71540404831449de82e40d9488a38575 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


beyond1920 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1556818304


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// The compactor avoids heavy rewriting when copy the old record from old 
base file into new base file
+if (config.populateMetaFields()) {
+  LOG.info("Using update instead rewriting during compaction");

Review Comment:
   > Set the log as debug level
   
   Using info level here does not cost much, right?  It only prints logs in 
class constructor, not for each input record.
   
   >  "instead" -> "instead of".
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


beyond1920 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1556818304


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// The compactor avoids heavy rewriting when copy the old record from old 
base file into new base file
+if (config.populateMetaFields()) {
+  LOG.info("Using update instead rewriting during compaction");

Review Comment:
   > Set the log as debug level
   
   Using info level here does not cost much, right?  It only prints logs in 
class constructor, not for each input record.
   
   >  "instead" -> "instead of".
   
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


beyond1920 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1556818304


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// The compactor avoids heavy rewriting when copy the old record from old 
base file into new base file
+if (config.populateMetaFields()) {
+  LOG.info("Using update instead rewriting during compaction");

Review Comment:
   > Set the log as debug level
   
   Using info level here does not cost much, right? 
   
   >  "instead" -> "instead of".
   Done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


beyond1920 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1556817370


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// if the old schema equals to the new schema, avoid heavy rewriting
+if (config.populateMetaFields() && useWriterSchemaForCompaction) {
+  LOG.info("Using update instead rewriting during compaction");
+  copyOldFunc = (key, record, schema, prop) -> 
this.updateMetadataToOldRecord(key, record, schema, prop);

Review Comment:
   Good question.
   The responsible of this method is only merging base record and incremental 
record, not including handle schema evolution. 
   Handling schema evolution happens before call the `HoodieMergeHandle#write` 
method.
   https://github.com/apache/hudi/assets/1525333/3a03e08b-fe2e-4da6-a788-07cbb6feeadd;>
   https://github.com/apache/hudi/assets/1525333/def1f2ee-ed97-47f8-92b6-76d45500bea7;>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


beyond1920 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1556818304


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// The compactor avoids heavy rewriting when copy the old record from old 
base file into new base file
+if (config.populateMetaFields()) {
+  LOG.info("Using update instead rewriting during compaction");

Review Comment:
   > Set the log as debug level,



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


danny0405 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1556687381


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// if the old schema equals to the new schema, avoid heavy rewriting
+if (config.populateMetaFields() && useWriterSchemaForCompaction) {
+  LOG.info("Using update instead rewriting during compaction");
+  copyOldFunc = (key, record, schema, prop) -> 
this.updateMetadataToOldRecord(key, record, schema, prop);

Review Comment:
   but it still uses the latest schema as the write schema, how about the 
schema already evolved?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


danny0405 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1556687736


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// The compactor avoids heavy rewriting when copy the old record from old 
base file into new base file
+if (config.populateMetaFields()) {
+  LOG.info("Using update instead rewriting during compaction");

Review Comment:
   Set the log as debug level, "instead" -> "instead of".



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10980:
URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042891233

   
   ## CI report:
   
   * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23148)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10980:
URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042643487

   
   ## CI report:
   
   * 07e398007c1557d3e17adc3d8a36d8778ed3e976 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23147)
 
   * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23148)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10980:
URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042627787

   
   ## CI report:
   
   * 07e398007c1557d3e17adc3d8a36d8778ed3e976 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23147)
 
   * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


beyond1920 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1555662848


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// if the old schema equals to the new schema, avoid heavy rewriting
+if (config.populateMetaFields() && useWriterSchemaForCompaction) {
+  LOG.info("Using update instead rewriting during compaction");
+  copyOldFunc = (key, record, schema, prop) -> 
this.updateMetadataToOldRecord(key, record, schema, prop);

Review Comment:
   Not exactly.
   The behavior is consistent with the old behavior.
   
   https://github.com/apache/hudi/assets/1525333/e254eab0-9c22-4658-a4a5-cc8faae9d2af;>
   
   
   https://github.com/apache/hudi/assets/1525333/438c9ee9-1189-4928-9c48-e102625c5967;>
   
   In the above pictures,  if `config.populateMetaFields() ` is true for 
compaction job, the `oldSchema` is equals to `writeSchemaWithMetaFields`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


beyond1920 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r1555662848


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// if the old schema equals to the new schema, avoid heavy rewriting
+if (config.populateMetaFields() && useWriterSchemaForCompaction) {
+  LOG.info("Using update instead rewriting during compaction");
+  copyOldFunc = (key, record, schema, prop) -> 
this.updateMetadataToOldRecord(key, record, schema, prop);

Review Comment:
   Not exactly.
   The behavior is consistent with the old behavior.
   https://github.com/apache/hudi/assets/1525333/2ab0c5e8-d22e-41a2-bb73-e541b64a8083;>
   In the above picture, if `config.populateMetaFields() && 
useWriterSchemaForCompaction` is true, the `oldSchema` equals to `newSchema`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10980:
URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042405078

   
   ## CI report:
   
   * 07e398007c1557d3e17adc3d8a36d8778ed3e976 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23147)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10980:
URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042297222

   
   ## CI report:
   
   * 07e398007c1557d3e17adc3d8a36d8778ed3e976 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23147)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


hudi-bot commented on PR #10980:
URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042282215

   
   ## CI report:
   
   * 07e398007c1557d3e17adc3d8a36d8778ed3e976 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


danny0405 commented on code in PR #10980:
URL: https://github.com/apache/hudi/pull/10980#discussion_r103225


##
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java:
##
@@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String 
instantTime, HoodieTab
 this.preserveMetadata = true;
 init(fileId, this.partitionPath, dataFileToBeMerged);
 validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields());
+// if the old schema equals to the new schema, avoid heavy rewriting
+if (config.populateMetaFields() && useWriterSchemaForCompaction) {
+  LOG.info("Using update instead rewriting during compaction");
+  copyOldFunc = (key, record, schema, prop) -> 
this.updateMetadataToOldRecord(key, record, schema, prop);

Review Comment:
   So you are assuming the compaction does not involve schema evolution or just 
the latest schema might be okay because the schema is compatible? How can we 
ensure the schema compatibility?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]

2024-04-08 Thread via GitHub


beyond1920 opened a new pull request, #10980:
URL: https://github.com/apache/hudi/pull/10980

   ### Change Logs
   Avoid unnecessary rewriting when copy old data from old base to new base 
file to improve compaction performance.
   See more detail in 
[HUDI-7578](https://issues.apache.org/jira/browse/HUDI-7578) or 
[issue#10978](https://github.com/apache/hudi/issues/10978).
   
   ### Impact
   
   Improve the compact performance.
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org