Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
yihua closed pull request #10980: [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance URL: https://github.com/apache/hudi/pull/10980 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
danny0405 commented on PR #10980: URL: https://github.com/apache/hudi/pull/10980#issuecomment-2060281356 Close because it is fixed in #11028 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #11028: URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058284585 ## CI report: * 67ca721df255223c873303aeccf7900c29f7811a Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23278) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #11028: URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058193126 ## CI report: * 67832fce75903cce3b3f66beb125f6a02fb82e11 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23277) * 67ca721df255223c873303aeccf7900c29f7811a Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23278) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #11028: URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058187793 ## CI report: * 67832fce75903cce3b3f66beb125f6a02fb82e11 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23277) * 67ca721df255223c873303aeccf7900c29f7811a UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #11028: URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058148444 ## CI report: * 67832fce75903cce3b3f66beb125f6a02fb82e11 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23277) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #11028: URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058142670 ## CI report: * 8fc55507a82ee1295f14c1125876b8395cfc27df Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23276) * 67832fce75903cce3b3f66beb125f6a02fb82e11 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23277) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #11028: URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058137317 ## CI report: * 8fc55507a82ee1295f14c1125876b8395cfc27df Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23276) * 67832fce75903cce3b3f66beb125f6a02fb82e11 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #11028: URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058104791 ## CI report: * 8fc55507a82ee1295f14c1125876b8395cfc27df Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23276) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #11028: URL: https://github.com/apache/hudi/pull/11028#issuecomment-2058098918 ## CI report: * 8fc55507a82ee1295f14c1125876b8395cfc27df UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
danny0405 opened a new pull request, #11028: URL: https://github.com/apache/hudi/pull/11028 ### Change Logs There is no need to copy for most of the use cases. ### Impact no impact. ### Risk level (write none, low medium or high below) none ### Documentation Update none ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
danny0405 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1557147035 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// The compactor avoids heavy rewriting when copy the old record from old base file into new base file +if (config.populateMetaFields()) { + LOG.info("Using update instead of rewriting during compaction"); Review Comment: `config.populateMetaFields()` is always true for data table(accounting to the metadata table), if the schema evolution is handled correctly, I guess we can always set up the metadata fields directly? And please move the log level into "DEBUG". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
danny0405 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1557137910 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// The compactor avoids heavy rewriting when copy the old record from old base file into new base file Review Comment: `The compactor avoids heavy rewriting when copy the old record from old base file into new base file` -> `The compactor avoids costly rewriting while copying the old record from the old base file into a new one` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #10980: URL: https://github.com/apache/hudi/pull/10980#issuecomment-2044184717 ## CI report: * c382de2b71540404831449de82e40d9488a38575 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23155) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #10980: URL: https://github.com/apache/hudi/pull/10980#issuecomment-2044130880 ## CI report: * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23148) * c382de2b71540404831449de82e40d9488a38575 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23155) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #10980: URL: https://github.com/apache/hudi/pull/10980#issuecomment-2044125667 ## CI report: * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23148) * c382de2b71540404831449de82e40d9488a38575 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
beyond1920 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1556818304 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// The compactor avoids heavy rewriting when copy the old record from old base file into new base file +if (config.populateMetaFields()) { + LOG.info("Using update instead rewriting during compaction"); Review Comment: > Set the log as debug level Using info level here does not cost much, right? It only prints logs in class constructor, not for each input record. > "instead" -> "instead of". Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
beyond1920 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1556818304 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// The compactor avoids heavy rewriting when copy the old record from old base file into new base file +if (config.populateMetaFields()) { + LOG.info("Using update instead rewriting during compaction"); Review Comment: > Set the log as debug level Using info level here does not cost much, right? It only prints logs in class constructor, not for each input record. > "instead" -> "instead of". Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
beyond1920 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1556818304 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// The compactor avoids heavy rewriting when copy the old record from old base file into new base file +if (config.populateMetaFields()) { + LOG.info("Using update instead rewriting during compaction"); Review Comment: > Set the log as debug level Using info level here does not cost much, right? > "instead" -> "instead of". Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
beyond1920 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1556817370 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// if the old schema equals to the new schema, avoid heavy rewriting +if (config.populateMetaFields() && useWriterSchemaForCompaction) { + LOG.info("Using update instead rewriting during compaction"); + copyOldFunc = (key, record, schema, prop) -> this.updateMetadataToOldRecord(key, record, schema, prop); Review Comment: Good question. The responsible of this method is only merging base record and incremental record, not including handle schema evolution. Handling schema evolution happens before call the `HoodieMergeHandle#write` method. https://github.com/apache/hudi/assets/1525333/3a03e08b-fe2e-4da6-a788-07cbb6feeadd;> https://github.com/apache/hudi/assets/1525333/def1f2ee-ed97-47f8-92b6-76d45500bea7;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
beyond1920 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1556818304 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// The compactor avoids heavy rewriting when copy the old record from old base file into new base file +if (config.populateMetaFields()) { + LOG.info("Using update instead rewriting during compaction"); Review Comment: > Set the log as debug level, -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
danny0405 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1556687381 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// if the old schema equals to the new schema, avoid heavy rewriting +if (config.populateMetaFields() && useWriterSchemaForCompaction) { + LOG.info("Using update instead rewriting during compaction"); + copyOldFunc = (key, record, schema, prop) -> this.updateMetadataToOldRecord(key, record, schema, prop); Review Comment: but it still uses the latest schema as the write schema, how about the schema already evolved? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
danny0405 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1556687736 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// The compactor avoids heavy rewriting when copy the old record from old base file into new base file +if (config.populateMetaFields()) { + LOG.info("Using update instead rewriting during compaction"); Review Comment: Set the log as debug level, "instead" -> "instead of". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #10980: URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042891233 ## CI report: * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23148) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #10980: URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042643487 ## CI report: * 07e398007c1557d3e17adc3d8a36d8778ed3e976 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23147) * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23148) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #10980: URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042627787 ## CI report: * 07e398007c1557d3e17adc3d8a36d8778ed3e976 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23147) * 36b0e8f8e5e00096b9844f8db6cc51cbc114f42c UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
beyond1920 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1555662848 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// if the old schema equals to the new schema, avoid heavy rewriting +if (config.populateMetaFields() && useWriterSchemaForCompaction) { + LOG.info("Using update instead rewriting during compaction"); + copyOldFunc = (key, record, schema, prop) -> this.updateMetadataToOldRecord(key, record, schema, prop); Review Comment: Not exactly. The behavior is consistent with the old behavior. https://github.com/apache/hudi/assets/1525333/e254eab0-9c22-4658-a4a5-cc8faae9d2af;> https://github.com/apache/hudi/assets/1525333/438c9ee9-1189-4928-9c48-e102625c5967;> In the above pictures, if `config.populateMetaFields() ` is true for compaction job, the `oldSchema` is equals to `writeSchemaWithMetaFields`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
beyond1920 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r1555662848 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// if the old schema equals to the new schema, avoid heavy rewriting +if (config.populateMetaFields() && useWriterSchemaForCompaction) { + LOG.info("Using update instead rewriting during compaction"); + copyOldFunc = (key, record, schema, prop) -> this.updateMetadataToOldRecord(key, record, schema, prop); Review Comment: Not exactly. The behavior is consistent with the old behavior. https://github.com/apache/hudi/assets/1525333/2ab0c5e8-d22e-41a2-bb73-e541b64a8083;> In the above picture, if `config.populateMetaFields() && useWriterSchemaForCompaction` is true, the `oldSchema` equals to `newSchema`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #10980: URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042405078 ## CI report: * 07e398007c1557d3e17adc3d8a36d8778ed3e976 Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23147) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #10980: URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042297222 ## CI report: * 07e398007c1557d3e17adc3d8a36d8778ed3e976 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=23147) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
hudi-bot commented on PR #10980: URL: https://github.com/apache/hudi/pull/10980#issuecomment-2042282215 ## CI report: * 07e398007c1557d3e17adc3d8a36d8778ed3e976 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
danny0405 commented on code in PR #10980: URL: https://github.com/apache/hudi/pull/10980#discussion_r103225 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandle.java: ## @@ -147,6 +149,13 @@ public HoodieMergeHandle(HoodieWriteConfig config, String instantTime, HoodieTab this.preserveMetadata = true; init(fileId, this.partitionPath, dataFileToBeMerged); validateAndSetAndKeyGenProps(keyGeneratorOpt, config.populateMetaFields()); +// if the old schema equals to the new schema, avoid heavy rewriting +if (config.populateMetaFields() && useWriterSchemaForCompaction) { + LOG.info("Using update instead rewriting during compaction"); + copyOldFunc = (key, record, schema, prop) -> this.updateMetadataToOldRecord(key, record, schema, prop); Review Comment: So you are assuming the compaction does not involve schema evolution or just the latest schema might be okay because the schema is compatible? How can we ensure the schema compatibility? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7578] Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance [hudi]
beyond1920 opened a new pull request, #10980: URL: https://github.com/apache/hudi/pull/10980 ### Change Logs Avoid unnecessary rewriting when copy old data from old base to new base file to improve compaction performance. See more detail in [HUDI-7578](https://issues.apache.org/jira/browse/HUDI-7578) or [issue#10978](https://github.com/apache/hudi/issues/10978). ### Impact Improve the compact performance. ### Risk level (write none, low medium or high below) None ### Documentation Update None ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org