[GitHub] [hudi] danny0405 commented on a diff in pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table
danny0405 commented on code in PR #8900: URL: https://github.com/apache/hudi/pull/8900#discussion_r1224329290 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleFactory.java: ## @@ -47,6 +50,7 @@ public static HoodieMergeHandle create( String fileId, TaskContextSupplier taskContextSupplier, Option keyGeneratorOpt) { +LOG.info("Create update handle for fileId {} and partition path {} at commit {}", fileId, partitionPath, instantTime); if (table.requireSortedRecords()) { Review Comment: Can we just remove this log, it's verbose. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleFactory.java: ## @@ -79,6 +83,7 @@ public static HoodieMergeHandle create( HoodieBaseFile dataFileToBeMerged, TaskContextSupplier taskContextSupplier, Option keyGeneratorOpt) { +LOG.info("Get updateHandle for fileId {} and partitionPath {} at commit {}", fileId, partitionPath, instantTime); if (table.requireSortedRecords()) { Review Comment: Can we just remove this log, it's verbose. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table
danny0405 commented on code in PR #8900: URL: https://github.com/apache/hudi/pull/8900#discussion_r1223977198 ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java: ## @@ -233,48 +233,49 @@ void testSyncMetadataTable() throws Exception { assertThat(completedTimeline.lastInstant().get().getTimestamp(), startsWith(HoodieTableMetadata.SOLO_COMMIT_TIMESTAMP)); // test metadata table compaction -// write another 4 commits -for (int i = 1; i < 5; i++) { +// write another 9 commits to trigger compaction twice. Since default clean version to retain is 2. Review Comment: > The reason for making the change is to support restore First of all, I'm confused why this change is related with restore ? The change is for MDT log compaction right? Can we address the restore issue in another PR ? > then next cleaner job after compaction would remove previous file slice there by blocking restore on metadata table or loosing data. Does the new cleaning strategy solve the issue, even we keep at least 2 versions for each file group, it does not ensure we can restore to a long history commit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table
danny0405 commented on code in PR #8900: URL: https://github.com/apache/hudi/pull/8900#discussion_r1223968227 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java: ## @@ -61,6 +61,12 @@ public class HoodieCompactionConfig extends HoodieConfig { + "but users are expected to trigger async job for execution. If `hoodie.compact.inline` is set to true, regular writers will do both scheduling and " + "execution inline for compaction"); + public static final ConfigProperty ENABLE_LOG_COMPACTION = ConfigProperty + .key("hoodie.log.compaction.enable") + .defaultValue("false") + .sinceVersion("0.14") + .withDocumentation("By enabling log compaction through this config, log compaction will also get enabled for the metadata table."); Review Comment: It seems only MDT uses the log compaction. If that is true, kind of think this belongs to a MDT config. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table
danny0405 commented on code in PR #8900: URL: https://github.com/apache/hudi/pull/8900#discussion_r1223970182 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataWriter.java: ## @@ -97,6 +98,11 @@ public interface HoodieTableMetadataWriter extends Serializable, AutoCloseable { */ BaseHoodieWriteClient getWriteClient(); + /** + * It returns write client for metadata table. + */ + HoodieTableMetaClient getMetadataMetaClient(); Review Comment: Just remove unrelated changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table
danny0405 commented on code in PR #8900: URL: https://github.com/apache/hudi/pull/8900#discussion_r1223968227 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/config/HoodieCompactionConfig.java: ## @@ -61,6 +61,12 @@ public class HoodieCompactionConfig extends HoodieConfig { + "but users are expected to trigger async job for execution. If `hoodie.compact.inline` is set to true, regular writers will do both scheduling and " + "execution inline for compaction"); + public static final ConfigProperty ENABLE_LOG_COMPACTION = ConfigProperty + .key("hoodie.log.compaction.enable") + .defaultValue("false") + .sinceVersion("0.14") + .withDocumentation("By enabling log compaction through this config, log compaction will also get enabled for the metadata table."); Review Comment: It seems only MDT uses the log compaction. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table
danny0405 commented on code in PR #8900: URL: https://github.com/apache/hudi/pull/8900#discussion_r1223901787 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/plan/generators/BaseHoodieCompactionPlanGenerator.java: ## @@ -158,7 +165,8 @@ protected List filterPartitionPathsByStrategy(HoodieWriteConfig writeCon return partitionPaths; } - protected boolean filterFileSlice(FileSlice fileSlice, String lastCompletedInstantTime, Set pendingFileGroupIds) { + protected boolean filterFileSlice(FileSlice fileSlice, String lastCompletedInstantTime, +Set pendingFileGroupIds, Option instantRange) { return fileSlice.getLogFiles().count() > 0 && !pendingFileGroupIds.contains(fileSlice.getFileGroupId()); Review Comment: Okay, it is overridden, a little obscure. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table
danny0405 commented on code in PR #8900: URL: https://github.com/apache/hudi/pull/8900#discussion_r1223899704 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/plan/generators/BaseHoodieCompactionPlanGenerator.java: ## @@ -158,7 +165,8 @@ protected List filterPartitionPathsByStrategy(HoodieWriteConfig writeCon return partitionPaths; } - protected boolean filterFileSlice(FileSlice fileSlice, String lastCompletedInstantTime, Set pendingFileGroupIds) { + protected boolean filterFileSlice(FileSlice fileSlice, String lastCompletedInstantTime, +Set pendingFileGroupIds, Option instantRange) { return fileSlice.getLogFiles().count() > 0 && !pendingFileGroupIds.contains(fileSlice.getFileGroupId()); Review Comment: is the `instantRange` been used anywhere? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table
danny0405 commented on code in PR #8900: URL: https://github.com/apache/hudi/pull/8900#discussion_r1223899704 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/table/action/compact/plan/generators/BaseHoodieCompactionPlanGenerator.java: ## @@ -158,7 +165,8 @@ protected List filterPartitionPathsByStrategy(HoodieWriteConfig writeCon return partitionPaths; } - protected boolean filterFileSlice(FileSlice fileSlice, String lastCompletedInstantTime, Set pendingFileGroupIds) { + protected boolean filterFileSlice(FileSlice fileSlice, String lastCompletedInstantTime, +Set pendingFileGroupIds, Option instantRange) { return fileSlice.getLogFiles().count() > 0 && !pendingFileGroupIds.contains(fileSlice.getFileGroupId()); Review Comment: Has the `instantRange` been used anywhere? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table
danny0405 commented on code in PR #8900: URL: https://github.com/apache/hudi/pull/8900#discussion_r1223894630 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -1021,17 +1023,46 @@ private void runPendingTableServicesOperations(BaseHoodieWriteClient writeClient * deltacommit. */ protected void compactIfNecessary(BaseHoodieWriteClient writeClient, String latestDeltacommitTime) { + +// Check if there are any pending compaction or log compaction instants in the timeline. +// If pending compact/logcompaction operations are found abort scheduling new compaction/logcompaction operations. +Option pendingLogCompactionInstant = + metadataMetaClient.getActiveTimeline().filterPendingLogCompactionTimeline().firstInstant(); +Option pendingCompactionInstant = + metadataMetaClient.getActiveTimeline().filterPendingCompactionTimeline().firstInstant(); +if (pendingLogCompactionInstant.isPresent() || pendingCompactionInstant.isPresent()) { + LOG.info(String.format("Not scheduling compaction or logcompaction, since a pending compaction instant %s or logcompaction %s instant is present", + pendingCompactionInstant, pendingLogCompactionInstant)); + return; +} + // Trigger compaction with suffixes based on the same instant time. This ensures that any future // delta commits synced over will not have an instant time lesser than the last completed instant on the // metadata table. final String compactionInstantTime = HoodieTableMetadataUtil.createCompactionTimestamp(latestDeltacommitTime); + // we need to avoid checking compaction w/ same instant again. // let's say we trigger compaction after C5 in MDT and so compaction completes with C4001. but C5 crashed before completing in MDT. // and again w/ C6, we will re-attempt compaction at which point latest delta commit is C4 in MDT. // and so we try compaction w/ instant C4001. So, we can avoid compaction if we already have compaction w/ same instant time. -if (!metadataMetaClient.getActiveTimeline().filterCompletedInstants().containsInstant(compactionInstantTime) -&& writeClient.scheduleCompactionAtInstant(compactionInstantTime, Option.empty())) { +if (metadataMetaClient.getActiveTimeline().filterCompletedInstants().containsInstant(compactionInstantTime)) { + LOG.info(String.format("Compaction with same %s time is already present in the timeline.", compactionInstantTime)); + return; +} else if (writeClient.scheduleCompactionAtInstant(compactionInstantTime, Option.empty())) { + LOG.info("Compaction is scheduled for timestamp " + compactionInstantTime); writeClient.compact(compactionInstantTime); +} else if (metadataWriteConfig.isLogCompactionEnabled()) { + // Schedule and execute log compaction with suffixes based on the same instant time. This ensures that any future + // delta commits synced over will not have an instant time lesser than the last completed instant on the + // metadata table. + final String logCompactionInstantTime = HoodieTableMetadataUtil.createLogCompactionTimestamp(latestDeltacommitTime); + if (metadataMetaClient.getActiveTimeline().filterCompletedInstants().containsInstant(logCompactionInstantTime)) { +LOG.info(String.format("Log compaction with same %s time is already present in the timeline.", logCompactionInstantTime)); +return; Review Comment: This `return` can be eliminated. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table
danny0405 commented on code in PR #8900: URL: https://github.com/apache/hudi/pull/8900#discussion_r1223872493 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -985,9 +985,10 @@ public void performTableServices(Option inFlightInstantTimestamp) { // Do timeline validation before scheduling compaction/logcompaction operations. if (validateTimelineBeforeSchedulingCompaction(inFlightInstantTimestamp, latestDeltacommitTime)) { +LOG.info("No inflight commits present in either of the timelines. " ++ "Proceeding to check compaction."); Review Comment: Do we need a log like this, the compaction would put many loggins anyway. ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ## @@ -1021,17 +1023,46 @@ private void runPendingTableServicesOperations(BaseHoodieWriteClient writeClient * deltacommit. */ protected void compactIfNecessary(BaseHoodieWriteClient writeClient, String latestDeltacommitTime) { + +// Check if there are any pending compaction or log compaction instants in the timeline. +// If pending compact/logcompaction operations are found abort scheduling new compaction/logcompaction operations. +Option pendingLogCompactionInstant = + metadataMetaClient.getActiveTimeline().filterPendingLogCompactionTimeline().firstInstant(); Review Comment: Wondering how could this happen, all the compactions are inline for MDT right? ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieMetadataWriteUtils.java: ## @@ -111,6 +112,10 @@ public static HoodieWriteConfig createMetadataWriteConfig( // deltacommits having corresponding completed commits. Therefore, we need to compact all fileslices of all // partitions together requiring UnBoundedCompactionStrategy. .withCompactionStrategy(new UnBoundedCompactionStrategy()) +// Check if log compaction is enabled, this is needed for tables with lot of records. +.withLogCompactionEnabled(writeConfig.isLogCompactionEnabled()) +// This config is only used if enableLogCompactionForMetadata is set. Review Comment: Yeah, log compaction for MDT and DT should not share the same config option I think. ## hudi-flink-datasource/hudi-flink/src/test/java/org/apache/hudi/sink/TestStreamWriteOperatorCoordinator.java: ## @@ -233,48 +233,49 @@ void testSyncMetadataTable() throws Exception { assertThat(completedTimeline.lastInstant().get().getTimestamp(), startsWith(HoodieTableMetadata.SOLO_COMMIT_TIMESTAMP)); // test metadata table compaction -// write another 4 commits -for (int i = 1; i < 5; i++) { +// write another 9 commits to trigger compaction twice. Since default clean version to retain is 2. Review Comment: Why we need to change the clean strategy then? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[GitHub] [hudi] danny0405 commented on a diff in pull request #8900: [HUDI-6334] Integrate logcompaction table service to metadata table and provides various bugfixes to metadata table
danny0405 commented on code in PR #8900: URL: https://github.com/apache/hudi/pull/8900#discussion_r1223869176 ## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieMergeHandleFactory.java: ## @@ -47,6 +50,7 @@ public static HoodieMergeHandle create( String fileId, TaskContextSupplier taskContextSupplier, Option keyGeneratorOpt) { +LOG.info("Get updateHandle for fileId " + fileId + " and partitionPath " + partitionPath + " at commit " + instantTime); Review Comment: `Create update handle for fileId ... and partition path ...` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org