prashantwason commented on code in PR #8684: URL: https://github.com/apache/hudi/pull/8684#discussion_r1203532474
########## hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java: ########## @@ -1052,51 +1091,81 @@ protected HoodieData<HoodieRecord> prepRecords(Map<MetadataPartitionType, } /** - * Perform a compaction on the Metadata Table. - * - * Cases to be handled: - * 1. We cannot perform compaction if there are previous inflight operations on the dataset. This is because - * a compacted metadata base file at time Tx should represent all the actions on the dataset till time Tx. - * - * 2. In multi-writer scenario, a parallel operation with a greater instantTime may have completed creating a - * deltacommit. + * Optimize the metadata table by running compaction, clean and archive as required. + * <p> + * Don't perform optimization if there are inflight operations on the dataset. This is for two reasons: + * - The compaction will contain the correct data as all failed operations have been rolled back. + * - Clean/compaction etc. will have the highest timestamp on the MDT and we won't be adding new operations + * with smaller timestamps to metadata table (makes for easier debugging) + * <p> + * This adds the limitations that long-running async operations (clustering, etc.) may cause delay in such MDT + * optimizations. We will relax this after MDT code has been hardened. */ - protected void compactIfNecessary(BaseHoodieWriteClient writeClient, String instantTime) { - // finish off any pending compactions if any from previous attempt. - writeClient.runAnyPendingCompactions(); - - String latestDeltaCommitTimeInMetadataTable = metadataMetaClient.reloadActiveTimeline() - .getDeltaCommitTimeline() - .filterCompletedInstants() - .lastInstant().orElseThrow(() -> new HoodieMetadataException("No completed deltacommit in metadata table")) - .getTimestamp(); - // we need to find if there are any inflights in data table timeline before or equal to the latest delta commit in metadata table. - // Whenever you want to change this logic, please ensure all below scenarios are considered. - // a. There could be a chance that latest delta commit in MDT is committed in MDT, but failed in DT. And so findInstantsBeforeOrEquals() should be employed - // b. There could be DT inflights after latest delta commit in MDT and we are ok with it. bcoz, the contract is, latest compaction instant time in MDT represents - // any instants before that is already synced with metadata table. - // c. Do consider out of order commits. For eg, c4 from DT could complete before c3. and we can't trigger compaction in MDT with c4 as base instant time, until every - // instant before c4 is synced with metadata table. - List<HoodieInstant> pendingInstants = dataMetaClient.reloadActiveTimeline().filterInflightsAndRequested() - .findInstantsBeforeOrEquals(latestDeltaCommitTimeInMetadataTable).getInstants(); + @Override + public void performTableServices(Option<String> inFlightInstantTimestamp) { + HoodieTimer metadataTableServicesTimer = HoodieTimer.start(); + boolean allTableServicesExecutedSuccessfullyOrSkipped = true; + try { + BaseHoodieWriteClient writeClient = getWriteClient(); + // Run any pending table services operations. + runPendingTableServicesOperations(writeClient); + + // Check and run clean operations. + String latestDeltacommitTime = metadataMetaClient.reloadActiveTimeline().getDeltaCommitTimeline() + .filterCompletedInstants() + .lastInstant().get() + .getTimestamp(); + LOG.info("Latest deltacommit time found is " + latestDeltacommitTime + ", running clean operations."); + cleanIfNecessary(writeClient, latestDeltacommitTime); + + // Do timeline validation before scheduling compaction/logcompaction operations. + if (!validateTimelineBeforeSchedulingCompaction(inFlightInstantTimestamp, latestDeltacommitTime)) { + return; Review Comment: Done -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org