yihua commented on issue #6686: URL: https://github.com/apache/hudi/issues/6686#issuecomment-1254272868
@asankadarshana007 The consistency check, when enabled, happens when removing invalid data files: (1) check that all paths to delete exist, (2) delete them, (3) wait for all paths to disappear after eventual consistency. Note that this logic is not needed for strong consistency. As the invalid data files are now determined based on the markers, there could be a case where a marker is created, but the data file has not started being written, so that the check (1) fails, which is okay. Given that there is no use case for the eventual consistency atm, we don't maintain the logic. Let me know if turning off `hoodie.consistency.check.enabled` solves your problem. You can close the ticket if all good. ``` if (!invalidDataPaths.isEmpty()) { LOG.info("Removing duplicate data files created due to task retries before committing. Paths=" + invalidDataPaths); Map<String, List<Pair<String, String>>> invalidPathsByPartition = invalidDataPaths.stream() .map(dp -> Pair.of(new Path(basePath, dp).getParent().toString(), new Path(basePath, dp).toString())) .collect(Collectors.groupingBy(Pair::getKey)); // Ensure all files in delete list is actually present. This is mandatory for an eventually consistent FS. // Otherwise, we may miss deleting such files. If files are not found even after retries, fail the commit if (consistencyCheckEnabled) { // This will either ensure all files to be deleted are present. waitForAllFiles(context, invalidPathsByPartition, FileVisibility.APPEAR); } // Now delete partially written files context.setJobStatus(this.getClass().getSimpleName(), "Delete all partially written files: " + config.getTableName()); deleteInvalidFilesByPartitions(context, invalidPathsByPartition); // Now ensure the deleted files disappear if (consistencyCheckEnabled) { // This will either ensure all files to be deleted are absent. waitForAllFiles(context, invalidPathsByPartition, FileVisibility.DISAPPEAR); } } ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org