prashantwason commented on code in PR #18182:
URL: https://github.com/apache/hudi/pull/18182#discussion_r3230492746
##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
@@ -2281,6 +2298,78 @@ public static String
deleteMetadataTablePartition(HoodieTableMetaClient dataMeta
return null;
}
+ /**
+ * Returns true if bucketing is enabled for the metadata table.
+ *
+ * @param metaClient MetaClient for the metadata table
+ * @return true if bucketing is enabled
+ */
+ private static boolean isBucketingEnabledForMDT(HoodieTableMetaClient
metaClient) {
+ // Bucketing status is saved within the main dataset's properties.
+ String basePath =
HoodieTableMetadata.getDataTableBasePathFromMetadataTable(metaClient.getBasePath().toString());
+ return HoodieTableMetaClient.builder()
+ .setBasePath(basePath)
+ .setConf(metaClient.getStorageConf().newInstance())
+ .setLoadActiveTimelineOnLoad(false)
+ .build()
+ .getTableConfig()
+ .isMetadataTablePartitionBucketingEnabled();
+ }
+
+ /**
+ * Returns the list of partitions to list for file slices.
Review Comment:
Same root cause as your earlier comment on the rebuild — fixed in 39755a7e
by persisting the bucketing flag on the MDT's own `hoodie.properties` and
reading it directly from the MDT metaClient's table config in
`isBucketingEnabledForMDT`. No more per-call disk read or HoodieTableMetaClient
construction.
##########
hudi-client/hudi-spark-client/src/test/java/org/apache/hudi/testutils/HoodieSparkClientTestHarness.java:
##########
@@ -647,7 +650,21 @@ private void runFullValidation(HoodieMetadataConfig
metadataConfig,
// Metadata table has a fixed number of partitions
// Cannot use FSUtils.getAllFoldersWithPartitionMetaFile for this as that
function filters all directory
// in the .hoodie folder.
- List<String> metadataTablePartitions =
FSUtils.getAllPartitionPaths(engineContext, metadataMetaClient, false);
+ List<String> metadataTablePartitions =
FSUtils.getAllPartitionPaths(engineContext,
HoodieTableMetadata.getMetadataTableBasePath(basePath),
+ false, false);
+
+ List<MetadataPartitionType> enabledPartitionTypes =
metadataWriter.getEnabledPartitionTypes();
+ if (writeConfig.getMetadataConfig().isFileGroupBucketingEnabled()) {
+ // First bucket should be present
+
assertTrue(metadataTablePartitions.contains(HoodieTableMetadataUtil.getBucketRelativePath(MetadataPartitionType.FILES,
0)));
+ } else {
+ // partition path is the only partition
+ assertEquals(enabledPartitionTypes.size(),
metadataTablePartitions.size());
+
assertTrue(metadataTablePartitions.contains(MetadataPartitionType.FILES.getPartitionPath()));
+ }
Review Comment:
Fixed in 39755a7e. The bucketing branch now iterates over
`enabledPartitionTypes` and asserts each partition has its bucket-0
sub-directory present, with the partition path included in the failure message
for easier debugging. RECORD_INDEX, COLUMN_STATS, etc. are now actually
verified.
##########
hudi-spark-datasource/hudi-spark/src/test/java/org/apache/hudi/functional/TestHoodieBackedMetadata.java:
##########
@@ -3760,7 +3762,7 @@ public void testDeleteWithRecordIndex() throws Exception {
}
Review Comment:
Restored the 2-space indent on `validateMetadata(SparkRDDWriteClient)` (line
3762), `verifyMetadataRawRecords` (line 1466), and a third de-indented line in
`validateMetadata(...)` around line 3989 in 39755a7e.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -1060,37 +1101,63 @@ private void initializeFileGroups(HoodieTableMetaClient
dataMetaClient, Metadata
// during initial commit, then the fileGroup would still be recognized (as
a FileSlice with no baseFiles but a
// valid logFile). Since these log files being created have no content, it
is safe to add them here before
// the bulkInsert.
- final String msg = String.format("Creating %d file groups for partition %s
with base fileId %s at instant time %s",
- fileGroupCount, relativePartitionPath,
metadataPartition.getFileIdPrefix(), instantTime);
+ final int bucketCount = (int) Math.ceil((float) fileGroupCount /
metadataWriteConfig.getMetadataConfig().getFileGroupBucketSize());
+ List<Pair<String, String>> fileGroupIdAndPathPairList;
+ final String msg;
+ if (bucketingEnabled) {
+ msg = String.format("Creating bucketed file groups for MDT partition %s:
#buckets=%d, #fileGroups=%d, fileId=%s, instant_time=%s",
+ relativePartitionPath, bucketCount, fileGroupCount,
metadataPartition.getFileIdPrefix(), instantTime);
+ fileGroupIdAndPathPairList = IntStream.range(0, fileGroupCount)
+ // Since bucketing is enabled, the filegroups will be located in the
bucket directory inside metadata partition directory
+ .mapToObj(i -> {
+ final int bucketIndex = i /
metadataWriteConfig.getMetadataConfig().getFileGroupBucketSize();
+ final String bucketPath =
HoodieTableMetadataUtil.getBucketRelativePath(relativePartitionPath,
bucketIndex);
+ return
Pair.of(HoodieTableMetadataUtil.getFileIDForFileGroup(metadataPartition, i,
relativePartitionPath, dataPartitionName), bucketPath);
+ })
+ .collect(Collectors.toList());
+ } else {
+ msg = String.format("Creating file groups for MDT partition %s:
#fileGroups=%d, fileId=%s, instant_time=%s",
+ relativePartitionPath, fileGroupCount,
metadataPartition.getFileIdPrefix(), instantTime);
+ fileGroupIdAndPathPairList = IntStream.range(0, fileGroupCount)
+ // Since bucketing is disabled, the filegroups will be located
directly under the metadata partition directory
+ .mapToObj(i ->
Pair.of(HoodieTableMetadataUtil.getFileIDForFileGroup(metadataPartition, i,
relativePartitionPath, dataPartitionName), relativePartitionPath))
+ .collect(Collectors.toList());
+ }
+
+ ValidationUtils.checkArgument(fileGroupIdAndPathPairList.size() ==
fileGroupCount);
LOG.info(msg);
- final List<String> fileGroupFileIds = IntStream.range(0, fileGroupCount)
- .mapToObj(i ->
HoodieTableMetadataUtil.getFileIDForFileGroup(metadataPartition, i,
relativePartitionPath, dataPartitionName))
- .collect(Collectors.toList());
- ValidationUtils.checkArgument(fileGroupFileIds.size() == fileGroupCount);
engineContext.setJobStatus(this.getClass().getSimpleName(), msg);
- engineContext.foreach(fileGroupFileIds, fileGroupFileId -> {
+ engineContext.foreach(fileGroupIdAndPathPairList, p -> {
+ final String fileGroupFileId = p.getKey();
+ final String relativePath = p.getValue();
+
try {
final Map<HeaderMetadataType, String> blockHeader =
Collections.singletonMap(HeaderMetadataType.INSTANT_TIME, instantTime);
final HoodieDeleteBlock block = new
HoodieDeleteBlock(Collections.emptyList(), blockHeader);
try (HoodieLogFormat.Writer writer = HoodieLogFormat.newWriterBuilder()
-
.onParentPath(FSUtils.constructAbsolutePath(metadataWriteConfig.getBasePath(),
relativePartitionPath))
+
.onParentPath(FSUtils.constructAbsolutePath(metadataWriteConfig.getBasePath(),
relativePath))
.withFileId(fileGroupFileId)
.withInstantTime(instantTime)
.withLogVersion(HoodieLogFile.LOGFILE_BASE_VERSION)
.withFileSize(0L)
.withSizeThreshold(metadataWriteConfig.getLogFileMaxSize())
- .withStorage(dataMetaClient.getStorage())
+ .withStorage(storage)
.withLogWriteToken(HoodieLogFormat.DEFAULT_WRITE_TOKEN)
.withTableVersion(metadataWriteConfig.getWriteVersion())
.withFileExtension(HoodieLogFile.DELTA_EXTENSION).build()) {
writer.appendBlock(block);
}
} catch (InterruptedException e) {
- throw new HoodieException(String.format("Failed to created fileGroup
%s for partition %s", fileGroupFileId, relativePartitionPath), e);
+ throw new HoodieException(String.format("Failed to created fileGroup
%s for partition %s", fileGroupFileId, relativePath), e);
}
- }, fileGroupFileIds.size());
+ }, fileGroupIdAndPathPairList.size());
+
+ if (bucketingEnabled &&
!dataMetaClient.getTableConfig().isMetadataTablePartitionBucketingEnabled()) {
+ // Bucketing has been enabled so set it in the table config
+ dataMetaClient =
HoodieTableMetadataUtil.setMetadataTablePartitionBucketing(dataMetaClient,
bucketingEnabled);
Review Comment:
Fixed in 39755a7e: assigns to `this.dataMetaClient` (the writer's instance
field) instead of the shadowing local parameter. Added a comment explaining the
shadowing hazard.
##########
hudi-common/src/main/java/org/apache/hudi/common/config/HoodieMetadataConfig.java:
##########
@@ -937,6 +947,14 @@ public boolean isDropMetadataIndex(String indexName) {
return subIndexNameToDrop.contains(indexName);
}
+ public boolean isFileGroupBucketingEnabled() {
+ return getBoolean(FILE_GROUP_BUCKETING_ENABLE);
+ }
+
+ public int getFileGroupBucketSize() {
+ return getInt(FILE_GROUP_BUCKET_SIZE);
+ }
Review Comment:
Added `checkArgument(bucketSize > 0, ...)` in `getFileGroupBucketSize()` in
39755a7e. Misconfigurations now fail at the config-getter rather than as an
`ArithmeticException` inside `initializeFileGroups`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]