nsivabalan commented on code in PR #19045:
URL: https://github.com/apache/hudi/pull/19045#discussion_r3498253089
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/metadata/HoodieBackedTableMetadataWriter.java:
##########
@@ -2081,6 +2192,14 @@ private SerializableFunction<HoodieRecord, HoodieRecord>
getRecordTagger(String
FileSlice slice = fileSlices.get(mappingFunction.apply(r.getRecordKey(),
fileGroupCount));
r.unseal();
r.setCurrentLocation(new
HoodieRecordLocation(slice.getBaseInstantTime(), slice.getFileId()));
+ // Under layouts that place file groups in sub-directories (e.g.
bucketing), the file slice's
+ // physical partition path may differ from the logical MDT partition
name carried on the
+ // record. Realign the record's partition path with where the file slice
actually lives so
+ // downstream HoodieAppendHandle's partition-path consistency check
passes.
+ String physical = slice.getPartitionPath();
Review Comment:
Acknowledged — this is one of the two non-trivial items we're deferring to a
follow-up. Compaction (`BaseHoodieCompactionPlanGenerator`) and cleaning's
full-listing path go through the normal FS view under the logical partition
name, which under bucketing will return zero file slices. cshuo flagged the
same in https://github.com/apache/hudi/pull/19045#discussion_r3492391550.
Tracking: the realignment here makes the per-record write stats land at the
bucket sub-path so an incremental-stats-driven plan can find them, but the
listing-driven fallback paths still need a layout-aware fan-out. Will be
addressed in the bucketing-table-services follow-up alongside the
partitioned-RLI work.
##########
hudi-client/hudi-client-common/src/main/java/org/apache/hudi/io/HoodieAppendHandle.java:
##########
@@ -268,6 +275,37 @@ private void init(HoodieRecord record) {
doInit = false;
}
+ /**
+ * Returns true when this append handle is writing to a layout sub-path of
an MDT partition
+ * (e.g. {@code record_index/0004} under sub-directory bucketing). In that
case, the
+ * {@code .hoodie_partition_metadata} marker must NOT be created at the
bucket level; it is
+ * written once at the logical partition root by the MDT initialization path.
+ *
+ * <p>Heuristic: an MDT partition path of the form {@code
<known-mdt-partition>/<NNNN>} where
+ * {@code NNNN} is the standard 4-digit bucket name produced by the
sub-directory bucketing
+ * layout. Third-party layouts using a different sub-path naming scheme can
ship their own
+ * append-handle integration; the OSS-shipped layouts use this convention.
+ */
+ private boolean isMDTLayoutSubPath(String physicalPartitionPath) {
Review Comment:
Partially addressed and partially deferred.
Fixed in `fcf090641055` + `8a7c60f5c54f`:
- Layering: moved out of the engine-agnostic append handle into
`HoodieTableMetadataUtil.isMDTBucketSubPath`; the AppendHandle is now a 3-line
delegate.
- Spurious matches: the heuristic now hard-gates on a non-flat MDT layout
being configured (`getMetadataLayoutClass()` present and != `FlatMDTLayout`),
so a 4-digit data-table partition value (e.g. `price=1000`) on a flat-default
table cannot trigger a false positive.
Still pending: the bucket-index >= 10000 case still falls back to false
(5-digit suffix), which would let a marker be written inside a bucket dir at
scale. Mitigation in code: `SubDirBucketedMDTLayout.bucketRelativePath` uses
`%04d` so any bucket index that overflows 4 digits already produces a path the
heuristic doesn't recognize. Real fix: have the layout itself answer
`isLayoutSubPath(path)` via the SPI rather than relying on a string heuristic.
Deferring to the same follow-up that addresses the compaction/cleaning fan-out,
since both want the layout to be the source of truth for sub-path
identification.
##########
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##########
Review Comment:
Acknowledged — same root cause as the hudi-agent comment at
https://github.com/apache/hudi/pull/19045#discussion_r3464757989. The
realignment in `getRecordTagger` makes incremental-stats-driven plans usable,
but compaction's
`BaseHoodieCompactionPlanGenerator.getLatestFileSlicesStateless(logical)` and
cleaning's `getPartitionPathsForFullCleaning()` still go through the
file-system view under logical names with no layout-aware fan-out. Deferring to
a bucketing-table-services follow-up patch; happy to chat on the design before
that lands.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]