nsivabalan opened a new pull request, #18865:
URL: https://github.com/apache/hudi/pull/18865
### Describe the issue this Pull Request addresses
Follow-up to #18353, which added
`hoodie.metadata.record.level.index.defer.init` to defer Record Level Index
(RLI) bootstrap to the 2nd commit on a fresh table. The original change relied
on `dataWriteConfig.getWriteSchema()` to resolve the read/data schema when
initializing the RLI partition. That schema is not always populated:
- When the metadata writer is constructed on the 2nd commit (after a
deferred first commit), the write config used to build the metadata writer may
not carry the avro schema string.
- When the metadata writer is constructed outside an active write (e.g. via
`metadataWriter(writeConfig)` for reads), the same gap exists.
In those cases `HoodieSchema.parse(dataWriteConfig.getWriteSchema())` fails,
blocking RLI from initializing on commit #2. The deferred path with bulk_insert
hit this bug.
### Summary and Changelog
**Core fix (`HoodieBackedTableMetadataWriter.java`)**
- Plumb a resolved `HoodieSchema` argument through the RLI init chain:
`initializeFilegroupsAndCommitToRecordIndexPartition` →
`initializeFilegroupsAndCommitToPartitionedRecordIndexPartition` →
`initializeRecordIndexPartition` → `readRecordKeysFromFileSliceSnapshot`.
Replaces the inline `HoodieSchema.parse(dataWriteConfig.getWriteSchema())`
previously evaluated inside the executor closure.
- New `resolveRecordIndexInitSchema(...)` helper: prefer
`dataWriteConfig.getWriteSchema()`; on empty, fall back to
`HoodieTableMetadataUtil.tryResolveSchemaForTable(dataMetaClient)` (the latest
committed schema). Throws a clear `HoodieMetadataException` when neither is
resolvable.
- Renamed the local `Lazy<Option<HoodieSchema>> tableSchema` →
`tableSchemaLazy` at the call site for clarity; javadoc on
`readRecordKeysFromFileSliceSnapshot` updated.
**Tests (`TestRecordLevelIndex.scala`)**
- Extended `testRecordLevelIndex` with a `deferRLIInit` parameter. When set,
the test asserts that after the first save the RLI partition is NOT yet present
in the metadata table config; it then proceeds through the existing assertion
flow which builds the metadata writer (triggering deferred init on the 2nd
entry).
- Added `testPartitionedRecordLevelIndexDefer(streamingWriteEnabled)` which
drives the deferred path via the existing helper and then verifies compaction.
- Added
`testPartitionedRecordLevelIndexDeferWithBulkInsert(streamingWriteEnabled)`:
commit #1 and commit #2 are both `bulk_insert` against a fresh table with defer
enabled. Validates:
- After commit #1 the RLI metadata partition is not initialized.
- After commit #2 the deferred RLI bootstrap completes (partition present,
partitioned RLI type).
- Record-key → location mapping is correct across all data partitions for
both batches, including cross-partition negative lookups.
### Impact
User-facing changes: none beyond what was introduced in #18353. This is a
follow-up bug fix that makes the opt-in deferred RLI init flow actually usable
on the 2nd commit (including bulk_insert).
Performance impact: none.
### Risk Level
low
### Documentation Update
none
### Contributor's checklist
- [x] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [x] Enough context is provided in the sections above
- [x] Adequate tests were added if applicable
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]