nsivabalan opened a new issue, #18866: URL: https://github.com/apache/hudi/issues/18866
### Tips before filing an issue - Have you gone through our [FAQs](https://hudi.apache.org/learn/faq)? - Join the mailing list to engage in conversations and get faster support at [email protected]. - If you have triaged this as a bug, then file an [issue](https://issues.apache.org/jira/projects/HUDI/issues) directly. ### Describe the problem you faced Follow-up to #18353 which introduced `hoodie.metadata.record.level.index.defer.init` to defer Record Level Index (RLI) bootstrap to the 2nd commit on a fresh table. The current implementation resolves the data schema used to read base/log files during RLI init via `HoodieSchema.parse(dataWriteConfig.getWriteSchema())`. The write config schema is not always populated when the metadata writer is constructed in the deferred path: - On the 2nd commit (after a deferred first commit), the write config used to build the metadata writer may not carry the avro schema string. - When the metadata writer is constructed outside an active write (e.g. read-side helpers building a `metadataWriter(writeConfig)`). In those cases the inline parse fails (NPE / empty schema parse) and the deferred RLI bootstrap on commit #2 breaks. The bulk_insert path on a fresh table with defer enabled hits this consistently. ### To Reproduce Steps to reproduce the behavior: 1. Create a fresh Hudi MoR table with `hoodie.metadata.record.level.index.defer.init=true` and RLI enabled (partitioned). 2. Issue commit #1 as `bulk_insert` (Overwrite). RLI is correctly deferred — partition is not initialized. 3. Issue commit #2 as `bulk_insert` (Append). On the 2nd metadata-writer entry, deferred RLI init runs, but the schema resolution via `dataWriteConfig.getWriteSchema()` is empty in this path, and RLI bootstrap fails. ### Expected behavior The deferred RLI partition should bootstrap successfully on the 2nd commit (or on any subsequent metadata writer construction once there is at least one completed instant on the data table), regardless of whether the write config carries an explicit avro schema string. The latest committed table schema is a safe fallback. ### Environment Description - Hudi version: master (1.2.0-rc2 line) - Spark version: 3.x - Storage (HDFS/S3/GCS..): n/a - Running on Docker? (yes/no): no ### Additional context - Original feature PR: #18353 - Affected code path: `HoodieBackedTableMetadataWriter#initializeFromFilesystem` → `initializeFilegroupsAndCommitToRecordIndexPartition` → `readRecordKeysFromFileSliceSnapshot`, where the data schema is parsed from `dataWriteConfig.getWriteSchema()`. ### Stacktrace `Cannot initialize record level index` / NPE from `HoodieSchema.parse(null)` inside the executor closure of `readRecordKeysFromFileSliceSnapshot` when the deferred RLI bootstrap is triggered with an unset write schema. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
