voonhous opened a new issue, #18996: URL: https://github.com/apache/hudi/issues/18996
### Describe the problem Two per-record costs on the metadata-table read path, both in `MetadataPartitionType` / `HoodieMetadataPayload`: 1. `MetadataPartitionType.get(int)` iterates `values()`, which clones the enum constant array on every call. It runs once per record materialized from the metadata table: the `HoodieMetadataPayload(Option<GenericRecord>)` constructor calls `MetadataPartitionType.get(type).constructMetadataPayload(...)` for every record returned by RLI / secondary-index / column-stats lookups and full scans, and `preCombine` repeats it for every key merged from MDT log files. 2. `RECORD_INDEX.constructMetadataPayload` decodes the numeric record-index fields with `Long.parseLong(record.get(field).toString())` / `Integer.parseInt(...)`, even though the Avro generic record already holds boxed `Long` / `Integer` values matching the `long` / `int` field types in `HoodieMetadata.avsc`. That is five `String` allocations plus five parses per materialized RLI record. For upsert tagging that reads millions of RLI entries, this is pure per-record garbage and CPU on the index-lookup hot path. ### Proposed fix 1. Cache `values()` once in a `private static final MetadataPartitionType[]` and iterate that in `get(int)`. The linear scan and the `IllegalArgumentException` for unknown types are unchanged. A direct-index lookup table is avoided because `EXPRESSION_INDEX` has record type `-1`. 2. Read the numeric record-index fields directly via `((Number) record.get(field)).longValue()` / `.intValue()` instead of `toString` + parse. The avsc declares these fields `long` / `int`, so the values are always `Long` / `Integer`, and RLI records always populate them (UUID encoding sets the bits, raw encoding sets `-1` sentinels). The string fields (`partition`, `fileId`) keep `.toString()`. Reconstructed records are identical. Behavior-preserving; verified with an avro write/read round-trip over both fileId encodings. Will raise a PR for this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
