nsivabalan opened a new pull request, #18990:
URL: https://github.com/apache/hudi/pull/18990
### Describe the issue this Pull Request addresses
Closes #18989.
### Summary and Changelog
`HoodieDatasetBulkInsertHelper.prepareForBulkInsert` was routing every
key-generator through `df.queryExecution.toRdd.mapPartitions(...)`, forcing an
RDD round-trip and per-row reflection-based keygen invocation even for the
common keygens where the record-key and partition-path values can be sourced
directly from input columns.
This patch restores tiered dispatch:
- **Tier 1 — `NonpartitionedKeyGenerator`** (single record-key field): emits
`col(rk).cast(String)` + `lit("")` as Catalyst columns. No UDF, no toRdd
round-trip.
- **Tier 2 — `SimpleKeyGenerator`** (single record-key + single
partition-path field, URL-encoding off, slash-separated dates off): emits
`col(rk).cast(String)` and a partition-path expression mirroring
`PartitionPathFormatterBase#combine`, including the `handleEmpty ->
__HIVE_DEFAULT_PARTITION__` substitution and hive-style `<field>=` prefixing.
- **Tier 3 — everything else** (multi-field keys, `ComplexKeyGenerator`,
`TimestampBasedKeyGenerator`, `CustomKeyGenerator`, `SimpleKeyGenerator` with
URL-encode or slash-separated dates): anonymous `functions.udf(...)` over a
struct of input columns calling the canonical
`BuiltinKeyGenerator.getRecordKey(Row)` / `getPartitionPath(Row)`. The UDFs are
not registered against the `SparkSession`, so nothing leaks across writes.
- **Auto-record-key generation** keeps the existing RDD path; it needs
`TaskContext.partitionId` and a stateful per-task counter, which can't be
expressed cleanly as a driver-side closure.
The Tier 3 UDF goes through the `Row`-overload keygen API which uses the
canonical `String` formatter, so all three partition-formatter flags
(hive-style, URL encode, slash-separated dates) remain honored for the keygens
that fall through. The Tier 2 fast-path encodes only the default and hive-style
flag subset (URL encoding has no efficient pure-Catalyst equivalent; the 1.2.0+
slash-separated branch exercises a separate code path we'd rather not encode
twice).
New tests in `TestHoodieDatasetBulkInsertHelper`:
- `testKeyGenParityAgainstAvroGroundTruth` (parameterized, 11 cases) — every
supported keygen class plus the `SimpleKeyGen` flag combos (default / hive /
slash / hive+slash / URL / hive+URL / Complex single+multi / TimestampBased /
Custom). Each case asserts the helper's record-key and partition-path output
matches `BuiltinKeyGenerator`'s Avro path byte-for-byte.
- `testFastPathCastsNonStringRecordKey` — Tier 1/2 must materialize the
string form of a non-string record-key column (uses `ts: long`).
- `testFastPathAvoidsUdf` — Tier 1/2 analyzed logical plans must not contain
a `ScalaUDF` node (i.e. they actually benefit from Catalyst codegen).
- `testTier2EmptyPartitionValueSubstitutedWithHiveDefault` — empty partition
values resolve to `__HIVE_DEFAULT_PARTITION__` under both default and
hive-style flags.
- `testUdfPathRespectsDriverSessionTimezone` — Tier 3 UDF picks up the
driver's `spark.sql.session.timeZone` (guards against executor JVM default
leakage on `TimestampBasedKeyGenerator`).
### Impact
Performance: restores per-row Catalyst codegen for bulk inserts that use
`NonpartitionedKeyGenerator` or `SimpleKeyGenerator` (with default or
hive-style partitioning) — the most common configurations in practice. No
behaviour change for the keygens that fall through to Tier 3; their output is
byte-identical to the prior RDD path (and to the Avro ground truth, which the
parity test now enforces).
No public API change. No config change. No storage format change.
### Risk Level
Low. The change is contained to
`HoodieDatasetBulkInsertHelper.prepareForBulkInsert` (Scala helper, no public
API surface) and the parity test exhaustively checks every keygen + formatter
combination against the canonical Avro keygen output. The Tier 3 fallback is
the existing RDD-replaced UDF path, so any keygen the fast paths don't claim
continues to use the same canonical formatter.
### Documentation Update
None.
### Contributor's checklist
- [x] Read through [contributor's
guide](https://hudi.apache.org/contribute/how-to-contribute)
- [x] Enough context is provided in the sections above
- [x] Adequate tests were added
🤖 Generated with [Claude Code](https://claude.com/claude-code)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]