Tongwei created SPARK-56745:
-------------------------------
Summary: Cache foldable ZoneId in ConvertTimezone to avoid per-row
lookup
Key: SPARK-56745
URL: https://issues.apache.org/jira/browse/SPARK-56745
Project: Spark
Issue Type: Improvement
Components: SQL
Affects Versions: 4.1.1
Reporter: Tongwei
The `ConvertTimezone` expression resolves both source and target timezone
arguments via `DateTimeUtils.getZoneId` on every row, even when the timezone
arguments are constant literals -- which is the typical usage:
convert_timezone('UTC', 'America/Los_Angeles', ts_col)
Each `getZoneId` call performs a regex normalization
(`Pattern.matcher().replaceFirst()`)
followed by a `ZoneId.of(..., ZoneId.SHORT_IDS)` lookup, which goes through
`ZoneRulesProvider`'s internal map. Doing this twice per row is wasteful when
the result is the same for the entire query.
The codegen paths of sibling expressions `FromUTCTimestamp` and
`ToUTCTimestamp`
already cache the foldable `ZoneId` via `addMutableState` (see
`datetimeExpressions.scala:1810-1844`). This proposal brings `ConvertTimezone`
in line:
* Add a `@transient lazy val` for foldable source/target zones (interpreted
path).
* Generate `addMutableState`-cached `ZoneId` terms when timezone args are
foldable (codegen path); fall back to per-row resolution otherwise.
* Add a `convertTimestampNtzToAnotherTz(ZoneId, ZoneId, Long)` overload in
`DateTimeUtils` so callers can pass pre-resolved zones.
* Short-circuit to NULL when a foldable timezone literal is null.
Expected impact: ~1.3-2x speedup of the `convert_timezone` function in the
common foldable-arguments case; meaningful (single-digit to low double-digit
percentage) end-to-end speedup for ETL workloads where `convert_timezone` is
on the hot path.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]