[ 
https://issues.apache.org/jira/browse/SPARK-56745?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated SPARK-56745:
-----------------------------------
    Labels: pull-request-available  (was: )

> Cache foldable ZoneId in ConvertTimezone to avoid per-row lookup
> ----------------------------------------------------------------
>
>                 Key: SPARK-56745
>                 URL: https://issues.apache.org/jira/browse/SPARK-56745
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 4.1.1
>            Reporter: Tongwei
>            Priority: Major
>              Labels: pull-request-available
>
> The `ConvertTimezone` expression resolves both source and target timezone
>   arguments via `DateTimeUtils.getZoneId` on every row, even when the timezone
>   arguments are constant literals -- which is the typical usage:
>       convert_timezone('UTC', 'America/Los_Angeles', ts_col)
>   Each `getZoneId` call performs a regex normalization 
> (`Pattern.matcher().replaceFirst()`)
>   followed by a `ZoneId.of(..., ZoneId.SHORT_IDS)` lookup, which goes through
>   `ZoneRulesProvider`'s internal map. Doing this twice per row is wasteful 
> when
>   the result is the same for the entire query.
>   The codegen paths of sibling expressions `FromUTCTimestamp` and 
> `ToUTCTimestamp`
>   already cache the foldable `ZoneId` via `addMutableState` (see
>   `datetimeExpressions.scala:1810-1844`). This proposal brings 
> `ConvertTimezone`
>   in line:
>     * Add a `@transient lazy val` for foldable source/target zones 
> (interpreted path).
>     * Generate `addMutableState`-cached `ZoneId` terms when timezone args are
>       foldable (codegen path); fall back to per-row resolution otherwise.
>     * Add a `convertTimestampNtzToAnotherTz(ZoneId, ZoneId, Long)` overload in
>       `DateTimeUtils` so callers can pass pre-resolved zones.
>     * Short-circuit to NULL when a foldable timezone literal is null.
>   Expected impact: ~1.3-2x speedup of the `convert_timezone` function in the
>   common foldable-arguments case; meaningful (single-digit to low double-digit
>   percentage) end-to-end speedup for ETL workloads where `convert_timezone` is
>   on the hot path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to