Tongwei created SPARK-56745:
-------------------------------

             Summary: Cache foldable ZoneId in ConvertTimezone to avoid per-row 
lookup
                 Key: SPARK-56745
                 URL: https://issues.apache.org/jira/browse/SPARK-56745
             Project: Spark
          Issue Type: Improvement
          Components: SQL
    Affects Versions: 4.1.1
            Reporter: Tongwei


The `ConvertTimezone` expression resolves both source and target timezone
  arguments via `DateTimeUtils.getZoneId` on every row, even when the timezone
  arguments are constant literals -- which is the typical usage:

      convert_timezone('UTC', 'America/Los_Angeles', ts_col)

  Each `getZoneId` call performs a regex normalization 
(`Pattern.matcher().replaceFirst()`)
  followed by a `ZoneId.of(..., ZoneId.SHORT_IDS)` lookup, which goes through
  `ZoneRulesProvider`'s internal map. Doing this twice per row is wasteful when
  the result is the same for the entire query.

  The codegen paths of sibling expressions `FromUTCTimestamp` and 
`ToUTCTimestamp`
  already cache the foldable `ZoneId` via `addMutableState` (see
  `datetimeExpressions.scala:1810-1844`). This proposal brings `ConvertTimezone`
  in line:

    * Add a `@transient lazy val` for foldable source/target zones (interpreted 
path).
    * Generate `addMutableState`-cached `ZoneId` terms when timezone args are
      foldable (codegen path); fall back to per-row resolution otherwise.
    * Add a `convertTimestampNtzToAnotherTz(ZoneId, ZoneId, Long)` overload in
      `DateTimeUtils` so callers can pass pre-resolved zones.
    * Short-circuit to NULL when a foldable timezone literal is null.

  Expected impact: ~1.3-2x speedup of the `convert_timezone` function in the
  common foldable-arguments case; meaningful (single-digit to low double-digit
  percentage) end-to-end speedup for ETL workloads where `convert_timezone` is
  on the hot path.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to