[
https://issues.apache.org/jira/browse/SPARK-56908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085859#comment-18085859
]
Gengliang Wang commented on SPARK-56908:
----------------------------------------
Triage note for the remaining open subtasks: several recently-filed ones
eliminate statically-dead null branches for non-nullable / constant inputs
(e.g. NaNvl, If, AtLeastNNonNulls, StringLocate, CreateNamedStruct, MakeDecimal
/ CheckOverflow). Per the "Scope guidance" just added to the description: these
are source-only changes, Janino constant-folds {{if (true)}} / {{if (false)}}
(no bytecode change), and for infrequent patterns the compile-time saving is
below the run-to-run noise floor (measured on TPC-DS whole-stage codegen at
~0.36 ms/KB, noise ~0.7%).
Unless a subtask's pattern is frequent in real generated code or removes more
than source text (e.g. a constant-pool / {{references[]}} entry), it adds
codegen complexity for negligible benefit and should be dropped rather than
merged.
> Reduce generated Java size in whole-stage codegen
> -------------------------------------------------
>
> Key: SPARK-56908
> URL: https://issues.apache.org/jira/browse/SPARK-56908
> Project: Spark
> Issue Type: Umbrella
> Components: SQL
> Affects Versions: 4.3.0
> Reporter: Gengliang Wang
> Priority: Major
> Labels: pull-request-available
>
> Whole-stage codegen generates a fresh Java class per stage. Across many
> operators the generated source contains (a) boilerplate that is
> type-independent across stages and can be deduplicated into static Java
> helpers, and (b) branches or variables that are statically dead at codegen
> time but emitted anyway.
> These patterns cost us in three places:
> - JVM 64KB method-size and constant-pool limits, which force interpreted
> fallback on deep query plans.
> - Janino compile time per stage.
> - JIT compile work (each stage class has its own bodies).
> This umbrella tracks small, behavior-preserving cleanups across the generated
> Java to address these issues. Each subtask is independently PR-able; behavior
> is preserved end-to-end and verified by the relevant operator's existing test
> suite with {{spark.sql.codegen.wholeStage}} forced both on and off.
> h3. Scope guidance (when is a dead-branch / simplification subtask worth it?)
> Not every statically-dead branch is worth eliminating. We measured the payoff
> on real generated code (TPC-DS whole-stage codegen): Janino compile time is
> ~linear in generated source (~0.36 ms/KB), and Janino constant-folds {{if
> (true)}} / {{if (false)}}, so there is *no bytecode change* and no
> JIT/runtime benefit. For a pattern that occurs only a handful of times, the
> source-only saving is below the run-to-run compile-time noise floor (~0.7%).
> Therefore a subtask that adds branching/complexity to the codegen logic to
> skip a dead branch is only justified when:
> - (a) the pattern is *frequent* in real generated code -- verify against e.g.
> the TPC-DS whole-stage codegen dumps, as SPARK-57198 did (the dead {{if
> (100.0D == 0)}} check appeared 56 times across 14 TPC-DS queries); or
> - (b) it removes more than source text -- e.g. an unreachable
> {{references[]}} / constant-pool entry (SPARK-57198), or it keeps generated
> methods under the 64KB / constant-pool limits.
> Trivial, infrequent dead-branch removals add codegen-logic complexity for
> negligible benefit and should be dropped rather than merged.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]