[ 
https://issues.apache.org/jira/browse/SPARK-56908?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang updated SPARK-56908:
-----------------------------------
    Description: 
Whole-stage codegen generates a fresh Java class per stage. Across many 
operators the generated source contains (a) boilerplate that is 
type-independent across stages and can be deduplicated into static Java 
helpers, and (b) branches or variables that are statically dead at codegen time 
but emitted anyway.

These patterns cost us in three places:
- JVM 64KB method-size and constant-pool limits, which force interpreted 
fallback on deep query plans.
- Janino compile time per stage.
- JIT compile work (each stage class has its own bodies).

This umbrella tracks small, behavior-preserving cleanups across the generated 
Java to address these issues. Each subtask is independently PR-able; behavior 
is preserved end-to-end and verified by the relevant operator's existing test 
suite with {{spark.sql.codegen.wholeStage}} forced both on and off.

h3. Scope guidance (when is a dead-branch / simplification subtask worth it?)

Not every statically-dead branch is worth eliminating. We measured the payoff 
on real generated code (TPC-DS whole-stage codegen): Janino constant-folds {{if 
(true)}} / {{if (false)}}, so skipping such a branch produces *no bytecode 
change* and no JIT/runtime benefit -- only smaller source. Janino compile time 
is ~linear in source (~0.36 ms/KB), and even the most frequent source-only 
patterns we measured ({{if (true)}} x445, a dead null-write branch x358 -- 
together only ~0.4% of generated source) saved compile time below the 
run-to-run noise floor (~0.7%).

Therefore a subtask that adds branching / complexity to the codegen logic to 
skip a dead branch is only justified when:
- *(b) it removes more than source text -- this is the real bar.* For example 
SPARK-57198: skipping the divide-by-zero guard for a non-zero literal also 
stops registering the unreachable {{errCtx}} entry in the {{references[]}} / 
constant pool, which Janino cannot fold away (unlike the {{if (false)}} text 
itself). Likewise SPARK-57199 moved the {{AGGREGATE_OUT_OF_MEMORY}} string + 
map constructor out of all 142 generated aggregate classes' constant pools into 
one compiled method. Keeping a large method under the 64KB / huge-method limit 
also qualifies.
- *(a) raw frequency rarely suffices on its own.* As measured above, even 
patterns occurring 358-445 times stayed below the compile-time noise floor with 
no bytecode change. Frequency only matters when the dead code is a large 
fraction of a _single_ method (approaching the 64KB limit), not merely numerous 
across the corpus.

Trivial, infrequent (and even frequent-but-source-only) dead-branch removals 
add codegen-logic complexity for negligible benefit and should be dropped 
rather than merged.

  was:
Whole-stage codegen generates a fresh Java class per stage. Across many 
operators the generated source contains (a) boilerplate that is 
type-independent across stages and can be deduplicated into static Java 
helpers, and (b) branches or variables that are statically dead at codegen time 
but emitted anyway.

These patterns cost us in three places:
- JVM 64KB method-size and constant-pool limits, which force interpreted 
fallback on deep query plans.
- Janino compile time per stage.
- JIT compile work (each stage class has its own bodies).

This umbrella tracks small, behavior-preserving cleanups across the generated 
Java to address these issues. Each subtask is independently PR-able; behavior 
is preserved end-to-end and verified by the relevant operator's existing test 
suite with {{spark.sql.codegen.wholeStage}} forced both on and off.

h3. Scope guidance (when is a dead-branch / simplification subtask worth it?)

Not every statically-dead branch is worth eliminating. We measured the payoff 
on real generated code (TPC-DS whole-stage codegen): Janino compile time is 
~linear in generated source (~0.36 ms/KB), and Janino constant-folds {{if 
(true)}} / {{if (false)}}, so there is *no bytecode change* and no JIT/runtime 
benefit. For a pattern that occurs only a handful of times, the source-only 
saving is below the run-to-run compile-time noise floor (~0.7%).

Therefore a subtask that adds branching/complexity to the codegen logic to skip 
a dead branch is only justified when:
- (a) the pattern is *frequent* in real generated code -- verify against e.g. 
the TPC-DS whole-stage codegen dumps, as SPARK-57198 did (the dead {{if (100.0D 
== 0)}} check appeared 56 times across 14 TPC-DS queries); or
- (b) it removes more than source text -- e.g. an unreachable {{references[]}} 
/ constant-pool entry (SPARK-57198), or it keeps generated methods under the 
64KB / constant-pool limits.

Trivial, infrequent dead-branch removals add codegen-logic complexity for 
negligible benefit and should be dropped rather than merged.


> Reduce generated Java size in whole-stage codegen
> -------------------------------------------------
>
>                 Key: SPARK-56908
>                 URL: https://issues.apache.org/jira/browse/SPARK-56908
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Gengliang Wang
>            Priority: Major
>              Labels: pull-request-available
>
> Whole-stage codegen generates a fresh Java class per stage. Across many 
> operators the generated source contains (a) boilerplate that is 
> type-independent across stages and can be deduplicated into static Java 
> helpers, and (b) branches or variables that are statically dead at codegen 
> time but emitted anyway.
> These patterns cost us in three places:
> - JVM 64KB method-size and constant-pool limits, which force interpreted 
> fallback on deep query plans.
> - Janino compile time per stage.
> - JIT compile work (each stage class has its own bodies).
> This umbrella tracks small, behavior-preserving cleanups across the generated 
> Java to address these issues. Each subtask is independently PR-able; behavior 
> is preserved end-to-end and verified by the relevant operator's existing test 
> suite with {{spark.sql.codegen.wholeStage}} forced both on and off.
> h3. Scope guidance (when is a dead-branch / simplification subtask worth it?)
> Not every statically-dead branch is worth eliminating. We measured the payoff 
> on real generated code (TPC-DS whole-stage codegen): Janino constant-folds 
> {{if (true)}} / {{if (false)}}, so skipping such a branch produces *no 
> bytecode change* and no JIT/runtime benefit -- only smaller source. Janino 
> compile time is ~linear in source (~0.36 ms/KB), and even the most frequent 
> source-only patterns we measured ({{if (true)}} x445, a dead null-write 
> branch x358 -- together only ~0.4% of generated source) saved compile time 
> below the run-to-run noise floor (~0.7%).
> Therefore a subtask that adds branching / complexity to the codegen logic to 
> skip a dead branch is only justified when:
> - *(b) it removes more than source text -- this is the real bar.* For example 
> SPARK-57198: skipping the divide-by-zero guard for a non-zero literal also 
> stops registering the unreachable {{errCtx}} entry in the {{references[]}} / 
> constant pool, which Janino cannot fold away (unlike the {{if (false)}} text 
> itself). Likewise SPARK-57199 moved the {{AGGREGATE_OUT_OF_MEMORY}} string + 
> map constructor out of all 142 generated aggregate classes' constant pools 
> into one compiled method. Keeping a large method under the 64KB / huge-method 
> limit also qualifies.
> - *(a) raw frequency rarely suffices on its own.* As measured above, even 
> patterns occurring 358-445 times stayed below the compile-time noise floor 
> with no bytecode change. Frequency only matters when the dead code is a large 
> fraction of a _single_ method (approaching the 64KB limit), not merely 
> numerous across the corpus.
> Trivial, infrequent (and even frequent-but-source-only) dead-branch removals 
> add codegen-logic complexity for negligible benefit and should be dropped 
> rather than merged.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to