[ 
https://issues.apache.org/jira/browse/SPARK-56908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18085859#comment-18085859
 ] 

Gengliang Wang commented on SPARK-56908:
----------------------------------------

Triage note for the remaining open subtasks: several recently-filed ones 
eliminate statically-dead null branches for non-nullable / constant inputs 
(e.g. NaNvl, If, AtLeastNNonNulls, StringLocate, CreateNamedStruct, MakeDecimal 
/ CheckOverflow). Per the "Scope guidance" just added to the description: these 
are source-only changes, Janino constant-folds {{if (true)}} / {{if (false)}} 
(no bytecode change), and for infrequent patterns the compile-time saving is 
below the run-to-run noise floor (measured on TPC-DS whole-stage codegen at 
~0.36 ms/KB, noise ~0.7%).

Unless a subtask's pattern is frequent in real generated code or removes more 
than source text (e.g. a constant-pool / {{references[]}} entry), it adds 
codegen complexity for negligible benefit and should be dropped rather than 
merged.

> Reduce generated Java size in whole-stage codegen
> -------------------------------------------------
>
>                 Key: SPARK-56908
>                 URL: https://issues.apache.org/jira/browse/SPARK-56908
>             Project: Spark
>          Issue Type: Umbrella
>          Components: SQL
>    Affects Versions: 4.3.0
>            Reporter: Gengliang Wang
>            Priority: Major
>              Labels: pull-request-available
>
> Whole-stage codegen generates a fresh Java class per stage. Across many 
> operators the generated source contains (a) boilerplate that is 
> type-independent across stages and can be deduplicated into static Java 
> helpers, and (b) branches or variables that are statically dead at codegen 
> time but emitted anyway.
> These patterns cost us in three places:
> - JVM 64KB method-size and constant-pool limits, which force interpreted 
> fallback on deep query plans.
> - Janino compile time per stage.
> - JIT compile work (each stage class has its own bodies).
> This umbrella tracks small, behavior-preserving cleanups across the generated 
> Java to address these issues. Each subtask is independently PR-able; behavior 
> is preserved end-to-end and verified by the relevant operator's existing test 
> suite with {{spark.sql.codegen.wholeStage}} forced both on and off.
> h3. Scope guidance (when is a dead-branch / simplification subtask worth it?)
> Not every statically-dead branch is worth eliminating. We measured the payoff 
> on real generated code (TPC-DS whole-stage codegen): Janino compile time is 
> ~linear in generated source (~0.36 ms/KB), and Janino constant-folds {{if 
> (true)}} / {{if (false)}}, so there is *no bytecode change* and no 
> JIT/runtime benefit. For a pattern that occurs only a handful of times, the 
> source-only saving is below the run-to-run compile-time noise floor (~0.7%).
> Therefore a subtask that adds branching/complexity to the codegen logic to 
> skip a dead branch is only justified when:
> - (a) the pattern is *frequent* in real generated code -- verify against e.g. 
> the TPC-DS whole-stage codegen dumps, as SPARK-57198 did (the dead {{if 
> (100.0D == 0)}} check appeared 56 times across 14 TPC-DS queries); or
> - (b) it removes more than source text -- e.g. an unreachable 
> {{references[]}} / constant-pool entry (SPARK-57198), or it keeps generated 
> methods under the 64KB / constant-pool limits.
> Trivial, infrequent dead-branch removals add codegen-logic complexity for 
> negligible benefit and should be dropped rather than merged.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to