[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

rednaxelafx Thu, 11 Jan 2018 15:10:55 -0800

Github user rednaxelafx commented on the issue:

    https://github.com/apache/spark/pull/20224
  
    Thanks for your comments and questions, @kiszk and @maropu !
    Let me address them in a couple of separate points.
    
    **tl;dr**
    
    On top of my original proposal in the PR description / JIRA ticket, I'd 
like to further add:
    a. A config option to choose whether or not to include the `codegenStageId` 
in the generated class name. The default should be "off" meaning not including 
the ID in the class name.
    b. To reserve the `[0]` element in the `references` array of the WSC 
generated class as a special value, to record the codegen stage ID. That way, 
let's say if we need to throw an exception from the generated code, we can 
include the codegen stage ID when constructing the exception message string. 
This doesn't add any new IDs to the generated code, so @kiszk 's concerns on 
codegen cache can be addressed. This can be always turned on.
    
    Side note: even if we only embed the codegen stage ID into comments, 
because of the way the codegen cache uses `CodeAndComment` as the key, 
differences in comments will still affect the cache hit effectiveness. On the 
other hand, putting the codegen stage ID into the `references` array solves 
this problems perfectly -- this array is Spark SQL's way of expression a 
"runtime constant pool" anyway. This idea is somewhat similar to how HotSpot 
VM's "LambdaForm bytecode sharing" works.
    
    **Detail Discussions** 
    
    My proposal and PR currently does 3 things:
    1. Add a per-query `codegenStageId` to `WholeStageCodegenExec`;
    2. Include the ID as a part of the explain output for physical plans;
    3. Include the ID as a part of the generated class name for WSC.
    
    Of the above, (1) is the fundamentals, while (2) and (3) are separate 
applications of using the information from (1).
    
    Would you (@kiszk and @maropu ) agree that at least having both (1) and (2) 
is a good idea? They don't interact with anything else at runtime, so there so 
behavioral change or performance implications because of them. They can be 
always turned on with minimal overhead.
    
    @rxin did point out that our current explain output for physical plans is 
already pretty cluttered and not user-friendly enough, so it makes sense to 
have a "verbose mode" in the future and then make the default mode less 
cluttered. But that's out of scope for this change.
    
    For (3), @kiszk does point out that there's an interaction between the 
generated code (in source string + comments form) and the codegen cache (from 
`CodeAndComment` -> generated class, in 
`org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator#cache`). I do 
know about this cache and have put in considerations of this interaction when I 
sketched out this proposal.
    
    This PR proposes an ID unique within a query. If the same query is run 
multiple times, it'll generate the exact same code (with the IDs included), so 
at least with the current implementation, we can guarantee that there won't be 
redundant compilation for multiple runs of the same query. I mentioned this in 
the PR description:
    > The reason why this proposal uses a per-query ID is because it's stable 
within a query, so that multiple runs of the same query will see the same 
resulting IDs. This both benefits understandability for users, and also it 
plays well with the codegen cache in Spark SQL which uses the generated source 
code as the key.
    This kind of codegen cache hit is fundamental, and this PR keeps it working.
    
    Within a query, though, before this change there could have been cases 
where there can be codegen stages that happens to have the exact same source 
code, thus would work well with the codegen cache. After this change, such 
cases would end up generating code with different IDs embedded into the class 
name so they'll have different source code, thus won't hit the codegen cache 
and would have to be compiled separately.
    
    Here's an example that would hit this case:
    ```
    spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1)
    val df1 = spark.range(5).select('id % 2 as 'x)
    val df2 = spark.range(5).select('id % 2 as 'y)
    val query = df1.join(df2, 'x === 'y)
    ```
    With this change, you can see the different codegen stages as follows:
    ```
    scala> query.explain
    == Physical Plan ==
    *(5) SortMergeJoin [x#3L], [y#9L], Inner
    :- *(2) Sort [x#3L ASC NULLS FIRST], false, 0
    :  +- Exchange hashpartitioning(x#3L, 200)
    :     +- *(1) Project [(id#0L % 2) AS x#3L]
    :        +- *(1) Filter isnotnull((id#0L % 2))
    :           +- *(1) Range (0, 5, step=1, splits=8)
    +- *(4) Sort [y#9L ASC NULLS FIRST], false, 0
       +- Exchange hashpartitioning(y#9L, 200)
          +- *(3) Project [(id#6L % 2) AS y#9L]
             +- *(3) Filter isnotnull((id#6L % 2))
                +- *(3) Range (0, 5, step=1, splits=8)
    ```
    The generated code for codegen stages (1) and (3) are actually identical, 
so the codegen cache will save us from redundantly compiling one of the stages.
    
    I would argue that this kind of codegen cache hit is accidental, brittle, 
and shouldn't be considered a must have.
    For example, if `df1` was declared as `spark.range(3)...` instead of 
`spark.range(5)...`, the generated code for codegen stages (1) and (3) would 
have been different (see [here](https://www.diffchecker.com/9KYDBULt) for 
details). In more realistic scenarios, the generated code shape is very 
sensitive to the surrounding operators and certain constants we directly embed 
into the code. Such generated code wouldn't reliable hit the codegen cache 
anyway.
    
    On the other hand, having such IDs in the class name is very useful for all 
kinds of diagnosis, not just for Spark developers.
    e.g. it can tell users where an exception or profile tick happened; it can 
tell exactly which generated method went above some certain bytecode size 
threshold, etc.
    
    But to avoid having any regressions whatsoever, I do agree we should have a 
config option to choose whether or not to embed this codegen stage ID into the 
generated class name.
    
    WDYT? @kiszk and @maropu ?



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

Reply via email to