Github user rednaxelafx commented on the issue: https://github.com/apache/spark/pull/20224 Thanks for your comments and questions, @kiszk and @maropu ! Let me address them in a couple of separate points. **tl;dr** On top of my original proposal in the PR description / JIRA ticket, I'd like to further add: a. A config option to choose whether or not to include the `codegenStageId` in the generated class name. The default should be "off" meaning not including the ID in the class name. b. To reserve the `[0]` element in the `references` array of the WSC generated class as a special value, to record the codegen stage ID. That way, let's say if we need to throw an exception from the generated code, we can include the codegen stage ID when constructing the exception message string. This doesn't add any new IDs to the generated code, so @kiszk 's concerns on codegen cache can be addressed. This can be always turned on. Side note: even if we only embed the codegen stage ID into comments, because of the way the codegen cache uses `CodeAndComment` as the key, differences in comments will still affect the cache hit effectiveness. On the other hand, putting the codegen stage ID into the `references` array solves this problems perfectly -- this array is Spark SQL's way of expression a "runtime constant pool" anyway. This idea is somewhat similar to how HotSpot VM's "LambdaForm bytecode sharing" works. **Detail Discussions** My proposal and PR currently does 3 things: 1. Add a per-query `codegenStageId` to `WholeStageCodegenExec`; 2. Include the ID as a part of the explain output for physical plans; 3. Include the ID as a part of the generated class name for WSC. Of the above, (1) is the fundamentals, while (2) and (3) are separate applications of using the information from (1). Would you (@kiszk and @maropu ) agree that at least having both (1) and (2) is a good idea? They don't interact with anything else at runtime, so there so behavioral change or performance implications because of them. They can be always turned on with minimal overhead. @rxin did point out that our current explain output for physical plans is already pretty cluttered and not user-friendly enough, so it makes sense to have a "verbose mode" in the future and then make the default mode less cluttered. But that's out of scope for this change. For (3), @kiszk does point out that there's an interaction between the generated code (in source string + comments form) and the codegen cache (from `CodeAndComment` -> generated class, in `org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator#cache`). I do know about this cache and have put in considerations of this interaction when I sketched out this proposal. This PR proposes an ID unique within a query. If the same query is run multiple times, it'll generate the exact same code (with the IDs included), so at least with the current implementation, we can guarantee that there won't be redundant compilation for multiple runs of the same query. I mentioned this in the PR description: > The reason why this proposal uses a per-query ID is because it's stable within a query, so that multiple runs of the same query will see the same resulting IDs. This both benefits understandability for users, and also it plays well with the codegen cache in Spark SQL which uses the generated source code as the key. This kind of codegen cache hit is fundamental, and this PR keeps it working. Within a query, though, before this change there could have been cases where there can be codegen stages that happens to have the exact same source code, thus would work well with the codegen cache. After this change, such cases would end up generating code with different IDs embedded into the class name so they'll have different source code, thus won't hit the codegen cache and would have to be compiled separately. Here's an example that would hit this case: ``` spark.conf.set("spark.sql.autoBroadcastJoinThreshold", 1) val df1 = spark.range(5).select('id % 2 as 'x) val df2 = spark.range(5).select('id % 2 as 'y) val query = df1.join(df2, 'x === 'y) ``` With this change, you can see the different codegen stages as follows: ``` scala> query.explain == Physical Plan == *(5) SortMergeJoin [x#3L], [y#9L], Inner :- *(2) Sort [x#3L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(x#3L, 200) : +- *(1) Project [(id#0L % 2) AS x#3L] : +- *(1) Filter isnotnull((id#0L % 2)) : +- *(1) Range (0, 5, step=1, splits=8) +- *(4) Sort [y#9L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(y#9L, 200) +- *(3) Project [(id#6L % 2) AS y#9L] +- *(3) Filter isnotnull((id#6L % 2)) +- *(3) Range (0, 5, step=1, splits=8) ``` The generated code for codegen stages (1) and (3) are actually identical, so the codegen cache will save us from redundantly compiling one of the stages. I would argue that this kind of codegen cache hit is accidental, brittle, and shouldn't be considered a must have. For example, if `df1` was declared as `spark.range(3)...` instead of `spark.range(5)...`, the generated code for codegen stages (1) and (3) would have been different (see [here](https://www.diffchecker.com/9KYDBULt) for details). In more realistic scenarios, the generated code shape is very sensitive to the surrounding operators and certain constants we directly embed into the code. Such generated code wouldn't reliable hit the codegen cache anyway. On the other hand, having such IDs in the class name is very useful for all kinds of diagnosis, not just for Spark developers. e.g. it can tell users where an exception or profile tick happened; it can tell exactly which generated method went above some certain bytecode size threshold, etc. But to avoid having any regressions whatsoever, I do agree we should have a config option to choose whether or not to embed this codegen stage ID into the generated class name. WDYT? @kiszk and @maropu ?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org