Github user rednaxelafx commented on a diff in the pull request: https://github.com/apache/spark/pull/20224#discussion_r161980740 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala --- @@ -575,7 +609,10 @@ case class CollapseCodegenStages(conf: SQLConf) extends Rule[SparkPlan] { def apply(plan: SparkPlan): SparkPlan = { if (conf.wholeStageEnabled) { - insertWholeStageCodegen(plan) + WholeStageCodegenExec.initializeCodegenStageCounterPerQuery() + val newPlan = insertWholeStageCodegen(plan) + WholeStageCodegenExec.resetCodegenStageCounter() --- End diff -- It's exactly the concern that @kiszk brought up: the codegen cache uses the generated source code as the key, so any differences in the source code text would break the cache hit. Imagine the counter were a globally (or Spark session-local globally) atomically incrementing, then if the same query were run twice, the codegen stages in those two runs will actually get different sets of IDs, resulting in different source code text (everything's the same except for the ID), and then it'll render the codegen cache useless -- basically nothing will never hit the cache since the ID will always be different. In fact, you'll find that the IDs in the explain output through `df.explain()` are going to be different from the ones you see in Spark UI's SQL tab's treeString, because explain is actually "one query with the `ExplainCommand` as the root". I had hit this exact problem in my early prototype and soon realized this isn't going to be user-friendly. By making the ID only increment within a query, we can make sure the codegen cache works for multiple runs of the same (or identically structured) query, and still be able to differentiate the codegen stages within a query.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org