Github user rednaxelafx commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20224#discussion_r161980740
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala
 ---
    @@ -575,7 +609,10 @@ case class CollapseCodegenStages(conf: SQLConf) 
extends Rule[SparkPlan] {
     
       def apply(plan: SparkPlan): SparkPlan = {
         if (conf.wholeStageEnabled) {
    -      insertWholeStageCodegen(plan)
    +      WholeStageCodegenExec.initializeCodegenStageCounterPerQuery()
    +      val newPlan = insertWholeStageCodegen(plan)
    +      WholeStageCodegenExec.resetCodegenStageCounter()
    --- End diff --
    
    It's exactly the concern that @kiszk brought up: the codegen cache uses the 
generated source code as the key, so any differences in the source code text 
would break the cache hit.
    Imagine the counter were a globally (or Spark session-local globally) 
atomically incrementing, then if the same query were run twice, the codegen 
stages in those two runs will actually get different sets of IDs, resulting in 
different source code text (everything's the same except for the ID), and then 
it'll render the codegen cache useless -- basically nothing will never hit the 
cache since the ID will always be different.
    
    In fact, you'll find that the IDs in the explain output through 
`df.explain()` are going to be different from the ones you see in Spark UI's 
SQL tab's treeString, because explain is actually "one query with the 
`ExplainCommand` as the root". I had hit this exact problem in my early 
prototype and soon realized this isn't going to be user-friendly.
    
    By making the ID only increment within a query, we can make sure the 
codegen cache works for multiple runs of the same (or identically structured) 
query, and still be able to differentiate the codegen stages within a query.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to