[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

rednaxelafx Tue, 23 Jan 2018 22:27:06 -0800

Github user rednaxelafx commented on the issue:

    https://github.com/apache/spark/pull/20224
  
    BTW, inspired by @cloud-fan 's comment, here's an example of the codegen 
stage IDs when scalar subqueries are involved:
    
    ```scala
    val sub = "(select sum(id) from range(5))"
    val df = spark.sql(s"select $sub as a, $sub as b")
    df.explain(true)
    ```
    would give:
    ```
    == Parsed Logical Plan ==
    'Project [scalar-subquery#0 [] AS a#1, scalar-subquery#2 [] AS b#3]
    :  :- 'Project [unresolvedalias('sum('id), None)]
    :  :  +- 'UnresolvedTableValuedFunction range, [5]
    :  +- 'Project [unresolvedalias('sum('id), None)]
    :     +- 'UnresolvedTableValuedFunction range, [5]
    +- OneRowRelation
    
    == Analyzed Logical Plan ==
    a: bigint, b: bigint
    Project [scalar-subquery#0 [] AS a#1L, scalar-subquery#2 [] AS b#3L]
    :  :- Aggregate [sum(id#14L) AS sum(id)#16L]
    :  :  +- Range (0, 5, step=1, splits=None)
    :  +- Aggregate [sum(id#17L) AS sum(id)#19L]
    :     +- Range (0, 5, step=1, splits=None)
    +- OneRowRelation
    
    == Optimized Logical Plan ==
    Project [scalar-subquery#0 [] AS a#1L, scalar-subquery#2 [] AS b#3L]
    :  :- Aggregate [sum(id#14L) AS sum(id)#16L]
    :  :  +- Range (0, 5, step=1, splits=None)
    :  +- Aggregate [sum(id#17L) AS sum(id)#19L]
    :     +- Range (0, 5, step=1, splits=None)
    +- OneRowRelation
    
    == Physical Plan ==
    *(1) Project [Subquery subquery0 AS a#1L, Subquery subquery2 AS b#3L]
    :  :- Subquery subquery0
    :  :  +- *(2) HashAggregate(keys=[], functions=[sum(id#14L)], 
output=[sum(id)#16L])
    :  :     +- Exchange SinglePartition
    :  :        +- *(1) HashAggregate(keys=[], functions=[partial_sum(id#14L)], 
output=[sum#21L])
    :  :           +- *(1) Range (0, 5, step=1, splits=8)
    :  +- Subquery subquery2
    :     +- *(2) HashAggregate(keys=[], functions=[sum(id#17L)], 
output=[sum(id)#19L])
    :        +- Exchange SinglePartition
    :           +- *(1) HashAggregate(keys=[], functions=[partial_sum(id#17L)], 
output=[sum#23L])
    :              +- *(1) Range (0, 5, step=1, splits=8)
    +- Scan OneRowRelation[]
    ```
    
    The reason why the IDs look a bit "odd" (that there are three separate 
codegen stages with ID 1) is because the main "spine" query and each individual 
subqueries are "planned" separately, thus they'd run `CollapseCodegenStages` 
separately, each counting up from 1 afresh. I would consider this behavior 
acceptable, but I wonder what others would think in this case.
    If this behavior for subqueries is not acceptable, I'll have to find 
alternative places to put the initialization and reset of the thread-local ID 
counter.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #20224: [SPARK-23032][SQL] Add a per-query codegenStageId to Who...

Reply via email to