Github user viirya commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21291#discussion_r188979880
  
    --- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/ConfigBehaviorSuite.scala ---
    @@ -39,7 +39,9 @@ class ConfigBehaviorSuite extends QueryTest with 
SharedSQLContext {
         def computeChiSquareTest(): Double = {
           val n = 10000
           // Trigger a sort
    -      val data = spark.range(0, n, 1, 1).sort('id.desc)
    +      // Range has range partitioning in its output now. To have a range 
shuffle, we
    +      // need to run a repartition first.
    +      val data = spark.range(0, n, 1, 1).repartition(10).sort('id.desc)
    --- End diff --
    
    This is a good point.
    
    This is query plan and partition size for `spark.range(0, n, 1, 
1).repartition(10).sort('id.desc)`, when we set 
`SQLConf.RANGE_EXCHANGE_SAMPLE_SIZE_PER_PARTITION` to 1:
    
    ```
    == Physical Plan ==
    *(2) Sort [id#15L DESC NULLS LAST], true, 0
    +- Exchange rangepartitioning(id#15L DESC NULLS LAST, 4)
       +- Exchange RoundRobinPartitioning(10)
          +- *(1) Range (0, 10000, step=1, splits=1)
    
    1666, 3766, 2003, 2565
    ```
    
    `spark.range(0, n, 1, 10).sort('id.desc)`:
    
    ```
    == Physical Plan ==
    *(2) Sort [id#13L DESC NULLS LAST], true, 0
    +- Exchange rangepartitioning(id#13L DESC NULLS LAST, 4)
       +- *(1) Range (0, 10000, step=1, splits=10)
    
    (2835, 2469, 2362, 2334)
    ```
    
    Because `repartition` shuffles data with `RoundRobinPartitioning`, I guess 
that it makes the worse sampling for range exchange. Without `repartition`, 
`Range`'s output is already range partitioning, so it can get sampling leading 
better range boundaries.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to