Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/21291#discussion_r189149095 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/ConfigBehaviorSuite.scala --- @@ -39,7 +39,9 @@ class ConfigBehaviorSuite extends QueryTest with SharedSQLContext { def computeChiSquareTest(): Double = { val n = 10000 // Trigger a sort - val data = spark.range(0, n, 1, 1).sort('id.desc) + // Range has range partitioning in its output now. To have a range shuffle, we + // need to run a repartition first. + val data = spark.range(0, n, 1, 1).repartition(10).sort('id.desc) --- End diff -- By `spark.range(0, n, 1, 10).sort('id.desc)`, there is no 3 times liner relation between `a` and `b`. As shown above, this is also evenly distribution, the chi-sq value is also under `100`. Here we need a redistribution on data to make sampling difficult. Previously, a repartition is added automatically before `sort`. Now `range` has correct output partition info, so the repattition must be added manually.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org