Github user sddyljsx commented on the issue: https://github.com/apache/spark/pull/21859 This optimization is only for SQL, but other places also use RangePartitioner. What it can affect other places? The failed UTs are caused by ``` else if (sampleCacheEnabled && numItems == numSampled) { // get the sampled data sampledArray = sketched.foldLeft(Array.empty[K])((total, sample) => { total ++ sample._3 }) Array.empty } ``` the RangePartitioner's rangeBounds will be empty. I think the rangeBounds will have no use if the optimization works, so an empty array is returned. But it may cause errors, so I change the code, and always get the rangeBounds. By this way, the only diff of the new RangePartitioner is that it stores an extra smapledArray which won't effect other places. what's more, ``` class RangePartitioner[K : Ordering : ClassTag, V]( partitions: Int, rdd: RDD[_ <: Product2[K, V]], private var ascending: Boolean = true, val samplePointsPerPartitionHint: Int = 20, val sampleCacheEnabled: Boolean = false) ``` the default value of the sampleCacheEnabled is false, only in this place it is true ``` ShuffleExchangeExec.scala new RangePartitioner( numPartitions, rddForSampling, ascending = true, samplePointsPerPartitionHint = SQLConf.get.rangeExchangeSampleSizePerPartition, sampleCacheEnabled = SQLConf.get.rangeExchangeSampleCacheEnabled) ``` it will be safe.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org