[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...

sddyljsx Wed, 08 Aug 2018 21:24:59 -0700

Github user sddyljsx commented on the issue:

    https://github.com/apache/spark/pull/21859
  
    This optimization is only for SQL, but other places also use 
RangePartitioner. What it can affect other places?
    
    The failed UTs are caused by 
    
    ```
    else if (sampleCacheEnabled && numItems == numSampled) {
            // get the sampled data
            sampledArray = sketched.foldLeft(Array.empty[K])((total, sample) => 
{
              total ++ sample._3
            })
            Array.empty
          }
    ```
    the RangePartitioner's rangeBounds will be empty. I think the rangeBounds 
will have no use if the optimization works, so an empty array is returned. But 
it may cause errors, so I change the code, and always get the rangeBounds. By 
this way, the only diff of the new RangePartitioner is that it stores an extra 
smapledArray which won't effect other places.
    what's more, 
    ```
    class RangePartitioner[K : Ordering : ClassTag, V](
        partitions: Int,
        rdd: RDD[_ <: Product2[K, V]],
        private var ascending: Boolean = true,
        val samplePointsPerPartitionHint: Int = 20,
        val sampleCacheEnabled: Boolean = false)
    ```
    the default value of the sampleCacheEnabled is false, only in this place it 
is true
    ```
    ShuffleExchangeExec.scala
    new RangePartitioner(
              numPartitions,
              rddForSampling,
              ascending = true,
              samplePointsPerPartitionHint = 
SQLConf.get.rangeExchangeSampleSizePerPartition,
              sampleCacheEnabled = SQLConf.get.rangeExchangeSampleCacheEnabled)
    ```
    it will be safe.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #21859: [SPARK-24900][SQL]Speed up sort when the dataset is smal...

Reply via email to