Github user xuchuanyin commented on a diff in the pull request:

    https://github.com/apache/carbondata/pull/1632#discussion_r157112148
  
    --- Diff: 
integration/spark-common/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala
 ---
    @@ -121,17 +121,18 @@ object DataLoadProcessBuilderOnSpark {
             CarbonProperties.getInstance().getGlobalSortRddStorageLevel()))
         }
     
    +    val sortStepRowConverter: SortStepRowHandler = new 
SortStepRowHandler(sortParameters)
         import scala.reflect.classTag
    +
    +    // 3. sort
         val sortRDD = convertRDD
    -      .sortBy(_.getData, numPartitions = numPartitions)(RowOrdering, 
classTag[Array[AnyRef]])
    -      .mapPartitionsWithIndex { case (index, rows) =>
    -        DataLoadProcessorStepOnSpark.convertTo3Parts(rows, index, 
modelBroadcast,
    -          sortStepRowCounter)
    -      }
    +      .map(r => DataLoadProcessorStepOnSpark.convertTo3Parts(r, 
TaskContext.getPartitionId(),
    +        modelBroadcast, sortStepRowConverter, sortStepRowCounter))
    +      .sortBy(r => r.getData, numPartitions = numPartitions)(RowOrdering, 
classTag[Array[AnyRef]])
    --- End diff --
    
    This change of code is not involved with sort temp file. I changed this 
because the interface and internal load procedure has been changed.
    
    After `convertRDD`, each row is still raw-row; In the sort phrase, rows 
will be converted to 3-parts; In the write phrase, rows will be encoded and 
written.
    
    In the previous implementation, Carbondata sort on these raw-rows and then 
convert each row to 3-parts in batch.
    
    In the current implementation, Carbondata firstly convert each row to 
3-parts, and sort on these rows.
    
    While converting raw-row to 3-parts-row, the interface 
(DataLoadProcessorStepOnSpark.convertTo3Parts) has changed: previously deal 
with batch, currently deal with one row.


---

Reply via email to