Github user QiangCai commented on a diff in the pull request: https://github.com/apache/carbondata/pull/2971#discussion_r238507037 --- Diff: integration/spark-common/src/main/scala/org/apache/carbondata/spark/load/DataLoadProcessBuilderOnSpark.scala --- @@ -156,4 +158,132 @@ object DataLoadProcessBuilderOnSpark { Array((uniqueLoadStatusId, (loadMetadataDetails, executionErrors))) } } + + /** + * 1. range partition the whole input data + * 2. for each range, sort the data and writ it to CarbonData files + */ + def loadDataUsingRangeSort( + sparkSession: SparkSession, + dataFrame: Option[DataFrame], + model: CarbonLoadModel, + hadoopConf: Configuration): Array[(String, (LoadMetadataDetails, ExecutionErrors))] = { + val originRDD = if (dataFrame.isDefined) { --- End diff -- better, but after refactoring, the code logic is not clear. Now, these two flows already reuse the process steps.
---