Zhichao Zhang created CARBONDATA-1366: ------------------------------------------
Summary: When sort_scope=global_sort, use 'StorageLevel.MEMORY_AND_DISK_SER' instead of 'StorageLevel.MEMORY_AND_DISK' for 'convertRDD' persisting to improve loading performance Key: CARBONDATA-1366 URL: https://issues.apache.org/jira/browse/CARBONDATA-1366 Project: CarbonData Issue Type: Bug Components: data-load, spark-integration Affects Versions: 1.2.0 Reporter: Zhichao Zhang Assignee: Zhichao Zhang Priority: Minor Fix For: 1.2.0 My testing env and configs are as followings: Env: 6 executors, 9G mem + 6 cores per executor Configs: SINGLE_PASS=true SORT_SCOPE=GLOBAL_SORT spark.memory.fraction=0.5 if using 'convertRDD.persist(StorageLevel.MEMORY_AND_DISK_SER)' in method 'org.apache.carbondata.spark.load.DataLoadProcessBuilderOnSpark.loadDataUsingGlobalSort', it takes about 7.2 min to load 144136697 lines (10.9 G parquet files), and if using 'convertRDD.persist(StorageLevel.MEMORY_AND_DISK)', it takes about 9.5 min to load 144136697 lines. -- This message was sent by Atlassian JIRA (v6.4.14#64029)