Hi all.! For the past few days, was trying to tune the perfomance parameters to improve the data loading speed(Carbon with Spark) for a new customer. During the course of tuning, have noticed a few points and stuck there as it seem to be base behavior. Could you guys help me to understand the rationale behind it and probably you have any better suggestions.
1. When load data with a folder having too many small files, we are trying to distribute splits such that we use all the availble nodes in cluster(based on file availability on nodes) and lauch a task to executor on each node. On executor, split contains multiple blocks( i.e., multiple files. so multiple record readers/input iterators to read from). We make output iterators based on /carbon.number.of.cores.while.loading///spark.executor.cores/ property. If configured as 4, then 4 carbondata files are written in case of no_sort. So each executor will write the number of carbon files equal to /carbon.number.of.cores.while.loading/ or /spark.executor.cores/ property. It leads to generation of many small carbondata files in case of no_sort *My questions are: 1. Why do we distribute tasks among many nodes even if amount of data to load is small? Why not consider the file sizes as well into account and lauch minimal tasks ? 2. Also on executor, why not control the number of output iterators(and in turn carbondata files to write) considering the amount of records being processed ? I understand the fact that we can set /carbon.number.of.cores.while.loading/ dynamic property to 1 to make executor write a single carbondata file. But it would be difficult to decide manually when to reduce to lower value in such cases.* 2. When we do /insert into table select from/ or /create table as select from/, we lauch one single task per node(with CoalescedRDDPartition) based on the locality of data irrespective of sort_scope of target table. It has record readers/input Iterators(with LazyRDDInternalRowIterator) for each file to read. Whereas when we do a simple /select * from table/ query, tasks launched are equal to number of carbondata files with CARBON_TASK_DISTRIBUTION_BLOCK. *My questions are: 1. Why do we lunch one task per node(i.e on data available node) in data load? If have a cluster where more executor instances are available per node, why not launch multiple tasks(ofcource based on data availability on node) ? Probably this could have improved load performance ? * Request you guys help me to understand the rationale behind it and probably you have any better suggestions or if it is planned in future work scope ? Your help would be greatly appreciated. Tthank you in advance Regards, Venu -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
