[Discussion]Query Regarding Task launch mechanism for data load operations

VenuReddy Fri, 14 Aug 2020 01:33:54 -0700

Hi all.!

For the past few days, was trying to tune the perfomance parameters to
improve the data loading speed(Carbon with Spark) for a new customer. During
the course of tuning, have noticed a few points and stuck there as it seem
to be base behavior. Could you guys help me to understand the rationale
behind it and probably you have any better suggestions.


1. When load data with a folder having too many small files, we are trying
to distribute splits such that we use all the availble nodes in
cluster(based on file availability on nodes) and lauch a task to executor on
each node. On executor, split contains multiple blocks( i.e., multiple
files. so multiple record readers/input iterators to read from). We make
output iterators based on
/carbon.number.of.cores.while.loading///spark.executor.cores/ property. If
configured as 4, then 4 carbondata files are written in case of no_sort. So
each executor will write the number of carbon files equal to
/carbon.number.of.cores.while.loading/ or /spark.executor.cores/ property.
It leads to generation of many small carbondata files in case of no_sort

*My questions are:
1. Why do we distribute tasks among many nodes even if amount of data to
load is small? Why not consider the file sizes as well into account and
lauch minimal tasks ?
2. Also on executor, why not control the number of output iterators(and in
turn carbondata files to write) considering the amount of records being
processed ? I understand the fact that we can set
/carbon.number.of.cores.while.loading/ dynamic property to 1 to make
executor write a single carbondata file. But it would be difficult to decide
manually when to reduce to lower value in such cases.*

2. When we do /insert into table select from/ or /create table as select
from/, we lauch one single task per node(with CoalescedRDDPartition) based
on the locality of data irrespective of sort_scope of target table.  It has
record readers/input Iterators(with LazyRDDInternalRowIterator) for each
file to read. Whereas when we do a simple /select * from table/ query, tasks
launched are equal to number of carbondata files with
CARBON_TASK_DISTRIBUTION_BLOCK.

*My questions are:
1. Why do we lunch one task per node(i.e on data available node) in data
load? If  have a cluster where more executor instances are available per
node, why not launch multiple tasks(ofcource based on data availability on
node) ? Probably this could have improved load performance ? 
*

Request you guys help me to understand the rationale behind it and probably
you have any better suggestions or if it is planned in future work scope ?  

Your help would be greatly appreciated. Tthank you in advance

Regards,
Venu



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

[Discussion]Query Regarding Task launch mechanism for data load operations

Reply via email to