Re: [Discussion]Query Regarding Task launch mechanism for data load operations

VenuReddy Thu, 17 Sep 2020 08:11:43 -0700

Hi Vishal,

Thank you for the response. 
Configuring `load_min_size_inmb` has helped to control the number of tasks
to launch in case of load from csv and could eventually reduce the
carnondata files along with.



But in case of insert into table select from flow `loadDataFrame()`, problem
didn't get resolved as we have completely different task launching
approach(not same as in `loadDataFile()`. Do you have suggestions about any
paramter to fine tune in insert flow ?

1. Any way to launch more than 1 task per node ?
 
2. Any way to contrl the number of output carbondata files for target table,
when there are too many small sized carbondata files to read/select from src
table ?  Otherwise it generates the output files equal to input files.
    -> I tried carbon property,
`carbon.task.distribution`=`merge_small_files`. Could reduce the number of
files generated for target table. Scanrdd with
CARBON_TASK_DISTRIBUTION_MERGE_FILES used similar mechanism as global
partition load(considered filesMaxPartitionBytes, filesOpenCostInBytes and
defaultParallelism for split size). 
        But, this property is not dynamically configured. Probably for some
reason ? Confused if it is a good option to use that property in this
scenario.

Any suggestions would be very helpful.

regards,
Venu



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

Re: [Discussion]Query Regarding Task launch mechanism for data load operations

Reply via email to