[DISCUSSION] Improve insert process

cenyuhai11 Thu, 26 Oct 2017 20:14:15 -0700

When I insert data into carbondata from one table, I should do as the
following:
1、select count(1) from table1
and then
2、insert into table table1 select * from table1


Why I should execute "select count(1) from table1" first?
because the number of tasks are compute by carbondata, it is releated to how
many executor hosts we have now!

I don't think it is the right way. We should let spark to control the number
of tasks.
set the parameter "mapred.max.splits.size" is a common way to adjust the
number of tasks.

Even when I do the step 2, some tasks still failed, it will increase the
insert time.

So I sugguest that don't adjust the number of tasks, just use the default
behavior of spark. 
And then if there are small files, add a fast merge job(merge data at
blocket level, just as )

so we also need to set the default value of
"carbon.number.of.cores.while.loading" to 1







--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/

[DISCUSSION] Improve insert process

Reply via email to