When I insert data into carbondata from one table, I should do as the following: 1、select count(1) from table1 and then 2、insert into table table1 select * from table1
Why I should execute "select count(1) from table1" first? because the number of tasks are compute by carbondata, it is releated to how many executor hosts we have now! I don't think it is the right way. We should let spark to control the number of tasks. set the parameter "mapred.max.splits.size" is a common way to adjust the number of tasks. Even when I do the step 2, some tasks still failed, it will increase the insert time. So I sugguest that don't adjust the number of tasks, just use the default behavior of spark. And then if there are small files, add a fast merge job(merge data at blocket level, just as ) so we also need to set the default value of "carbon.number.of.cores.while.loading" to 1 -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
