in carbon.properties, set enable.query.statistics=true, then the query detail can be get, and then u can check.
regards Jay ------------------ ???????? ------------------ ??????: "??????";<beido...@gmail.com>; ????????: 2017??1??3??(??????) ????12:15 ??????: "dev"<dev@carbondata.incubator.apache.org>; ????: Re: how to make carbon run faster is has a way ,to show carbon use index info,scan nums of lows, used time ,i use like multier index filter may be quicker,like a_id =1 and b_id=2 and c_id=3 and day='2017-01-01' ,the same to orc sql, day is used partition,but *_id not has index, but orc is faster or near equal.i thank this case ,carbon should be better ?so i wan't to know to carbon use index info or my load csv data to carbon is wrong ,so not used index ? 2017-01-03 11:08 GMT+08:00 Jay <2550062...@qq.com>: > hi, beidou > > > 1. the amount of your data is 36GB, for 1 GB 1 block, 40 cores is > enough, > but i think every task may takes too long time, so i suggest to > increase parallelism(for example, change --executor-cores 1 to 5) > then enable.blocklet.distribution=true may make more effect. > 2. try not use date function. change "date(a.create_time)>= > '2016-11-01'" to "a.create_time>= '2016-11-01 00:00:00'", something like > this. > > > regards > Jay > > > ------------------ ???????? ------------------ > ??????: "??????";<beido...@gmail.com>; > ????????: 2017??1??2??(??????) ????9:35 > ??????: "dev"<dev@carbondata.incubator.apache.org>; > > ????: Re: how to make carbon run faster > > > > 1. > You can add the date as filter condition also, for example : select * from > test_carbon where > status = xx (give a specific value) and date(a.create_time)>= '2016-11-01' > and date(a.create_time)<= > > '2016-12-26'. > > this case test before , slow than orc > > What are your exact business cases? Partition and indexes both are good way > to improve performance, suggest you increasing data set to more than 1 > billion rows, and try it again. > > 2.Each machine only has one cpu core ? > ------------------------------ > yes ,for duebg easy and cpu conflict,i user one executor for one core for > each machine > > each meachine has 32cores > > 2017-01-02 20:35 GMT+08:00 Liang Chen <chenliang6...@gmail.com>: > > > 1. > > You can add the date as filter condition also, for example : select * > from > > test_carbon where > > status = xx (give a specific value) and date(a.create_time)>= > '2016-11-01' > > and date(a.create_time)<= > > > '2016-12-26'. > > > > What are your exact business cases? Partition and indexes both are good > way > > to improve performance, suggest you increasing data set to more than 1 > > billion rows, and try it again. > > > > 2.Each machine only has one cpu core ? > > ------------------------------ > > yes ,for duebg easy and cpu conflict,i user one executor for one core > for > > each machine > > > > Regards > > Liang > > > > > > 2017-01-02 12:06 GMT+08:00 ?????? <beido...@gmail.com>: > > > > > 1.Can you tell that the SQL generated how many rows data? > > > > > > as the sql,most id are related,so is samll, 10~20 rows as rueturn > > result > > > > > > 2.You can try more SQL query, for example : select * from test_carbon > > where > > > status = xx (give a specific value), the example will use the most left > > > column to filter query(to check the indexes effectiveness) > > > > > > so in this case ,no partition may be on hiveorc sql, so carbon must > > faster > > > 3.Did you use how many machines(Node)? Because one executor will > generate > > > one index B+ tree , for fully utilizing index, please try to reduce the > > > number of executor, suggest : one machine/node launch one executor(and > > > increase the executor's memory) > > > > > > yes ,for duebg easy and cpu conflict,i user one executor for one core > > for > > > each machine > > > but the the query run times aslo slower than orcsql > > > > > > > > > thanks > > > > > > 2017-01-02 11:24 GMT+08:00 Liang Chen <chenliang6...@gmail.com>: > > > > > > > Hi > > > > > > > > Thanks for you started try Apache CarbonData project. > > > > > > > > There are may have various reasons for the test result,i assumed that > > you > > > > made time based partition for ORC data ,right ? > > > > 1.Can you tell that the SQL generated how many rows data? > > > > > > > > 2.You can try more SQL query, for example : select * from test_carbon > > > where > > > > status = xx (give a specific value), the example will use the most > left > > > > column to filter query(to check the indexes effectiveness) > > > > > > > > 3.Did you use how many machines(Node)? Because one executor will > > generate > > > > one index B+ tree , for fully utilizing index, please try to reduce > the > > > > number of executor, suggest : one machine/node launch one > executor(and > > > > increase the executor's memory) > > > > > > > > Regards > > > > Liang > > > > > > > > > > > > geda wrote > > > > > Hello: > > > > > i test the same data the same sql from two format ,1.carbondata > > 2,hive > > > > orc > > > > > but carbon format run slow than orc. > > > > > i use carbondata with index order like create table order > > > > > hivesql:(dt is partition dir ) > > > > > select count(1) as total ,status,d_id from test_orc where status > !=17 > > > and > > > > > v_id in ( 91532,91533,91534,91535,91536,91537,10001 ) and dt > >= > > > > > '2016-11-01' and dt <= '2016-12-26' group by status,d_id order by > > > total > > > > > desc > > > > > carbonsql:(create_time is timestamp type ) > > > > > > > > > > select count(1) as total ,status,d_id from test_carbon where status > > > !=17 > > > > > and v_id in ( 91532,91533,91534,91535,91536,91537,10001 ) and > > > > > date(a.create_time)>= '2016-11-01' and date(a.create_time)<= > > > > '2016-12-26' > > > > > group by status,d_id order by total desc > > > > > > > > > > create carbondata like > > > > > CREATE TABLE test_carbon ( status int, v_id bigint, d_id bigint, > > > > > create_time timestamp > > > > > ... > > > > > ... > > > > > 'DICTIONARY_INCLUDE'='status,d_id,v_id,create_time') > > > > > > > > > > run with spark-shell,on 40 node ,spark1.6.1,carbon0.20,hadoop- > 2.6.3 > > > > > like > > > > > 2month ,60days 30w row per days ,600MB csv format perday > > > > > $SPARK_HOME/bin/spark-shell --verbose --name "test" --master > > > > > yarn-client --driver-memory 10G --executor-memory 16G > > > --num-executors > > > > > 40 --executor-cores 1 > > > > > i test many case > > > > > 1. > > > > > gc tune ,no full gc > > > > > 2. spark.sql.suffle.partition > > > > > all task are in run in same time > > > > > 3.carbon.conf set > > > > > enable.blocklet.distribution=true > > > > > > > > > > i use the code to test sql run time > > > > > val start = System.nanoTime() > > > > > body > > > > > (System.nanoTime() - start)/1000/1000 > > > > > > > > > > body is sqlcontext(sql).show() > > > > > i find orc return back faster then carbon, > > > > > > > > > > to see in ui ,some times carbon ,orc are run more or less the same > (i > > > > > think carbon use index should be faser,or scan sequnece read is > faser > > > > than > > > > > idex scan),but orc is more stable > > > > > ui show spend 5s,but return time orc 8s,carbon 12s.(i don't know > how > > to > > > > > detch how time spend ) > > > > > > > > > > here are some pic i run (run many times ) > > > > > carbon run: > > > > <http://apache-carbondata-mailing-list-archive.1130556. > > > > n5.nabble.com/file/n5305/carbon-slowest-job-run1.png> > > > > > > > > > <http://apache-carbondata-mailing-list-archive.1130556. > > > > n5.nabble.com/file/n5305/carbon-slowest-job-run2.png> > > > > > > > > > <http://apache-carbondata-mailing-list-archive.1130556. > > > > n5.nabble.com/file/n5305/carbon-slowest-job-total-run1.png> > > > > > > > > > <http://apache-carbondata-mailing-list-archive.1130556. > > > > n5.nabble.com/file/n5305/carbon-slowest-job-total-run2.png> > > > > > > > > > <http://apache-carbondata-mailing-list-archive.1130556. > > > > n5.nabble.com/file/n5305/carbon-slowest-run2.png> > > > > > > > > > > orc run: > > > > <http://apache-carbondata-mailing-list-archive.1130556. > > > > n5.nabble.com/file/n5305/hiveconext-slowest-job-total-run1.png> > > > > > > > > > <http://apache-carbondata-mailing-list-archive.1130556. > > > > n5.nabble.com/file/n5305/hiveconext-slowest-total-run1.png> > > > > > > > > > > > > > > > so my question is : > > > > > 1. why in spark-shell,sql.show(),orc sql return faster then carbon > > > > > 2. in the spark ui ,carbon should use index to skip more data,scan > > data > > > > > some time use 4s, 2s, 0.2s ,how to make the slowest task faster? > > > > > 3. like the sql ,i use the leftest index scan,so i think is should > > be > > > > run > > > > > faster than orc test in this case ,but not ,why? > > > > > 4.if the 3 question is ,exlain this ,my data is two small,so serial > > > read > > > > > is faster than index scan ? > > > > > > > > > > sorry for my poor english ,help,thanks! > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > View this message in context: http://apache-carbondata- > > > > mailing-list-archive.1130556.n5.nabble.com/how-to-make- > > > > carbon-run-faster-tp5305p5322.html > > > > Sent from the Apache CarbonData Mailing List archive mailing list > > archive > > > > at Nabble.com. > > > > > > > > > > > > > > > -- > > Regards > > Liang > > >