how to make carbon run faster

geda Fri, 30 Dec 2016 23:39:52 -0800

Hello:
i test the same data the same sql from two format ,1.carbondata 2,hive orc
but carbon format run  slow than orc.
i use carbondata with index order like create table order 
hivesql:(dt is partition dir )
select count(1) as total ,status,d_id from test_orc where status !=17 and
v_id  in ( 91532,91533,91534,91535,91536,91537,10001 )  and   dt >=
'2016-11-01'  and  dt <= '2016-12-26' group by status,d_id order by total
desc
carbonsql:(create_time is timestamp type )


select count(1) as total ,status,d_id from test_carbon where status !=17 and
v_id  in ( 91532,91533,91534,91535,91536,91537,10001 )  and 
date(a.create_time)>= '2016-11-01' and  date(a.create_time)<= '2016-12-26'
group by status,d_id order by total desc

create carbondata like 
CREATE TABLE test_carbon ( status int, v_id bigint, d_id bigint, create_time
timestamp
...
...
'DICTIONARY_INCLUDE'='status,d_id,v_id,create_time')

run with spark-shell,on 40 node ,spark1.6.1,carbon0.20,hadoop-2.6.3
like 
2month ,60days 30w row per days ,600MB csv format perday 
 $SPARK_HOME/bin/spark-shell --verbose --name "test"   --master yarn-client 
--driver-memory 10G   --executor-memory 16G --num-executors 40
--executor-cores 1 
 i test many case 
 1.
 gc tune ,no full gc
 2. spark.sql.suffle.partition 
 all task are in run in same time 
 3.carbon.conf set 
enable.blocklet.distribution=true

i use the code to test sql run time 
val start = System.nanoTime()
  body
  (System.nanoTime() - start)/1000/1000
  
body is  sqlcontext(sql).show()
i find orc return back faster then carbon,

to see in ui ,some times carbon ,orc are run more or less the same (i think
carbon use index should be faser,or scan sequnece read is faser than idex
scan),but orc is more stable
ui show spend 5s,but return time orc 8s,carbon 12s.(i don't know how to
detch how time spend )

here are some pic i run (run many times )
carbon run:
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n5305/carbon-slowest-job-run1.png>
 
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n5305/carbon-slowest-job-run2.png>
 
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n5305/carbon-slowest-job-total-run1.png>
 
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n5305/carbon-slowest-job-total-run2.png>
 

<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n5305/carbon-slowest-run2.png>
 

orc run:
<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n5305/hiveconext-slowest-job-total-run1.png>
 


<http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/file/n5305/hiveconext-slowest-total-run1.png>
 


so my question is :
1. why in spark-shell,sql.show(),orc sql  return faster then carbon
2. in the spark ui ,carbon should use index to skip more data,scan data some
time use 4s, 2s, 0.2s ,how to make the slowest task faster? 
3. like the sql ,i use the  leftest index scan,so i think is should be run
faster than orc test in this case ,but not ,why?
4.if the 3 question is ,exlain this ,my data is two small,so serial read is
faster than index scan ?

sorry for my poor english ,help,thanks!








--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/how-to-make-carbon-run-faster-tp5305.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

how to make carbon run faster

Reply via email to