Re: how to make carbon run faster

Liang Chen Mon, 02 Jan 2017 04:36:08 -0800

1.
You can add the date as filter condition also, for example :  select * from
test_carbon where
status = xx (give a specific value) and date(a.create_time)>= '2016-11-01'
and  date(a.create_time)<=
> '2016-12-26'.


What are your exact business cases? Partition and indexes both are good way
to improve performance, suggest you increasing data set to more than 1
billion rows, and try it again.

2.Each machine only has one cpu core ?
------------------------------
yes ,for duebg easy and cpu conflict,i user one executor for one core  for
each machine

Regards
Liang


2017-01-02 12:06 GMT+08:00 北斗七 <beido...@gmail.com>:

> 1.Can you tell that the SQL generated how many rows data?
>
> as the sql,most  id are related,so is samll, 10~20  rows as  rueturn result
>
> 2.You can try more SQL query, for example : select * from test_carbon where
> status = xx (give a specific value), the example will use the most left
> column to filter query(to check the indexes effectiveness)
>
> so in this case ,no partition may be on hiveorc sql, so carbon must faster
> 3.Did you use how many machines(Node)? Because one executor will generate
> one index B+ tree , for fully utilizing index, please try to reduce the
> number of executor, suggest : one machine/node launch one executor(and
> increase the executor's memory)
>
> yes ,for duebg easy and cpu conflict,i user one executor for one core  for
> each machine
> but the the query run times aslo slower than orcsql
>
>
> thanks
>
> 2017-01-02 11:24 GMT+08:00 Liang Chen <chenliang6...@gmail.com>:
>
> > Hi
> >
> > Thanks for you started try Apache CarbonData project.
> >
> > There are may have various reasons for the test result,i assumed that you
> > made time based partition for ORC data ,right ?
> > 1.Can you tell that the SQL generated how many rows data?
> >
> > 2.You can try more SQL query, for example : select * from test_carbon
> where
> > status = xx (give a specific value), the example will use the most left
> > column to filter query(to check the indexes effectiveness)
> >
> > 3.Did you use how many machines(Node)? Because one executor will generate
> > one index B+ tree , for fully utilizing index, please try to reduce the
> > number of executor, suggest : one machine/node launch one executor(and
> > increase the executor's memory)
> >
> > Regards
> > Liang
> >
> >
> > geda wrote
> > > Hello:
> > > i test the same data the same sql from two format ,1.carbondata 2,hive
> > orc
> > > but carbon format run  slow than orc.
> > > i use carbondata with index order like create table order
> > > hivesql:(dt is partition dir )
> > > select count(1) as total ,status,d_id from test_orc where status !=17
> and
> > > v_id  in ( 91532,91533,91534,91535,91536,91537,10001 )  and   dt >=
> > > '2016-11-01'  and  dt <= '2016-12-26' group by status,d_id order by
> total
> > > desc
> > > carbonsql:(create_time is timestamp type )
> > >
> > > select count(1) as total ,status,d_id from test_carbon where status
> !=17
> > > and v_id  in ( 91532,91533,91534,91535,91536,91537,10001 )  and
> > > date(a.create_time)>= '2016-11-01' and  date(a.create_time)<=
> > '2016-12-26'
> > > group by status,d_id order by total desc
> > >
> > > create carbondata like
> > > CREATE TABLE test_carbon ( status int, v_id bigint, d_id bigint,
> > > create_time timestamp
> > > ...
> > > ...
> > > 'DICTIONARY_INCLUDE'='status,d_id,v_id,create_time')
> > >
> > > run with spark-shell,on 40 node ,spark1.6.1,carbon0.20,hadoop-2.6.3
> > > like
> > > 2month ,60days 30w row per days ,600MB csv format perday
> > >  $SPARK_HOME/bin/spark-shell --verbose --name "test"   --master
> > > yarn-client  --driver-memory 10G   --executor-memory 16G
> --num-executors
> > > 40 --executor-cores 1
> > >  i test many case
> > >  1.
> > >  gc tune ,no full gc
> > >  2. spark.sql.suffle.partition
> > >  all task are in run in same time
> > >  3.carbon.conf set
> > > enable.blocklet.distribution=true
> > >
> > > i use the code to test sql run time
> > > val start = System.nanoTime()
> > >   body
> > >   (System.nanoTime() - start)/1000/1000
> > >
> > > body is  sqlcontext(sql).show()
> > > i find orc return back faster then carbon,
> > >
> > > to see in ui ,some times carbon ,orc are run more or less the same (i
> > > think carbon use index should be faser,or scan sequnece read is faser
> > than
> > > idex scan),but orc is more stable
> > > ui show spend 5s,but return time orc 8s,carbon 12s.(i don't know how to
> > > detch how time spend )
> > >
> > > here are some pic i run (run many times )
> > > carbon run:
> > <http://apache-carbondata-mailing-list-archive.1130556.
> > n5.nabble.com/file/n5305/carbon-slowest-job-run1.png>
> > >
> > <http://apache-carbondata-mailing-list-archive.1130556.
> > n5.nabble.com/file/n5305/carbon-slowest-job-run2.png>
> > >
> > <http://apache-carbondata-mailing-list-archive.1130556.
> > n5.nabble.com/file/n5305/carbon-slowest-job-total-run1.png>
> > >
> > <http://apache-carbondata-mailing-list-archive.1130556.
> > n5.nabble.com/file/n5305/carbon-slowest-job-total-run2.png>
> > >
> > <http://apache-carbondata-mailing-list-archive.1130556.
> > n5.nabble.com/file/n5305/carbon-slowest-run2.png>
> > >
> > > orc run:
> > <http://apache-carbondata-mailing-list-archive.1130556.
> > n5.nabble.com/file/n5305/hiveconext-slowest-job-total-run1.png>
> > >
> > <http://apache-carbondata-mailing-list-archive.1130556.
> > n5.nabble.com/file/n5305/hiveconext-slowest-total-run1.png>
> > >
> > >
> > > so my question is :
> > > 1. why in spark-shell,sql.show(),orc sql  return faster then carbon
> > > 2. in the spark ui ,carbon should use index to skip more data,scan data
> > > some time use 4s, 2s, 0.2s ,how to make the slowest task faster?
> > > 3. like the sql ,i use the  leftest index scan,so i think is should be
> > run
> > > faster than orc test in this case ,but not ,why?
> > > 4.if the 3 question is ,exlain this ,my data is two small,so serial
> read
> > > is faster than index scan ?
> > >
> > > sorry for my poor english ,help,thanks!
> >
> >
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-
> > mailing-list-archive.1130556.n5.nabble.com/how-to-make-
> > carbon-run-faster-tp5305p5322.html
> > Sent from the Apache CarbonData Mailing List archive mailing list archive
> > at Nabble.com.
> >
>



-- 
Regards
Liang

Re: how to make carbon run faster

Reply via email to