Re: Presto+CarbonData optimization work discussion

2017-09-01 Thread Liang Chen
Hi

Based on pull request 1307, the latest test result as below, the performance
be improved 3 times.

presto:default> select province,sum(age),count(*) from presto_carbon_dict
group by province order by province;
 province |  _col1   |  _col2
--+--+-
 AB   | 57442740 | 1385010
 BC   | 57488826 | 1385580
 MB   | 57564702 | 1386510
 NB   | 57599520 | 1386960
 NL   | 57446592 | 1383774
 NS   | 57448734 | 1384272
 NT   | 57534228 | 1386936
 NU   | 57506844 | 1385346
 ON   | 57484956 | 1384470
 PE   | 57325164 | 1379802
 QC   | 57467886 | 1385076
 SK   | 57385152 | 1382364
 YT   | 57377556 | 1383900
(13 rows)

Query 20170902_033821_6_h6g24, FINISHED, 1 node
Splits: 50 total, 50 done (100.00%)
0:03 [18M rows, 0B] [6.62M rows/s, 0B/s]


Regards
Liang


Liang Chen wrote
> Hi
> 
> For -- 4) Lazy decoding of the dictionary,  just i tested 180 millions
> rows data with the script: 
> "select province,sum(age),count(*) from presto_carbondata group by
> province order by province"
> 
> Spark integration module has "dictionary lazy decode", presto doesn't have
> "dictionary lazy decode", the performance is 4.5 times difference, so
> "dictionary lazy decode" might much help to improve aggregation
> performance.
> 
> The detail test result as below : 
*
> 1. Presto+CarbonData is 9 second:
*
> presto:default> select province,sum(age),count(*) from presto_carbondata
> group by province order by province;
>  province |  _col1   |  _col2
> --+--+-
>  AB   | 57442740 | 1385010
>  BC   | 57488826 | 1385580
>  MB   | 57564702 | 1386510
>  NB   | 57599520 | 1386960
>  NL   | 57446592 | 1383774
>  NS   | 57448734 | 1384272
>  NT   | 57534228 | 1386936
>  NU   | 57506844 | 1385346
>  ON   | 57484956 | 1384470
>  PE   | 57325164 | 1379802
>  QC   | 57467886 | 1385076
>  SK   | 57385152 | 1382364
>  YT   | 57377556 | 1383900
> (13 rows)
> 
> Query 20170720_022833_4_c9ky2, FINISHED, 1 node
> Splits: 55 total, 55 done (100.00%)
> 0:09 [18M rows, 34.3MB] [1.92M rows/s, 3.65MB/s]
*
> 2.Spark+CarbonData is :2 seconds
*
> scala> benchmark { carbon.sql("select province,sum(age),count(*) from
> presto_carbondata group by province order by province").show }
> ++++
> |province|sum(age)|count(1)|
> ++++
> |  AB|57442740| 1385010|
> |  BC|57488826| 1385580|
> |  MB|57564702| 1386510|
> |  NB|57599520| 1386960|
> |  NL|57446592| 1383774|
> |  NS|57448734| 1384272|
> |  NT|57534228| 1386936|
> |  NU|57506844| 1385346|
> |  ON|57484956| 1384470|
> |  PE|57325164| 1379802|
> |  QC|57467886| 1385076|
> |  SK|57385152| 1382364|
> |  YT|57377556| 1383900|
> ++++
> 
> 2109.346231ms





--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: Presto+CarbonData optimization work discussion

2017-07-28 Thread rui qin
Hi,
   Based on tpch test, I have run the 10 GB data and results are attached
with this email.We see a little improvement. However,with the increase in
concurrency, the sql execution time will drop.Compared with the previous
test, it has not been improved.The block size is 1024 MB and we use your
create statement.





prestoTest.xlsx
<http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/n19002/prestoTest.xlsx>
  



--
View this message in context: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Presto-CarbonData-optimization-work-discussion-tp18509p19002.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive 
at Nabble.com.


Re: Presto+CarbonData optimization work discussion

2017-07-26 Thread rui qin
Please provide the statement to create the table in carbon data.And the block
size is 1024MB?



--
View this message in context: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Presto-CarbonData-optimization-work-discussion-tp18509p18803.html
Sent from the Apache CarbonData Dev Mailing List archive mailing list archive 
at Nabble.com.


Re: Presto+CarbonData optimization work discussion

2017-07-25 Thread Bhavya Aggarwal
I have created a pull request 1190  for Presto Optimization where we have
done following changes to improve the performance

1. Removed unnecessary loops from the integration code to make it more
efficient.
2. Implemented Lazy Blocks as is being used in case of ORC.
3. Improved dictionary decoding to have better results.

I have run this on my local machine for 2 GB data and results are attached
with this email, we see an improvement in almost all TPCH queries that we
have run.

Thanks and regards
Bhavya

On Thu, Jul 20, 2017 at 12:21 PM, rui qin <qr7...@gmail.com> wrote:

> For -- 6) spark has the vectorized feature,but not in presto.How to
> implement
> it?
>
>
>
> --
> View this message in context: http://apache-carbondata-dev-
> mailing-list-archive.1130556.n5.nabble.com/Presto-
> CarbonData-optimization-work-discussion-tp18509p18548.html
> Sent from the Apache CarbonData Dev Mailing List archive mailing list
> archive at Nabble.com.
>


PrestoQueryResults.xlsx
Description: MS-Excel 2007 spreadsheet


Re: Presto+CarbonData optimization work discussion

2017-07-19 Thread Liang Chen
Hi Ravi

Thanks for your comment.

I tested again with excluding province as dictionary.  In spark, the query
time is around 3 seconds, in presto same is 9 seconds.  so for this query
case(short string), dictionary lazy decode might not be the key factor.

Regards
Liang

2017-07-20 10:56 GMT+08:00 Ravindra Pesala <ravi.pes...@gmail.com>:

> Hi Liang,
>
> I see that province column data is not big, so I guess it hardly make any
> impact with lazy decoding in this scenario. Can you do one more test by
> excluding the province from dictionary in both presto and spark
> integrations. It will tell whether it is really a lazy decoding issue or
> not.
>
> Regards,
> Ravindra
>
> On 20 July 2017 at 08:04, Liang Chen <chenliang6...@gmail.com> wrote:
>
> > Hi
> >
> > For -- 4) Lazy decoding of the dictionary,  just i tested 180 millions
> rows
> > data with the script:
> > "select province,sum(age),count(*) from presto_carbondata group by
> province
> > order by province"
> >
> > Spark integration module has "dictionary lazy decode", presto doesn't
> have
> > "dictionary lazy decode", the performance is 4.5 times difference, so
> > "dictionary lazy decode" might much help to improve aggregation
> > performance.
> >
> > The detail test result as below :
> >
> > *1. Presto+CarbonData is 9 second:*
> > presto:default> select province,sum(age),count(*) from presto_carbondata
> > group by province order by province;
> >  province |  _col1   |  _col2
> > --+--+-
> >  AB   | 57442740 | 1385010
> >  BC   | 57488826 | 1385580
> >  MB   | 57564702 | 1386510
> >  NB   | 57599520 | 1386960
> >  NL   | 57446592 | 1383774
> >  NS   | 57448734 | 1384272
> >  NT   | 57534228 | 1386936
> >  NU   | 57506844 | 1385346
> >  ON   | 57484956 | 1384470
> >  PE   | 57325164 | 1379802
> >  QC   | 57467886 | 1385076
> >  SK   | 57385152 | 1382364
> >  YT   | 57377556 | 1383900
> > (13 rows)
> >
> > Query 20170720_022833_4_c9ky2, FINISHED, 1 node
> > Splits: 55 total, 55 done (100.00%)
> > 0:09 [18M rows, 34.3MB] [1.92M rows/s, 3.65MB/s]
> >
> > *2.Spark+CarbonData is :2 seconds*
> > scala> benchmark { carbon.sql("select province,sum(age),count(*) from
> > presto_carbondata group by province order by province").show }
> > ++++
> > |province|sum(age)|count(1)|
> > ++++
> > |  AB|57442740| 1385010|
> > |  BC|57488826| 1385580|
> > |  MB|57564702| 1386510|
> > |  NB|57599520| 1386960|
> > |  NL|57446592| 1383774|
> > |  NS|57448734| 1384272|
> > |  NT|57534228| 1386936|
> > |  NU|57506844| 1385346|
> > |  ON|57484956| 1384470|
> > |  PE|57325164| 1379802|
> > |  QC|57467886| 1385076|
> > |  SK|57385152| 1382364|
> > |  YT|57377556| 1383900|
> > ++++
> >
> > 2109.346231ms
> >
> >
> >
> > --
> > View this message in context: http://apache-carbondata-dev-
> > mailing-list-archive.1130556.n5.nabble.com/Presto-
> > CarbonData-optimization-work-discussion-tp18509p18522.html
> > Sent from the Apache CarbonData Dev Mailing List archive mailing list
> > archive at Nabble.com.
> >
>
>
>
> --
> Thanks & Regards,
> Ravi
>


Re: Presto+CarbonData optimization work discussion

2017-07-19 Thread Ravindra Pesala
Hi Liang,

I see that province column data is not big, so I guess it hardly make any
impact with lazy decoding in this scenario. Can you do one more test by
excluding the province from dictionary in both presto and spark
integrations. It will tell whether it is really a lazy decoding issue or
not.

Regards,
Ravindra

On 20 July 2017 at 08:04, Liang Chen <chenliang6...@gmail.com> wrote:

> Hi
>
> For -- 4) Lazy decoding of the dictionary,  just i tested 180 millions rows
> data with the script:
> "select province,sum(age),count(*) from presto_carbondata group by province
> order by province"
>
> Spark integration module has "dictionary lazy decode", presto doesn't have
> "dictionary lazy decode", the performance is 4.5 times difference, so
> "dictionary lazy decode" might much help to improve aggregation
> performance.
>
> The detail test result as below :
>
> *1. Presto+CarbonData is 9 second:*
> presto:default> select province,sum(age),count(*) from presto_carbondata
> group by province order by province;
>  province |  _col1   |  _col2
> --+--+-
>  AB   | 57442740 | 1385010
>  BC   | 57488826 | 1385580
>  MB   | 57564702 | 1386510
>  NB   | 57599520 | 1386960
>  NL   | 57446592 | 1383774
>  NS   | 57448734 | 1384272
>  NT   | 57534228 | 1386936
>  NU   | 57506844 | 1385346
>  ON   | 57484956 | 1384470
>  PE   | 57325164 | 1379802
>  QC   | 57467886 | 1385076
>  SK   | 57385152 | 1382364
>  YT   | 57377556 | 1383900
> (13 rows)
>
> Query 20170720_022833_4_c9ky2, FINISHED, 1 node
> Splits: 55 total, 55 done (100.00%)
> 0:09 [18M rows, 34.3MB] [1.92M rows/s, 3.65MB/s]
>
> *2.Spark+CarbonData is :2 seconds*
> scala> benchmark { carbon.sql("select province,sum(age),count(*) from
> presto_carbondata group by province order by province").show }
> ++++
> |province|sum(age)|count(1)|
> ++++
> |  AB|57442740| 1385010|
> |  BC|57488826| 1385580|
> |  MB|57564702| 1386510|
> |  NB|57599520| 1386960|
> |  NL|57446592| 1383774|
> |  NS|57448734| 1384272|
> |  NT|57534228| 1386936|
> |  NU|57506844| 1385346|
> |  ON|57484956| 1384470|
> |  PE|57325164| 1379802|
> |  QC|57467886| 1385076|
> |  SK|57385152| 1382364|
> |  YT|57377556| 1383900|
> ++++
>
> 2109.346231ms
>
>
>
> --
> View this message in context: http://apache-carbondata-dev-
> mailing-list-archive.1130556.n5.nabble.com/Presto-
> CarbonData-optimization-work-discussion-tp18509p18522.html
> Sent from the Apache CarbonData Dev Mailing List archive mailing list
> archive at Nabble.com.
>



-- 
Thanks & Regards,
Ravi