Re: Presto+CarbonData optimization work discussion
Hi Based on pull request 1307, the latest test result as below, the performance be improved 3 times. presto:default> select province,sum(age),count(*) from presto_carbon_dict group by province order by province; province | _col1 | _col2 --+--+- AB | 57442740 | 1385010 BC | 57488826 | 1385580 MB | 57564702 | 1386510 NB | 57599520 | 1386960 NL | 57446592 | 1383774 NS | 57448734 | 1384272 NT | 57534228 | 1386936 NU | 57506844 | 1385346 ON | 57484956 | 1384470 PE | 57325164 | 1379802 QC | 57467886 | 1385076 SK | 57385152 | 1382364 YT | 57377556 | 1383900 (13 rows) Query 20170902_033821_6_h6g24, FINISHED, 1 node Splits: 50 total, 50 done (100.00%) 0:03 [18M rows, 0B] [6.62M rows/s, 0B/s] Regards Liang Liang Chen wrote > Hi > > For -- 4) Lazy decoding of the dictionary, just i tested 180 millions > rows data with the script: > "select province,sum(age),count(*) from presto_carbondata group by > province order by province" > > Spark integration module has "dictionary lazy decode", presto doesn't have > "dictionary lazy decode", the performance is 4.5 times difference, so > "dictionary lazy decode" might much help to improve aggregation > performance. > > The detail test result as below : * > 1. Presto+CarbonData is 9 second: * > presto:default> select province,sum(age),count(*) from presto_carbondata > group by province order by province; > province | _col1 | _col2 > --+--+- > AB | 57442740 | 1385010 > BC | 57488826 | 1385580 > MB | 57564702 | 1386510 > NB | 57599520 | 1386960 > NL | 57446592 | 1383774 > NS | 57448734 | 1384272 > NT | 57534228 | 1386936 > NU | 57506844 | 1385346 > ON | 57484956 | 1384470 > PE | 57325164 | 1379802 > QC | 57467886 | 1385076 > SK | 57385152 | 1382364 > YT | 57377556 | 1383900 > (13 rows) > > Query 20170720_022833_4_c9ky2, FINISHED, 1 node > Splits: 55 total, 55 done (100.00%) > 0:09 [18M rows, 34.3MB] [1.92M rows/s, 3.65MB/s] * > 2.Spark+CarbonData is :2 seconds * > scala> benchmark { carbon.sql("select province,sum(age),count(*) from > presto_carbondata group by province order by province").show } > ++++ > |province|sum(age)|count(1)| > ++++ > | AB|57442740| 1385010| > | BC|57488826| 1385580| > | MB|57564702| 1386510| > | NB|57599520| 1386960| > | NL|57446592| 1383774| > | NS|57448734| 1384272| > | NT|57534228| 1386936| > | NU|57506844| 1385346| > | ON|57484956| 1384470| > | PE|57325164| 1379802| > | QC|57467886| 1385076| > | SK|57385152| 1382364| > | YT|57377556| 1383900| > ++++ > > 2109.346231ms -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Re: Presto+CarbonData optimization work discussion
Hi, Based on tpch test, I have run the 10 GB data and results are attached with this email.We see a little improvement. However,with the increase in concurrency, the sql execution time will drop.Compared with the previous test, it has not been improved.The block size is 1024 MB and we use your create statement. prestoTest.xlsx <http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/file/n19002/prestoTest.xlsx> -- View this message in context: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Presto-CarbonData-optimization-work-discussion-tp18509p19002.html Sent from the Apache CarbonData Dev Mailing List archive mailing list archive at Nabble.com.
Re: Presto+CarbonData optimization work discussion
Please provide the statement to create the table in carbon data.And the block size is 1024MB? -- View this message in context: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Presto-CarbonData-optimization-work-discussion-tp18509p18803.html Sent from the Apache CarbonData Dev Mailing List archive mailing list archive at Nabble.com.
Re: Presto+CarbonData optimization work discussion
I have created a pull request 1190 for Presto Optimization where we have done following changes to improve the performance 1. Removed unnecessary loops from the integration code to make it more efficient. 2. Implemented Lazy Blocks as is being used in case of ORC. 3. Improved dictionary decoding to have better results. I have run this on my local machine for 2 GB data and results are attached with this email, we see an improvement in almost all TPCH queries that we have run. Thanks and regards Bhavya On Thu, Jul 20, 2017 at 12:21 PM, rui qin <qr7...@gmail.com> wrote: > For -- 6) spark has the vectorized feature,but not in presto.How to > implement > it? > > > > -- > View this message in context: http://apache-carbondata-dev- > mailing-list-archive.1130556.n5.nabble.com/Presto- > CarbonData-optimization-work-discussion-tp18509p18548.html > Sent from the Apache CarbonData Dev Mailing List archive mailing list > archive at Nabble.com. > PrestoQueryResults.xlsx Description: MS-Excel 2007 spreadsheet
Re: Presto+CarbonData optimization work discussion
Hi Ravi Thanks for your comment. I tested again with excluding province as dictionary. In spark, the query time is around 3 seconds, in presto same is 9 seconds. so for this query case(short string), dictionary lazy decode might not be the key factor. Regards Liang 2017-07-20 10:56 GMT+08:00 Ravindra Pesala <ravi.pes...@gmail.com>: > Hi Liang, > > I see that province column data is not big, so I guess it hardly make any > impact with lazy decoding in this scenario. Can you do one more test by > excluding the province from dictionary in both presto and spark > integrations. It will tell whether it is really a lazy decoding issue or > not. > > Regards, > Ravindra > > On 20 July 2017 at 08:04, Liang Chen <chenliang6...@gmail.com> wrote: > > > Hi > > > > For -- 4) Lazy decoding of the dictionary, just i tested 180 millions > rows > > data with the script: > > "select province,sum(age),count(*) from presto_carbondata group by > province > > order by province" > > > > Spark integration module has "dictionary lazy decode", presto doesn't > have > > "dictionary lazy decode", the performance is 4.5 times difference, so > > "dictionary lazy decode" might much help to improve aggregation > > performance. > > > > The detail test result as below : > > > > *1. Presto+CarbonData is 9 second:* > > presto:default> select province,sum(age),count(*) from presto_carbondata > > group by province order by province; > > province | _col1 | _col2 > > --+--+- > > AB | 57442740 | 1385010 > > BC | 57488826 | 1385580 > > MB | 57564702 | 1386510 > > NB | 57599520 | 1386960 > > NL | 57446592 | 1383774 > > NS | 57448734 | 1384272 > > NT | 57534228 | 1386936 > > NU | 57506844 | 1385346 > > ON | 57484956 | 1384470 > > PE | 57325164 | 1379802 > > QC | 57467886 | 1385076 > > SK | 57385152 | 1382364 > > YT | 57377556 | 1383900 > > (13 rows) > > > > Query 20170720_022833_4_c9ky2, FINISHED, 1 node > > Splits: 55 total, 55 done (100.00%) > > 0:09 [18M rows, 34.3MB] [1.92M rows/s, 3.65MB/s] > > > > *2.Spark+CarbonData is :2 seconds* > > scala> benchmark { carbon.sql("select province,sum(age),count(*) from > > presto_carbondata group by province order by province").show } > > ++++ > > |province|sum(age)|count(1)| > > ++++ > > | AB|57442740| 1385010| > > | BC|57488826| 1385580| > > | MB|57564702| 1386510| > > | NB|57599520| 1386960| > > | NL|57446592| 1383774| > > | NS|57448734| 1384272| > > | NT|57534228| 1386936| > > | NU|57506844| 1385346| > > | ON|57484956| 1384470| > > | PE|57325164| 1379802| > > | QC|57467886| 1385076| > > | SK|57385152| 1382364| > > | YT|57377556| 1383900| > > ++++ > > > > 2109.346231ms > > > > > > > > -- > > View this message in context: http://apache-carbondata-dev- > > mailing-list-archive.1130556.n5.nabble.com/Presto- > > CarbonData-optimization-work-discussion-tp18509p18522.html > > Sent from the Apache CarbonData Dev Mailing List archive mailing list > > archive at Nabble.com. > > > > > > -- > Thanks & Regards, > Ravi >
Re: Presto+CarbonData optimization work discussion
Hi Liang, I see that province column data is not big, so I guess it hardly make any impact with lazy decoding in this scenario. Can you do one more test by excluding the province from dictionary in both presto and spark integrations. It will tell whether it is really a lazy decoding issue or not. Regards, Ravindra On 20 July 2017 at 08:04, Liang Chen <chenliang6...@gmail.com> wrote: > Hi > > For -- 4) Lazy decoding of the dictionary, just i tested 180 millions rows > data with the script: > "select province,sum(age),count(*) from presto_carbondata group by province > order by province" > > Spark integration module has "dictionary lazy decode", presto doesn't have > "dictionary lazy decode", the performance is 4.5 times difference, so > "dictionary lazy decode" might much help to improve aggregation > performance. > > The detail test result as below : > > *1. Presto+CarbonData is 9 second:* > presto:default> select province,sum(age),count(*) from presto_carbondata > group by province order by province; > province | _col1 | _col2 > --+--+- > AB | 57442740 | 1385010 > BC | 57488826 | 1385580 > MB | 57564702 | 1386510 > NB | 57599520 | 1386960 > NL | 57446592 | 1383774 > NS | 57448734 | 1384272 > NT | 57534228 | 1386936 > NU | 57506844 | 1385346 > ON | 57484956 | 1384470 > PE | 57325164 | 1379802 > QC | 57467886 | 1385076 > SK | 57385152 | 1382364 > YT | 57377556 | 1383900 > (13 rows) > > Query 20170720_022833_4_c9ky2, FINISHED, 1 node > Splits: 55 total, 55 done (100.00%) > 0:09 [18M rows, 34.3MB] [1.92M rows/s, 3.65MB/s] > > *2.Spark+CarbonData is :2 seconds* > scala> benchmark { carbon.sql("select province,sum(age),count(*) from > presto_carbondata group by province order by province").show } > ++++ > |province|sum(age)|count(1)| > ++++ > | AB|57442740| 1385010| > | BC|57488826| 1385580| > | MB|57564702| 1386510| > | NB|57599520| 1386960| > | NL|57446592| 1383774| > | NS|57448734| 1384272| > | NT|57534228| 1386936| > | NU|57506844| 1385346| > | ON|57484956| 1384470| > | PE|57325164| 1379802| > | QC|57467886| 1385076| > | SK|57385152| 1382364| > | YT|57377556| 1383900| > ++++ > > 2109.346231ms > > > > -- > View this message in context: http://apache-carbondata-dev- > mailing-list-archive.1130556.n5.nabble.com/Presto- > CarbonData-optimization-work-discussion-tp18509p18522.html > Sent from the Apache CarbonData Dev Mailing List archive mailing list > archive at Nabble.com. > -- Thanks & Regards, Ravi