Re: CarbonData Performance Optimization
So excited. Good optimization. -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/
Re: CarbonData Performance Optimization
+1 Regards Manish Gupta On Thu, 27 Sep 2018 at 11:36 AM, Kumar Vishal wrote: > +1 > Regards > Kumar Vishal > > On Thu, Sep 27, 2018 at 8:57 AM Jacky Li wrote: > > > +1 > > > > > 在 2018年9月21日,上午10:20,Ravindra Pesala 写道: > > > > > > Hi, > > > > > > In case of querying data using Spark or Presto, carbondata is not well > > > optimized for reading data and fill the vector. The major issues are as > > > follows. > > > 1. CarbonData has long method stack for reading and filling out the > data > > to > > > vector. > > > 2. Many conditions and checks before filling out the data to vector. > > > 3. Maintaining intermediate copies of data leads more CPU utilization. > > > Because of the above issues, there is a high chance of missing the CPU > > > cache while processing the leads to poor performance. > > > > > > So here I am proposing the optimization to fill the vector without much > > > method stack and condition checks and no intermediate copies to utilize > > > more CPU cache. > > > > > > *Full Scan queries:* > > > After decompressing the page in our V3 reader we can immediately fill > > the > > > data to a vector without any condition checks inside loops. So here > > > complete column page data is set to column vector in a single batch and > > > gives back data to Spark/Presto. > > > *Filter Queries:* > > > First, apply page level pruning using the min/max of each page and get > > > the valid pages of blocklet. Decompress only valid pages and fill the > > > vector directly as mentioned in full scan query scenario. > > > > > > In this method, we can also get the advantage of avoiding two times > > > filtering in Spark/Presto as they do the filtering again even though we > > > return the filtered data. > > > > > > Please find the *TPCH performance report of updated carbon* as per the > > > changes mentioned above. Please note that the changes I have done the > > > changes in POC quality so it takes some time to stabilize it. > > > > > > *Configurations* > > > Laptop with i7 processor and 16 GB RAM. > > > TPCH Data Scale: 100 GB > > > No Sort with no inverted index data. > > > Total CarbonData Size : 32 GB > > > Total Parquet Size : 31 GB > > > > > > > > > Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon > New > > > Vs Parquet Carbon old Vs Parquet > > > Q1 101 96 128 25.00% 4.95% -26.73% > > > Q2 85 82 85 3.53% 3.53% 0.00% > > > Q3 118 112 135 17.04% 5.08% -14.41% > > > Q4 473 424 486 12.76% 10.36% -2.75% > > > Q5 228 201 205 1.95% 11.84% 10.09% > > > Q6 19.2 19.2 48 60.00% 0.00% -150.00% > > > Q7 194 181 198 8.59% 6.70% -2.06% > > > Q8 285 263 275 4.36% 7.72% 3.51% > > > Q9 362 345 363 4.96% 4.70% -0.28% > > > Q10 101 92 93 1.08% 8.91% 7.92% > > > Q11 64 61 62 1.61% 4.69% 3.13% > > > Q12 41.4 44 63 30.16% -6.28% -52.17% > > > Q13 43.4 43.6 43.7 0.23% -0.46% -0.69% > > > Q14 36.9 31.5 41 23.17% 14.63% -11.11% > > > Q15 70 59 80 26.25% 15.71% -14.29% > > > Q16 64 60 64 6.25% 6.25% 0.00% > > > Q17 426 418 432 3.24% 1.88% -1.41% > > > Q18 1015 921 1001 7.99% 9.26% 1.38% > > > Q19 62 53 59 10.17% 14.52% 4.84% > > > Q20 406 326 426 23.47% 19.70% -4.93% > > > Full Scan Query* 140 116 164 29.27% 17.14% -17.14% > > > *Full Scan Query means count of every coumn of lineitem, In this way we > > can > > > check the full scan query performance. > > > > > > The above optimization is not just limited to fileformat and Presto > > > integration but also improves for CarbonSession integration. > > > We can further optimize carbon by the tasks(Vishal is already working > on > > > it) like adaptive encoding for all types of columns and storing length > > and > > > values in separate pages in case of string datatype.Please refer > > > > > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html > > > . > > > > > > -- > > > Thanks & Regards, > > > Ravi > > > > > > > > > > > >
Re: CarbonData Performance Optimization
+1 Regards Kumar Vishal On Thu, Sep 27, 2018 at 8:57 AM Jacky Li wrote: > +1 > > > 在 2018年9月21日,上午10:20,Ravindra Pesala 写道: > > > > Hi, > > > > In case of querying data using Spark or Presto, carbondata is not well > > optimized for reading data and fill the vector. The major issues are as > > follows. > > 1. CarbonData has long method stack for reading and filling out the data > to > > vector. > > 2. Many conditions and checks before filling out the data to vector. > > 3. Maintaining intermediate copies of data leads more CPU utilization. > > Because of the above issues, there is a high chance of missing the CPU > > cache while processing the leads to poor performance. > > > > So here I am proposing the optimization to fill the vector without much > > method stack and condition checks and no intermediate copies to utilize > > more CPU cache. > > > > *Full Scan queries:* > > After decompressing the page in our V3 reader we can immediately fill > the > > data to a vector without any condition checks inside loops. So here > > complete column page data is set to column vector in a single batch and > > gives back data to Spark/Presto. > > *Filter Queries:* > > First, apply page level pruning using the min/max of each page and get > > the valid pages of blocklet. Decompress only valid pages and fill the > > vector directly as mentioned in full scan query scenario. > > > > In this method, we can also get the advantage of avoiding two times > > filtering in Spark/Presto as they do the filtering again even though we > > return the filtered data. > > > > Please find the *TPCH performance report of updated carbon* as per the > > changes mentioned above. Please note that the changes I have done the > > changes in POC quality so it takes some time to stabilize it. > > > > *Configurations* > > Laptop with i7 processor and 16 GB RAM. > > TPCH Data Scale: 100 GB > > No Sort with no inverted index data. > > Total CarbonData Size : 32 GB > > Total Parquet Size : 31 GB > > > > > > Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon New > > Vs Parquet Carbon old Vs Parquet > > Q1 101 96 128 25.00% 4.95% -26.73% > > Q2 85 82 85 3.53% 3.53% 0.00% > > Q3 118 112 135 17.04% 5.08% -14.41% > > Q4 473 424 486 12.76% 10.36% -2.75% > > Q5 228 201 205 1.95% 11.84% 10.09% > > Q6 19.2 19.2 48 60.00% 0.00% -150.00% > > Q7 194 181 198 8.59% 6.70% -2.06% > > Q8 285 263 275 4.36% 7.72% 3.51% > > Q9 362 345 363 4.96% 4.70% -0.28% > > Q10 101 92 93 1.08% 8.91% 7.92% > > Q11 64 61 62 1.61% 4.69% 3.13% > > Q12 41.4 44 63 30.16% -6.28% -52.17% > > Q13 43.4 43.6 43.7 0.23% -0.46% -0.69% > > Q14 36.9 31.5 41 23.17% 14.63% -11.11% > > Q15 70 59 80 26.25% 15.71% -14.29% > > Q16 64 60 64 6.25% 6.25% 0.00% > > Q17 426 418 432 3.24% 1.88% -1.41% > > Q18 1015 921 1001 7.99% 9.26% 1.38% > > Q19 62 53 59 10.17% 14.52% 4.84% > > Q20 406 326 426 23.47% 19.70% -4.93% > > Full Scan Query* 140 116 164 29.27% 17.14% -17.14% > > *Full Scan Query means count of every coumn of lineitem, In this way we > can > > check the full scan query performance. > > > > The above optimization is not just limited to fileformat and Presto > > integration but also improves for CarbonSession integration. > > We can further optimize carbon by the tasks(Vishal is already working on > > it) like adaptive encoding for all types of columns and storing length > and > > values in separate pages in case of string datatype.Please refer > > > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html > > . > > > > -- > > Thanks & Regards, > > Ravi > > > > > >
Re: CarbonData Performance Optimization
+1 > 在 2018年9月21日,上午10:20,Ravindra Pesala 写道: > > Hi, > > In case of querying data using Spark or Presto, carbondata is not well > optimized for reading data and fill the vector. The major issues are as > follows. > 1. CarbonData has long method stack for reading and filling out the data to > vector. > 2. Many conditions and checks before filling out the data to vector. > 3. Maintaining intermediate copies of data leads more CPU utilization. > Because of the above issues, there is a high chance of missing the CPU > cache while processing the leads to poor performance. > > So here I am proposing the optimization to fill the vector without much > method stack and condition checks and no intermediate copies to utilize > more CPU cache. > > *Full Scan queries:* > After decompressing the page in our V3 reader we can immediately fill the > data to a vector without any condition checks inside loops. So here > complete column page data is set to column vector in a single batch and > gives back data to Spark/Presto. > *Filter Queries:* > First, apply page level pruning using the min/max of each page and get > the valid pages of blocklet. Decompress only valid pages and fill the > vector directly as mentioned in full scan query scenario. > > In this method, we can also get the advantage of avoiding two times > filtering in Spark/Presto as they do the filtering again even though we > return the filtered data. > > Please find the *TPCH performance report of updated carbon* as per the > changes mentioned above. Please note that the changes I have done the > changes in POC quality so it takes some time to stabilize it. > > *Configurations* > Laptop with i7 processor and 16 GB RAM. > TPCH Data Scale: 100 GB > No Sort with no inverted index data. > Total CarbonData Size : 32 GB > Total Parquet Size : 31 GB > > > Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon New > Vs Parquet Carbon old Vs Parquet > Q1 101 96 128 25.00% 4.95% -26.73% > Q2 85 82 85 3.53% 3.53% 0.00% > Q3 118 112 135 17.04% 5.08% -14.41% > Q4 473 424 486 12.76% 10.36% -2.75% > Q5 228 201 205 1.95% 11.84% 10.09% > Q6 19.2 19.2 48 60.00% 0.00% -150.00% > Q7 194 181 198 8.59% 6.70% -2.06% > Q8 285 263 275 4.36% 7.72% 3.51% > Q9 362 345 363 4.96% 4.70% -0.28% > Q10 101 92 93 1.08% 8.91% 7.92% > Q11 64 61 62 1.61% 4.69% 3.13% > Q12 41.4 44 63 30.16% -6.28% -52.17% > Q13 43.4 43.6 43.7 0.23% -0.46% -0.69% > Q14 36.9 31.5 41 23.17% 14.63% -11.11% > Q15 70 59 80 26.25% 15.71% -14.29% > Q16 64 60 64 6.25% 6.25% 0.00% > Q17 426 418 432 3.24% 1.88% -1.41% > Q18 1015 921 1001 7.99% 9.26% 1.38% > Q19 62 53 59 10.17% 14.52% 4.84% > Q20 406 326 426 23.47% 19.70% -4.93% > Full Scan Query* 140 116 164 29.27% 17.14% -17.14% > *Full Scan Query means count of every coumn of lineitem, In this way we can > check the full scan query performance. > > The above optimization is not just limited to fileformat and Presto > integration but also improves for CarbonSession integration. > We can further optimize carbon by the tasks(Vishal is already working on > it) like adaptive encoding for all types of columns and storing length and > values in separate pages in case of string datatype.Please refer > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html > . > > -- > Thanks & Regards, > Ravi >
Re: CarbonData Performance Optimization
Hi +1, great proposal, very expect to see your pull request. Regards Liang ravipesala wrote > Hi, > > In case of querying data using Spark or Presto, carbondata is not well > optimized for reading data and fill the vector. The major issues are as > follows. > 1. CarbonData has long method stack for reading and filling out the data > to > vector. > 2. Many conditions and checks before filling out the data to vector. > 3. Maintaining intermediate copies of data leads more CPU utilization. > Because of the above issues, there is a high chance of missing the CPU > cache while processing the leads to poor performance. > > So here I am proposing the optimization to fill the vector without much > method stack and condition checks and no intermediate copies to utilize > more CPU cache. > > *Full Scan queries:* > After decompressing the page in our V3 reader we can immediately fill > the > data to a vector without any condition checks inside loops. So here > complete column page data is set to column vector in a single batch and > gives back data to Spark/Presto. > *Filter Queries:* > First, apply page level pruning using the min/max of each page and get > the valid pages of blocklet. Decompress only valid pages and fill the > vector directly as mentioned in full scan query scenario. > > In this method, we can also get the advantage of avoiding two times > filtering in Spark/Presto as they do the filtering again even though we > return the filtered data. > > Please find the *TPCH performance report of updated carbon* as per the > changes mentioned above. Please note that the changes I have done the > changes in POC quality so it takes some time to stabilize it. > > *Configurations* > Laptop with i7 processor and 16 GB RAM. > TPCH Data Scale: 100 GB > No Sort with no inverted index data. > Total CarbonData Size : 32 GB > Total Parquet Size : 31 GB > > > Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon New > Vs Parquet Carbon old Vs Parquet > Q1 101 96 128 25.00% 4.95% -26.73% > Q2 85 82 85 3.53% 3.53% 0.00% > Q3 118 112 135 17.04% 5.08% -14.41% > Q4 473 424 486 12.76% 10.36% -2.75% > Q5 228 201 205 1.95% 11.84% 10.09% > Q6 19.2 19.2 48 60.00% 0.00% -150.00% > Q7 194 181 198 8.59% 6.70% -2.06% > Q8 285 263 275 4.36% 7.72% 3.51% > Q9 362 345 363 4.96% 4.70% -0.28% > Q10 101 92 93 1.08% 8.91% 7.92% > Q11 64 61 62 1.61% 4.69% 3.13% > Q12 41.4 44 63 30.16% -6.28% -52.17% > Q13 43.4 43.6 43.7 0.23% -0.46% -0.69% > Q14 36.9 31.5 41 23.17% 14.63% -11.11% > Q15 70 59 80 26.25% 15.71% -14.29% > Q16 64 60 64 6.25% 6.25% 0.00% > Q17 426 418 432 3.24% 1.88% -1.41% > Q18 1015 921 1001 7.99% 9.26% 1.38% > Q19 62 53 59 10.17% 14.52% 4.84% > Q20 406 326 426 23.47% 19.70% -4.93% > Full Scan Query* 140 116 164 29.27% 17.14% -17.14% > *Full Scan Query means count of every coumn of lineitem, In this way we > can > check the full scan query performance. > > The above optimization is not just limited to fileformat and Presto > integration but also improves for CarbonSession integration. > We can further optimize carbon by the tasks(Vishal is already working on > it) like adaptive encoding for all types of columns and storing length and > values in separate pages in case of string datatype.Please refer > http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html > . > > -- > Thanks & Regards, > Ravi -- Sent from: http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/