Re: CarbonData Performance Optimization

2018-09-27 Thread xm_zzc
So excited. Good optimization.



--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/


Re: CarbonData Performance Optimization

2018-09-27 Thread manish gupta
+1

Regards
Manish Gupta

On Thu, 27 Sep 2018 at 11:36 AM, Kumar Vishal 
wrote:

> +1
> Regards
> Kumar Vishal
>
> On Thu, Sep 27, 2018 at 8:57 AM Jacky Li  wrote:
>
> > +1
> >
> > > 在 2018年9月21日,上午10:20,Ravindra Pesala  写道:
> > >
> > > Hi,
> > >
> > > In case of querying data using Spark  or Presto, carbondata is not well
> > > optimized for reading data and fill the vector. The major issues are as
> > > follows.
> > > 1. CarbonData has long method stack for reading and filling out the
> data
> > to
> > > vector.
> > > 2. Many conditions and checks before filling out the data to vector.
> > > 3. Maintaining intermediate copies of data leads more CPU utilization.
> > > Because of the above issues, there is a high chance of missing the CPU
> > > cache while processing the leads to poor performance.
> > >
> > > So here I am proposing the optimization to fill the vector without much
> > > method stack and condition checks and no intermediate copies to utilize
> > > more CPU cache.
> > >
> > > *Full Scan queries:*
> > >  After decompressing the page in our V3 reader we can immediately fill
> > the
> > > data to a vector without any condition checks inside loops. So here
> > > complete column page data is set to column vector in a single batch and
> > > gives back data to Spark/Presto.
> > > *Filter Queries:*
> > >  First, apply page level pruning using the min/max of each page and get
> > > the valid pages of blocklet.  Decompress only valid pages and fill the
> > > vector directly as mentioned in full scan query scenario.
> > >
> > > In this method, we can also get the advantage of avoiding two times
> > > filtering in Spark/Presto as they do the filtering again even though we
> > > return the filtered data.
> > >
> > > Please find the *TPCH performance report of updated carbon* as per the
> > > changes mentioned above. Please note that the changes I have done the
> > > changes in POC quality so it takes some time to stabilize it.
> > >
> > > *Configurations*
> > > Laptop with i7 processor and 16 GB RAM.
> > > TPCH Data Scale: 100 GB
> > > No Sort with no inverted index data.
> > > Total CarbonData Size : 32 GB
> > > Total Parquet Size :  31 GB
> > >
> > >
> > > Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon
> New
> > > Vs Parquet Carbon old Vs Parquet
> > > Q1 101 96 128 25.00% 4.95% -26.73%
> > > Q2 85 82 85 3.53% 3.53% 0.00%
> > > Q3 118 112 135 17.04% 5.08% -14.41%
> > > Q4 473 424 486 12.76% 10.36% -2.75%
> > > Q5 228 201 205 1.95% 11.84% 10.09%
> > > Q6 19.2 19.2 48 60.00% 0.00% -150.00%
> > > Q7 194 181 198 8.59% 6.70% -2.06%
> > > Q8 285 263 275 4.36% 7.72% 3.51%
> > > Q9 362 345 363 4.96% 4.70% -0.28%
> > > Q10 101 92 93 1.08% 8.91% 7.92%
> > > Q11 64 61 62 1.61% 4.69% 3.13%
> > > Q12 41.4 44 63 30.16% -6.28% -52.17%
> > > Q13 43.4 43.6 43.7 0.23% -0.46% -0.69%
> > > Q14 36.9 31.5 41 23.17% 14.63% -11.11%
> > > Q15 70 59 80 26.25% 15.71% -14.29%
> > > Q16 64 60 64 6.25% 6.25% 0.00%
> > > Q17 426 418 432 3.24% 1.88% -1.41%
> > > Q18 1015 921 1001 7.99% 9.26% 1.38%
> > > Q19 62 53 59 10.17% 14.52% 4.84%
> > > Q20 406 326 426 23.47% 19.70% -4.93%
> > > Full Scan Query* 140 116 164 29.27% 17.14% -17.14%
> > > *Full Scan Query means count of every coumn of lineitem, In this way we
> > can
> > > check the full scan query performance.
> > >
> > > The above optimization is not just limited to fileformat and Presto
> > > integration but also improves for CarbonSession integration.
> > > We can further optimize carbon by the tasks(Vishal is already working
> on
> > > it) like adaptive encoding for all types of columns and storing length
> > and
> > > values in separate pages in case of string datatype.Please refer
> > >
> >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html
> > > .
> > >
> > > --
> > > Thanks & Regards,
> > > Ravi
> > >
> >
> >
> >
> >
>


Re: CarbonData Performance Optimization

2018-09-27 Thread Kumar Vishal
+1
Regards
Kumar Vishal

On Thu, Sep 27, 2018 at 8:57 AM Jacky Li  wrote:

> +1
>
> > 在 2018年9月21日,上午10:20,Ravindra Pesala  写道:
> >
> > Hi,
> >
> > In case of querying data using Spark  or Presto, carbondata is not well
> > optimized for reading data and fill the vector. The major issues are as
> > follows.
> > 1. CarbonData has long method stack for reading and filling out the data
> to
> > vector.
> > 2. Many conditions and checks before filling out the data to vector.
> > 3. Maintaining intermediate copies of data leads more CPU utilization.
> > Because of the above issues, there is a high chance of missing the CPU
> > cache while processing the leads to poor performance.
> >
> > So here I am proposing the optimization to fill the vector without much
> > method stack and condition checks and no intermediate copies to utilize
> > more CPU cache.
> >
> > *Full Scan queries:*
> >  After decompressing the page in our V3 reader we can immediately fill
> the
> > data to a vector without any condition checks inside loops. So here
> > complete column page data is set to column vector in a single batch and
> > gives back data to Spark/Presto.
> > *Filter Queries:*
> >  First, apply page level pruning using the min/max of each page and get
> > the valid pages of blocklet.  Decompress only valid pages and fill the
> > vector directly as mentioned in full scan query scenario.
> >
> > In this method, we can also get the advantage of avoiding two times
> > filtering in Spark/Presto as they do the filtering again even though we
> > return the filtered data.
> >
> > Please find the *TPCH performance report of updated carbon* as per the
> > changes mentioned above. Please note that the changes I have done the
> > changes in POC quality so it takes some time to stabilize it.
> >
> > *Configurations*
> > Laptop with i7 processor and 16 GB RAM.
> > TPCH Data Scale: 100 GB
> > No Sort with no inverted index data.
> > Total CarbonData Size : 32 GB
> > Total Parquet Size :  31 GB
> >
> >
> > Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon New
> > Vs Parquet Carbon old Vs Parquet
> > Q1 101 96 128 25.00% 4.95% -26.73%
> > Q2 85 82 85 3.53% 3.53% 0.00%
> > Q3 118 112 135 17.04% 5.08% -14.41%
> > Q4 473 424 486 12.76% 10.36% -2.75%
> > Q5 228 201 205 1.95% 11.84% 10.09%
> > Q6 19.2 19.2 48 60.00% 0.00% -150.00%
> > Q7 194 181 198 8.59% 6.70% -2.06%
> > Q8 285 263 275 4.36% 7.72% 3.51%
> > Q9 362 345 363 4.96% 4.70% -0.28%
> > Q10 101 92 93 1.08% 8.91% 7.92%
> > Q11 64 61 62 1.61% 4.69% 3.13%
> > Q12 41.4 44 63 30.16% -6.28% -52.17%
> > Q13 43.4 43.6 43.7 0.23% -0.46% -0.69%
> > Q14 36.9 31.5 41 23.17% 14.63% -11.11%
> > Q15 70 59 80 26.25% 15.71% -14.29%
> > Q16 64 60 64 6.25% 6.25% 0.00%
> > Q17 426 418 432 3.24% 1.88% -1.41%
> > Q18 1015 921 1001 7.99% 9.26% 1.38%
> > Q19 62 53 59 10.17% 14.52% 4.84%
> > Q20 406 326 426 23.47% 19.70% -4.93%
> > Full Scan Query* 140 116 164 29.27% 17.14% -17.14%
> > *Full Scan Query means count of every coumn of lineitem, In this way we
> can
> > check the full scan query performance.
> >
> > The above optimization is not just limited to fileformat and Presto
> > integration but also improves for CarbonSession integration.
> > We can further optimize carbon by the tasks(Vishal is already working on
> > it) like adaptive encoding for all types of columns and storing length
> and
> > values in separate pages in case of string datatype.Please refer
> >
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html
> > .
> >
> > --
> > Thanks & Regards,
> > Ravi
> >
>
>
>
>


Re: CarbonData Performance Optimization

2018-09-26 Thread Jacky Li
+1

> 在 2018年9月21日,上午10:20,Ravindra Pesala  写道:
> 
> Hi,
> 
> In case of querying data using Spark  or Presto, carbondata is not well
> optimized for reading data and fill the vector. The major issues are as
> follows.
> 1. CarbonData has long method stack for reading and filling out the data to
> vector.
> 2. Many conditions and checks before filling out the data to vector.
> 3. Maintaining intermediate copies of data leads more CPU utilization.
> Because of the above issues, there is a high chance of missing the CPU
> cache while processing the leads to poor performance.
> 
> So here I am proposing the optimization to fill the vector without much
> method stack and condition checks and no intermediate copies to utilize
> more CPU cache.
> 
> *Full Scan queries:*
>  After decompressing the page in our V3 reader we can immediately fill the
> data to a vector without any condition checks inside loops. So here
> complete column page data is set to column vector in a single batch and
> gives back data to Spark/Presto.
> *Filter Queries:*
>  First, apply page level pruning using the min/max of each page and get
> the valid pages of blocklet.  Decompress only valid pages and fill the
> vector directly as mentioned in full scan query scenario.
> 
> In this method, we can also get the advantage of avoiding two times
> filtering in Spark/Presto as they do the filtering again even though we
> return the filtered data.
> 
> Please find the *TPCH performance report of updated carbon* as per the
> changes mentioned above. Please note that the changes I have done the
> changes in POC quality so it takes some time to stabilize it.
> 
> *Configurations*
> Laptop with i7 processor and 16 GB RAM.
> TPCH Data Scale: 100 GB
> No Sort with no inverted index data.
> Total CarbonData Size : 32 GB
> Total Parquet Size :  31 GB
> 
> 
> Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon New
> Vs Parquet Carbon old Vs Parquet
> Q1 101 96 128 25.00% 4.95% -26.73%
> Q2 85 82 85 3.53% 3.53% 0.00%
> Q3 118 112 135 17.04% 5.08% -14.41%
> Q4 473 424 486 12.76% 10.36% -2.75%
> Q5 228 201 205 1.95% 11.84% 10.09%
> Q6 19.2 19.2 48 60.00% 0.00% -150.00%
> Q7 194 181 198 8.59% 6.70% -2.06%
> Q8 285 263 275 4.36% 7.72% 3.51%
> Q9 362 345 363 4.96% 4.70% -0.28%
> Q10 101 92 93 1.08% 8.91% 7.92%
> Q11 64 61 62 1.61% 4.69% 3.13%
> Q12 41.4 44 63 30.16% -6.28% -52.17%
> Q13 43.4 43.6 43.7 0.23% -0.46% -0.69%
> Q14 36.9 31.5 41 23.17% 14.63% -11.11%
> Q15 70 59 80 26.25% 15.71% -14.29%
> Q16 64 60 64 6.25% 6.25% 0.00%
> Q17 426 418 432 3.24% 1.88% -1.41%
> Q18 1015 921 1001 7.99% 9.26% 1.38%
> Q19 62 53 59 10.17% 14.52% 4.84%
> Q20 406 326 426 23.47% 19.70% -4.93%
> Full Scan Query* 140 116 164 29.27% 17.14% -17.14%
> *Full Scan Query means count of every coumn of lineitem, In this way we can
> check the full scan query performance.
> 
> The above optimization is not just limited to fileformat and Presto
> integration but also improves for CarbonSession integration.
> We can further optimize carbon by the tasks(Vishal is already working on
> it) like adaptive encoding for all types of columns and storing length and
> values in separate pages in case of string datatype.Please refer
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html
> .
> 
> -- 
> Thanks & Regards,
> Ravi
> 





Re: CarbonData Performance Optimization

2018-09-24 Thread Liang Chen
Hi

+1, great proposal, very expect to see your pull request.

Regards
Liang

ravipesala wrote
> Hi,
> 
> In case of querying data using Spark  or Presto, carbondata is not well
> optimized for reading data and fill the vector. The major issues are as
> follows.
> 1. CarbonData has long method stack for reading and filling out the data
> to
> vector.
> 2. Many conditions and checks before filling out the data to vector.
> 3. Maintaining intermediate copies of data leads more CPU utilization.
> Because of the above issues, there is a high chance of missing the CPU
> cache while processing the leads to poor performance.
> 
> So here I am proposing the optimization to fill the vector without much
> method stack and condition checks and no intermediate copies to utilize
> more CPU cache.
> 
> *Full Scan queries:*
>   After decompressing the page in our V3 reader we can immediately fill
> the
> data to a vector without any condition checks inside loops. So here
> complete column page data is set to column vector in a single batch and
> gives back data to Spark/Presto.
> *Filter Queries:*
>   First, apply page level pruning using the min/max of each page and get
> the valid pages of blocklet.  Decompress only valid pages and fill the
> vector directly as mentioned in full scan query scenario.
> 
> In this method, we can also get the advantage of avoiding two times
> filtering in Spark/Presto as they do the filtering again even though we
> return the filtered data.
> 
> Please find the *TPCH performance report of updated carbon* as per the
> changes mentioned above. Please note that the changes I have done the
> changes in POC quality so it takes some time to stabilize it.
> 
> *Configurations*
> Laptop with i7 processor and 16 GB RAM.
> TPCH Data Scale: 100 GB
> No Sort with no inverted index data.
> Total CarbonData Size : 32 GB
> Total Parquet Size :  31 GB
> 
> 
> Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon New
> Vs Parquet Carbon old Vs Parquet
> Q1 101 96 128 25.00% 4.95% -26.73%
> Q2 85 82 85 3.53% 3.53% 0.00%
> Q3 118 112 135 17.04% 5.08% -14.41%
> Q4 473 424 486 12.76% 10.36% -2.75%
> Q5 228 201 205 1.95% 11.84% 10.09%
> Q6 19.2 19.2 48 60.00% 0.00% -150.00%
> Q7 194 181 198 8.59% 6.70% -2.06%
> Q8 285 263 275 4.36% 7.72% 3.51%
> Q9 362 345 363 4.96% 4.70% -0.28%
> Q10 101 92 93 1.08% 8.91% 7.92%
> Q11 64 61 62 1.61% 4.69% 3.13%
> Q12 41.4 44 63 30.16% -6.28% -52.17%
> Q13 43.4 43.6 43.7 0.23% -0.46% -0.69%
> Q14 36.9 31.5 41 23.17% 14.63% -11.11%
> Q15 70 59 80 26.25% 15.71% -14.29%
> Q16 64 60 64 6.25% 6.25% 0.00%
> Q17 426 418 432 3.24% 1.88% -1.41%
> Q18 1015 921 1001 7.99% 9.26% 1.38%
> Q19 62 53 59 10.17% 14.52% 4.84%
> Q20 406 326 426 23.47% 19.70% -4.93%
> Full Scan Query* 140 116 164 29.27% 17.14% -17.14%
> *Full Scan Query means count of every coumn of lineitem, In this way we
> can
> check the full scan query performance.
> 
> The above optimization is not just limited to fileformat and Presto
> integration but also improves for CarbonSession integration.
> We can further optimize carbon by the tasks(Vishal is already working on
> it) like adaptive encoding for all types of columns and storing length and
> values in separate pages in case of string datatype.Please refer
> http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html
> .
> 
> -- 
> Thanks & Regards,
> Ravi





--
Sent from: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/