Hi,

In case of querying data using Spark  or Presto, carbondata is not well
optimized for reading data and fill the vector. The major issues are as
follows.
1. CarbonData has long method stack for reading and filling out the data to
vector.
2. Many conditions and checks before filling out the data to vector.
3. Maintaining intermediate copies of data leads more CPU utilization.
Because of the above issues, there is a high chance of missing the CPU
cache while processing the leads to poor performance.

So here I am proposing the optimization to fill the vector without much
method stack and condition checks and no intermediate copies to utilize
more CPU cache.

*Full Scan queries:*
  After decompressing the page in our V3 reader we can immediately fill the
data to a vector without any condition checks inside loops. So here
complete column page data is set to column vector in a single batch and
gives back data to Spark/Presto.
*Filter Queries:*
  First, apply page level pruning using the min/max of each page and get
the valid pages of blocklet.  Decompress only valid pages and fill the
vector directly as mentioned in full scan query scenario.

In this method, we can also get the advantage of avoiding two times
filtering in Spark/Presto as they do the filtering again even though we
return the filtered data.

Please find the *TPCH performance report of updated carbon* as per the
changes mentioned above. Please note that the changes I have done the
changes in POC quality so it takes some time to stabilize it.

*Configurations*
Laptop with i7 processor and 16 GB RAM.
TPCH Data Scale: 100 GB
No Sort with no inverted index data.
Total CarbonData Size : 32 GB
Total Parquet Size :  31 GB


Queries Parquet Carbon New Carbon Old Carbon Old vs Carbon New Carbon New
Vs Parquet Carbon old Vs Parquet
Q1 101 96 128 25.00% 4.95% -26.73%
Q2 85 82 85 3.53% 3.53% 0.00%
Q3 118 112 135 17.04% 5.08% -14.41%
Q4 473 424 486 12.76% 10.36% -2.75%
Q5 228 201 205 1.95% 11.84% 10.09%
Q6 19.2 19.2 48 60.00% 0.00% -150.00%
Q7 194 181 198 8.59% 6.70% -2.06%
Q8 285 263 275 4.36% 7.72% 3.51%
Q9 362 345 363 4.96% 4.70% -0.28%
Q10 101 92 93 1.08% 8.91% 7.92%
Q11 64 61 62 1.61% 4.69% 3.13%
Q12 41.4 44 63 30.16% -6.28% -52.17%
Q13 43.4 43.6 43.7 0.23% -0.46% -0.69%
Q14 36.9 31.5 41 23.17% 14.63% -11.11%
Q15 70 59 80 26.25% 15.71% -14.29%
Q16 64 60 64 6.25% 6.25% 0.00%
Q17 426 418 432 3.24% 1.88% -1.41%
Q18 1015 921 1001 7.99% 9.26% 1.38%
Q19 62 53 59 10.17% 14.52% 4.84%
Q20 406 326 426 23.47% 19.70% -4.93%
Full Scan Query* 140 116 164 29.27% 17.14% -17.14%
*Full Scan Query means count of every coumn of lineitem, In this way we can
check the full scan query performance.

The above optimization is not just limited to fileformat and Presto
integration but also improves for CarbonSession integration.
We can further optimize carbon by the tasks(Vishal is already working on
it) like adaptive encoding for all types of columns and storing length and
values in separate pages in case of string datatype.Please refer
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discussion-Carbondata-Store-size-optimization-td62283.html
.

-- 
Thanks & Regards,
Ravi

Reply via email to