Ravindra Pesala created CARBONDATA-726:
------------------------------------------

             Summary: Update with V3 format for better IO and processing 
optimization.
                 Key: CARBONDATA-726
                 URL: https://issues.apache.org/jira/browse/CARBONDATA-726
             Project: CarbonData
          Issue Type: Improvement
            Reporter: Ravindra Pesala


Problems in current format.
1. IO read is slower since it needs to go for multiple seeks on the file to 
read column blocklets. Current size of blocklet is 120000, so it needs to read 
multiple times from file to scan the data on that column. Alternatively we can 
increase the blocklet size but it suffers for filter queries as it gets big 
blocklet to filter.
2. Decompression is slower in current format, we are using inverted index for 
faster filter queries and using NumberCompressor to compress the inverted index 
in bit wise packing. It becomes slower so we should avoid number compressor. 
One alternative is to keep blocklet size with in 32000 so that inverted index 
can be written with short, but IO read suffers a lot.

To overcome from above 2 issues we are introducing new format V3.
Here each blocklet has multiple pages with size 32000, number of pages in 
blocklet is configurable. Since we keep the page with in short limit so no need 
compress the inverted index here.
And maintain the max/min for each page to further prune the filter queries.
Read the blocklet with pages at once and keep in offheap memory.
During filter first check the max/min range and if it is valid then go for 
decompressing the page to filter further.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to