Hi Dev, I have worked on the design document. Please find below the link for design document and share your feedback.
https://drive.google.com/open?id=1lN06Pj5tBiBIPSxOBIjK9bpbFVhlUoQA I have also raised the jira issue and uploaded the design document. Please find below the jira link. https://issues.apache.org/jira/browse/CARBONDATA-2638 Regards Manish Gupta On Sat, Jun 23, 2018 at 7:40 PM, manish gupta <tomanishgupt...@gmail.com> wrote: > Thanks for the feedback Jacky. > > As of now we have min/max at each block and blocklet level and while > loading the metadata cache we compute the task level min/max. Segment Level > min/max is not considered as of now but surely this solution can be > enhanced to consider segment level min/max. > > We can discuss further on this in detail and decide whether to consider > now or enhance it in near future. > > Regards > Manish Gupta > > On Fri, Jun 22, 2018 at 8:34 PM, Jacky Li <jacky.li...@qq.com> wrote: > >> Hi Manish, >> >> +1 for solution 1 for next carbon version. Solution 2 should be also >> considered, but for a future version after next version. >> >> In my previous observation, many scenario user will filter on time range, >> and since Carbon’s segment is per incremental load which makes it related >> to time normally. So if we can have minmax for sort_columns for segment >> level, I think it will further help making driver index minimum. Will you >> also consider this? >> >> Regards, >> Jacky >> >> >> > 在 2018年6月21日,下午5:24,manish gupta <tomanishgupt...@gmail.com> 写道: >> > >> > Hi Dev, >> > >> > The current implementation of Blocklet dataMap caching in driver is >> that it >> > caches the min and max values of all the columns in schema by default. >> > >> > The problem with this implementation is that as the number of loads >> > increases the memory required to hold min and max values also increases >> > considerably. We know that in most of the scenarios there is a single >> > driver and memory configured for driver is less as compared to executor. >> > With continuos increase in memory requirement driver can even go out of >> > memory which makes the situation further worse. >> > >> > *Proposed Solution to solve the above problem:* >> > >> > Carbondata uses min and max values for blocklet level pruning. It might >> not >> > be necessary that user has filter on all the columns specified in the >> > schema instead it could be only few columns that has filter applied on >> them >> > in the query. >> > >> > 1. We provide user an option to cache the min and max values of only the >> > required columns. Caching only the required columns can optimize the >> > blocklet dataMap memory usage as well as solve the driver memory >> problem to >> > a greater extent. >> > >> > 2. Using an external storage/DB to cache min and max values. We can also >> > implement a solution to create a table in the external DB and store min >> and >> > max values for all the columns in that table. This will not use any >> driver >> > memory and hence the driver memory usage will be optimized further as >> > compared to solution 1. >> > >> > *Solution 1* will not have any performance impact as the user will cache >> > the required filter columns and it will not have any external dependency >> > for query execution. >> > *Solution 2* will degrade the query performance as it will involve >> querying >> > for min and max values from external DB required for Blocklet pruning. >> > >> > *So from my point of view we should go with solution 1 and in near >> future >> > propose a design for solution 2. User can have an option to select >> between >> > the 2 options*. Kindly share your suggestions. >> > >> > Regards >> > Manish Gupta >> >> >> >> >