Re: [Discussion] Blocklet DataMap caching in driver

manish gupta Sun, 24 Jun 2018 20:08:24 -0700

Hi Dev,

I have worked on the design document. Please find below the link for design
document and share your feedback.


https://drive.google.com/open?id=1lN06Pj5tBiBIPSxOBIjK9bpbFVhlUoQA

I have also raised the jira issue and uploaded the design document. Please
find below the jira link.

https://issues.apache.org/jira/browse/CARBONDATA-2638

Regards
Manish Gupta

On Sat, Jun 23, 2018 at 7:40 PM, manish gupta <tomanishgupt...@gmail.com>
wrote:

> Thanks for the feedback Jacky.
>
> As of now we have min/max at each block and blocklet level and while
> loading the metadata cache we compute the task level min/max. Segment Level
> min/max is not considered as of now but surely this solution can be
> enhanced to consider segment level min/max.
>
> We can discuss further on this in detail and decide whether to consider
> now or enhance it in near future.
>
> Regards
> Manish Gupta
>
> On Fri, Jun 22, 2018 at 8:34 PM, Jacky Li <jacky.li...@qq.com> wrote:
>
>> Hi Manish,
>>
>> +1 for solution 1 for next carbon version. Solution 2 should be also
>> considered, but for a future version after next version.
>>
>> In my previous observation, many scenario user will filter on time range,
>> and since Carbon’s segment is per incremental load which makes it related
>> to time normally. So if we can have minmax for sort_columns for segment
>> level, I think it will further help making driver index minimum. Will you
>> also consider this?
>>
>> Regards,
>> Jacky
>>
>>
>> > 在 2018年6月21日，下午5:24，manish gupta <tomanishgupt...@gmail.com> 写道：
>> >
>> > Hi Dev,
>> >
>> > The current implementation of Blocklet dataMap caching in driver is
>> that it
>> > caches the min and max values of all the columns in schema by default.
>> >
>> > The problem with this implementation is that as the number of loads
>> > increases the memory required to hold min and max values also increases
>> > considerably. We know that in most of the scenarios there is a single
>> > driver and memory configured for driver is less as compared to executor.
>> > With continuos increase in memory requirement driver can even go out of
>> > memory which makes the situation further worse.
>> >
>> > *Proposed Solution to solve the above problem:*
>> >
>> > Carbondata uses min and max values for blocklet level pruning. It might
>> not
>> > be necessary that user has filter on all the columns specified in the
>> > schema instead it could be only few columns that has filter applied on
>> them
>> > in the query.
>> >
>> > 1. We provide user an option to cache the min and max values of only the
>> > required columns. Caching only the required columns can optimize the
>> > blocklet dataMap memory usage as well as solve the driver memory
>> problem to
>> > a greater extent.
>> >
>> > 2. Using an external storage/DB to cache min and max values. We can also
>> > implement a solution to create a table in the external DB and store min
>> and
>> > max values for all the columns in that table. This will not use any
>> driver
>> > memory and hence the driver memory usage will be optimized further as
>> > compared to solution 1.
>> >
>> > *Solution 1* will not have any performance impact as the user will cache
>> > the required filter columns and it will not have any external dependency
>> > for query execution.
>> > *Solution 2* will degrade the query performance as it will involve
>> querying
>> > for min and max values from external DB required for Blocklet pruning.
>> >
>> > *So from my point of view we should go with solution 1 and in near
>> future
>> > propose a design for solution 2. User can have an option to select
>> between
>> > the 2 options*. Kindly share your suggestions.
>> >
>> > Regards
>> > Manish Gupta
>>
>>
>>
>>
>

Re: [Discussion] Blocklet DataMap caching in driver

Reply via email to