Re: Improving Non-dictionary storage & performance.

Jacky Li Wed, 01 Mar 2017 17:19:14 -0800

Hi Ravindra & Vishal,

Yes, I think these works need to be done before switching no-dictionary as 
default. So as of now, we should use dictionary as default. 
I think we can suggest user to do loading as:
1. First load: use 2-pass mode to load, the first scan should discover the 
cardinality, and check with user specified option. We should define rules to 
pass or fail the validation, and finalize the load option for subsequent load.
2. Subsequent load: use single-pass mode to load, use the options defined by 
first load


What is your idea?

Regards,
Jacky

> 在 2017年3月1日，下午11:31，Ravindra Pesala <ravi.pes...@gmail.com> 写道：
> 
> Hi Vishal,
> 
> You are right, thats why we can do no-dictionary only for String datatype.
> Please look at my first point. we can always use direct dictionary for
> possible data types like short, int, long, double & float for sort_columns.
> 
> Regards,
> Ravindra.
> 
> On 1 March 2017 at 18:18, Kumar Vishal <kumarvishal1...@gmail.com> wrote:
> 
>> Hi Ravi,
>> Sorting of data for no dictionary should be based on data type + same for
>> filter . Please add this point.
>> 
>> -Regards
>> Kumar Vishal
>> 
>> On Wed, Mar 1, 2017 at 8:34 PM, Ravindra Pesala <ravi.pes...@gmail.com>
>> wrote:
>> 
>>> Hi,
>>> 
>>> In order to make non-dictionary columns storage and performance more
>>> efficient, I am suggesting following improvements.
>>> 
>>> 1. Make always SHORT, INT, BIGINT, DOUBLE & FLOAT always  direct
>>> dictionary.
>>>   Right now only date and timestamp are direct dictionary columns. We
>> can
>>> make SHORT, INT, BIGINT, DOUBLE & FLOAT Direct dictionary if these
>> columns
>>> are included in SORT_COLUMNS
>>> 
>>> 2. Consider delta/value compression while storing direct dictionary
>> values.
>>> Right now it always uses INT datatype to store direct dictionary values.
>> So
>>> we can consider value/Delta compression to compact the storage.
>>> 
>>> 3. Use the Separator instead of LV format to store String value in
>>> no-dictionary format.
>>> Currently String datatypes for non-dictionary colums are stored as
>>> LV(length value) format, here we are using Short(2 bytes) as length
>> always.
>>> In order to keep storage compact we can use separator (0 byte as
>> separator)
>>> it just takes single byte. And while reading we can traverse through data
>>> and get the offsets like we are doing now.
>>> 
>>> 4. Add Range filters for no-dictionary columns.
>>> Currently range filters like greater/ less than filters are not
>> implemented
>>> for no-dictionary columns. So we should implement them to avoid row level
>>> filter and improve the performance.
>>> 
>>> Regards,
>>> Ravindra.
>>> 
>> 
> 
> 
> -- 
> Thanks & Regards,
> Ravi

Re: Improving Non-dictionary storage & performance.

Reply via email to