[ 
https://issues.apache.org/jira/browse/HIVE-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616547#comment-13616547
 ] 

Kevin Wilfong commented on HIVE-4244:
-------------------------------------

Some initial thoughts based on some experiments.

Dicitonary encoding seems to be less effective than just Zlib at compressing 
values if the number of distinct values is > ~80% of the total number of 
values.  This number can be configurable.  It's still smaller in memory, so we 
may be able to get away with on writing the stripe, writing out the data 
directly there.  This should be comparable in performance to converting the 
dictionary index that is already done.

Also, if the uncompressed (but encoded) size of the dictionary + index (data 
stream) is greater than the size of the uncompressed size of the original data, 
the compressed data tends to be larger as well despite the sorting.  This will 
be more expensive to figure out as we don't know the size of the index until it 
has been run length encoded.
                
> Make string dictionaries adaptive in ORC
> ----------------------------------------
>
>                 Key: HIVE-4244
>                 URL: https://issues.apache.org/jira/browse/HIVE-4244
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Owen O'Malley
>            Assignee: Kevin Wilfong
>
> The ORC writer should adaptively switch between dictionary and direct 
> encoding. I'd propose looking at the first 100,000 values in each column and 
> decide whether there is sufficient loading in the dictionary to use 
> dictionary encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to