[
https://issues.apache.org/jira/browse/HIVE-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13616657#comment-13616657
]
Owen O'Malley commented on HIVE-4244:
-------------------------------------
We should play with different values, but I was guessing the right cutover
point for the heuristic was at a loading of 2 to 3 (50% to 33% distinct values).
We aren't really going to know whether the heuristic is right or wrong unless
we compare both encodings, which is much too expensive. By taking a good guess
after looking at the start of the stripe, we can get good performance most of
the time.
> Make string dictionaries adaptive in ORC
> ----------------------------------------
>
> Key: HIVE-4244
> URL: https://issues.apache.org/jira/browse/HIVE-4244
> Project: Hive
> Issue Type: Bug
> Components: Serializers/Deserializers
> Reporter: Owen O'Malley
> Assignee: Kevin Wilfong
>
> The ORC writer should adaptively switch between dictionary and direct
> encoding. I'd propose looking at the first 100,000 values in each column and
> decide whether there is sufficient loading in the dictionary to use
> dictionary encoding.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira