subject:"\[jira\] \[Commented\] \(HIVE\-4244\) Make string dictionaries adaptive in ORC"

[jira] [Commented] (HIVE-4244) Make string dictionaries adaptive in ORC

2013-03-28 Thread Kevin Wilfong (JIRA)

[
https://issues.apache.org/jira/browse/HIVE-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616547#comment-13616547
]

Kevin Wilfong commented on HIVE-4244:
-

Some initial thoughts based on some experiments.

Dicitonary encoding seems to be less effective than just Zlib at compressing
values if the number of distinct values is ~80% of the total number of
values. This number can be configurable. It's still smaller in memory, so we
may be able to get away with on writing the stripe, writing out the data
directly there. This should be comparable in performance to converting the
dictionary index that is already done.

Also, if the uncompressed (but encoded) size of the dictionary + index (data
stream) is greater than the size of the uncompressed size of the original data,
the compressed data tends to be larger as well despite the sorting. This will
be more expensive to figure out as we don't know the size of the index until it
has been run length encoded.

Make string dictionaries adaptive in ORC

Key: HIVE-4244
URL: https://issues.apache.org/jira/browse/HIVE-4244
Project: Hive
Issue Type: Bug
Components: Serializers/Deserializers
Reporter: Owen O'Malley
Assignee: Kevin Wilfong

The ORC writer should adaptively switch between dictionary and direct
encoding. I'd propose looking at the first 100,000 values in each column and
decide whether there is sufficient loading in the dictionary to use
dictionary encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-4244) Make string dictionaries adaptive in ORC

2013-03-28 Thread Owen O'Malley (JIRA)


[ 
https://issues.apache.org/jira/browse/HIVE-4244?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13616657#comment-13616657
 ] 

Owen O'Malley commented on HIVE-4244:
-

We should play with different values, but I was guessing the right cutover 
point for the heuristic was at a loading of 2 to 3 (50% to 33% distinct values).

We aren't really going to know whether the heuristic is right or wrong unless 
we compare both encodings, which is much too expensive. By taking a good guess 
after looking at the start of the stripe, we can get good performance most of 
the time.

 Make string dictionaries adaptive in ORC
 

 Key: HIVE-4244
 URL: https://issues.apache.org/jira/browse/HIVE-4244
 Project: Hive
  Issue Type: Bug
  Components: Serializers/Deserializers
Reporter: Owen O'Malley
Assignee: Kevin Wilfong

 The ORC writer should adaptively switch between dictionary and direct 
 encoding. I'd propose looking at the first 100,000 values in each column and 
 decide whether there is sufficient loading in the dictionary to use 
 dictionary encoding.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-4244) Make string dictionaries adaptive in ORC

[jira] [Commented] (HIVE-4244) Make string dictionaries adaptive in ORC

2 matches

Site Navigation

Mail list logo

Footer information