[ 
https://issues.apache.org/jira/browse/HIVE-4123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13710342#comment-13710342
 ] 

Prasanth J commented on HIVE-4123:
----------------------------------

This patch improves upon the existing run length encoding for integers. As 
mentioned in the description, it uses bit packing for more tighter compression, 
improved run length and delta encoding and also it supports longer runs. 

This patch supports the following light weight compression techniques

*SHORT_REPEAT*
*DIRECT*
*PATCHED_BASE*
*DELTA*


The description and format for these types are as below:

*SHORT_REPEAT:* Used for short repeated integer sequences.
* 1 byte header
** 2 bits for encoding type
** 3 bits for bytes required for repeating value
** 3 bits for repeat count (MIN_REPEAT + run length)
* Blob - repeat value (fixed bytes)

*DIRECT:* Used for random integer sequences whose number of bit requirement 
doesn't vary a lot.
* 2 bytes header
** 1st byte
*** 2 bits for encoding type
*** 5 bits for fixed bit width of values in blob
*** 1 bit for storing MSB of run length
** 2nd byte
*** 8 bits for lower run length bits
* Blob - fixed width * run length bits long

*PATCHED_BASE:* Used for random integer sequences whose number of bit 
requirement varies beyond a threshold.
* 4 bytes header
** 1st byte
*** 2 bits for encoding type
*** 5 bits for fixed bit width of values in blob
*** 1 bit for storing MSB of run length
** 2nd byte
*** 8 bits for lower run length bits
** 3rd byte
*** 3 bits for bytes required for base value
*** 5 bits for patch width
** 4th byte
*** 3 bits for patch gap width
*** 5 bits for patch length
* Base value - base width * 8 bits
* Data blob - fixed width * run length
* Patch blob - (patch width + patch gap width) * patch length

*DELTA:* Used for monotonically increasing or decreasing sequences, sequences 
with fixed delta values or long repeated sequences.
* 2 bytes header
** 1st byte
*** 2 bits for encoding type
*** 5 bits for fixed bit width of values in blob
*** 1 bit for storing MSB of run length
** 2nd byte
*** 8 bits for lower run length bits
* Base value - encoded as varint
* Delta base (only long fixed delta runs) - zigzag encoded
* Delta blob (variable delta runs) - zigzag encoded

I have tested this new implementation with the current implementation and the 
comparison of compression ratio between the existing implementation and new 
implementation is shown in the attached excel sheet for various real world 
datasets. As seen from the comparison sheet the new implementation gives 
significant improvement in compression ratio over the existing implementation 
for most of the cases. 

NOTE: This patch is generated against the trunk after applying HIVE-4724 patch. 

[~owen.omalley] can you please review this patch and let me know your review 
comments? Also let me know if I need to upload this patch to phabricator.


                
> The RLE encoding for ORC can be improved
> ----------------------------------------
>
>                 Key: HIVE-4123
>                 URL: https://issues.apache.org/jira/browse/HIVE-4123
>             Project: Hive
>          Issue Type: New Feature
>          Components: File Formats
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>
> The run length encoding of integers can be improved:
> * tighter bit packing
> * allow delta encoding
> * allow longer runs

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to