[ 
https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160726#comment-13160726
 ] 

alex gemini commented on HIVE-2097:
-----------------------------------

another issue is about efficient serialization/deserialization,for the same 
example above,assume every gender,age,region have 100 message equally store in 
one dfs block,in gender column we store value like 
this:{'male'}[1-60k]{'female'}[60k+1 - 120k],age column look like 
this:{21}[1-3k]{22}[3k+1 - 6k]{23}[6k+1 - 9k],and region column is 
like:{'LA'}[1-300]{'NY'}[301-600].
When we issue a query on a single table like :select sum(age) from logs where 
region='LA' and age=30,we count every column represented at 
'select,where,group' clause,so we know the last column means lowest 
selectivity(in this example is region),we find the region 
value={'LA'}[(1-300),(30k+1 - 30k+300),(60k+1 -60k+300)....] and 'NY' 
value={'NY'}[(301-600),(30k+301 - 30k+600),(60k+301 -60k+600)....] 
we just need to deserialization it but don't need to decompression it because 
we know the lowest selectivity column,then we organize inputSplit's key like 
{[age='21'][region='LA']} and value is {(1-300),(30k+1 - 30k+300),(60k+1 
-60k+300)....},this inputSplit key and value is unique per dfs block because we 
already sort column by selectivity,the lowest selectivity column presented at 
(select,where,group) must be unique. 
                
> Explore mechanisms for better compression with RC Files
> -------------------------------------------------------
>
>                 Key: HIVE-2097
>                 URL: https://issues.apache.org/jira/browse/HIVE-2097
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>
> Optimization of the compression mechanisms used by RC File to be explored.
> Some initial ideas
>  
> 1. More efficient serialization/deserialization based on type-specific and 
> storage-specific knowledge.
>  
>    For instance, storing sorted numeric values efficiently using some delta 
> coding techniques
> 2. More efficient compression based on type-specific and storage-specific 
> knowledge
>    Enable compression codecs to be specified based on types or individual 
> columns
> 3. Reordering the on-disk storage for better compression efficiency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to