[jira] Commented: (HIVE-1602) List Partitioning

Ning Zhang (JIRA) Fri, 27 Aug 2010 16:26:23 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903708#action_12903708
 ]


Ning Zhang commented on HIVE-1602:
----------------------------------

I agree this will be a big change and we are tossing the ideas here. We don't 
have a final plan yet. 

HAR is one idea and definitely we should try it once HIVE-1467 is done. But as 
you said it won't change the # of partitions. Check out some of our tables, 
which has more than 240 partitions each day. With dynamic partition, it is very 
easy to increase it even more. 

Another idea Namit and I were talking about is to store the mapping from the 
list of values {'s', 'm', 'l'} to the actual partition location and store this 
mapping in the metastore. This essentially separates the logical concept of 
partition from the physical storage location (HDFS directories). This could be 
a big change and break some users' assumption who are relying on the reverse of 
the mapping (figuring out partition from the HDFS directory). 

If we decide to go this route, inserting is easy as we get the mapping from 
metastore and decide which directory to write given an output row. Querying is 
a little bit complicated as the partition prunning phase need to figure out 
which physical directory a partition correspond to and get the partition column 
value from the data file itself rather than from the directory name. The 
overhead is of course the partition column value need extra storage in the data 
file. But if we sort based on the partitioning column and with RCFile and 
column level run-length compression (which we have already supported), the 
storage overhead is very small. 

> List Partitioning
> -----------------
>
>                 Key: HIVE-1602
>                 URL: https://issues.apache.org/jira/browse/HIVE-1602
>             Project: Hadoop Hive
>          Issue Type: New Feature
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>
> Dynamic partition inserts create partitions bases on the dynamic partition 
> column values. Currently it creates one partition for each distinct DP column 
> value. This could result in skews in the created dynamic partitions in that 
> some partitions are large but there could be large number of small partitions 
> as well. This results in burdens in HDFS as well as metastore. A list 
> partitioning scheme that aggregate a number of small partitions into one big 
> one is more preferable for skewed partitions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-1602) List Partitioning

Reply via email to