[ 
https://issues.apache.org/jira/browse/HIVE-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903664#action_12903664
 ] 

Ning Zhang commented on HIVE-1602:
----------------------------------

@joydeep, this is intended to be an open ended discussions about how to tackle 
partition skews. Combining small partitions into one large partitions seems to 
be a natural way. May be the name of list partition is not so obvious, but I 
meant to map a list of values from the DP column to one partition rather than a 
1-to-1 mapping.

HAR is one option and we can keep the partition spec as part of the file name 
so that the actual column is not stored. 

Another way is to store the partition column value in the data file itself if 
the partition corresponds to a list of values. 

> the user can do a one time analysis of the data (for size distribution on 
> different partitioning columns) and then generate the clumping logic manually.

The problem is that there is no way that the user can manually cluster data 
with different partition column values. for example, if event is a DP column 
and you find a couple of large partitions event = {'l', 'g'}, and a 3 small 
partitions event = {'s', 'm', 'l'}. How can the user manually cluster event=s, 
event=m, event=l into one? If there are a lot of these small partitions it 
introduces a lot of problems in HDFS, metastore, and Hive client side. 

> List Partitioning
> -----------------
>
>                 Key: HIVE-1602
>                 URL: https://issues.apache.org/jira/browse/HIVE-1602
>             Project: Hadoop Hive
>          Issue Type: New Feature
>    Affects Versions: 0.7.0
>            Reporter: Ning Zhang
>
> Dynamic partition inserts create partitions bases on the dynamic partition 
> column values. Currently it creates one partition for each distinct DP column 
> value. This could result in skews in the created dynamic partitions in that 
> some partitions are large but there could be large number of small partitions 
> as well. This results in burdens in HDFS as well as metastore. A list 
> partitioning scheme that aggregate a number of small partitions into one big 
> one is more preferable for skewed partitions. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to