[ https://issues.apache.org/jira/browse/PIG-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905220#action_12905220 ]
Dmitriy V. Ryaboy commented on PIG-1592: ---------------------------------------- One proposal is to simply change the default weighted range partitioner to take into account the record size. If record size is uniform, or uniformly distributed, or non-uniformly distributed but independent of the order key, this change shouldn't materially affect the distributions created for data sets not covered by this issue. > ORDER BY distribution is uneven when record size is correlated with order key > ----------------------------------------------------------------------------- > > Key: PIG-1592 > URL: https://issues.apache.org/jira/browse/PIG-1592 > Project: Pig > Issue Type: Improvement > Reporter: Dmitriy V. Ryaboy > Fix For: 0.9.0 > > > The partitioner contributed in PIG-545 distributes the order key space > between partitions so that each partition gets approximately the same number > of keys, even when the keys have a non-uniform distribution over the key > space. > Unfortunately this still allows for severe partition imbalance when record > size is correlated with the order key. By way of motivating example, consider > this script which attempts to produce a list of genuses based on how many > species each genus contains: > {code} > set default_parallel 60; > critters = load 'biodata'' as (genus, species); > genus_counts = foreach (group critters by genus) generate group as genus, > COUNT(critters) as num_species, critters; > ordered_genuses = order genus_counts by num_species desc; > store ordered_genuses.... > {code} > The higher the value of genus_counts, the more species tuples will be > contained in the critters bag, the wider the row. This can cause a severe > processing imbalance, as the partitioner processing the records with the > highest values of genus_counts will have the same number of *records* as the > partitioner processing the lowest number, but it will have far more actual > *bytes* to work on. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.