[
https://issues.apache.org/jira/browse/PIG-1592?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12905220#action_12905220
]
Dmitriy V. Ryaboy commented on PIG-1592:
----------------------------------------
One proposal is to simply change the default weighted range partitioner to take
into account the record size. If record size is uniform, or uniformly
distributed, or non-uniformly distributed but independent of the order key,
this change shouldn't materially affect the distributions created for data sets
not covered by this issue.
> ORDER BY distribution is uneven when record size is correlated with order key
> -----------------------------------------------------------------------------
>
> Key: PIG-1592
> URL: https://issues.apache.org/jira/browse/PIG-1592
> Project: Pig
> Issue Type: Improvement
> Reporter: Dmitriy V. Ryaboy
> Fix For: 0.9.0
>
>
> The partitioner contributed in PIG-545 distributes the order key space
> between partitions so that each partition gets approximately the same number
> of keys, even when the keys have a non-uniform distribution over the key
> space.
> Unfortunately this still allows for severe partition imbalance when record
> size is correlated with the order key. By way of motivating example, consider
> this script which attempts to produce a list of genuses based on how many
> species each genus contains:
> {code}
> set default_parallel 60;
> critters = load 'biodata'' as (genus, species);
> genus_counts = foreach (group critters by genus) generate group as genus,
> COUNT(critters) as num_species, critters;
> ordered_genuses = order genus_counts by num_species desc;
> store ordered_genuses....
> {code}
> The higher the value of genus_counts, the more species tuples will be
> contained in the critters bag, the wider the row. This can cause a severe
> processing imbalance, as the partitioner processing the records with the
> highest values of genus_counts will have the same number of *records* as the
> partitioner processing the lowest number, but it will have far more actual
> *bytes* to work on.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.