[ https://issues.apache.org/jira/browse/PIG-1306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850942#action_12850942 ]
Yan Zhou commented on PIG-1306: ------------------------------- Committed to the trunk and 0.7 branch. > [zebra] Support of locally sorted input splits > ---------------------------------------------- > > Key: PIG-1306 > URL: https://issues.apache.org/jira/browse/PIG-1306 > Project: Pig > Issue Type: Improvement > Reporter: Yan Zhou > Assignee: Yan Zhou > Fix For: 0.7.0 > > Attachments: PIG-1306.patch, PIG-1306.patch, PIG-1306.patch, > PIG-1306.patch, PIG-1306.patch > > > Current Zebra supports sorted or unsorted input splits on sorted table or > sorted table unions. The sorted input splits are based upon key ranges which > do not overlap. And the splits are basically globally sorted in that they are > locally sorted, and their key ranges do not overlap. > The biggest problem of the key-range splits are performance hits suffered if > data skew is present, particularly if a key range contains a duplicate key > solely which makes the data trunk of the duplicate keys virtually > unsplittable regardless how many mappers are available: it just has to be > processed by a single mapper. > On the other hand, there are scenarios when the globally sorted splits are a > over-kill and only locally sorted splits are good enough. Examples are the > use of Zebra sorted tables as the probe table in a map-side merge inner join. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.