[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12897085#action_12897085 ]
Yan Zhou commented on PIG-1518: ------------------------------- The pseudo code of the combination op is as follows: for each node of the nodes (sorted in the order of ascending sizes) { while the node's split list (sorted in the order of descending sizes) is not empty { find the biggest splits that can be combined with the first split of the list of the splits; if the accumulated split size is >= half of the limit { generate a combined split; remove the accumulated splits from the node's split list; clear the accumulated split list; } else { break; } } } // leftover combination for each node of the nodes { for each split of the node's split list { add the split to a leftover list; } } for each split in the leftover list { if accumulated split size is >= limit { generate a combined split; remove the accumulated splits from the node's split list; clear the accumulated split list; } if it is the last split in the leftover list { try to see if it can be added with an existing combined split; if not, generate a combined split on the accumulated splits; } } The complexity is n*log(n) with n being the number of original splits that are smaller than the limit. > multi file input format for loaders > ----------------------------------- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement > Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.