[ https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903423#action_12903423 ]
Yan Zhou commented on PIG-1518: ------------------------------- MergeJoinIndexer and IndexableLoadFunc are both not combinable. Regarding orderedLoadFunc, the story is a bit more complex. First of all, it's only non-overriden method, getSplitComparable, is only used in MergeJoinIndexer which is already not combinable. The big issue is FileInputLoadFunc which is extended by BinStorage, PigStorage and InterStorage. Semantically, I agree OrderedLoadFunc should not be combinable. However, FileInputFormat's implementation of OrderedLoadFunc makes little sense in that its ordering is based on the (path, offset) pair. This is an ordering but just an arbitrary ordering. Mathematically one can establish any arbitrary ordering over a discrete set of data. But the point is how is the ordering used. For our purpose, the ordering should be related to some keys used in data manipulation for which (path, offset) does not serve the purpose. Or implicitly a FileInputLoadFunc still requires the storage gives out splits in some key ordering. If that storage ordering does not actually exist, FileInputLoadFunc as an OrderedLoadFunc will have no use of its "sortness" because the ordering is just, well, arbitray. The three extensions of FileInputLoadFunc work on generic data storage. Unless they work on sorted data in general, they should not be an OrderedLoadFunc. The other use of OrderedLoadFunc, not its non-overriden method, getSplitComparable, is by map-side cogroup. But it does not check if the sort key is the join key which is critical for correctness. It also requires to be a CollectableLoadFunc to work properly. Since we do not want to break backward compatibility, and the only use of OrderLoadFunc in Pig, except for MergeJinIndexer which is already excluded from combining, is in map side cogroup with CollectableLoadFunc, I mark "CollectableLoadFunc AND an OrderedLoadFunc" as non-combinable. In the future, we should really clean up the the OrderedLoadFunc from FileInputLoadFunc and let the getSplitComparable method provide key-related info and not the (path, offset) pair. Backward compatibility may need to be addressed too. Only then will the water become clearer and I be ok to adjust the noncombinable setting accordingly. > multi file input format for loaders > ----------------------------------- > > Key: PIG-1518 > URL: https://issues.apache.org/jira/browse/PIG-1518 > Project: Pig > Issue Type: Improvement > Reporter: Olga Natkovich > Assignee: Yan Zhou > Fix For: 0.8.0 > > Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, > PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch > > > We frequently run in the situation where Pig needs to deal with small files > in the input. In this case a separate map is created for each file which > could be very inefficient. > It would be greate to have an umbrella input format that can take multiple > files and use them in a single split. We would like to see this working with > different data formats if possible. > There are already a couple of input formats doing similar thing: > MultifileInputFormat as well as CombinedInputFormat; howevere, neither works > with ne Hadoop 20 API. > We at least want to do a feasibility study for Pig 0.8.0. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.