[ 
https://issues.apache.org/jira/browse/PIG-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903423#action_12903423
 ] 

Yan Zhou commented on PIG-1518:
-------------------------------

MergeJoinIndexer and IndexableLoadFunc are both not combinable.

Regarding orderedLoadFunc, the story is a bit more complex. First of all, it's 
only non-overriden method, getSplitComparable, is only used in MergeJoinIndexer 
which is already not combinable. 

The big issue is FileInputLoadFunc which is extended by BinStorage, PigStorage 
and InterStorage. Semantically, I agree OrderedLoadFunc should not be 
combinable. However, FileInputFormat's implementation of OrderedLoadFunc makes 
little sense in that its ordering is based on the  (path, offset) pair. This is 
an ordering but just an arbitrary ordering. Mathematically one can establish 
any arbitrary ordering over a discrete set of data. But the point is how is the 
ordering used. For our purpose, the ordering should be related to some keys 
used in data manipulation for which (path, offset) does not serve the purpose. 
Or implicitly a FileInputLoadFunc still requires the storage gives out splits 
in some key ordering. If that storage ordering does not actually exist, 
FileInputLoadFunc as an OrderedLoadFunc will have no use of its "sortness"
because the ordering is just, well, arbitray. The three extensions of 
FileInputLoadFunc work on generic data storage. Unless they work on sorted data 
in general, they should not be an OrderedLoadFunc.

The other use of OrderedLoadFunc, not its non-overriden method, 
getSplitComparable, is by map-side cogroup. But it does not check if the sort 
key is the join key which is critical for correctness.  It also requires to be 
a CollectableLoadFunc to work properly.

Since we do not want to break backward compatibility, and the only use of 
OrderLoadFunc in Pig, except for MergeJinIndexer which is already excluded from 
combining, is in map side cogroup with CollectableLoadFunc, I mark 
"CollectableLoadFunc AND an OrderedLoadFunc" as non-combinable.

In the future, we should really clean up the the OrderedLoadFunc from 
FileInputLoadFunc and let the getSplitComparable method provide key-related 
info and not the (path, offset) pair. Backward compatibility may need to be 
addressed too. Only then will the water become clearer and I be ok to adjust 
the noncombinable setting accordingly.

> multi file input format for loaders
> -----------------------------------
>
>                 Key: PIG-1518
>                 URL: https://issues.apache.org/jira/browse/PIG-1518
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Yan Zhou
>             Fix For: 0.8.0
>
>         Attachments: PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, 
> PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch, PIG-1518.patch
>
>
> We frequently run in the situation where Pig needs to deal with small files 
> in the input. In this case a separate map is created for each file which 
> could be very inefficient. 
> It would be greate to have an umbrella input format that can take multiple 
> files and use them in a single split. We would like to see this working with 
> different data formats if possible.
> There are already a couple of input formats doing similar thing: 
> MultifileInputFormat as well as CombinedInputFormat; howevere, neither works 
> with ne Hadoop 20 API. 
> We at least want to do a feasibility study for Pig 0.8.0.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to