[ 
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903895#action_12903895
 ] 

Thejas M Nair commented on PIG-1458:
------------------------------------

Another comment about the patch -
- The test testUnknownNumMaps2 is same as testUnknownNumMaps, it should be 
removed .


A note about the 2nd case described in first comment -
bq. 2.  The right input is a map-only job and input files do not exist at the 
compile time.

When the input does not exist for the input map-only job, in most(/all ?) cases 
it would be possible to determine the number of files by looking at the 
previous MR operator (or ones before that).
Also, with current implementation, since the checks for number of files are 
being done before the MR jobs are merged together, there will be cases where 
the final plan has only one MR job with existing input for the replicated input 
and pig still considers it as a case 2.

The example used in testUnknownNumMaps() has only one input MR job with inputs 
that exist at compile time, but if pig.frjoin.merge.files.optimistic=false, it 
will create an additional MR job that combines the input -
{code}
A = LOAD '" + INPUT_FILE + "' as (x:int,y:int);
B = Filter A by x < 50;
C = join A by $0, B by $0 using 'repl';
{code}


> aggregate files for replicated join
> -----------------------------------
>
>                 Key: PIG-1458
>                 URL: https://issues.apache.org/jira/browse/PIG-1458
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Olga Natkovich
>            Assignee: Richard Ding
>             Fix For: 0.8.0
>
>         Attachments: PIG-1458.patch
>
>
> We have noticed that if the smaller data in replicated join has many files, 
> this puts  unneeded burden on the name node. pre-aggregating the files can 
> improve the situation

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to