[
https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903895#action_12903895
]
Thejas M Nair commented on PIG-1458:
------------------------------------
Another comment about the patch -
- The test testUnknownNumMaps2 is same as testUnknownNumMaps, it should be
removed .
A note about the 2nd case described in first comment -
bq. 2. The right input is a map-only job and input files do not exist at the
compile time.
When the input does not exist for the input map-only job, in most(/all ?) cases
it would be possible to determine the number of files by looking at the
previous MR operator (or ones before that).
Also, with current implementation, since the checks for number of files are
being done before the MR jobs are merged together, there will be cases where
the final plan has only one MR job with existing input for the replicated input
and pig still considers it as a case 2.
The example used in testUnknownNumMaps() has only one input MR job with inputs
that exist at compile time, but if pig.frjoin.merge.files.optimistic=false, it
will create an additional MR job that combines the input -
{code}
A = LOAD '" + INPUT_FILE + "' as (x:int,y:int);
B = Filter A by x < 50;
C = join A by $0, B by $0 using 'repl';
{code}
> aggregate files for replicated join
> -----------------------------------
>
> Key: PIG-1458
> URL: https://issues.apache.org/jira/browse/PIG-1458
> Project: Pig
> Issue Type: Improvement
> Reporter: Olga Natkovich
> Assignee: Richard Ding
> Fix For: 0.8.0
>
> Attachments: PIG-1458.patch
>
>
> We have noticed that if the smaller data in replicated join has many files,
> this puts unneeded burden on the name node. pre-aggregating the files can
> improve the situation
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.