[ https://issues.apache.org/jira/browse/PIG-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12903895#action_12903895 ]
Thejas M Nair commented on PIG-1458: ------------------------------------ Another comment about the patch - - The test testUnknownNumMaps2 is same as testUnknownNumMaps, it should be removed . A note about the 2nd case described in first comment - bq. 2. The right input is a map-only job and input files do not exist at the compile time. When the input does not exist for the input map-only job, in most(/all ?) cases it would be possible to determine the number of files by looking at the previous MR operator (or ones before that). Also, with current implementation, since the checks for number of files are being done before the MR jobs are merged together, there will be cases where the final plan has only one MR job with existing input for the replicated input and pig still considers it as a case 2. The example used in testUnknownNumMaps() has only one input MR job with inputs that exist at compile time, but if pig.frjoin.merge.files.optimistic=false, it will create an additional MR job that combines the input - {code} A = LOAD '" + INPUT_FILE + "' as (x:int,y:int); B = Filter A by x < 50; C = join A by $0, B by $0 using 'repl'; {code} > aggregate files for replicated join > ----------------------------------- > > Key: PIG-1458 > URL: https://issues.apache.org/jira/browse/PIG-1458 > Project: Pig > Issue Type: Improvement > Reporter: Olga Natkovich > Assignee: Richard Ding > Fix For: 0.8.0 > > Attachments: PIG-1458.patch > > > We have noticed that if the smaller data in replicated join has many files, > this puts unneeded burden on the name node. pre-aggregating the files can > improve the situation -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.