[ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12868893#action_12868893
 ] 

Thejas M Nair commented on PIG-1249:
------------------------------------

If default_parallel has not been set, the patch sets a new default number of 
reducers based on input file sizes.
If the 'input' specified in the load statement is not an hdfs file, it fail to 
find the file size and default of 1 reduce will be used.

The next steps in automatically determining number of reducers (which can be 
addressed in separate jiras) are -
1. Determining different number of reducers for each MR job of a pig-query, 
based on the input size for the MR job.
2. Extending this functionality to load functions that don't take hdfs files as 
input. We can look at using LoadMetaData.getStatistics() .

Comments on the patch -
If default_parallel is specified, the number of reducers doesn't need to be 
determined.
{code}

estimateNumberOfReducers(conf,mro);
if (pigContext.defaultParallel > 0)
       conf.set("mapred.reduce.tasks", ""+pigContext.defaultParallel);

{code}
can be changed to 
{code}
if (pigContext.defaultParallel > 0)
     conf.set("mapred.reduce.tasks", ""+pigContext.defaultParallel);
else
     estimateNumberOfReducers(conf,mro);
{code}

Everything else looks good.

Hudson still seems to be having problems. I am currently running unit tests 
with this patch.



> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts 
> which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge 
> data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to