[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Jeff Zhang (JIRA) Mon, 17 May 2010 09:01:15 -0700

     [ 
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jeff Zhang updated PIG-1249:
----------------------------

    Attachment: PIG-1249.patch

The current idea is borrowed from hive, use the input file size to estimate the 
reducer number.
Two parameters can been set for this purpose
pig.exec.reducers.bytes.per.reducer  // the number of bytes of input for each 
reducer
pig.exec.reducers.max                          // the max number of reducer 
number

This only work for hdfs, won't work for other data source such as hbase or 
cassandra.



> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
>                 Key: PIG-1249
>                 URL: https://issues.apache.org/jira/browse/PIG-1249
>             Project: Pig
>          Issue Type: Improvement
>            Reporter: Arun C Murthy
>            Assignee: Jeff Zhang
>            Priority: Critical
>             Fix For: 0.8.0
>
>         Attachments: PIG-1249.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts 
> which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge 
> data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-1249) Safe-guards against misconfigured Pig scripts without PARALLEL keyword

Reply via email to