[
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12893446#action_12893446
]
Olga Natkovich commented on PIG-1249:
-------------------------------------
Comments for the documentation:
+ /**
+ * Currently the estimation of reducer number is only applied to HDFS, The
estimation is based on the input size of data storage on HDFS.
+ * Two parameters can been configured for the estimation, one is
pig.exec.reducers.max which constrain the maximum number of reducer task
(default is 999). The other
+ * is pig.exec.reducers.bytes.per.reducer(default value is 1000*1000*1000)
which means the how much data can been handled for each reducer.
+ * e.g. the following is your pig script
+ * a = load '/data/a';
+ * b = load '/data/b';
+ * c = join a by $0, b by $0;
+ * store c into '/tmp';
+ *
+ * The size of /data/a is 1000*1000*1000, and size of /data/b is
2*1000*1000*1000.
+ * Then the estimated reducer number is
(1000*1000*1000+2*1000*1000*1000)/(1000*1000*1000)=3
> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
> Key: PIG-1249
> URL: https://issues.apache.org/jira/browse/PIG-1249
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.8.0
> Reporter: Arun C Murthy
> Assignee: Jeff Zhang
> Priority: Critical
> Fix For: 0.8.0
>
> Attachments: PIG-1249-4.patch, PIG-1249.patch, PIG-1249_5.patch,
> PIG_1249_2.patch, PIG_1249_3.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts
> which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge
> data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.