[
https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jeff Zhang updated PIG-1249:
----------------------------
Attachment: PIG-1249.patch
The current idea is borrowed from hive, use the input file size to estimate the
reducer number.
Two parameters can been set for this purpose
pig.exec.reducers.bytes.per.reducer // the number of bytes of input for each
reducer
pig.exec.reducers.max // the max number of reducer
number
This only work for hdfs, won't work for other data source such as hbase or
cassandra.
> Safe-guards against misconfigured Pig scripts without PARALLEL keyword
> ----------------------------------------------------------------------
>
> Key: PIG-1249
> URL: https://issues.apache.org/jira/browse/PIG-1249
> Project: Pig
> Issue Type: Improvement
> Reporter: Arun C Murthy
> Assignee: Jeff Zhang
> Priority: Critical
> Fix For: 0.8.0
>
> Attachments: PIG-1249.patch
>
>
> It would be *very* useful for Pig to have safe-guards against naive scripts
> which process a *lot* of data without the use of PARALLEL keyword.
> We've seen a fair number of instances where naive users process huge
> data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.