[ https://issues.apache.org/jira/browse/PIG-1249?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12870948#action_12870948 ]
Jeff Zhang commented on PIG-1249: --------------------------------- Response to Alan's questions, 1. In this code, what happens if a loader is not loading from a file (like an HBase loader)? It looks to me like it will end up throwing an IOException when it tries to stat the 'file' which won't exist and that will cause Pig to die. Ideally in this case it should decide that it cannot make a rational estimate and not try to estimate. {color:blue} It won't throw IOException when file doesn't exit, getTotalInputFileSize will return 0 if not loading from file or file doesn't exit. And the final estimated reducer number will be 1. {color} 2. I'm curious where the values of ~1GB per reducer and 999 reducers came from. {color:blue} These two numbers is what Hive use, I'm not sure how they came from. Maybe from their experience. {color} 3. Does this estimate apply only to the first job or to all jobs? {color:blue} It will apply to all the jobs {color} 4. How does this work in the case of joins, where there are multiple inputs to a job? {color:blue} it will estimate the reducer number according the all the inputs files' size {color} > Safe-guards against misconfigured Pig scripts without PARALLEL keyword > ---------------------------------------------------------------------- > > Key: PIG-1249 > URL: https://issues.apache.org/jira/browse/PIG-1249 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.8.0 > Reporter: Arun C Murthy > Assignee: Jeff Zhang > Priority: Critical > Fix For: 0.8.0 > > Attachments: PIG-1249.patch, PIG_1249_2.patch > > > It would be *very* useful for Pig to have safe-guards against naive scripts > which process a *lot* of data without the use of PARALLEL keyword. > We've seen a fair number of instances where naive users process huge > data-sets (>10TB) with badly mis-configured #reduces e.g. 1 reduce. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.