[ https://issues.apache.org/jira/browse/PIG-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Daniel Dai closed PIG-1198. --------------------------- > [zebra] performance improvements > -------------------------------- > > Key: PIG-1198 > URL: https://issues.apache.org/jira/browse/PIG-1198 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.6.0 > Reporter: Yan Zhou > Assignee: Yan Zhou > Fix For: 0.7.0 > > Attachments: PIG-1198.patch, PIG-1198.patch, PIG-1198.patch > > > Current input split generation is row-based split on individual TFiles. This > leaves undesired fact that even for TFiles smaller than one block one split > is still generated for each. Consequently, there will be many mappers, and > many waves, needed to handle the many small TFiles generated by as many > mappers/reducers that wrote the data. This issue can be addressed by > generating input splits that can include multiple TFiles. > For sorted tables, key distribution generation by table, which is used to > generated proper input splits, includes key distributions from column groups > even they are not in projection. This incurs extra cost to perform > unnecessary computations and, more inappropriately, creates unreasonable > results on input split generations; > For unsorted tables, when row split is generated on a union of tables, the > FileSplits are generated for each table and then lumped together to form the > final list of splits to Map/Reduce. This has a undesirable fact that number > of splits is subject to the number of tables in the table union and not just > controlled by the number of splits used by the Map/Reduce framework; > The input split's goal size is calculated on all column groups even if some > of them are not in projection; > For input splits of multiple files in one column group, all files are opened > at startup. This is unnecessary and takes unnecessarily resources from start > to end. The files should be opened when needed and closed when not; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.