[ 
https://issues.apache.org/jira/browse/PIG-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Zhou updated PIG-1198:
--------------------------

    Attachment: PIG-1198.patch

This patch is based upon the load-store-redesign branch and thus might have 
minor differences due to different code base from the final patch to be applied 
to the trunk. This patch is teherefore only for reviewing purpose only and no 
submission is intended. 

> [zebra] performance improvements
> --------------------------------
>
>                 Key: PIG-1198
>                 URL: https://issues.apache.org/jira/browse/PIG-1198
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Yan Zhou
>            Assignee: Yan Zhou
>             Fix For: 0.7.0
>
>         Attachments: PIG-1198.patch
>
>
> Current input split generation is row-based split on individual TFiles. This 
> leaves undesired fact that even for TFiles smaller than one block one split 
> is still generated for each. Consequently, there will be many mappers, and 
> many waves, needed to handle the many small TFiles generated by as many 
> mappers/reducers that wrote the data. This issue can be addressed by 
> generating input splits that can include multiple TFiles. 
> For sorted tables, key distribution generation by table, which is used to 
> generated proper input splits, includes key distributions from column groups 
> even they are not in projection. This incurs extra cost to perform 
> unnecessary computations and, more inappropriately, creates unreasonable 
> results on input split generations; 
> For unsorted tables, when row split is generated on a union of tables, the 
> FileSplits are generated for each table and then lumped together to form the 
> final list of splits to Map/Reduce. This has a undesirable fact that number 
> of splits is subject to the number of tables in the table union and not just 
> controlled by the number of splits used by the Map/Reduce framework; 
> The input split's goal size is calculated on all column groups even if some 
> of them are not in projection; 
> For input splits of multiple files in one column group, all files are opened 
> at startup. This is unnecessary and takes unnecessarily resources from start 
> to end. The files should be opened when needed and closed when not; 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to