[jira] Commented: (PIG-1198) [zebra] performance improvements

Hadoop QA (JIRA) Fri, 26 Feb 2010 00:18:52 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838747#action_12838747
 ]


Hadoop QA commented on PIG-1198:
--------------------------------

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12437084/PIG-1198.patch
  against trunk revision 916429.

    +1 @author.  The patch does not contain any @author tags.

    +1 tests included.  The patch appears to include 3 new or modified tests.

    +1 javadoc.  The javadoc tool did not generate any warning messages.

    +1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

    +1 findbugs.  The patch does not introduce any new Findbugs warnings.

    +1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

    -1 core tests.  The patch failed core unit tests.

    +1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/225/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/225/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/225/console

This message is automatically generated.

> [zebra] performance improvements
> --------------------------------
>
>                 Key: PIG-1198
>                 URL: https://issues.apache.org/jira/browse/PIG-1198
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.6.0
>            Reporter: Yan Zhou
>            Assignee: Yan Zhou
>             Fix For: 0.7.0
>
>         Attachments: PIG-1198.patch, PIG-1198.patch, PIG-1198.patch
>
>
> Current input split generation is row-based split on individual TFiles. This 
> leaves undesired fact that even for TFiles smaller than one block one split 
> is still generated for each. Consequently, there will be many mappers, and 
> many waves, needed to handle the many small TFiles generated by as many 
> mappers/reducers that wrote the data. This issue can be addressed by 
> generating input splits that can include multiple TFiles. 
> For sorted tables, key distribution generation by table, which is used to 
> generated proper input splits, includes key distributions from column groups 
> even they are not in projection. This incurs extra cost to perform 
> unnecessary computations and, more inappropriately, creates unreasonable 
> results on input split generations; 
> For unsorted tables, when row split is generated on a union of tables, the 
> FileSplits are generated for each table and then lumped together to form the 
> final list of splits to Map/Reduce. This has a undesirable fact that number 
> of splits is subject to the number of tables in the table union and not just 
> controlled by the number of splits used by the Map/Reduce framework; 
> The input split's goal size is calculated on all column groups even if some 
> of them are not in projection; 
> For input splits of multiple files in one column group, all files are opened 
> at startup. This is unnecessary and takes unnecessarily resources from start 
> to end. The files should be opened when needed and closed when not; 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-1198) [zebra] performance improvements

Reply via email to