[
https://issues.apache.org/jira/browse/PIG-1198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12838747#action_12838747
]
Hadoop QA commented on PIG-1198:
--------------------------------
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12437084/PIG-1198.patch
against trunk revision 916429.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 3 new or modified tests.
+1 javadoc. The javadoc tool did not generate any warning messages.
+1 javac. The applied patch does not increase the total number of javac
compiler warnings.
+1 findbugs. The patch does not introduce any new Findbugs warnings.
+1 release audit. The applied patch does not increase the total number of
release audit warnings.
-1 core tests. The patch failed core unit tests.
+1 contrib tests. The patch passed contrib unit tests.
Test results:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/225/testReport/
Findbugs warnings:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/225/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/225/console
This message is automatically generated.
> [zebra] performance improvements
> --------------------------------
>
> Key: PIG-1198
> URL: https://issues.apache.org/jira/browse/PIG-1198
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.6.0
> Reporter: Yan Zhou
> Assignee: Yan Zhou
> Fix For: 0.7.0
>
> Attachments: PIG-1198.patch, PIG-1198.patch, PIG-1198.patch
>
>
> Current input split generation is row-based split on individual TFiles. This
> leaves undesired fact that even for TFiles smaller than one block one split
> is still generated for each. Consequently, there will be many mappers, and
> many waves, needed to handle the many small TFiles generated by as many
> mappers/reducers that wrote the data. This issue can be addressed by
> generating input splits that can include multiple TFiles.
> For sorted tables, key distribution generation by table, which is used to
> generated proper input splits, includes key distributions from column groups
> even they are not in projection. This incurs extra cost to perform
> unnecessary computations and, more inappropriately, creates unreasonable
> results on input split generations;
> For unsorted tables, when row split is generated on a union of tables, the
> FileSplits are generated for each table and then lumped together to form the
> final list of splits to Map/Reduce. This has a undesirable fact that number
> of splits is subject to the number of tables in the table union and not just
> controlled by the number of splits used by the Map/Reduce framework;
> The input split's goal size is calculated on all column groups even if some
> of them are not in projection;
> For input splits of multiple files in one column group, all files are opened
> at startup. This is unnecessary and takes unnecessarily resources from start
> to end. The files should be opened when needed and closed when not;
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.