[
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12886281#action_12886281
]
Hadoop QA commented on PIG-1472:
--------------------------------
-1 overall. Here are the results of testing the latest attachment
http://issues.apache.org/jira/secure/attachment/12448937/PIG-1472.2.patch
against trunk revision 960062.
+1 @author. The patch does not contain any @author tags.
+1 tests included. The patch appears to include 69 new or modified tests.
-1 javadoc. The javadoc tool appears to have generated 1 warning messages.
-1 javac. The applied patch generated 148 javac compiler warnings (more
than the trunk's current 145 warnings).
-1 findbugs. The patch appears to introduce 2 new Findbugs warnings.
-1 release audit. The applied patch generated 400 release audit warnings
(more than the trunk's current 399 warnings).
-1 core tests. The patch failed core unit tests.
-1 contrib tests. The patch failed contrib unit tests.
Test results:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/testReport/
Release audit warnings:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/artifact/trunk/patchprocess/releaseAuditDiffWarnings.txt
Findbugs warnings:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output:
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/362/console
This message is automatically generated.
> Optimize serialization/deserialization between Map and Reduce and between MR
> jobs
> ---------------------------------------------------------------------------------
>
> Key: PIG-1472
> URL: https://issues.apache.org/jira/browse/PIG-1472
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.8.0
> Reporter: Thejas M Nair
> Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1472.2.patch, PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in
> serializing/deserializing (sedes) records between Map and Reduce and between
> MR jobs.
> For example, if PigMix queries are modified to specify types for all the
> fields in the load statement schema, some of the queries (L2,L3,L9, L10 in
> pigmix v1) that have records with bags and maps being transmitted across map
> or reduce boundaries run a lot longer (runtime increase of few times has been
> seen.
> There are a few optimizations that have shown to improve the performance of
> sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if
> a bytearray is smaller than 255 bytes , a byte can be used to store the
> length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and
> DataInput.readUTF. This reduces the cost of serialization by more than 1/2.
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The
> serialization format that these loaders use cannot change, so after the
> optimization their format is going to be different from the format used
> between M/R boundaries.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.