[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thejas M Nair updated PIG-1472:
-------------------------------

    Attachment: PIG-1472.patch

Summary of changes in the patch -
1. The default TupleFactory is now BinSedesTupleFactory.  It returns the tuple 
implementation in BinSedesTuple class. This changes the serialization format 
between Map and Reduce.
2. The (de)serialization in BinSedesTuple and DefaultAbstractBag uses an 
implementation of a new InterSedes interface, which is returned by 
InterSedesFactory.getInterSedesInstance() 
3. A new load function InterStorage is used for serializing data between MR 
jobs . This load function should not be used like a regular load/store function 
to store persistent data.
4. DefaultTupleFactory has been retained, so that any external udfs that were 
using it can still compile. DefaultTupleFactory is a subclass of 
BinSedesTupleFactory that does not override any of the functions.

I think the serialization format should not be tied to the Tuple. A load 
function can return any tuple implementation, if we happen to call write 
function of that tuple, it will not be possible to read it in the reduce side 
using the default tuple. I think the InterSedes/InterSedesFactory classes 
should be used instead. With this patch, the InterSedes/InterSedesFactory 
classes get used only when BinSedesTuple is the default tuple.


> Optimize serialization/deserialization between Map and Reduce and between MR 
> jobs
> ---------------------------------------------------------------------------------
>
>                 Key: PIG-1472
>                 URL: https://issues.apache.org/jira/browse/PIG-1472
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0
>            Reporter: Thejas M Nair
>            Assignee: Thejas M Nair
>             Fix For: 0.8.0
>
>         Attachments: PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in 
> serializing/deserializing (sedes) records between Map and Reduce and between 
> MR jobs. 
> For example, if PigMix queries are modified to specify types for all the 
> fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
> pigmix v1) that have records with bags and maps being transmitted across map 
> or reduce boundaries run a lot longer (runtime increase of few times has been 
> seen.
> There are a few optimizations that have shown to improve the performance of 
> sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if 
> a bytearray is smaller than 255 bytes , a byte can be used to store the 
> length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
> DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
> serialization format that these loaders use cannot change, so after the 
> optimization their format is going to be different from the format used 
> between M/R boundaries.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to