[jira] [Commented] (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2015-08-13 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14695549#comment-14695549
 ] 

Rohini Palaniswamy commented on PIG-1472:
-

Actually did some more verification and looks like the TINY, SMALL approach is 
better than WritableUtils.writeVInt. Current uses unsigned max values for byte 
and short  (255 and 65535) and is able to represent better till 65535 with 
lesser bytes than WritableUtils.writeVInt. And most of the data will fall into 
that category. Also WritableUtils.writeVInt uses 2 bytes and 3 bytes for byte 
and short respectively as 1 byte takes up length for some ranges. For eg: 32767 
uses 3 bytes and not 2. So better to leave it at the current approach.  One 
thing that might be advantageous though is use WritableUtils.writeVLong to 
serialize LONG instead of out.writeLong().  Though for values >= Math.pow(2, 
56) it uses 9 bytes, for  val > Math.pow(2, 32)  and val < Math.pow(2, 48) it 
uses 5 to 7 bytes which is good. timestamps which is the most used long uses 7 
bytes instead of 8. Apart from the byte saving need to see the time taken to 
serialize and deserialize to see if it is really advantageous.



> Optimize serialization/deserialization between Map and Reduce and between MR 
> jobs
> -
>
> Key: PIG-1472
> URL: https://issues.apache.org/jira/browse/PIG-1472
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, 
> PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in 
> serializing/deserializing (sedes) records between Map and Reduce and between 
> MR jobs. 
> For example, if PigMix queries are modified to specify types for all the 
> fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
> pigmix v1) that have records with bags and maps being transmitted across map 
> or reduce boundaries run a lot longer (runtime increase of few times has been 
> seen.
> There are a few optimizations that have shown to improve the performance of 
> sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if 
> a bytearray is smaller than 255 bytes , a byte can be used to store the 
> length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
> DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
> serialization format that these loaders use cannot change, so after the 
> optimization their format is going to be different from the format used 
> between M/R boundaries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2015-08-12 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694035#comment-14694035
 ] 

Rohini Palaniswamy commented on PIG-1472:
-

Thanks [~thejas]. Created PIG-4656 to move to WritableUtils.writeVInt. Where 
type also denotes size, will keep as is. For eg:
TUPLE_0 to TUPLE_9 will stay as that packs type and size into one byte. But 
with TINYTUPLE, SMALLTUPLE and TUPLE - only TUPLE will be retained converting 
to WritableUtils.writeVInt.

> Optimize serialization/deserialization between Map and Reduce and between MR 
> jobs
> -
>
> Key: PIG-1472
> URL: https://issues.apache.org/jira/browse/PIG-1472
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, 
> PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in 
> serializing/deserializing (sedes) records between Map and Reduce and between 
> MR jobs. 
> For example, if PigMix queries are modified to specify types for all the 
> fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
> pigmix v1) that have records with bags and maps being transmitted across map 
> or reduce boundaries run a lot longer (runtime increase of few times has been 
> seen.
> There are a few optimizations that have shown to improve the performance of 
> sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if 
> a bytearray is smaller than 255 bytes , a byte can be used to store the 
> length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
> DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
> serialization format that these loaders use cannot change, so after the 
> optimization their format is going to be different from the format used 
> between M/R boundaries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2015-08-12 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14694026#comment-14694026
 ] 

Thejas M Nair commented on PIG-1472:


I don't remember if I had looked into WritableUtils.writeVInt back then or if 
it was available with the pig version being used back then (its been 5 years! 
:) )
Would using WritableUtils.writeVInt mean that an extra byte needs to be used 
for storing the type ? ie bag vs map vs tuple ..
For complex types, savings are more noticeable for smaller sizes. For a bag of 
size 32768, one byte saving won't be significant. However, for an int of size 
32768 , the saving of one byte is significant.


> Optimize serialization/deserialization between Map and Reduce and between MR 
> jobs
> -
>
> Key: PIG-1472
> URL: https://issues.apache.org/jira/browse/PIG-1472
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, 
> PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in 
> serializing/deserializing (sedes) records between Map and Reduce and between 
> MR jobs. 
> For example, if PigMix queries are modified to specify types for all the 
> fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
> pigmix v1) that have records with bags and maps being transmitted across map 
> or reduce boundaries run a lot longer (runtime increase of few times has been 
> seen.
> There are a few optimizations that have shown to improve the performance of 
> sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if 
> a bytearray is smaller than 255 bytes , a byte can be used to store the 
> length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
> DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
> serialization format that these loaders use cannot change, so after the 
> optimization their format is going to be different from the format used 
> between M/R boundaries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2015-08-11 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692416#comment-14692416
 ] 

Rohini Palaniswamy commented on PIG-1472:
-

There is no difference for byte and short compared to current approach with 
WritableUtils.writeVInt. But with int, it could be beneficial where lot of 
numbers could be written with 3 bytes instead of 4. For eg: 32768 is written 
using 3 bytes whereas currently 4 bytes (int) is used.  It would also simplify 
code and reduce the number of types.

> Optimize serialization/deserialization between Map and Reduce and between MR 
> jobs
> -
>
> Key: PIG-1472
> URL: https://issues.apache.org/jira/browse/PIG-1472
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, 
> PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in 
> serializing/deserializing (sedes) records between Map and Reduce and between 
> MR jobs. 
> For example, if PigMix queries are modified to specify types for all the 
> fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
> pigmix v1) that have records with bags and maps being transmitted across map 
> or reduce boundaries run a lot longer (runtime increase of few times has been 
> seen.
> There are a few optimizations that have shown to improve the performance of 
> sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if 
> a bytearray is smaller than 255 bytes , a byte can be used to store the 
> length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
> DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
> serialization format that these loaders use cannot change, so after the 
> optimization their format is going to be different from the format used 
> between M/R boundaries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-1472) Optimize serialization/deserialization between Map and Reduce and between MR jobs

2015-08-11 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14692394#comment-14692394
 ] 

Rohini Palaniswamy commented on PIG-1472:
-

[~thejas],
   Is there any reason multiple datatypes (TINY, SMALL) were introduced and 
size stored as byte or short  instead of using 
org.apache.hadoop.io.WritableUtils.writeVInt which hadoop uses to store sizes 
efficiently with as less bytes as possible. If there is no particular reason we 
should avoid doing that, then will create a jira to switch to that. 

> Optimize serialization/deserialization between Map and Reduce and between MR 
> jobs
> -
>
> Key: PIG-1472
> URL: https://issues.apache.org/jira/browse/PIG-1472
> Project: Pig
>  Issue Type: Improvement
>Affects Versions: 0.8.0
>Reporter: Thejas M Nair
>Assignee: Thejas M Nair
> Fix For: 0.8.0
>
> Attachments: PIG-1472.2.patch, PIG-1472.3.patch, PIG-1472.4.patch, 
> PIG-1472.patch
>
>
> In certain types of pig queries most of the execution time is spent in 
> serializing/deserializing (sedes) records between Map and Reduce and between 
> MR jobs. 
> For example, if PigMix queries are modified to specify types for all the 
> fields in the load statement schema, some of the queries (L2,L3,L9, L10 in 
> pigmix v1) that have records with bags and maps being transmitted across map 
> or reduce boundaries run a lot longer (runtime increase of few times has been 
> seen.
> There are a few optimizations that have shown to improve the performance of 
> sedes in my tests -
> 1. Use smaller number of bytes to store length of the column . For example if 
> a bytearray is smaller than 255 bytes , a byte can be used to store the 
> length instead of the integer that is currently used.
> 2. Instead of custom code to do sedes on Strings, use DataOutput.writeUTF and 
> DataInput.readUTF.  This reduces the cost of serialization by more than 1/2. 
> Zebra and BinStorage are known to use DefaultTuple sedes functionality. The 
> serialization format that these loaders use cannot change, so after the 
> optimization their format is going to be different from the format used 
> between M/R boundaries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)