Avoid serialization/deserialization costs for PigStorage data - Use custom Map
and Bag implementation
-----------------------------------------------------------------------------------------------------
Key: PIG-1473
URL: https://issues.apache.org/jira/browse/PIG-1473
Project: Pig
Issue Type: Improvement
Affects Versions: 0.8.0
Reporter: Thejas M Nair
Fix For: 0.8.0
Cost of serialization/deserialization (sedes) can be very high and avoiding it
will improve performance.
Avoid sedes when possible by implementing approach #3 proposed in
http://wiki.apache.org/pig/AvoidingSedes .
The load function uses subclass of Map and DataBag which holds the serialized
copy. LoadFunction delays deserialization of map and bag types until a member
function of java.util.Map or DataBag is called.
Example of query where this will help -
{CODE}
l = LOAD 'file1' AS (a : int, b : map [ ]);
f = FOREACH l GENERATE udf1(a), b;
fil = FILTER f BY $0 > 5;
dump fil; -- Serialization of column b can be delayed until here using this
approach .
{CODE}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.