[
https://issues.apache.org/jira/browse/PIG-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12883382#action_12883382
]
Jeff Zhang commented on PIG-1473:
---------------------------------
This sounds like the lazy deserialization in Hive, Great !
> Avoid serialization/deserialization costs for PigStorage data - Use custom
> Map and Bag implementation
> -----------------------------------------------------------------------------------------------------
>
> Key: PIG-1473
> URL: https://issues.apache.org/jira/browse/PIG-1473
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.8.0
> Reporter: Thejas M Nair
> Assignee: Thejas M Nair
> Fix For: 0.8.0
>
>
> Cost of serialization/deserialization (sedes) can be very high and avoiding
> it will improve performance.
> Avoid sedes when possible by implementing approach #3 proposed in
> http://wiki.apache.org/pig/AvoidingSedes .
> The load function uses subclass of Map and DataBag which holds the serialized
> copy. LoadFunction delays deserialization of map and bag types until a
> member function of java.util.Map or DataBag is called.
> Example of query where this will help -
> {CODE}
> l = LOAD 'file1' AS (a : int, b : map [ ]);
> f = FOREACH l GENERATE udf1(a), b;
> fil = FILTER f BY $0 > 5;
> dump fil; -- Serialization of column b can be delayed until here using this
> approach .
> {CODE}
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.