[ 
https://issues.apache.org/jira/browse/PIG-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127315#comment-13127315
 ] 

Thejas M Nair commented on PIG-2234:
------------------------------------

Yes, adding support for bags of primitive type would break lot of udfs and 
load/store functions that make assumptions about bag structure. So its going to 
be a major disruptive change. But I think it is useful enough to think of ways 
to do it in a less disruptive way (supporting multiple udf/load/store 
interfaces?).

There are few ways of supporting this without adding support for full fledged 
new bag type-
1. The init and intermediate calls return primitive types. Pig then wraps the 
primitive type in a tuple and adds them to a bag before it is given as input to 
the udf intermidate or final calls. This will save on the tuple objects being 
created in the map task and serialized at the combiner boundary. Udfs will not 
have to be exposed to new type. My concern is that the conditional adding of 
extra tuple inconsistent (but similar change was committed yesterday for TOBAG 
udf). 

2. The PrimitiveBag approach you suggested above. This approach will mean 
exposing udfs to new bag type.  This would introduce a EvalFunc (Algebraic 
interface specifically) only data type. A udf specific datatype can also be 
confusing. 

I think we should give some thought to adding support for bags of primitive 
type before investing on these alternatives.
                
> Alebraic udf Init and Intermediate functions should be able to return non 
> tuple data types
> ------------------------------------------------------------------------------------------
>
>                 Key: PIG-2234
>                 URL: https://issues.apache.org/jira/browse/PIG-2234
>             Project: Pig
>          Issue Type: Improvement
>    Affects Versions: 0.8.0, 0.9.0
>            Reporter: Thejas M Nair
>
> The exec() call to Algebraic UDF initial and intermediate classes are 
> required to return a Tuple. This has been done because the output is 
> collected in a DataBag and passed to Intermediate.exec() and Final.exec() 
> calls, and DataBag in pig needs to contain a Tuple. But this results in 
> additional Tuple objects getting created and also adds additional 
> (de)serialization costs. Functions such as COUNT, SUM are also having to wrap 
> the initial and intermediate results in Tuples.
> The Algebraic interface needs to change to reduce the costs for udfs that 
> don't need an intermediate tuple .

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to