[ https://issues.apache.org/jira/browse/PIG-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127315#comment-13127315 ]
Thejas M Nair commented on PIG-2234: ------------------------------------ Yes, adding support for bags of primitive type would break lot of udfs and load/store functions that make assumptions about bag structure. So its going to be a major disruptive change. But I think it is useful enough to think of ways to do it in a less disruptive way (supporting multiple udf/load/store interfaces?). There are few ways of supporting this without adding support for full fledged new bag type- 1. The init and intermediate calls return primitive types. Pig then wraps the primitive type in a tuple and adds them to a bag before it is given as input to the udf intermidate or final calls. This will save on the tuple objects being created in the map task and serialized at the combiner boundary. Udfs will not have to be exposed to new type. My concern is that the conditional adding of extra tuple inconsistent (but similar change was committed yesterday for TOBAG udf). 2. The PrimitiveBag approach you suggested above. This approach will mean exposing udfs to new bag type. This would introduce a EvalFunc (Algebraic interface specifically) only data type. A udf specific datatype can also be confusing. I think we should give some thought to adding support for bags of primitive type before investing on these alternatives. > Alebraic udf Init and Intermediate functions should be able to return non > tuple data types > ------------------------------------------------------------------------------------------ > > Key: PIG-2234 > URL: https://issues.apache.org/jira/browse/PIG-2234 > Project: Pig > Issue Type: Improvement > Affects Versions: 0.8.0, 0.9.0 > Reporter: Thejas M Nair > > The exec() call to Algebraic UDF initial and intermediate classes are > required to return a Tuple. This has been done because the output is > collected in a DataBag and passed to Intermediate.exec() and Final.exec() > calls, and DataBag in pig needs to contain a Tuple. But this results in > additional Tuple objects getting created and also adds additional > (de)serialization costs. Functions such as COUNT, SUM are also having to wrap > the initial and intermediate results in Tuples. > The Algebraic interface needs to change to reduce the costs for udfs that > don't need an intermediate tuple . -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira