[
https://issues.apache.org/jira/browse/PIG-2234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13127315#comment-13127315
]
Thejas M Nair commented on PIG-2234:
------------------------------------
Yes, adding support for bags of primitive type would break lot of udfs and
load/store functions that make assumptions about bag structure. So its going to
be a major disruptive change. But I think it is useful enough to think of ways
to do it in a less disruptive way (supporting multiple udf/load/store
interfaces?).
There are few ways of supporting this without adding support for full fledged
new bag type-
1. The init and intermediate calls return primitive types. Pig then wraps the
primitive type in a tuple and adds them to a bag before it is given as input to
the udf intermidate or final calls. This will save on the tuple objects being
created in the map task and serialized at the combiner boundary. Udfs will not
have to be exposed to new type. My concern is that the conditional adding of
extra tuple inconsistent (but similar change was committed yesterday for TOBAG
udf).
2. The PrimitiveBag approach you suggested above. This approach will mean
exposing udfs to new bag type. This would introduce a EvalFunc (Algebraic
interface specifically) only data type. A udf specific datatype can also be
confusing.
I think we should give some thought to adding support for bags of primitive
type before investing on these alternatives.
> Alebraic udf Init and Intermediate functions should be able to return non
> tuple data types
> ------------------------------------------------------------------------------------------
>
> Key: PIG-2234
> URL: https://issues.apache.org/jira/browse/PIG-2234
> Project: Pig
> Issue Type: Improvement
> Affects Versions: 0.8.0, 0.9.0
> Reporter: Thejas M Nair
>
> The exec() call to Algebraic UDF initial and intermediate classes are
> required to return a Tuple. This has been done because the output is
> collected in a DataBag and passed to Intermediate.exec() and Final.exec()
> calls, and DataBag in pig needs to contain a Tuple. But this results in
> additional Tuple objects getting created and also adds additional
> (de)serialization costs. Functions such as COUNT, SUM are also having to wrap
> the initial and intermediate results in Tuples.
> The Algebraic interface needs to change to reduce the costs for udfs that
> don't need an intermediate tuple .
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira