[
https://issues.apache.org/jira/browse/PIG-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641940#action_12641940
]
Santhosh Srinivasan commented on PIG-505:
-----------------------------------------
In the existing design, Pig treats unknown types as bytearrays. As a result, if
UDFs return complex types (map, bag, tuple) and if the outputSchema method does
not specify the schema of the complex type, Pig will treat the contents of the
complex type as bytearray. The exception to this rule is map. The contents of
map are always treated as bytearray. In the future, there are plans for
treating unknown types as unknown types.
Impact of UDFs creating bytearrays
------------------------------------------------
If a UDF creates bytearrays, and if the bytearray is used in comparisons or in
arithmetic operations or cast explicitly to a Pig type, the bytearray has to be
cast to the appropriate type. The knowledge of converting the bytearray to the
appropriate Pig type is best known to the UDF. In all other cases, we are
making best effort guesses at picking the appropriate (load) function to
convert bytearrays to Pig types.
While the above paragraph captures the problem of UDFs creating bytearrays and
their subsequent use, the view of treating unkown types as bytearrays leads to
another problem. The UDF could be returning the correct Pig type and not a
bytearray. In such cases, the treatment of the correct type as bytearray leads
to an execution time situation. Pig expects DataByteArray but gets the correct
type (i.e., Integer, Double, ...) leading to a ClassCastException. This problem
has been addressed by capturing the exception, examining the type of the result
and returning the right type in POCast. In the situation where a bytearray is
returned by the UDF, Pig will never be able to convert the bytearray to the
appropriate type resulting in a valid run time exception.
With this background, the impact of treating unknown types as unknown types and
the impact of treating unknown types as bytearrays is listed below:
Impact of using unkown type as unknown type:
====================================
Pros:
-------
1. UDFs that return bytearrays are affected at runtime, iff they use bytearrays
in contexts where a cast is required, i.e., arithmetic operations, comparisons
and explicit casts.
2. Lineage code is not impacted
3. Aligns with the strategy of handling Unknown types in the future
Cons:
---------
1. Finding the right match for UDFs does not handle unknown types
2. Impact of the change is huge. The getNext(Unkown) required of every operator
as the output of a UDF could be used anywhere an expression is allowed in Pig
3. Cost of figuring out the right type at runtime is the cost of instanceof in
Java
Impact of using unknown type as bytearray:
=================================
Pros:
-------
1. UDFs that return bytearrays are affected at runtime, iff they use bytearrays
in contexts where a cast is required, i.e., arithmetic operations, comparisons
and explicit casts.
2. Unknowns are treated as bytearrays which is consistent with what the
treatment of unknown types in the current design
3. Treatment of unknowns as bytearrays in POCast is already in place
Cons:
--------
1. Lineage code is impacted. Lineage for UDFs will have to trace the correct
load function. If there is a single load function then the choice is obvious.
If there are more than one load functions to pick then randomly pick a load
function. Expect the failure to occur at run time if a bytearray is returned by
the UDF. This is a hack as the UDF could create bytearrays that are not
recognized by the load functions.
2. Adding to the current view of treating unknown types as bytearrays.
3. Cost of figuring out the right type is the cost of handling an exception in
Java
We need to pick an approach based on the short term versus long term view
nature of the solution. Comments/questions/thoughts are welcome.
> Lineage for UDFs that do not return bytearray
> ---------------------------------------------
>
> Key: PIG-505
> URL: https://issues.apache.org/jira/browse/PIG-505
> Project: Pig
> Issue Type: Bug
> Affects Versions: types_branch
> Reporter: Santhosh Srinivasan
> Assignee: Santhosh Srinivasan
> Fix For: types_branch
>
>
> In Pig-335, the lineage design states that UDFs that return bytearrays could
> cause problems in tracing the lineage. For UDFs that do not return bytearray,
> the lineage design should pickup the right load function to use as long as
> there is no ambiguity. In the current implementation, we could have issues
> with scripts like:
> {code}
> a = load 'input' as (field1);
> b = foreach a generate myudf_to_double(field1);
> c = foreach b generate $0 + 2.0;
> {code}
> When $0 has to be cast to a double, the lineage code will complain that it
> hit a UDF and hence cannot determine the right load function to use.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.