[ 
https://issues.apache.org/jira/browse/PIG-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12641940#action_12641940
 ] 

Santhosh Srinivasan commented on PIG-505:
-----------------------------------------

In the existing design, Pig treats unknown types as bytearrays. As a result, if 
UDFs return complex types (map, bag, tuple) and if the outputSchema method does 
not specify the schema of the complex type,  Pig will treat the contents of the 
complex type as bytearray. The exception to this rule is map. The contents of 
map are always treated as bytearray. In the future, there are plans for 
treating unknown types as unknown types.

Impact of UDFs creating bytearrays
------------------------------------------------

If a UDF creates bytearrays, and if the bytearray is used in comparisons or in 
arithmetic operations or cast explicitly to a Pig type, the bytearray has to be 
cast to the appropriate type. The knowledge of converting the bytearray to the 
appropriate Pig type is best known to the UDF. In all other cases, we are 
making best effort guesses at picking the appropriate (load) function to 
convert bytearrays to Pig types.

While the above paragraph captures the problem of UDFs creating bytearrays and 
their subsequent use, the view of treating unkown types as bytearrays leads to 
another problem. The UDF could be returning the correct Pig type and not a 
bytearray. In such cases, the treatment of the correct type as bytearray leads 
to an execution time situation. Pig expects DataByteArray but gets the correct 
type (i.e., Integer, Double, ...) leading to a ClassCastException. This problem 
has been addressed by capturing the exception, examining the type of the result 
and returning the right type in POCast. In the situation where a bytearray is 
returned by the UDF, Pig will never be able to convert the bytearray to the 
appropriate type resulting in a valid run time exception.

With this background, the impact of treating unknown types as unknown types and 
the impact of treating unknown types as bytearrays is listed below:

Impact of using unkown type as unknown type:
====================================

Pros:
-------
1. UDFs that return bytearrays are affected at runtime, iff they use bytearrays 
in contexts where a cast is required, i.e., arithmetic operations, comparisons 
and explicit casts.

2. Lineage code is not impacted

3. Aligns with the strategy of handling Unknown types in the future

Cons:
---------
1. Finding the right match for UDFs does not handle unknown types

2. Impact of the change is huge. The getNext(Unkown) required of every operator 
as the output of a UDF could be used anywhere an expression is allowed in Pig

3. Cost of figuring out the right type at runtime is the cost of instanceof in 
Java

Impact of using unknown type as bytearray:
=================================

Pros:
-------

1. UDFs that return bytearrays are affected at runtime, iff they use bytearrays 
in contexts where a cast is required, i.e., arithmetic operations, comparisons 
and explicit casts.

2. Unknowns are treated as bytearrays which is consistent with what the 
treatment of unknown types in the current design

3. Treatment of unknowns as bytearrays in POCast is already in place

Cons:
--------

1. Lineage code is impacted. Lineage for UDFs will have to trace the correct 
load function. If there is a single load function then the choice is obvious. 
If there are more than one load functions to pick then randomly pick a load 
function. Expect the failure to occur at run time if a bytearray is returned by 
the UDF. This is a hack as the UDF could create bytearrays that are not 
recognized by the load functions.

2. Adding to the current view of treating unknown types as bytearrays.

3. Cost of figuring out the right type is the cost of handling an exception in 
Java

We need to pick an approach based on the short term versus long term view 
nature of the solution. Comments/questions/thoughts are welcome.

> Lineage for UDFs that do not return bytearray
> ---------------------------------------------
>
>                 Key: PIG-505
>                 URL: https://issues.apache.org/jira/browse/PIG-505
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Santhosh Srinivasan
>            Assignee: Santhosh Srinivasan
>             Fix For: types_branch
>
>
> In Pig-335, the lineage design states that UDFs that return bytearrays could 
> cause problems in tracing the lineage. For UDFs that do not return bytearray, 
> the lineage design should pickup the right load function to use as long as 
> there is no ambiguity.  In the current implementation, we could have issues 
> with scripts like:
> {code}
> a = load 'input' as (field1);
> b = foreach a generate myudf_to_double(field1);
> c =  foreach b generate $0 + 2.0;
> {code}
> When $0 has to be cast to a double, the lineage code will complain that it 
> hit a UDF and hence cannot determine the right load function to use.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to