[jira] Commented: (PIG-505) Lineage for UDFs that do not return bytearray

Alan Gates (JIRA) Thu, 23 Oct 2008 10:25:06 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12642206#action_12642206
 ]


Alan Gates commented on PIG-505:
--------------------------------

A couple of comments:

You say that the long term plan is to have a true unknown type, and then pose 
the problem as do we want to start the switch now or later.  (In fairness, I'm 
pretty sure you're quoting something I said here, so I'm about to question my 
own statement.)  I don't know if that's true or not.  In the original design 
for types we had an unknown type.  We ended up dropping it in implementation 
because it turned out to be so similar to the bytearray.  While there is some 
cost to combining byte arrays with unknown types (as you lay out) I'm not sure 
that that means we should should separate the two.  The long term cost of 
maintainability may be greater.

I'm a confused by the first con of continuing to use byte arrays as unknowns.  
Are you saying that if we do this, in the case where there is only one load 
function in the script, after a UDF returns what is really a byte array, we'll 
use the cast from that load function?  I'm not certain what the right course is 
here.  From a correctness viewpoint, we can argue that pig doesn't know whether 
that byte array is from the load function or from the UDF.  However, this is a 
little burdensome to the user because it means any byte arrays inside complex 
types have to be dealt with before going to a UDF.  The con of using the load 
function where possible is if the byte array really is from the UDF and not the 
load function, we may error out or worse silently produce wrong data.  Since 
silently producing wrong data is a mortal sin in data processing I'd come down 
on the side of not using the load function's cast here.

A question, if we allow unknown in this one case, do we truly have to change 
code everywhere?  Instead of adding a getNext(unknown) to all operators, could 
we instead add a CastFromUnknown operator?  The entry point would still be 
getNext(ByteArray), so from all outside code's viewpoint the current type 
system should remain untouched.  And this operator would be written to 
introspect the type of object it got and either pass it on as is if it's the 
right type or cast it to the right type if it can.  It would never use a load 
function's cast (assuming we choose as indicated above), and it wouldn't incur 
the cost of throwing and catching an exception on the cast, it could use 
instanceof instead (which should be much faster).

> Lineage for UDFs that do not return bytearray
> ---------------------------------------------
>
>                 Key: PIG-505
>                 URL: https://issues.apache.org/jira/browse/PIG-505
>             Project: Pig
>          Issue Type: Bug
>    Affects Versions: types_branch
>            Reporter: Santhosh Srinivasan
>            Assignee: Santhosh Srinivasan
>             Fix For: types_branch
>
>
> In Pig-335, the lineage design states that UDFs that return bytearrays could 
> cause problems in tracing the lineage. For UDFs that do not return bytearray, 
> the lineage design should pickup the right load function to use as long as 
> there is no ambiguity.  In the current implementation, we could have issues 
> with scripts like:
> {code}
> a = load 'input' as (field1);
> b = foreach a generate myudf_to_double(field1);
> c =  foreach b generate $0 + 2.0;
> {code}
> When $0 has to be cast to a double, the lineage code will complain that it 
> hit a UDF and hence cannot determine the right load function to use.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-505) Lineage for UDFs that do not return bytearray

Reply via email to