RE: Semantic cleanup: How to adding two bytearray

Olga Natkovich Fri, 14 Jan 2011 13:12:56 -0800

I think the tradeoff between fully dynamic types and static types are between 
convenience (why should I tell you what the type is if the data is properly 
typed) and type-safety (what if your data has invalid values) and performance 
(dynamic typing would be slower.)


My vote is for static typing because I believe the type-safety (and clear 
schema definition) and performance are more important.

Olga

-----Original Message-----
From: Daniel Dai [mailto:jiany...@yahoo-inc.com] 
Sent: Friday, January 14, 2011 12:12 PM
To: dev@pig.apache.org
Subject: Re: Semantic cleanup: How to adding two bytearray

Runtime detection can be done row by row. This will solve the problem in 
your sample, though it suffers a little bit performance.

Require casting before adding is also clean. However, this would break 
backward compatibility.

Dmitriy Ryaboy wrote:
> How is runtime detection done? I worry that if 1.txt contains:
> 1, 2
> 1.1, 2.2
>
> We get into a situation where addition of the fields in the first tuple
> produces integers, and adding the fields of the second tuple produces
> doubles.
>
> A more invasive but perhaps easier to reason about solution might be to be
> stricter about types, and require bytearrays to be cast to whatever type
> they are supposed to be if you want to add / delete / do non-byte-things to
> them.
>
> This is a problem if UDFs that output tuples or bags don't specify schemas
> (and specifying schemas of tuples and bags is fairly onerous right now in
> Java). I am not sure what the solution here is, other than finding a clean,
> less onerous way of declaring schemas, fixing up everything in builtin and
> piggybank to only use the new clean sparkly api and document the heck out of
> it.
>
> D
>
> On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <jiany...@yahoo-inc.com> wrote:
>
>   
>> One goal of semantic cleanup work undergoing is to clarify the usage of
>> unknown type.
>>
>> In Pig schema system, user can define output schema for LoadFunc/EvalFunc.
>> Pig will propagate those schema to the entire script. Defining schema for
>> LoadFunc/EvalFunc is optional. If user don't define schema, Pig will mark
>> them bytearray. However, in the run time, user can feed any data type in.
>> Before, Pig assumes the runtime type for bytearray is DataByteArray, which
>> arose several issues (PIG-1277, PIG-999, PIG-1016).
>>
>> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the
>> object to figure out what the real type is at runtime. We've done that for
>> all shuffle keys (PIG-1277). However, there are other cases. One case is
>> adding two bytearray. For example,
>>
>> a = load '1.txt' using SomeLoader() as (a0, a1);  // Assume SomeLoader does
>> not define schema, but actually feed Integer
>> b = foreach a generate a0+a1;
>>
>> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of
>> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and mark
>> the output schema for a0+a1 as double. Here is something interesting,
>> SomeLoader loads Integer, and we get Double after adding. We can change it
>> if we do the following:
>> 1. Don't cast bytearray into Double (in TypeCheckingVisitor)
>> 2. Change POAdd(Similarly, all other ExpressionOperators, multply, divide,
>> etc) to handle bytearray. When the schema for POAdd is bytearray, Pig will
>> figure out the data type at runtime, and process adding according to the
>> real type
>>
>> Pro:
>> 1. Consistent with the goal for unknown type cleanup: treat all bytearray
>> as unknown type. In the runtime, inspect the object to find the real type
>>
>> Cons:
>> 1. Slow down the processing since we need to inspect object type at runtime
>> 2. Bring some indeterminism to schema system. Before a0+a1 is double,
>> downstream schema is more clear.
>>
>> Any comments?
>>
>> Daniel
>>
>>

RE: Semantic cleanup: How to adding two bytearray

Reply via email to