RE: Semantic cleanup: How to adding two bytearray

Olga Natkovich Fri, 14 Jan 2011 13:17:13 -0800

Then the true type is DataByteArray so it would be used.

Olga


-----Original Message-----
From: Thejas M Nair [mailto:te...@yahoo-inc.com] 
Sent: Friday, January 14, 2011 1:01 PM
To: dev@pig.apache.org; Jianyong Dai
Subject: Re: Semantic cleanup: How to adding two bytearray

What would happen in case the loader is PigStorage ? The bytearray type
would actually be a DataByteArray . Will it be cast to double in that case ?

-Thejas





On 1/13/11 8:58 PM, "Daniel Dai" <jiany...@yahoo-inc.com> wrote:

> One goal of semantic cleanup work undergoing is to clarify the usage of
> unknown type.
> 
> In Pig schema system, user can define output schema for
> LoadFunc/EvalFunc. Pig will propagate those schema to the entire script.
> Defining schema for LoadFunc/EvalFunc is optional. If user don't define
> schema, Pig will mark them bytearray. However, in the run time, user can
> feed any data type in. Before, Pig assumes the runtime type for
> bytearray is DataByteArray, which arose several issues (PIG-1277,
> PIG-999, PIG-1016).
> 
> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the
> object to figure out what the real type is at runtime. We've done that
> for all shuffle keys (PIG-1277). However, there are other cases. One
> case is adding two bytearray. For example,
> 
> a = load '1.txt' using SomeLoader() as (a0, a1);  // Assume SomeLoader
> does not define schema, but actually feed Integer
> b = foreach a generate a0+a1;
> 
> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of
> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and
> mark the output schema for a0+a1 as double. Here is something
> interesting, SomeLoader loads Integer, and we get Double after adding.
> We can change it if we do the following:
> 1. Don't cast bytearray into Double (in TypeCheckingVisitor)
> 2. Change POAdd(Similarly, all other ExpressionOperators, multply,
> divide, etc) to handle bytearray. When the schema for POAdd is
> bytearray, Pig will figure out the data type at runtime, and process
> adding according to the real type
> 
> Pro:
> 1. Consistent with the goal for unknown type cleanup: treat all
> bytearray as unknown type. In the runtime, inspect the object to find
> the real type
> 
> Cons:
> 1. Slow down the processing since we need to inspect object type at runtime
> 2. Bring some indeterminism to schema system. Before a0+a1 is double,
> downstream schema is more clear.
> 
> Any comments?
> 
> Daniel
>

RE: Semantic cleanup: How to adding two bytearray

Reply via email to