Then the true type is DataByteArray so it would be used. Olga
-----Original Message----- From: Thejas M Nair [mailto:te...@yahoo-inc.com] Sent: Friday, January 14, 2011 1:01 PM To: dev@pig.apache.org; Jianyong Dai Subject: Re: Semantic cleanup: How to adding two bytearray What would happen in case the loader is PigStorage ? The bytearray type would actually be a DataByteArray . Will it be cast to double in that case ? -Thejas On 1/13/11 8:58 PM, "Daniel Dai" <jiany...@yahoo-inc.com> wrote: > One goal of semantic cleanup work undergoing is to clarify the usage of > unknown type. > > In Pig schema system, user can define output schema for > LoadFunc/EvalFunc. Pig will propagate those schema to the entire script. > Defining schema for LoadFunc/EvalFunc is optional. If user don't define > schema, Pig will mark them bytearray. However, in the run time, user can > feed any data type in. Before, Pig assumes the runtime type for > bytearray is DataByteArray, which arose several issues (PIG-1277, > PIG-999, PIG-1016). > > In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the > object to figure out what the real type is at runtime. We've done that > for all shuffle keys (PIG-1277). However, there are other cases. One > case is adding two bytearray. For example, > > a = load '1.txt' using SomeLoader() as (a0, a1); // Assume SomeLoader > does not define schema, but actually feed Integer > b = foreach a generate a0+a1; > > In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of > a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and > mark the output schema for a0+a1 as double. Here is something > interesting, SomeLoader loads Integer, and we get Double after adding. > We can change it if we do the following: > 1. Don't cast bytearray into Double (in TypeCheckingVisitor) > 2. Change POAdd(Similarly, all other ExpressionOperators, multply, > divide, etc) to handle bytearray. When the schema for POAdd is > bytearray, Pig will figure out the data type at runtime, and process > adding according to the real type > > Pro: > 1. Consistent with the goal for unknown type cleanup: treat all > bytearray as unknown type. In the runtime, inspect the object to find > the real type > > Cons: > 1. Slow down the processing since we need to inspect object type at runtime > 2. Bring some indeterminism to schema system. Before a0+a1 is double, > downstream schema is more clear. > > Any comments? > > Daniel >