Semantic cleanup: How to adding two bytearray

Daniel Dai Thu, 13 Jan 2011 20:59:38 -0800

One goal of semantic cleanup work undergoing is to clarify the usage ofunknown type.

In Pig schema system, user can define output schema forLoadFunc/EvalFunc. Pig will propagate those schema to the entire script.Defining schema for LoadFunc/EvalFunc is optional. If user don't defineschema, Pig will mark them bytearray. However, in the run time, user canfeed any data type in. Before, Pig assumes the runtime type forbytearray is DataByteArray, which arose several issues (PIG-1277,PIG-999, PIG-1016).

In 0.9, Pig will treat bytearray as unknown type. Pig will inspect theobject to figure out what the real type is at runtime. We've done thatfor all shuffle keys (PIG-1277). However, there are other cases. Onecase is adding two bytearray. For example,

a = load '1.txt' using SomeLoader() as (a0, a1); // Assume SomeLoaderdoes not define schema, but actually feed Integer

b = foreach a generate a0+a1;

In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case ofa0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), andmark the output schema for a0+a1 as double. Here is somethinginteresting, SomeLoader loads Integer, and we get Double after adding.We can change it if we do the following:

1. Don't cast bytearray into Double (in TypeCheckingVisitor)

2. Change POAdd(Similarly, all other ExpressionOperators, multply,divide, etc) to handle bytearray. When the schema for POAdd isbytearray, Pig will figure out the data type at runtime, and processadding according to the real type


Pro:

1. Consistent with the goal for unknown type cleanup: treat allbytearray as unknown type. In the runtime, inspect the object to findthe real type


Cons:
1. Slow down the processing since we need to inspect object type at runtime

2. Bring some indeterminism to schema system. Before a0+a1 is double,downstream schema is more clear.


Any comments?

Daniel

Semantic cleanup: How to adding two bytearray

Reply via email to