One goal of semantic cleanup work undergoing is to clarify the usage of unknown type.

In Pig schema system, user can define output schema for LoadFunc/EvalFunc. Pig will propagate those schema to the entire script. Defining schema for LoadFunc/EvalFunc is optional. If user don't define schema, Pig will mark them bytearray. However, in the run time, user can feed any data type in. Before, Pig assumes the runtime type for bytearray is DataByteArray, which arose several issues (PIG-1277, PIG-999, PIG-1016).

In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the object to figure out what the real type is at runtime. We've done that for all shuffle keys (PIG-1277). However, there are other cases. One case is adding two bytearray. For example,

a = load '1.txt' using SomeLoader() as (a0, a1); // Assume SomeLoader does not define schema, but actually feed Integer
b = foreach a generate a0+a1;

In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and mark the output schema for a0+a1 as double. Here is something interesting, SomeLoader loads Integer, and we get Double after adding. We can change it if we do the following:
1. Don't cast bytearray into Double (in TypeCheckingVisitor)
2. Change POAdd(Similarly, all other ExpressionOperators, multply, divide, etc) to handle bytearray. When the schema for POAdd is bytearray, Pig will figure out the data type at runtime, and process adding according to the real type

Pro:
1. Consistent with the goal for unknown type cleanup: treat all bytearray as unknown type. In the runtime, inspect the object to find the real type

Cons:
1. Slow down the processing since we need to inspect object type at runtime
2. Bring some indeterminism to schema system. Before a0+a1 is double, downstream schema is more clear.

Any comments?

Daniel

Reply via email to