One goal of semantic cleanup work undergoing is to clarify the usage of
unknown type.
In Pig schema system, user can define output schema for
LoadFunc/EvalFunc. Pig will propagate those schema to the entire script.
Defining schema for LoadFunc/EvalFunc is optional. If user don't define
schema, Pig will mark them bytearray. However, in the run time, user can
feed any data type in. Before, Pig assumes the runtime type for
bytearray is DataByteArray, which arose several issues (PIG-1277,
PIG-999, PIG-1016).
In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the
object to figure out what the real type is at runtime. We've done that
for all shuffle keys (PIG-1277). However, there are other cases. One
case is adding two bytearray. For example,
a = load '1.txt' using SomeLoader() as (a0, a1); // Assume SomeLoader
does not define schema, but actually feed Integer
b = foreach a generate a0+a1;
In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of
a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and
mark the output schema for a0+a1 as double. Here is something
interesting, SomeLoader loads Integer, and we get Double after adding.
We can change it if we do the following:
1. Don't cast bytearray into Double (in TypeCheckingVisitor)
2. Change POAdd(Similarly, all other ExpressionOperators, multply,
divide, etc) to handle bytearray. When the schema for POAdd is
bytearray, Pig will figure out the data type at runtime, and process
adding according to the real type
Pro:
1. Consistent with the goal for unknown type cleanup: treat all
bytearray as unknown type. In the runtime, inspect the object to find
the real type
Cons:
1. Slow down the processing since we need to inspect object type at runtime
2. Bring some indeterminism to schema system. Before a0+a1 is double,
downstream schema is more clear.
Any comments?
Daniel