I think the tradeoff between fully dynamic types and static types are between convenience (why should I tell you what the type is if the data is properly typed) and type-safety (what if your data has invalid values) and performance (dynamic typing would be slower.)
My vote is for static typing because I believe the type-safety (and clear schema definition) and performance are more important. Olga -----Original Message----- From: Daniel Dai [mailto:jiany...@yahoo-inc.com] Sent: Friday, January 14, 2011 12:12 PM To: dev@pig.apache.org Subject: Re: Semantic cleanup: How to adding two bytearray Runtime detection can be done row by row. This will solve the problem in your sample, though it suffers a little bit performance. Require casting before adding is also clean. However, this would break backward compatibility. Dmitriy Ryaboy wrote: > How is runtime detection done? I worry that if 1.txt contains: > 1, 2 > 1.1, 2.2 > > We get into a situation where addition of the fields in the first tuple > produces integers, and adding the fields of the second tuple produces > doubles. > > A more invasive but perhaps easier to reason about solution might be to be > stricter about types, and require bytearrays to be cast to whatever type > they are supposed to be if you want to add / delete / do non-byte-things to > them. > > This is a problem if UDFs that output tuples or bags don't specify schemas > (and specifying schemas of tuples and bags is fairly onerous right now in > Java). I am not sure what the solution here is, other than finding a clean, > less onerous way of declaring schemas, fixing up everything in builtin and > piggybank to only use the new clean sparkly api and document the heck out of > it. > > D > > On Thu, Jan 13, 2011 at 8:58 PM, Daniel Dai <jiany...@yahoo-inc.com> wrote: > > >> One goal of semantic cleanup work undergoing is to clarify the usage of >> unknown type. >> >> In Pig schema system, user can define output schema for LoadFunc/EvalFunc. >> Pig will propagate those schema to the entire script. Defining schema for >> LoadFunc/EvalFunc is optional. If user don't define schema, Pig will mark >> them bytearray. However, in the run time, user can feed any data type in. >> Before, Pig assumes the runtime type for bytearray is DataByteArray, which >> arose several issues (PIG-1277, PIG-999, PIG-1016). >> >> In 0.9, Pig will treat bytearray as unknown type. Pig will inspect the >> object to figure out what the real type is at runtime. We've done that for >> all shuffle keys (PIG-1277). However, there are other cases. One case is >> adding two bytearray. For example, >> >> a = load '1.txt' using SomeLoader() as (a0, a1); // Assume SomeLoader does >> not define schema, but actually feed Integer >> b = foreach a generate a0+a1; >> >> In Pig 0.8, schema system marks a0 and a1 as bytearray. In the case of >> a0+a1, Pig cast both a0 and a1 to double (in TypeCheckingVisitor), and mark >> the output schema for a0+a1 as double. Here is something interesting, >> SomeLoader loads Integer, and we get Double after adding. We can change it >> if we do the following: >> 1. Don't cast bytearray into Double (in TypeCheckingVisitor) >> 2. Change POAdd(Similarly, all other ExpressionOperators, multply, divide, >> etc) to handle bytearray. When the schema for POAdd is bytearray, Pig will >> figure out the data type at runtime, and process adding according to the >> real type >> >> Pro: >> 1. Consistent with the goal for unknown type cleanup: treat all bytearray >> as unknown type. In the runtime, inspect the object to find the real type >> >> Cons: >> 1. Slow down the processing since we need to inspect object type at runtime >> 2. Bring some indeterminism to schema system. Before a0+a1 is double, >> downstream schema is more clear. >> >> Any comments? >> >> Daniel >> >>