IC
On 5/15/08, Alan Gates <[EMAIL PROTECTED]> wrote: > > I doubt you'll get the votes on setting it on by default. Pig's founders > have been fairly adamant that pig continue to work in the no metadata case. > Turning this on by default would break that rule. > > Alan. > > pi song wrote: > >> We can have that "strict typing" option in pig.properties and then make >> the >> type checking validation consuming that config key. However by default I >> want to turn it on. >> >> Pi >> >> >> On 5/15/08, Alan Gates <[EMAIL PROTECTED]> wrote: >> >> >>> I agree this will be somewhat surprising, perhaps we should give a >>> warning. >>> But we need to preserve our philosophy that "Pig's eat anything". This >>> would seem to dictate that we allow people to use union regardless of the >>> schemas. One open question in my mind is whether we have a "strict mode" >>> (similar to 'use strict' in perl) where things like this cause errors >>> instead of (possibly) warnings. >>> >>> Alan. >>> >>> pi song wrote: >>> >>> >>> >>>> Alan, >>>> >>>> On my second thought, union of two incompatible data streams can cause >>>> undefined state in downstream operators, resulting in a mix of good >>>> output >>>> and garbage. This seems to break the rule of least surprise. What do you >>>> think? >>>> >>>> Pi >>>> >>>> On Wed, May 14, 2008 at 9:06 AM, pi song <[EMAIL PROTECTED]> wrote: >>>> >>>> >>>> >>>> >>>> >>>>> Ok, will follow that. >>>>> >>>>> >>>>> On 5/14/08, Alan Gates <[EMAIL PROTECTED]> wrote: >>>>> >>>>> >>>>> >>>>> >>>>>> I agree that option 3 is the correct course. >>>>>> >>>>>> One note, you say: >>>>>> >>>>>> In case that schemas from all the input ports are not compatible, no >>>>>> problem >>>>>> because we won't process it. >>>>>> >>>>>> How do you mean "won't process it"? We still have to allow a union >>>>>> operation between two non-compatible inputs (otherwise we can only use >>>>>> union >>>>>> when we have schemas). But the resulting union will not have a schema >>>>>> (since the output no longer has a consistent schema). >>>>>> >>>>>> Alan. >>>>>> >>>>>> >>>>>> pi song wrote: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> Union is an example of bag (relational) operators that can have more >>>>>>> than >>>>>>> one input. >>>>>>> >>>>>>> In case that schemas from all the input ports are the same, no >>>>>>> problem. >>>>>>> In case that schemas from all the input ports are not compatible, no >>>>>>> problem >>>>>>> because we won't process it. >>>>>>> In case that schemas from all the input ports are not the same, but >>>>>>> compatible, here comes a problem. >>>>>>> >>>>>>> Example: >>>>>>> >>>>>>> C = UNION A,B ; >>>>>>> >>>>>>> Schema(A) = < Int, Chararray > >>>>>>> Schema(B) = < Double, Chararray > >>>>>>> >>>>>>> The output schema will get resolved to < Double, Chararray >. Here is >>>>>>> the >>>>>>> problem. The Union operator at the moment doesn't support casting in >>>>>>> any >>>>>>> layer. In this case if we don't cast it, the binary data of Int will >>>>>>> get >>>>>>> picked up as Double by the downstream operator!! There are a couple >>>>>>> solutions for this:- >>>>>>> >>>>>>> 1) Implement LOUnion and POUnion to support type casting internally >>>>>>> 2) Add casting support in LOUnion operator and let the >>>>>>> LogicalToPhysical >>>>>>> compiler generates LOForeach for it. >>>>>>> 3) Explicitly insert LOForEach to do necessary casting between Union >>>>>>> and >>>>>>> the >>>>>>> problematic input. This is analogous to the way we implement implicit >>>>>>> casting for expression operators. >>>>>>> 4) Don't support "not same but compatible" case at all. >>>>>>> >>>>>>> I will do (3) because it makes the most sense to me plus incurs the >>>>>>> least >>>>>>> impact on other modules. Does anyone have problem with it? >>>>>>> >>>>>>> Pi >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>> >>> >> >> >
