"FLATTEN(TOBAG(t1)), FLATTEN(TOBAG(t2)) should give you the cross"


Why?  TOBAG(t1) should give you a bag with one tuple in it.  FLATTEN on
that gives you one tuple.
If you want a bag with each tuple field, FLATTEN it first

TOBAG(FLATTEN(t1))
or reference the fields:
TOBAG(t1::$0, t1::$1, t1::$2)

I have not tested the above, but that is logically what you want and the
same thing if t1 has 3 fields.  You may need an extra line of FOREACH ...
GENERATE to do this.

FLATTEN on a tuple 'unnests' it.  FLATTEN on a bag 'explodes' it.

The first example is
('a', 'b', 'c'), ('x', 'y')
to
('a', 'x')
('a', 'y')
('b', 'x')
('b', 'y')
('c', 'x')
('c', 'y')



Which seems sane, but is not in the general case.  Consider:
('a', 3L, 2.0d), (0, {('x')})
to:
('a', 0)
('a', {('x')})
(3L, 0)
(3L, {('x')})
(2.0d, 0)
(2.0d, {('x')})


The schema on output is undefined and nearly unusable.




On 4/5/12 2:27 AM, "Gianmarco De Francisci Morales" <g...@apache.org>
wrote:

>I would say the additional nesting level is a bug.
>But we should check if we break stuff with this change.
>
>Cheers,
>--
>Gianmarco
>
>
>
>On Thu, Apr 5, 2012 at 01:36, Jonathan Coveney <jcove...@gmail.com> wrote:
>
>> Pig folks: it seems like it defies the expectation if TOBAG is run on a
>> single TUPLE and you don't get a bag. I can patch it, but seem like a
>>fair
>> change?
>>
>> 2012/4/4 Eli Finkelshteyn <iefin...@gmail.com>
>>
>> > Nah, doesn't work because it doubles up the tuple, so that:
>> >
>> > TOBAG(('hello', 'howdy', 'hi'))
>> > returns
>> > {(('hello', 'howdy', 'hi'))}
>> >
>> > And so,
>> >
>> > FLATTEN(TOBAG(t1)), FLATTEN(TOBAG(t2))
>> > gets me
>> >
>> > ('hello', 'howdy', 'hi'), ('hola', 'bonjour')
>> >
>> > which is just what I started with.
>> >
>> > Anyway, to solve this problem, what I did was make a quick python udf
>>to
>> > make a bag from a tuple without doubling up the tuple, and then ran
>> FLATTEN
>> > on that, which looks like:
>> >
>> > bagged = FOREACH split_set GENERATE FLATTEN(py_udfs.tupleToBag(t1)**),
>> > FLATTEN(py_udfs.tupleToBag(t2)**);
>> >
>> > Where the Python udf I'm using is:
>> >
>> > @outputSchema("b:bag{}")
>> > def tupleToBag(tup):
>> >    b = [tupify(i) for i in tupify(tup)]
>> >    return b
>> >
>> > def tupify(tup):
>> >    if isinstance(tup, tuple):
>> >        return tup
>> >    return (tup,)
>> >
>> > I'll add that into Python PiggyBank as soon as I get a chance to
>>finish
>> > that stuff up.
>> >
>> > Eli
>> >
>> >
>> >
>> > On 4/4/12 2:43 PM, Jonathan Coveney wrote:
>> >
>> >> FLATTEN(TOBAG(t1)), FLATTEN(TOBAG(t2)) should give you the cross
>> >>
>> >> 2012/4/4 Eli Finkelshteyn<iefinkel@gmail.**com <iefin...@gmail.com>>
>> >>
>> >>  That's for a relation only. Unless I'm missing something, it does
>>not
>> >>> work
>> >>> for tuples. What I'm doing what require a FOREACH, I'm thinking.
>> >>>
>> >>> Eli
>> >>>
>> >>>
>> >>> On 4/4/12 2:24 PM, Prashant Kommireddi wrote:
>> >>>
>> >>>  http://pig.apache.org/docs/r0.****9.1/basic.html#cross<
>> http://pig.apache.org/docs/r0.**9.1/basic.html#cross>
>> >>>> <http://**pig.apache.org/docs/r0.9.1/**basic.html#cross<
>> http://pig.apache.org/docs/r0.9.1/basic.html#cross>
>> >>>> >
>> >>>>
>> >>>> -Prashant
>> >>>>
>> >>>> On Wed, Apr 4, 2012 at 11:18 AM, Eli
>>Finkelshteyn<iefinkel@gmail.****
>> >>>> com<iefin...@gmail.com>
>> >>>>
>> >>>>  wrote:
>> >>>>>
>> >>>>  Hi Folks,
>> >>>>
>> >>>>> I'm currently trying to do something I figured would be trivial,
>>but
>> >>>>> actually wound up being a bit of work for me, so I'm wondering if
>>I'm
>> >>>>> missing something. All I want to do is get a cross product of two
>> >>>>> tuples.
>> >>>>> So for example, given an input of:
>> >>>>>
>> >>>>> ('hello', 'howdy', 'hi'), ('hola', 'bonjour')
>> >>>>>
>> >>>>> I'd get:
>> >>>>>
>> >>>>> ('hello', 'hola')
>> >>>>> ('hello', 'bonjour')
>> >>>>> ('howdy', 'hola')
>> >>>>> ('howdy', 'bonjour')
>> >>>>> ('hi', 'hola')
>> >>>>> ('hi', 'bonjour')
>> >>>>>
>> >>>>> At first, I figured I could FLATTEN(TOBAG(tuple1, tuple2)), but
>> that's
>> >>>>> no
>> >>>>> good cause the tuples are first themselves put into new tuples.
>>So,
>> >>>>> what
>> >>>>> I'm left with no is writing a dirty and slow python udf for this.
>>Is
>> >>>>> there
>> >>>>> really no better way to do this? I'd think it would be a pretty
>> >>>>> standard
>> >>>>> task.
>> >>>>>
>> >>>>> Eli
>> >>>>>
>> >>>>>
>> >>>>>
>> >
>>

Reply via email to