It works!!!!

This confirms that Pig is better than Java MapReduce :-)

Thanks everyone for their help.

Input:
Toy Story|0|0|0|0|1|1|0|0|0
GoldenEye|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0
SomeNewMovie|0|0|0|0|1|1|0|0|0|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1

Script:
movies = load 'movies' USING PigStorage('|');
movie_and_genres = FOREACH movies GENERATE $0 as movie_name,
        (bag{tuple()})TOBAG($1 ..) AS genres: bag{genre_bit: tuple()};
movies_sum_genres = foreach movie_and_genres generate movie_name,
(int)SUM(genres) as genre_total;
DUMP movies_sum_genres;

Output:
(Toy Story,2)
(GoldenEye,3)
(SomeNewMovie,41)



On Tue, Mar 19, 2013 at 12:43 PM, Abhinav Neelam
<abhinavroc...@gmail.com> wrote:
> Russell's code works with a little modification. (The cast to int doesn't
> work though.)
>
> movie_and_genres = FOREACH movies GENERATE $0 as movie_name,
> (bag{tuple()})TOBAG($2 ..) AS genres: bag{genre_bit: tuple()};
> foo = foreach movies_and_genres generate movie_name, (int)SUM(genres) as
> genre_total;
>
>
> Having said that, it appears from your problem description that there're a
> fixed number of genres and every movie record would contain either a 0 or 1
> corresponding to that genre. Ergo, every record has the same number of
> columns. (Is that right? I see your second example doesn't follow this
> though.) Then you could specify the detailed schema in your load statement
> and simplify matters.

The second example is the main reason for using the range, in that new genres
could be added arbitrarily at the end of each record.  Thus movie#1 could have
been added when there were 10 known genres in the format, but
movie#1000 has 11 known genres
in the record format.

> Secondly, it appears that order matters in your genre bitmap (you say the
> first column corresponds to action movies and so on).

that's correct -- for all movies, the first zero is whether or not it belongs
to an 'action' genre.

>Bags are unordered,
> so it makes sense to make a tuple out of your genre bitmap first because
> the TOBAG operation will throw away all column order information.
> You need to FLATTEN your tuple before TOBAG-ging and SUM-ming it though.

Hmm in my use-case (find the movies that belong to 2 or more genres)
this wouldn't matter, but
that's a very interesting (and tricky) point to note.  Thank you very much!

> HTH,
> Abhinav
>
>
>
>
> On 19 March 2013 07:20, Nathan Neff <nathan.n...@cloudera.com> wrote:
>
>> It seems like I'm getting closer:
>>
>> With this data:
>>
>> Toy Story|0
>> GoldenEye|0|1|0|1
>>
>> And this script:
>>
>> movies = load 'movies' USING PigStorage('|');
>> movie_and_genres = FOREACH movies GENERATE $0, TOTUPLE($1 ..);
>> DUMP movie_and_genres;
>> describe movie_and_genres;
>>
>> I get this output:
>>
>> (Toy Story,(0))
>> (GoldenEye,(0,1,0,1))
>> movie_and_genres: {bytearray,()}
>>

Reply via email to