Russell's code works with a little modification. (The cast to int doesn't
work though.)

movie_and_genres = FOREACH movies GENERATE $0 as movie_name,
(bag{tuple()})TOBAG($2 ..) AS genres: bag{genre_bit: tuple()};
foo = foreach movies_and_genres generate movie_name, (int)SUM(genres) as
genre_total;


Having said that, it appears from your problem description that there're a
fixed number of genres and every movie record would contain either a 0 or 1
corresponding to that genre. Ergo, every record has the same number of
columns. (Is that right? I see your second example doesn't follow this
though.) Then you could specify the detailed schema in your load statement
and simplify matters.

Secondly, it appears that order matters in your genre bitmap (you say the
first column corresponds to action movies and so on). Bags are unordered,
so it makes sense to make a tuple out of your genre bitmap first because
the TOBAG operation will throw away all column order information.
You need to FLATTEN your tuple before TOBAG-ging and SUM-ming it though.

HTH,
Abhinav




On 19 March 2013 07:20, Nathan Neff <nathan.n...@cloudera.com> wrote:

> It seems like I'm getting closer:
>
> With this data:
>
> Toy Story|0
> GoldenEye|0|1|0|1
>
> And this script:
>
> movies = load 'movies' USING PigStorage('|');
> movie_and_genres = FOREACH movies GENERATE $0, TOTUPLE($1 ..);
> DUMP movie_and_genres;
> describe movie_and_genres;
>
> I get this output:
>
> (Toy Story,(0))
> (GoldenEye,(0,1,0,1))
> movie_and_genres: {bytearray,()}
>

Reply via email to