It works!!!! This confirms that Pig is better than Java MapReduce :-)
Thanks everyone for their help. Input: Toy Story|0|0|0|0|1|1|0|0|0 GoldenEye|0|1|1|0|0|0|0|0|0|0|0|0|0|0|0|0|1|0|0 SomeNewMovie|0|0|0|0|1|1|0|0|0|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1 Script: movies = load 'movies' USING PigStorage('|'); movie_and_genres = FOREACH movies GENERATE $0 as movie_name, (bag{tuple()})TOBAG($1 ..) AS genres: bag{genre_bit: tuple()}; movies_sum_genres = foreach movie_and_genres generate movie_name, (int)SUM(genres) as genre_total; DUMP movies_sum_genres; Output: (Toy Story,2) (GoldenEye,3) (SomeNewMovie,41) On Tue, Mar 19, 2013 at 12:43 PM, Abhinav Neelam <abhinavroc...@gmail.com> wrote: > Russell's code works with a little modification. (The cast to int doesn't > work though.) > > movie_and_genres = FOREACH movies GENERATE $0 as movie_name, > (bag{tuple()})TOBAG($2 ..) AS genres: bag{genre_bit: tuple()}; > foo = foreach movies_and_genres generate movie_name, (int)SUM(genres) as > genre_total; > > > Having said that, it appears from your problem description that there're a > fixed number of genres and every movie record would contain either a 0 or 1 > corresponding to that genre. Ergo, every record has the same number of > columns. (Is that right? I see your second example doesn't follow this > though.) Then you could specify the detailed schema in your load statement > and simplify matters. The second example is the main reason for using the range, in that new genres could be added arbitrarily at the end of each record. Thus movie#1 could have been added when there were 10 known genres in the format, but movie#1000 has 11 known genres in the record format. > Secondly, it appears that order matters in your genre bitmap (you say the > first column corresponds to action movies and so on). that's correct -- for all movies, the first zero is whether or not it belongs to an 'action' genre. >Bags are unordered, > so it makes sense to make a tuple out of your genre bitmap first because > the TOBAG operation will throw away all column order information. > You need to FLATTEN your tuple before TOBAG-ging and SUM-ming it though. Hmm in my use-case (find the movies that belong to 2 or more genres) this wouldn't matter, but that's a very interesting (and tricky) point to note. Thank you very much! > HTH, > Abhinav > > > > > On 19 March 2013 07:20, Nathan Neff <nathan.n...@cloudera.com> wrote: > >> It seems like I'm getting closer: >> >> With this data: >> >> Toy Story|0 >> GoldenEye|0|1|0|1 >> >> And this script: >> >> movies = load 'movies' USING PigStorage('|'); >> movie_and_genres = FOREACH movies GENERATE $0, TOTUPLE($1 ..); >> DUMP movie_and_genres; >> describe movie_and_genres; >> >> I get this output: >> >> (Toy Story,(0)) >> (GoldenEye,(0,1,0,1)) >> movie_and_genres: {bytearray,()} >>