Are you familiar with the CUBE keyword that was relatively recently added? This sounds like a perfect use case for it. Furthermore, how are you splitting on activity? There is a SPLIT operator which is perfect for this, as you can have a different relation for each one.
What I would do would be to use split to break it down into activities, then make a macro that gives you the counts you want and use that on each of the split relations. 2013/5/6 Thomas Edison <justdoit.thomas.edi...@gmail.com> > Hi there, > > I have a huge input on an HDFS and I would like to use Pig to calculate > several unique metrics. To help explain the problem more easily, I assume > the input file has the following schema: > > userId:chararray, dimensionA_key:chararray, dimensionB_key:chararray, > dimensionC_key:chararray, activity:chararray, ... > > Each record represent an activity performed by that userId. > > Based on the value in the activity field, this activity record will be > mapped to 1 or more categories. There are about 10 categories in total. > > Now I need to count the number of unique users for different dimension > combinations (i.e. A, B, C, A+B, A+C, B+C, A+B+C) for each activity > category. > > What would be the best practices to perform such calculation? > > I have tried several ways. Although I can get the results I want, it takes > a very long time (i.e. days). What I found is most of the time is spent on > the map phase. It looks like the script tries to load the huge input file > every time it tries to calculate one unique count. Is there a way to > improve this behavior? > > I also tried something similar to below, but it looks like it reaches the > memory cap for a single reducer and just stuck at the last reducer step. > > source = load ... as (userId:chararray, dimensionA_key:chararray, > dimensionB_key:chararray, dimensionC_key:chararray, > activity:chararray, ...); > a = group source by (dimensionA_key, dimensionB_key); > b = foreach a { > userId1 = udf.newUserIdForCategory1(userId, activity); > -- this udf returns the original user id if the activity should be > mapped to Category1 and None otherwise > userId2 = udf.newUserIdForCategory2(userId, activity); > userId3 = udf.newUserIdForCategory3(userId, activity); > ... > userId10 = udf.newUserIdForCategory10(userId, activity); > generate FLATTEN(group), COUNT(userId1), COUNT(userId2), > COUNT(userId3), ..., COUNT(userId10); > } > store b ...; > > Thanks. > > T.E. >