Are you setting parallel as Mirdul suggests? Or does your cluster have a default parallelism set?

Alan.

On Jan 20, 2010, at 1:58 AM, Rob Stewart wrote:

Hi again,

The results have been produced. I can tell you that I made the following
improvements:
1. Removed unnecessary "words = FOREACH myinput GENERATE
FLATTEN(TOKENIZE(\$0));"
2. Using PigStorage, and not deprecated TextLoader
3. Using Pig 0.6.0 , not 0.5.0


Here is the resulting Pig script, executed by pig 0.6.0:
--------------
myinput = LOAD 'Inputs/WordCount/wordsx1_skewed.dat' USING PigStorage();
grouped = GROUP myinput BY \$0 PARALLEL 56;
counts = FOREACH grouped GENERATE group, COUNT(myinput) AS total;
STORE counts INTO 'Outputs/WordCount/wordCountx1_skewed.pig' USING
PigStorage();
--------------

The results can be found here:
www.macs.hw.ac.uk/~rs46/WordCount_ScaleUp_Results_version2.pdf

It brings Pig into the equation, though Hive seems most optimal. Remember that this is just for a GROUP and COUNT (for word count). I will run others ("skewed JOIN" and "Group, Count and Order by") which may show Pig to have
some optimizing properties.

Rob Stewart

Reply via email to