Are you setting parallel as Mirdul suggests? Or does your cluster
have a default parallelism set?
Alan.
On Jan 20, 2010, at 1:58 AM, Rob Stewart wrote:
Hi again,
The results have been produced. I can tell you that I made the
following
improvements:
1. Removed unnecessary "words = FOREACH myinput GENERATE
FLATTEN(TOKENIZE(\$0));"
2. Using PigStorage, and not deprecated TextLoader
3. Using Pig 0.6.0 , not 0.5.0
Here is the resulting Pig script, executed by pig 0.6.0:
--------------
myinput = LOAD 'Inputs/WordCount/wordsx1_skewed.dat' USING
PigStorage();
grouped = GROUP myinput BY \$0 PARALLEL 56;
counts = FOREACH grouped GENERATE group, COUNT(myinput) AS total;
STORE counts INTO 'Outputs/WordCount/wordCountx1_skewed.pig' USING
PigStorage();
--------------
The results can be found here:
www.macs.hw.ac.uk/~rs46/WordCount_ScaleUp_Results_version2.pdf
It brings Pig into the equation, though Hive seems most optimal.
Remember
that this is just for a GROUP and COUNT (for word count). I will run
others
("skewed JOIN" and "Group, Count and Order by") which may show Pig
to have
some optimizing properties.
Rob Stewart