Re: WordCount Results Version 2 - Pig 0.6.0

Alan Gates Wed, 20 Jan 2010 10:07:46 -0800

Are you setting parallel as Mirdul suggests? Or does your clusterhave a default parallelism set?


Alan.


On Jan 20, 2010, at 1:58 AM, Rob Stewart wrote:

Hi again,
The results have been produced. I can tell you that I made thefollowing
improvements:
1. Removed unnecessary "words = FOREACH myinput GENERATE
FLATTEN(TOKENIZE(\$0));"
2. Using PigStorage, and not deprecated TextLoader
3. Using Pig 0.6.0 , not 0.5.0


Here is the resulting Pig script, executed by pig 0.6.0:
--------------
myinput = LOAD 'Inputs/WordCount/wordsx1_skewed.dat' USINGPigStorage();
grouped = GROUP myinput BY \$0 PARALLEL 56;
counts = FOREACH grouped GENERATE group, COUNT(myinput) AS total;
STORE counts INTO 'Outputs/WordCount/wordCountx1_skewed.pig' USING
PigStorage();
--------------

The results can be found here:
www.macs.hw.ac.uk/~rs46/WordCount_ScaleUp_Results_version2.pdf
It brings Pig into the equation, though Hive seems most optimal.Rememberthat this is just for a GROUP and COUNT (for word count). I will runothers("skewed JOIN" and "Group, Count and Order by") which may show Pigto have
some optimizing properties.

Rob Stewart

Re: WordCount Results Version 2 - Pig 0.6.0

Reply via email to