Not sure what the scope of the experiment is, but some useful comparisons could be against :
a) job using only mapred api.
b) hadoop streaming.
c) pig streaming.

It also depends on the actual script/job being run - if it is using combiners, multiple outputs, 'depth of pipeline', how many jobs you end up running for it, etc.



If you are interested in only testing how pig scales, then interesting metrics could be :
a) size of input.
b) with/without compression.
c) number of mappers.
d) number of reducers.
e) output size (depending on what you are running I guess).


Regards,
Mridul


On Thursday 21 April 2011 01:27 AM, Lai Will wrote:
Hi there,

I'm planning to do some performance measurements of my hadoop pig code in order 
to see how it scales.
Does anyone have some suggestions on how to do that?

I thought of measuring the time needed for completion on a fixed cluster size 
by increasing the input data.
Then by fixing the input data and by adding cluster nodes. Does anyone have 
experience in doing that? I thought of writing a script that does start/stop 
the time and execute the pig command. Maybe there's a better way?

Best,
Will

Reply via email to