On Sep 13, 2007, at 2:20 AM, Taeho Kang wrote:
I did run WordCount included in 0.14.1 release version on a 1 node
Hadoop
cluster (Pentium D with 2GB of RAM).
Thanks for running the benchmark. I'm afraid that with such a small
cluster and data size you are getting swamped in the start up costs.
I have not done enough benchmarking of the C++ bindings
There were 2 input files (one 4.5MB file + one 36MB file).
I also did take Combiner out of Java version WordCount MapReduce,
as there
was no Combiner used for C++ version.
Actually, the wordcount-part.cc example does have a combiner. You
would want to remove the partitioner from that example that forces
every key to partition 0 however. *smile* Actually, as an example,
the bad partitioner wasn't a good idea. I should move the bad
partitioner to a test case.
The result is.... as many of you have guessed, Java version won the
race big
time. Java version was about 4 times quicker.
I'll write a sort benchmark for C++ so that we can run a reasonably
large program. Note that for simple programs, the C++ is by
definition slower since pipes runs the C++ as a subprocess underneath
a Java mapper and reducer.
-- Owen