hadoop benchmarked, too slow to use

Elia Mazzawi Tue, 10 Jun 2008 15:57:01 -0700

Hello,

we were considering using hadoop to process some data,
we have it set up on 8 nodes ( 1 master + 7 slaves)


we filled the cluster up with files that contain tab delimited data.
string \tab string etc

then we ran the example grep with a regular expression to count thenumber of each unique starting string.

we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran

bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output'^[a-zA-Z]+\t'

it took 26 minutes

then to compare, we ran this bash command on one of the nodes, whichproduced the same output out of the data:


cat * | sed -e s/\  .*// |sort | uniq -c > /tmp/out
(sed regexpr is tab not spaces)

which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it took214 minutes which is less than 10X the time, but still not that impressive.

so we are seeing a 10X performance penalty for using Hadoop vs thesystem commands,

is that expected?
we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?

hadoop benchmarked, too slow to use

Reply via email to