Hello,

we were considering using hadoop to process some data,
we have it set up on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count the number of each unique starting string.
we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t'
it took 26 minutes

then to compare, we ran this bash command on one of the nodes, which produced the same output out of the data:

cat * | sed -e s/\  .*// |sort | uniq -c > /tmp/out
(sed regexpr is tab not spaces)

which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it took 214 minutes which is less than 10X the time, but still not that impressive.


so we are seeing a 10X performance penalty for using Hadoop vs the system commands,
is that expected?
we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?

Reply via email to