Hello,
we were considering using hadoop to process some data,
we have it set up on 8 nodes ( 1 master + 7 slaves)
we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count the
number of each unique starting string.
we had 3500 files containing 3,015,294 lines totaling 5 GB.
to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output
'^[a-zA-Z]+\t'
it took 26 minutes
then to compare, we ran this bash command on one of the nodes, which
produced the same output out of the data:
cat * | sed -e s/\ .*// |sort | uniq -c > /tmp/out
(sed regexpr is tab not spaces)
which took 2.5 minutes
Then we added 10X the data into the cluster and reran Hadoop, it took
214 minutes which is less than 10X the time, but still not that impressive.
so we are seeing a 10X performance penalty for using Hadoop vs the
system commands,
is that expected?
we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?