Re: hadoop benchmarked, too slow to use

Arun C Murthy Wed, 11 Jun 2008 13:11:10 -0700


On Jun 11, 2008, at 11:53 AM, Elia Mazzawi wrote:

we concatenated the files to bring them close to and less than 64mband the difference was huge without changing anything else
we went from 214 minutes to 3 minutes !


*smile*

How many reduces are you running now? 1 or more?

Arun

Elia Mazzawi wrote:
Thanks for the suggestions,
I'm going to rerun the same test with close to < 64Mb files and 7then 14 reducers.
we've done another test to see if more servers would speed up thecluster,
with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours
vs 214 minutes with all nodes online.
started the test after hdfs marked the nodes as dead, and therewere no timeouts.
332/214 = 55% more time with 5/7 = 71%  servers.

so our conclusion is that more servers will make the cluster faster.



Ashish Thusoo wrote:
Try by first just reducing the number of files and increasing thedatain each file so you have close to 64MB of data per file. So inyour casethat would amount to about 700-800 files in the 10X test case(insteadof 35000 that you have). See if that give substantially betterresultson your larger test case. For the smaller one, I don't think youwill beable to do better than the unix command - the data set is toosmall.
Ashish
-----Original Message-----
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent:Tuesday, June 10, 2008 5:00 PM
To: core-user@hadoop.apache.org
Subject: Re: hadoop benchmarked, too slow to use
so it would make sense for me to configure hadoop for smallerchunks?
Elia Mazzawi wrote:
yes chunk size was 64mb, and each file has some data it used 7mappers
and 1 reducer.

10X the data took 214 minutes
vs 26 minutes for the smaller set
i uploaded the same data 10 times in different directories ( somore files, same size )
Ashish Thusoo wrote:
Apart from the setup times, the fact that you have 3500 filesmeans that you are going after around 220GB of data as eachfile would have
atleast one chunk (this calculation is assuming a chunk size of64MB and this assumes that each file has atleast some data).Mappers would
probably need to read up this amount of data and with 7 nodesyou may
just have
14 map slots. I may be wrong here, but just out of curiosityhow many
mappers does your job use.
Don't know why the 10X data was not better though if the badperformance of the smaller test case was due to fragmentation.For that test did you also increase the number of files, or didyou simply increase the amount of data in each file.
Plus on small sets (of the order of 2-3 GB) of data unixcommands can't really be beaten :)
Ashish
-----Original Message-----
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent:Tuesday, June 10, 2008 3:56 PM
To: core-user@hadoop.apache.org
Subject: hadoop benchmarked, too slow to use

Hello,
we were considering using hadoop to process some data, we haveit set
up on 8 nodes ( 1 master + 7 slaves)
we filled the cluster up with files that contain tab delimiteddata.
string \tab string etc
then we ran the example grep with a regular expression to countthe number of each unique starting string.
we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output '^[a-zA-Z]+\t'
it took 26 minutes
then to compare, we ran this bash command on one of the nodes,which produced the same output out of the data:
cat * | sed -e s/\ .*// |sort | uniq -c > /tmp/out (sedregexpr is tab not spaces)
which took 2.5 minutes
Then we added 10X the data into the cluster and reran Hadoop,it took214 minutes which is less than 10X the time, but still not thatimpressive.
so we are seeing a 10X performance penalty for using Hadoop vsthe system commands, is that expected?
we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?

Re: hadoop benchmarked, too slow to use

Reply via email to