Re: hadoop benchmarked, too slow to use

2008-06-11 Thread Elia Mazzawi

Thanks for the suggestions,

I'm going to rerun the same test with close to  64Mb files and 7 then 
14 reducers.



we've done another test to see if more servers would speed up the cluster,

with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours
vs 214 minutes with all nodes online.
started the test after hdfs marked the nodes as dead, and there were no 
timeouts.


332/214 = 55% more time with 5/7 = 71%  servers.

so our conclusion is that more servers will make the cluster faster.



Ashish Thusoo wrote:

Try by first just reducing the number of files and increasing the data
in each file so you have close to 64MB of data per file. So in your case
that would amount to about 700-800 files in the 10X test case (instead
of 35000 that you have). See if that give substantially better results
on your larger test case. For the smaller one, I don't think you will be
able to do better than the unix  command - the data set is too small.

Ashish 


-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 10, 2008 5:00 PM

To: core-user@hadoop.apache.org
Subject: Re: hadoop benchmarked, too slow to use

so it would make sense for me to configure hadoop for smaller chunks?

Elia Mazzawi wrote:
  

yes chunk size was 64mb, and each file has some data it used 7 mappers



  

and 1 reducer.

10X the data took 214 minutes
vs 26 minutes for the smaller set

i uploaded the same data 10 times in different directories ( so more 
files, same size )



Ashish Thusoo wrote:

Apart from the setup times, the fact that you have 3500 files means 
that you are going after around 220GB of data as each file would have
  


  
atleast one chunk (this calculation is assuming a chunk size of 64MB 
and this assumes that each file has atleast some data). Mappers would
  


  

probably need to read up this amount of data and with 7 nodes you may
  


  

just have
14 map slots. I may be wrong here, but just out of curiosity how many
  


  

mappers does your job use.

Don't know why the 10X data was not better though if the bad 
performance of the smaller test case was due to fragmentation. For 
that test did you also increase the number of files, or did you 
simply increase the amount of data in each file.


Plus on small sets (of the order of 2-3 GB) of data unix commands 
can't really be beaten :)


Ashish
-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: 
Tuesday, June 10, 2008 3:56 PM

To: core-user@hadoop.apache.org
Subject: hadoop benchmarked, too slow to use

Hello,

we were considering using hadoop to process some data, we have it set
  


  

up on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count the 
number of each unique starting string.

we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output 
'^[a-zA-Z]+\t'

it took 26 minutes

then to compare, we ran this bash command on one of the nodes, which 
produced the same output out of the data:


cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out (sed regexpr is 
tab not spaces)


which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it took
214 minutes which is less than 10X the time, but still not that 
impressive.



so we are seeing a 10X performance penalty for using Hadoop vs the 
system commands, is that expected?

we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?
  
  


  




Re: hadoop benchmarked, too slow to use

2008-06-11 Thread Arun C Murthy


On Jun 11, 2008, at 11:53 AM, Elia Mazzawi wrote:



we concatenated the files to bring them close to and less than 64mb  
and the difference was huge without changing anything else

we went from 214 minutes to 3 minutes !



*smile*

How many reduces are you running now? 1 or more?

Arun


Elia Mazzawi wrote:

Thanks for the suggestions,

I'm going to rerun the same test with close to  64Mb files and 7  
then 14 reducers.



we've done another test to see if more servers would speed up the  
cluster,


with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours
vs 214 minutes with all nodes online.
started the test after hdfs marked the nodes as dead, and there  
were no timeouts.


332/214 = 55% more time with 5/7 = 71%  servers.

so our conclusion is that more servers will make the cluster faster.



Ashish Thusoo wrote:
Try by first just reducing the number of files and increasing the  
data
in each file so you have close to 64MB of data per file. So in  
your case
that would amount to about 700-800 files in the 10X test case  
(instead
of 35000 that you have). See if that give substantially better  
results
on your larger test case. For the smaller one, I don't think you  
will be
able to do better than the unix  command - the data set is too  
small.


Ashish
-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent:  
Tuesday, June 10, 2008 5:00 PM

To: core-user@hadoop.apache.org
Subject: Re: hadoop benchmarked, too slow to use

so it would make sense for me to configure hadoop for smaller  
chunks?


Elia Mazzawi wrote:

yes chunk size was 64mb, and each file has some data it used 7  
mappers






and 1 reducer.

10X the data took 214 minutes
vs 26 minutes for the smaller set

i uploaded the same data 10 times in different directories ( so  
more files, same size )



Ashish Thusoo wrote:

Apart from the setup times, the fact that you have 3500 files  
means that you are going after around 220GB of data as each  
file would have





atleast one chunk (this calculation is assuming a chunk size of  
64MB and this assumes that each file has atleast some data).  
Mappers would





probably need to read up this amount of data and with 7 nodes  
you may






just have
14 map slots. I may be wrong here, but just out of curiosity  
how many






mappers does your job use.

Don't know why the 10X data was not better though if the bad  
performance of the smaller test case was due to fragmentation.  
For that test did you also increase the number of files, or did  
you simply increase the amount of data in each file.


Plus on small sets (of the order of 2-3 GB) of data unix  
commands can't really be beaten :)


Ashish
-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent:  
Tuesday, June 10, 2008 3:56 PM

To: core-user@hadoop.apache.org
Subject: hadoop benchmarked, too slow to use

Hello,

we were considering using hadoop to process some data, we have  
it set






up on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited  
data.

string \tab string etc
then we ran the example grep with a regular expression to count  
the number of each unique starting string.

we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output '^ 
[a-zA-Z]+\t'

it took 26 minutes

then to compare, we ran this bash command on one of the nodes,  
which produced the same output out of the data:


cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out (sed  
regexpr is tab not spaces)


which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop,  
it took
214 minutes which is less than 10X the time, but still not that  
impressive.



so we are seeing a 10X performance penalty for using Hadoop vs  
the system commands, is that expected?

we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?












Re: hadoop benchmarked, too slow to use

2008-06-11 Thread Elia Mazzawi
that was with 7 reducers, but i meant to run it with 1. I'll re-run to 
compare.


Arun C Murthy wrote:


On Jun 11, 2008, at 11:53 AM, Elia Mazzawi wrote:



we concatenated the files to bring them close to and less than 64mb 
and the difference was huge without changing anything else

we went from 214 minutes to 3 minutes !



*smile*

How many reduces are you running now? 1 or more?

Arun


Elia Mazzawi wrote:

Thanks for the suggestions,

I'm going to rerun the same test with close to  64Mb files and 7 
then 14 reducers.



we've done another test to see if more servers would speed up the 
cluster,


with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours
vs 214 minutes with all nodes online.
started the test after hdfs marked the nodes as dead, and there were 
no timeouts.


332/214 = 55% more time with 5/7 = 71%  servers.

so our conclusion is that more servers will make the cluster faster.



Ashish Thusoo wrote:

Try by first just reducing the number of files and increasing the data
in each file so you have close to 64MB of data per file. So in your 
case

that would amount to about 700-800 files in the 10X test case (instead
of 35000 that you have). See if that give substantially better results
on your larger test case. For the smaller one, I don't think you 
will be

able to do better than the unix  command - the data set is too small.

Ashish
-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: 
Tuesday, June 10, 2008 5:00 PM

To: core-user@hadoop.apache.org
Subject: Re: hadoop benchmarked, too slow to use

so it would make sense for me to configure hadoop for smaller chunks?

Elia Mazzawi wrote:

yes chunk size was 64mb, and each file has some data it used 7 
mappers






and 1 reducer.

10X the data took 214 minutes
vs 26 minutes for the smaller set

i uploaded the same data 10 times in different directories ( so 
more files, same size )



Ashish Thusoo wrote:

Apart from the setup times, the fact that you have 3500 files 
means that you are going after around 220GB of data as each file 
would have





atleast one chunk (this calculation is assuming a chunk size of 
64MB and this assumes that each file has atleast some data). 
Mappers would





probably need to read up this amount of data and with 7 nodes you 
may






just have
14 map slots. I may be wrong here, but just out of curiosity how 
many






mappers does your job use.

Don't know why the 10X data was not better though if the bad 
performance of the smaller test case was due to fragmentation. 
For that test did you also increase the number of files, or did 
you simply increase the amount of data in each file.


Plus on small sets (of the order of 2-3 GB) of data unix commands 
can't really be beaten :)


Ashish
-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: 
Tuesday, June 10, 2008 3:56 PM

To: core-user@hadoop.apache.org
Subject: hadoop benchmarked, too slow to use

Hello,

we were considering using hadoop to process some data, we have it 
set






up on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count 
the number of each unique starting string.

we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output 
'^[a-zA-Z]+\t'

it took 26 minutes

then to compare, we ran this bash command on one of the nodes, 
which produced the same output out of the data:


cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out (sed regexpr 
is tab not spaces)


which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it 
took
214 minutes which is less than 10X the time, but still not that 
impressive.



so we are seeing a 10X performance penalty for using Hadoop vs 
the system commands, is that expected?

we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?














Re: hadoop benchmarked, too slow to use

2008-06-11 Thread Ted Dunning
Yes.  That does count as huge.

Congratulations!

On Wed, Jun 11, 2008 at 11:53 AM, Elia Mazzawi [EMAIL PROTECTED]
wrote:


 we concatenated the files to bring them close to and less than 64mb and the
 difference was huge without changing anything else
 we went from 214 minutes to 3 minutes !




-- 
ted


RE: hadoop benchmarked, too slow to use

2008-06-11 Thread Ashish Thusoo
good to know... this puppy does scale :) and hadoop is awesome for what
it does...

Ashish

-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, June 11, 2008 11:54 AM
To: core-user@hadoop.apache.org
Subject: Re: hadoop benchmarked, too slow to use


we concatenated the files to bring them close to and less than 64mb and
the difference was huge without changing anything else we went from 214
minutes to 3 minutes !

Elia Mazzawi wrote:
 Thanks for the suggestions,

 I'm going to rerun the same test with close to  64Mb files and 7 then
 14 reducers.


 we've done another test to see if more servers would speed up the 
 cluster,

 with 2 nodes down took 322 minutes on the 10X data thats 5.3 hours vs 
 214 minutes with all nodes online.
 started the test after hdfs marked the nodes as dead, and there were 
 no timeouts.

 332/214 = 55% more time with 5/7 = 71%  servers.

 so our conclusion is that more servers will make the cluster faster.



 Ashish Thusoo wrote:
 Try by first just reducing the number of files and increasing the 
 data in each file so you have close to 64MB of data per file. So in 
 your case that would amount to about 700-800 files in the 10X test 
 case (instead of 35000 that you have). See if that give substantially

 better results on your larger test case. For the smaller one, I don't

 think you will be able to do better than the unix  command - the data
set is too small.

 Ashish
 -Original Message-
 From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: 
 Tuesday, June 10, 2008 5:00 PM
 To: core-user@hadoop.apache.org
 Subject: Re: hadoop benchmarked, too slow to use

 so it would make sense for me to configure hadoop for smaller chunks?

 Elia Mazzawi wrote:
  
 yes chunk size was 64mb, and each file has some data it used 7 
 mappers
 

  
 and 1 reducer.

 10X the data took 214 minutes
 vs 26 minutes for the smaller set

 i uploaded the same data 10 times in different directories ( so more

 files, same size )


 Ashish Thusoo wrote:

 Apart from the setup times, the fact that you have 3500 files means

 that you are going after around 220GB of data as each file would 
 have
   

  
 atleast one chunk (this calculation is assuming a chunk size of 
 64MB and this assumes that each file has atleast some data).
 Mappers would
   

  
 probably need to read up this amount of data and with 7 nodes you 
 may
   

  
 just have
 14 map slots. I may be wrong here, but just out of curiosity how 
 many
   

  
 mappers does your job use.

 Don't know why the 10X data was not better though if the bad 
 performance of the smaller test case was due to fragmentation. For 
 that test did you also increase the number of files, or did you 
 simply increase the amount of data in each file.

 Plus on small sets (of the order of 2-3 GB) of data unix commands 
 can't really be beaten :)

 Ashish
 -Original Message-
 From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: 
 Tuesday, June 10, 2008 3:56 PM
 To: core-user@hadoop.apache.org
 Subject: hadoop benchmarked, too slow to use

 Hello,

 we were considering using hadoop to process some data, we have it 
 set
   

  
 up on 8 nodes ( 1 master + 7 slaves)

 we filled the cluster up with files that contain tab delimited
data.
 string \tab string etc
 then we ran the example grep with a regular expression to count the

 number of each unique starting string.
 we had 3500 files containing 3,015,294 lines totaling 5 GB.

 to benchmark it we ran
 bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output 
 '^[a-zA-Z]+\t'
 it took 26 minutes

 then to compare, we ran this bash command on one of the nodes, 
 which produced the same output out of the data:

 cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out (sed regexpr is

 tab not spaces)

 which took 2.5 minutes

 Then we added 10X the data into the cluster and reran Hadoop, it 
 took
 214 minutes which is less than 10X the time, but still not that 
 impressive.


 so we are seeing a 10X performance penalty for using Hadoop vs the 
 system commands, is that expected?
 we were expecting hadoop to be faster since it is distributed?
 perhaps there is too much overhead involved here?
 is the data too small?
 

   




hadoop benchmarked, too slow to use

2008-06-10 Thread Elia Mazzawi

Hello,

we were considering using hadoop to process some data,
we have it set up on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count the 
number of each unique starting string.

we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output 
'^[a-zA-Z]+\t'

it took 26 minutes

then to compare, we ran this bash command on one of the nodes, which 
produced the same output out of the data:


cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out
(sed regexpr is tab not spaces)

which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it took 
214 minutes which is less than 10X the time, but still not that impressive.



so we are seeing a 10X performance penalty for using Hadoop vs the 
system commands,

is that expected?
we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?


Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Ashish Venugopal
Just a small note (does not answer your question, but deals with your
testing command), when running the system command version below, its
important to test with
sort -k 1 -t $TAB

where TAB is something like:

TAB=`echo \t`

to ensure that you sort by key, rather than the whole line. Sorting by the
whole line can cause your reduce code to seem to work during testing (if you
are testing on the command line), but then not work correctly via Hadoop.



On Tue, Jun 10, 2008 at 6:56 PM, Elia Mazzawi [EMAIL PROTECTED]
wrote:

 Hello,

 we were considering using hadoop to process some data,
 we have it set up on 8 nodes ( 1 master + 7 slaves)

 we filled the cluster up with files that contain tab delimited data.
 string \tab string etc
 then we ran the example grep with a regular expression to count the number
 of each unique starting string.
 we had 3500 files containing 3,015,294 lines totaling 5 GB.

 to benchmark it we ran
 bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output
 '^[a-zA-Z]+\t'
 it took 26 minutes

 then to compare, we ran this bash command on one of the nodes, which
 produced the same output out of the data:

 cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out
 (sed regexpr is tab not spaces)

 which took 2.5 minutes

 Then we added 10X the data into the cluster and reran Hadoop, it took 214
 minutes which is less than 10X the time, but still not that impressive.


 so we are seeing a 10X performance penalty for using Hadoop vs the system
 commands,
 is that expected?
 we were expecting hadoop to be faster since it is distributed?
 perhaps there is too much overhead involved here?
 is the data too small?



Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Miles Osborne
I suspect that many people are using Hadoop with a moderate number of nodes
and expecting to see a win over a sequential, single node version.  The
result (and I've seen this too) is typically that the single node version
wins hands-down.

Apart from speeding-up the Hadoop job (eg via compression, tweaking magic
parameters and the like), it is better to consider jobs that are simply too
large to fit into memory on a single machine. Under those circumstances
--and with many machines-- you will see Hadoop winning.  Horses for courses.

(Another dimension is redundancy;  we have had a long running Nutch setup
downloading and parsing blogs posts.  Only very recently was it noticed that
one of our machines was in fact dead.  The show still went on and nothing
was lost)

Miles

2008/6/10 Elia Mazzawi [EMAIL PROTECTED]:

 Hello,

 we were considering using hadoop to process some data,
 we have it set up on 8 nodes ( 1 master + 7 slaves)

 we filled the cluster up with files that contain tab delimited data.
 string \tab string etc
 then we ran the example grep with a regular expression to count the number
 of each unique starting string.
 we had 3500 files containing 3,015,294 lines totaling 5 GB.

 to benchmark it we ran
 bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output
 '^[a-zA-Z]+\t'
 it took 26 minutes

 then to compare, we ran this bash command on one of the nodes, which
 produced the same output out of the data:

 cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out
 (sed regexpr is tab not spaces)

 which took 2.5 minutes

 Then we added 10X the data into the cluster and reran Hadoop, it took 214
 minutes which is less than 10X the time, but still not that impressive.


 so we are seeing a 10X performance penalty for using Hadoop vs the system
 commands,
 is that expected?
 we were expecting hadoop to be faster since it is distributed?
 perhaps there is too much overhead involved here?
 is the data too small?




-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.


Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Elia Mazzawi

I compared the 2 results they were the same,
for the system command the sed before the sort, is working properly, i 
did ctrl V then tab to input a tab character in the terminal, and viewed 
the result its stripping out the rest of the data okay.




Ashish Venugopal wrote:

Just a small note (does not answer your question, but deals with your
testing command), when running the system command version below, its
important to test with
sort -k 1 -t $TAB

where TAB is something like:

TAB=`echo \t`

to ensure that you sort by key, rather than the whole line. Sorting by the
whole line can cause your reduce code to seem to work during testing (if you
are testing on the command line), but then not work correctly via Hadoop.



On Tue, Jun 10, 2008 at 6:56 PM, Elia Mazzawi [EMAIL PROTECTED]
wrote:

  

Hello,

we were considering using hadoop to process some data,
we have it set up on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count the number
of each unique starting string.
we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output
'^[a-zA-Z]+\t'
it took 26 minutes

then to compare, we ran this bash command on one of the nodes, which
produced the same output out of the data:

cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out
(sed regexpr is tab not spaces)

which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it took 214
minutes which is less than 10X the time, but still not that impressive.


so we are seeing a 10X performance penalty for using Hadoop vs the system
commands,
is that expected?
we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?




  




RE: hadoop benchmarked, too slow to use

2008-06-10 Thread Joydeep Sen Sarma
how many reducers? Perhaps u are defaulting to one reducer.

One variable is how fast the java regex evaluation is wrt to sed. One
option is to use hadoop streaming and use ur sed fragment as the mapper.
That will be another way of measuring hadoop overhead that eliminates
some variables.

Hadoop also has a quite a few variables to tune performance .. (check
the hadoop wiki for yahoo's sort benchmark settings for example)

-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 10, 2008 3:56 PM
To: core-user@hadoop.apache.org
Subject: hadoop benchmarked, too slow to use

Hello,

we were considering using hadoop to process some data,
we have it set up on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count the 
number of each unique starting string.
we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output 
'^[a-zA-Z]+\t'
it took 26 minutes

then to compare, we ran this bash command on one of the nodes, which 
produced the same output out of the data:

cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out
(sed regexpr is tab not spaces)

which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it took 
214 minutes which is less than 10X the time, but still not that
impressive.


so we are seeing a 10X performance penalty for using Hadoop vs the 
system commands,
is that expected?
we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?


RE: hadoop benchmarked, too slow to use

2008-06-10 Thread Ashish Thusoo
Apart from the setup times, the fact that you have 3500 files means that
you are going after around 220GB of data as each file would have atleast
one chunk (this calculation is assuming a chunk size of 64MB and this
assumes that each file has atleast some data). Mappers would probably
need to read up this amount of data and with 7 nodes you may just have
14 map slots. I may be wrong here, but just out of curiosity how many
mappers does your job use.

Don't know why the 10X data was not better though if the bad performance
of the smaller test case was due to fragmentation. For that test did you
also increase the number of files, or did you simply increase the amount
of data in each file.

Plus on small sets (of the order of 2-3 GB) of data unix commands can't
really be beaten :)

Ashish 

-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 10, 2008 3:56 PM
To: core-user@hadoop.apache.org
Subject: hadoop benchmarked, too slow to use

Hello,

we were considering using hadoop to process some data, we have it set up
on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count the
number of each unique starting string.
we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output
'^[a-zA-Z]+\t'
it took 26 minutes

then to compare, we ran this bash command on one of the nodes, which
produced the same output out of the data:

cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out (sed regexpr is tab
not spaces)

which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it took
214 minutes which is less than 10X the time, but still not that
impressive.


so we are seeing a 10X performance penalty for using Hadoop vs the 
system commands,
is that expected?
we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?


Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Elia Mazzawi

I could rerun the benchmark with a single node server to see what happens.
my concern is, the 8 node setup was 10X slower than the bash command, so 
I was starting to suspect that the cluster is not running properly, but 
everything looks good in the logs. no timeouts and such.


Miles Osborne wrote:

I suspect that many people are using Hadoop with a moderate number of nodes
and expecting to see a win over a sequential, single node version.  The
result (and I've seen this too) is typically that the single node version
wins hands-down.

Apart from speeding-up the Hadoop job (eg via compression, tweaking magic
parameters and the like), it is better to consider jobs that are simply too
large to fit into memory on a single machine. Under those circumstances
--and with many machines-- you will see Hadoop winning.  Horses for courses.

(Another dimension is redundancy;  we have had a long running Nutch setup
downloading and parsing blogs posts.  Only very recently was it noticed that
one of our machines was in fact dead.  The show still went on and nothing
was lost)

Miles

2008/6/10 Elia Mazzawi [EMAIL PROTECTED]:

  

Hello,

we were considering using hadoop to process some data,
we have it set up on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count the number
of each unique starting string.
we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output
'^[a-zA-Z]+\t'
it took 26 minutes

then to compare, we ran this bash command on one of the nodes, which
produced the same output out of the data:

cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out
(sed regexpr is tab not spaces)

which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it took 214
minutes which is less than 10X the time, but still not that impressive.


so we are seeing a 10X performance penalty for using Hadoop vs the system
commands,
is that expected?
we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?






  




Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Elia Mazzawi


yes there was only 1 reducer, how many should i try ?


Joydeep Sen Sarma wrote:

how many reducers? Perhaps u are defaulting to one reducer.

One variable is how fast the java regex evaluation is wrt to sed. One
option is to use hadoop streaming and use ur sed fragment as the mapper.
That will be another way of measuring hadoop overhead that eliminates
some variables.

Hadoop also has a quite a few variables to tune performance .. (check
the hadoop wiki for yahoo's sort benchmark settings for example)

-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 10, 2008 3:56 PM

To: core-user@hadoop.apache.org
Subject: hadoop benchmarked, too slow to use

Hello,

we were considering using hadoop to process some data,
we have it set up on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count the 
number of each unique starting string.

we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output 
'^[a-zA-Z]+\t'

it took 26 minutes

then to compare, we ran this bash command on one of the nodes, which 
produced the same output out of the data:


cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out
(sed regexpr is tab not spaces)

which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it took 
214 minutes which is less than 10X the time, but still not that

impressive.


so we are seeing a 10X performance penalty for using Hadoop vs the 
system commands,

is that expected?
we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?
  




Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Miles Osborne
Why not do a little experiment and see what the timing results are when
using a range of reducers

eg 1, 2, 5, 7, 13

Miles

2008/6/11 Elia Mazzawi [EMAIL PROTECTED]:


 yes there was only 1 reducer, how many should i try ?



 Joydeep Sen Sarma wrote:

 how many reducers? Perhaps u are defaulting to one reducer.

 One variable is how fast the java regex evaluation is wrt to sed. One
 option is to use hadoop streaming and use ur sed fragment as the mapper.
 That will be another way of measuring hadoop overhead that eliminates
 some variables.

 Hadoop also has a quite a few variables to tune performance .. (check
 the hadoop wiki for yahoo's sort benchmark settings for example)

 -Original Message-
 From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday,
 June 10, 2008 3:56 PM
 To: core-user@hadoop.apache.org
 Subject: hadoop benchmarked, too slow to use

 Hello,

 we were considering using hadoop to process some data,
 we have it set up on 8 nodes ( 1 master + 7 slaves)

 we filled the cluster up with files that contain tab delimited data.
 string \tab string etc
 then we ran the example grep with a regular expression to count the number
 of each unique starting string.
 we had 3500 files containing 3,015,294 lines totaling 5 GB.

 to benchmark it we ran
 bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output
 '^[a-zA-Z]+\t'
 it took 26 minutes

 then to compare, we ran this bash command on one of the nodes, which
 produced the same output out of the data:

 cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out
 (sed regexpr is tab not spaces)

 which took 2.5 minutes

 Then we added 10X the data into the cluster and reran Hadoop, it took 214
 minutes which is less than 10X the time, but still not that
 impressive.


 so we are seeing a 10X performance penalty for using Hadoop vs the system
 commands,
 is that expected?
 we were expecting hadoop to be faster since it is distributed?
 perhaps there is too much overhead involved here?
 is the data too small?






-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.


Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Elia Mazzawi


yes chunk size was 64mb, and each file has some data
it used 7 mappers and 1 reducer.

10X the data took 214 minutes
vs 26 minutes for the smaller set

i uploaded the same data 10 times in different directories ( so more 
files, same size )



Ashish Thusoo wrote:

Apart from the setup times, the fact that you have 3500 files means that
you are going after around 220GB of data as each file would have atleast
one chunk (this calculation is assuming a chunk size of 64MB and this
assumes that each file has atleast some data). Mappers would probably
need to read up this amount of data and with 7 nodes you may just have
14 map slots. I may be wrong here, but just out of curiosity how many
mappers does your job use.

Don't know why the 10X data was not better though if the bad performance
of the smaller test case was due to fragmentation. For that test did you
also increase the number of files, or did you simply increase the amount
of data in each file.

Plus on small sets (of the order of 2-3 GB) of data unix commands can't
really be beaten :)

Ashish 


-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 10, 2008 3:56 PM

To: core-user@hadoop.apache.org
Subject: hadoop benchmarked, too slow to use

Hello,

we were considering using hadoop to process some data, we have it set up
on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count the
number of each unique starting string.
we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output
'^[a-zA-Z]+\t'
it took 26 minutes

then to compare, we ran this bash command on one of the nodes, which
produced the same output out of the data:

cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out (sed regexpr is tab
not spaces)

which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it took
214 minutes which is less than 10X the time, but still not that
impressive.


so we are seeing a 10X performance penalty for using Hadoop vs the 
system commands,

is that expected?
we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?
  




RE: hadoop benchmarked, too slow to use

2008-06-10 Thread Joydeep Sen Sarma
Perhaps something of the order of number of cores in ur system. At least
7 for sure - but perhaps 14 if u have multi-core machines (assuming
tasks per node is 2 at least).

Also - Ashish hit the nail on the head - way too many small files.
Hadoop job overhead is killing you. At this point - I wish I could say -
just use TextMultiFileInputFormat - except there isn't one (and I guess
the nearest alternative - the hadoop archival stuff is not in 17). Bad
luck.

-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 10, 2008 4:26 PM
To: core-user@hadoop.apache.org
Subject: Re: hadoop benchmarked, too slow to use


yes there was only 1 reducer, how many should i try ?


Joydeep Sen Sarma wrote:
 how many reducers? Perhaps u are defaulting to one reducer.

 One variable is how fast the java regex evaluation is wrt to sed. One
 option is to use hadoop streaming and use ur sed fragment as the
mapper.
 That will be another way of measuring hadoop overhead that eliminates
 some variables.

 Hadoop also has a quite a few variables to tune performance .. (check
 the hadoop wiki for yahoo's sort benchmark settings for example)

 -Original Message-
 From: Elia Mazzawi [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, June 10, 2008 3:56 PM
 To: core-user@hadoop.apache.org
 Subject: hadoop benchmarked, too slow to use

 Hello,

 we were considering using hadoop to process some data,
 we have it set up on 8 nodes ( 1 master + 7 slaves)

 we filled the cluster up with files that contain tab delimited data.
 string \tab string etc
 then we ran the example grep with a regular expression to count the 
 number of each unique starting string.
 we had 3500 files containing 3,015,294 lines totaling 5 GB.

 to benchmark it we ran
 bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output 
 '^[a-zA-Z]+\t'
 it took 26 minutes

 then to compare, we ran this bash command on one of the nodes, which 
 produced the same output out of the data:

 cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out
 (sed regexpr is tab not spaces)

 which took 2.5 minutes

 Then we added 10X the data into the cluster and reran Hadoop, it took 
 214 minutes which is less than 10X the time, but still not that
 impressive.


 so we are seeing a 10X performance penalty for using Hadoop vs the 
 system commands,
 is that expected?
 we were expecting hadoop to be faster since it is distributed?
 perhaps there is too much overhead involved here?
 is the data too small?
   



Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Elia Mazzawi


okay I'll try that.
thanks

Joydeep Sen Sarma wrote:

Perhaps something of the order of number of cores in ur system. At least
7 for sure - but perhaps 14 if u have multi-core machines (assuming
tasks per node is 2 at least).

Also - Ashish hit the nail on the head - way too many small files.
Hadoop job overhead is killing you. At this point - I wish I could say -
just use TextMultiFileInputFormat - except there isn't one (and I guess
the nearest alternative - the hadoop archival stuff is not in 17). Bad
luck.

-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 10, 2008 4:26 PM

To: core-user@hadoop.apache.org
Subject: Re: hadoop benchmarked, too slow to use


yes there was only 1 reducer, how many should i try ?


Joydeep Sen Sarma wrote:
  

how many reducers? Perhaps u are defaulting to one reducer.

One variable is how fast the java regex evaluation is wrt to sed. One
option is to use hadoop streaming and use ur sed fragment as the


mapper.
  

That will be another way of measuring hadoop overhead that eliminates
some variables.

Hadoop also has a quite a few variables to tune performance .. (check
the hadoop wiki for yahoo's sort benchmark settings for example)

-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, June 10, 2008 3:56 PM

To: core-user@hadoop.apache.org
Subject: hadoop benchmarked, too slow to use

Hello,

we were considering using hadoop to process some data,
we have it set up on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count the 
number of each unique starting string.

we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output 
'^[a-zA-Z]+\t'

it took 26 minutes

then to compare, we ran this bash command on one of the nodes, which 
produced the same output out of the data:


cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out
(sed regexpr is tab not spaces)

which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it took 
214 minutes which is less than 10X the time, but still not that

impressive.


so we are seeing a 10X performance penalty for using Hadoop vs the 
system commands,

is that expected?
we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?
  



  




Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Elia Mazzawi

so it would make sense for me to configure hadoop for smaller chunks?

Elia Mazzawi wrote:


yes chunk size was 64mb, and each file has some data
it used 7 mappers and 1 reducer.

10X the data took 214 minutes
vs 26 minutes for the smaller set

i uploaded the same data 10 times in different directories ( so more 
files, same size )



Ashish Thusoo wrote:

Apart from the setup times, the fact that you have 3500 files means that
you are going after around 220GB of data as each file would have atleast
one chunk (this calculation is assuming a chunk size of 64MB and this
assumes that each file has atleast some data). Mappers would probably
need to read up this amount of data and with 7 nodes you may just have
14 map slots. I may be wrong here, but just out of curiosity how many
mappers does your job use.

Don't know why the 10X data was not better though if the bad performance
of the smaller test case was due to fragmentation. For that test did you
also increase the number of files, or did you simply increase the amount
of data in each file.

Plus on small sets (of the order of 2-3 GB) of data unix commands can't
really be beaten :)

Ashish
-Original Message-
From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: 
Tuesday, June 10, 2008 3:56 PM

To: core-user@hadoop.apache.org
Subject: hadoop benchmarked, too slow to use

Hello,

we were considering using hadoop to process some data, we have it set up
on 8 nodes ( 1 master + 7 slaves)

we filled the cluster up with files that contain tab delimited data.
string \tab string etc
then we ran the example grep with a regular expression to count the
number of each unique starting string.
we had 3500 files containing 3,015,294 lines totaling 5 GB.

to benchmark it we ran
bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output
'^[a-zA-Z]+\t'
it took 26 minutes

then to compare, we ran this bash command on one of the nodes, which
produced the same output out of the data:

cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out (sed regexpr is tab
not spaces)

which took 2.5 minutes

Then we added 10X the data into the cluster and reran Hadoop, it took
214 minutes which is less than 10X the time, but still not that
impressive.


so we are seeing a 10X performance penalty for using Hadoop vs the 
system commands,

is that expected?
we were expecting hadoop to be faster since it is distributed?
perhaps there is too much overhead involved here?
is the data too small?
  






Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Jim R. Wilson
Hey all,

I'm not a world-class hadoop expert yet, so I won't try to advise
anything in particular except to point out that the regular expression
implementation makes a big difference[1]. It may be that Java vs. sed
is an unfair fight.

One valid test /may be/ to run a 1 mapper hadoop job and see how that
fairs.  In a perfect world, you'd expect it to run about 7x slower
than in the 7 slave cluster.  How far off it is might tell you
something.

Good luck!

[1] http://swtch.com/~rsc/regexp/regexp1.html

-- Jim R. Wilson (jimbojw)

On Tue, Jun 10, 2008 at 7:14 PM, Ashish Thusoo [EMAIL PROTECTED] wrote:
 Try by first just reducing the number of files and increasing the data
 in each file so you have close to 64MB of data per file. So in your case
 that would amount to about 700-800 files in the 10X test case (instead
 of 35000 that you have). See if that give substantially better results
 on your larger test case. For the smaller one, I don't think you will be
 able to do better than the unix  command - the data set is too small.

 Ashish

 -Original Message-
 From: Elia Mazzawi [mailto:[EMAIL PROTECTED]
 Sent: Tuesday, June 10, 2008 5:00 PM
 To: core-user@hadoop.apache.org
 Subject: Re: hadoop benchmarked, too slow to use

 so it would make sense for me to configure hadoop for smaller chunks?

 Elia Mazzawi wrote:

 yes chunk size was 64mb, and each file has some data it used 7 mappers

 and 1 reducer.

 10X the data took 214 minutes
 vs 26 minutes for the smaller set

 i uploaded the same data 10 times in different directories ( so more
 files, same size )


 Ashish Thusoo wrote:
 Apart from the setup times, the fact that you have 3500 files means
 that you are going after around 220GB of data as each file would have

 atleast one chunk (this calculation is assuming a chunk size of 64MB
 and this assumes that each file has atleast some data). Mappers would

 probably need to read up this amount of data and with 7 nodes you may

 just have
 14 map slots. I may be wrong here, but just out of curiosity how many

 mappers does your job use.

 Don't know why the 10X data was not better though if the bad
 performance of the smaller test case was due to fragmentation. For
 that test did you also increase the number of files, or did you
 simply increase the amount of data in each file.

 Plus on small sets (of the order of 2-3 GB) of data unix commands
 can't really be beaten :)

 Ashish
 -Original Message-
 From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent:
 Tuesday, June 10, 2008 3:56 PM
 To: core-user@hadoop.apache.org
 Subject: hadoop benchmarked, too slow to use

 Hello,

 we were considering using hadoop to process some data, we have it set

 up on 8 nodes ( 1 master + 7 slaves)

 we filled the cluster up with files that contain tab delimited data.
 string \tab string etc
 then we ran the example grep with a regular expression to count the
 number of each unique starting string.
 we had 3500 files containing 3,015,294 lines totaling 5 GB.

 to benchmark it we ran
 bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output
 '^[a-zA-Z]+\t'
 it took 26 minutes

 then to compare, we ran this bash command on one of the nodes, which
 produced the same output out of the data:

 cat * | sed -e s/\  .*// |sort | uniq -c  /tmp/out (sed regexpr is
 tab not spaces)

 which took 2.5 minutes

 Then we added 10X the data into the cluster and reran Hadoop, it took
 214 minutes which is less than 10X the time, but still not that
 impressive.


 so we are seeing a 10X performance penalty for using Hadoop vs the
 system commands, is that expected?
 we were expecting hadoop to be faster since it is distributed?
 perhaps there is too much overhead involved here?
 is the data too small?