RE: hadoop benchmarked, too slow to use

2008-06-11 Thread Ashish Thusoo
good to know... this puppy does scale :) and hadoop is awesome for what it does... Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Wednesday, June 11, 2008 11:54 AM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use we

Re: hadoop benchmarked, too slow to use

2008-06-11 Thread Ted Dunning
Yes. That does count as huge. Congratulations! On Wed, Jun 11, 2008 at 11:53 AM, Elia Mazzawi <[EMAIL PROTECTED]> wrote: > > we concatenated the files to bring them close to and less than 64mb and the > difference was huge without changing anything else > we went from 214 minutes to 3 minutes !

Re: hadoop benchmarked, too slow to use

2008-06-11 Thread Elia Mazzawi
, June 10, 2008 5:00 PM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use so it would make sense for me to configure hadoop for smaller chunks? Elia Mazzawi wrote: yes chunk size was 64mb, and each file has some data it used 7 mappers and 1 reducer. 10X th

Re: hadoop benchmarked, too slow to use

2008-06-11 Thread Arun C Murthy
one, I don't think you will be able to do better than the unix command - the data set is too small. Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 5:00 PM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too

Re: hadoop benchmarked, too slow to use

2008-06-11 Thread Elia Mazzawi
ge- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 5:00 PM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use so it would make sense for me to configure hadoop for smaller chunks? Elia Mazzawi wrote: yes chunk size was 64mb, and each

Re: hadoop benchmarked, too slow to use

2008-06-11 Thread Elia Mazzawi
will be able to do better than the unix command - the data set is too small. Ashish -Original Message- From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: Tuesday, June 10, 2008 5:00 PM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use so it would ma

Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Jim R. Wilson
TED] > Sent: Tuesday, June 10, 2008 5:00 PM > To: core-user@hadoop.apache.org > Subject: Re: hadoop benchmarked, too slow to use > > so it would make sense for me to configure hadoop for smaller chunks? > > Elia Mazzawi wrote: >> >> yes chunk size was 64mb, and each

RE: hadoop benchmarked, too slow to use

2008-06-10 Thread Ashish Thusoo
t: Re: hadoop benchmarked, too slow to use so it would make sense for me to configure hadoop for smaller chunks? Elia Mazzawi wrote: > > yes chunk size was 64mb, and each file has some data it used 7 mappers > and 1 reducer. > > 10X the data took 214 minutes > vs 26 minutes fo

Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Arun C Murthy
On Jun 10, 2008, at 3:56 PM, Elia Mazzawi wrote: Hello, we were considering using hadoop to process some data, we have it set up on 8 nodes ( 1 master + 7 slaves) we filled the cluster up with files that contain tab delimited data. string \tab string etc then we ran the example grep with a re

Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Elia Mazzawi
so it would make sense for me to configure hadoop for smaller chunks? Elia Mazzawi wrote: yes chunk size was 64mb, and each file has some data it used 7 mappers and 1 reducer. 10X the data took 214 minutes vs 26 minutes for the smaller set i uploaded the same data 10 times in different direct

Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Elia Mazzawi
TED] Sent: Tuesday, June 10, 2008 4:26 PM To: core-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use yes there was only 1 reducer, how many should i try ? Joydeep Sen Sarma wrote: how many reducers? Perhaps u are defaulting to one reducer. One variable is how fast the

RE: hadoop benchmarked, too slow to use

2008-06-10 Thread Joydeep Sen Sarma
-user@hadoop.apache.org Subject: Re: hadoop benchmarked, too slow to use yes there was only 1 reducer, how many should i try ? Joydeep Sen Sarma wrote: > how many reducers? Perhaps u are defaulting to one reducer. > > One variable is how fast the java regex evaluation is wrt to sed. One > option

Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Elia Mazzawi
yes chunk size was 64mb, and each file has some data it used 7 mappers and 1 reducer. 10X the data took 214 minutes vs 26 minutes for the smaller set i uploaded the same data 10 times in different directories ( so more files, same size ) Ashish Thusoo wrote: Apart from the setup times, the

Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Miles Osborne
Why not do a little experiment and see what the timing results are when using a range of reducers eg 1, 2, 5, 7, 13 Miles 2008/6/11 Elia Mazzawi <[EMAIL PROTECTED]>: > > yes there was only 1 reducer, how many should i try ? > > > > Joydeep Sen Sarma wrote: > >> how many reducers? Perhaps u are

Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Elia Mazzawi
yes there was only 1 reducer, how many should i try ? Joydeep Sen Sarma wrote: how many reducers? Perhaps u are defaulting to one reducer. One variable is how fast the java regex evaluation is wrt to sed. One option is to use hadoop streaming and use ur sed fragment as the mapper. That will b

Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Elia Mazzawi
I could rerun the benchmark with a single node server to see what happens. my concern is, the 8 node setup was 10X slower than the bash command, so I was starting to suspect that the cluster is not running properly, but everything looks good in the logs. no timeouts and such. Miles Osborne wro

RE: hadoop benchmarked, too slow to use

2008-06-10 Thread Ashish Thusoo
Apart from the setup times, the fact that you have 3500 files means that you are going after around 220GB of data as each file would have atleast one chunk (this calculation is assuming a chunk size of 64MB and this assumes that each file has atleast some data). Mappers would probably need to read

RE: hadoop benchmarked, too slow to use

2008-06-10 Thread Joydeep Sen Sarma
how many reducers? Perhaps u are defaulting to one reducer. One variable is how fast the java regex evaluation is wrt to sed. One option is to use hadoop streaming and use ur sed fragment as the mapper. That will be another way of measuring hadoop overhead that eliminates some variables. Hadoop a

Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Elia Mazzawi
I compared the 2 results they were the same, for the system command the sed before the sort, is working properly, i did ctrl V then tab to input a tab character in the terminal, and viewed the result its stripping out the rest of the data okay. Ashish Venugopal wrote: Just a small note (doe

Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Miles Osborne
I suspect that many people are using Hadoop with a moderate number of nodes and expecting to see a win over a sequential, single node version. The result (and I've seen this too) is typically that the single node version wins hands-down. Apart from speeding-up the Hadoop job (eg via compression,

Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Ashish Venugopal
Just a small note (does not answer your question, but deals with your testing command), when running the system command version below, its important to test with sort -k 1 -t $TAB where TAB is something like: TAB=`echo "\t"` to ensure that you sort by key, rather than the whole line. Sorting by