Hey all, I'm not a world-class hadoop expert yet, so I won't try to advise anything in particular except to point out that the regular expression implementation makes a big difference[1]. It may be that Java vs. sed is an unfair fight.
One valid test /may be/ to run a 1 mapper hadoop job and see how that fairs. In a perfect world, you'd expect it to run about 7x slower than in the 7 slave cluster. How far off it is might tell you something. Good luck! [1] http://swtch.com/~rsc/regexp/regexp1.html -- Jim R. Wilson (jimbojw) On Tue, Jun 10, 2008 at 7:14 PM, Ashish Thusoo <[EMAIL PROTECTED]> wrote: > Try by first just reducing the number of files and increasing the data > in each file so you have close to 64MB of data per file. So in your case > that would amount to about 700-800 files in the 10X test case (instead > of 35000 that you have). See if that give substantially better results > on your larger test case. For the smaller one, I don't think you will be > able to do better than the unix command - the data set is too small. > > Ashish > > -----Original Message----- > From: Elia Mazzawi [mailto:[EMAIL PROTECTED] > Sent: Tuesday, June 10, 2008 5:00 PM > To: core-user@hadoop.apache.org > Subject: Re: hadoop benchmarked, too slow to use > > so it would make sense for me to configure hadoop for smaller chunks? > > Elia Mazzawi wrote: >> >> yes chunk size was 64mb, and each file has some data it used 7 mappers > >> and 1 reducer. >> >> 10X the data took 214 minutes >> vs 26 minutes for the smaller set >> >> i uploaded the same data 10 times in different directories ( so more >> files, same size ) >> >> >> Ashish Thusoo wrote: >>> Apart from the setup times, the fact that you have 3500 files means >>> that you are going after around 220GB of data as each file would have > >>> atleast one chunk (this calculation is assuming a chunk size of 64MB >>> and this assumes that each file has atleast some data). Mappers would > >>> probably need to read up this amount of data and with 7 nodes you may > >>> just have >>> 14 map slots. I may be wrong here, but just out of curiosity how many > >>> mappers does your job use. >>> >>> Don't know why the 10X data was not better though if the bad >>> performance of the smaller test case was due to fragmentation. For >>> that test did you also increase the number of files, or did you >>> simply increase the amount of data in each file. >>> >>> Plus on small sets (of the order of 2-3 GB) of data unix commands >>> can't really be beaten :) >>> >>> Ashish >>> -----Original Message----- >>> From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent: >>> Tuesday, June 10, 2008 3:56 PM >>> To: core-user@hadoop.apache.org >>> Subject: hadoop benchmarked, too slow to use >>> >>> Hello, >>> >>> we were considering using hadoop to process some data, we have it set > >>> up on 8 nodes ( 1 master + 7 slaves) >>> >>> we filled the cluster up with files that contain tab delimited data. >>> string \tab string etc >>> then we ran the example grep with a regular expression to count the >>> number of each unique starting string. >>> we had 3500 files containing 3,015,294 lines totaling 5 GB. >>> >>> to benchmark it we ran >>> bin/hadoop jar hadoop-0.17.0-examples.jar grep data/* output >>> '^[a-zA-Z]+\t' >>> it took 26 minutes >>> >>> then to compare, we ran this bash command on one of the nodes, which >>> produced the same output out of the data: >>> >>> cat * | sed -e s/\ .*// |sort | uniq -c > /tmp/out (sed regexpr is >>> tab not spaces) >>> >>> which took 2.5 minutes >>> >>> Then we added 10X the data into the cluster and reran Hadoop, it took >>> 214 minutes which is less than 10X the time, but still not that >>> impressive. >>> >>> >>> so we are seeing a 10X performance penalty for using Hadoop vs the >>> system commands, is that expected? >>> we were expecting hadoop to be faster since it is distributed? >>> perhaps there is too much overhead involved here? >>> is the data too small? >>> >> > >