Hey all,

I'm not a world-class hadoop expert yet, so I won't try to advise
anything in particular except to point out that the regular expression
implementation makes a big difference[1]. It may be that Java vs. sed
is an unfair fight.

One valid test /may be/ to run a 1 mapper hadoop job and see how that
fairs.  In a perfect world, you'd expect it to run about 7x slower
than in the 7 slave cluster.  How far off it is might tell you
something.

Good luck!

[1] http://swtch.com/~rsc/regexp/regexp1.html

-- Jim R. Wilson (jimbojw)

On Tue, Jun 10, 2008 at 7:14 PM, Ashish Thusoo <[EMAIL PROTECTED]> wrote:
> Try by first just reducing the number of files and increasing the data
> in each file so you have close to 64MB of data per file. So in your case
> that would amount to about 700-800 files in the 10X test case (instead
> of 35000 that you have). See if that give substantially better results
> on your larger test case. For the smaller one, I don't think you will be
> able to do better than the unix  command - the data set is too small.
>
> Ashish
>
> -----Original Message-----
> From: Elia Mazzawi [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, June 10, 2008 5:00 PM
> To: core-user@hadoop.apache.org
> Subject: Re: hadoop benchmarked, too slow to use
>
> so it would make sense for me to configure hadoop for smaller chunks?
>
> Elia Mazzawi wrote:
>>
>> yes chunk size was 64mb, and each file has some data it used 7 mappers
>
>> and 1 reducer.
>>
>> 10X the data took 214 minutes
>> vs 26 minutes for the smaller set
>>
>> i uploaded the same data 10 times in different directories ( so more
>> files, same size )
>>
>>
>> Ashish Thusoo wrote:
>>> Apart from the setup times, the fact that you have 3500 files means
>>> that you are going after around 220GB of data as each file would have
>
>>> atleast one chunk (this calculation is assuming a chunk size of 64MB
>>> and this assumes that each file has atleast some data). Mappers would
>
>>> probably need to read up this amount of data and with 7 nodes you may
>
>>> just have
>>> 14 map slots. I may be wrong here, but just out of curiosity how many
>
>>> mappers does your job use.
>>>
>>> Don't know why the 10X data was not better though if the bad
>>> performance of the smaller test case was due to fragmentation. For
>>> that test did you also increase the number of files, or did you
>>> simply increase the amount of data in each file.
>>>
>>> Plus on small sets (of the order of 2-3 GB) of data unix commands
>>> can't really be beaten :)
>>>
>>> Ashish
>>> -----Original Message-----
>>> From: Elia Mazzawi [mailto:[EMAIL PROTECTED] Sent:
>>> Tuesday, June 10, 2008 3:56 PM
>>> To: core-user@hadoop.apache.org
>>> Subject: hadoop benchmarked, too slow to use
>>>
>>> Hello,
>>>
>>> we were considering using hadoop to process some data, we have it set
>
>>> up on 8 nodes ( 1 master + 7 slaves)
>>>
>>> we filled the cluster up with files that contain tab delimited data.
>>> string \tab string etc
>>> then we ran the example grep with a regular expression to count the
>>> number of each unique starting string.
>>> we had 3500 files containing 3,015,294 lines totaling 5 GB.
>>>
>>> to benchmark it we ran
>>> bin/hadoop jar hadoop-0.17.0-examples.jar grep data/*  output
>>> '^[a-zA-Z]+\t'
>>> it took 26 minutes
>>>
>>> then to compare, we ran this bash command on one of the nodes, which
>>> produced the same output out of the data:
>>>
>>> cat * | sed -e s/\  .*// |sort | uniq -c > /tmp/out (sed regexpr is
>>> tab not spaces)
>>>
>>> which took 2.5 minutes
>>>
>>> Then we added 10X the data into the cluster and reran Hadoop, it took
>>> 214 minutes which is less than 10X the time, but still not that
>>> impressive.
>>>
>>>
>>> so we are seeing a 10X performance penalty for using Hadoop vs the
>>> system commands, is that expected?
>>> we were expecting hadoop to be faster since it is distributed?
>>> perhaps there is too much overhead involved here?
>>> is the data too small?
>>>
>>
>
>

Reply via email to