Re: Optimal setup for a test problem

Andrew Nguyen Mon, 12 Apr 2010 13:46:27 -0700

I don't think you can :-).  Sorry, they are 100Mbps NIC's...  I get 95Mbit/sec 
from one node to another with iperf.


Should I still be expecting such dismal performance with just 100Mbps?

On Apr 12, 2010, at 1:31 PM, Todd Lipcon wrote:

> On Mon, Apr 12, 2010 at 1:05 PM, Andrew Nguyen <
> andrew-lists-had...@ucsfcti.org> wrote:
> 
>> 5 identically spec'ed nodes, each has:
>> 
>> 2 GB RAM
>> Pentium 4 3.0G with HT
>> 250GB HDD on PATA
>> 10Mbps NIC
>> 
> 
> This is probably your issue - 10mbps nic? I didn't know you could even get
> those anymore!
> 
> Hadoop runs on commodity hardware, but you're not likely to get reasonable
> performance with hardware like that.
> 
> -Todd
> 
> 
>> On Apr 12, 2010, at 11:58 AM, alex kamil wrote:
>> 
>>> Andrew,
>>> 
>>> I would also suggest to run DFSIO benchmark to isolate io related issues
>>> 
>>> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10 -fileSize
>> 1000
>>> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10 -fileSize
>> 1000
>>> 
>>> there are additional tests specific for mapreduce -  run  "hadoop jar
>> hadoop-0.20.2-test.jar" for the complete list
>>> 
>>> 45 min for mapping 6GB on 5 nodes is way too high assuming your
>> gain/offset conversion is a simple algebraic manipulation
>>> 
>>> it takes less than 5 min  to run a simple mapper (using streaming) on a
>> 4 nodes cluster on something like 10GB, the mapper i used was an awk command
>> extracting <key:value> pair from a log (no reducer)
>>> 
>>> Thanks
>>> Alex
>>> 
>>> 
>>> 
>>> 
>>> On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon <t...@cloudera.com> wrote:
>>> Hi Andrew,
>>> 
>>> Do you need the sorting behavior that having an identity reducer gives
>> you?
>>> If not, set the number of reduce tasks to 0 and you'll end up with a map
>>> only job, which should be significantly faster.
>>> 
>>> -Todd
>>> 
>>> On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen <
>>> andrew-lists-had...@ucsfcti.org> wrote:
>>> 
>>>> Hello,
>>>> 
>>>> I recently setup a 5 node cluster (1 master, 4 slaves) and am looking
>> to
>>>> use it to process high volumes of patient physiologic data.  As an
>> initial
>>>> exercise to gain a better understanding, I have attempted to run the
>>>> following problem (which isn't the type of problem that Hadoop was
>> really
>>>> designed for, as is my understanding).
>>>> 
>>>> I have a 6G data file, that contains key/value of <sample number,
>> sample
>>>> value>.  I'd like to convert the values based on a gain/offset to their
>>>> physical units.  I've setup a MapReduce job using streaming where the
>> mapper
>>>> does the conversion, and the reducer is just an identity reducer.
>> Based on
>>>> other threads on the mailing list, my initial results are consistent in
>> the
>>>> fact that it takes considerably more time to process this in Hadoop
>> then it
>>>> is on my Macbook pro (45 minutes vs. 13 minutes).  The input is a
>> single 6G
>>>> file and it looks like the file is being split into 101 map tasks.
>> This is
>>>> consistent with the 64M block sizes.
>>>> 
>>>> So my questions are:
>>>> 
>>>> * Would it help to increase the block size to 128M?  Or, decrease the
>> block
>>>> size?  What are some key factors to think about with this question?
>>>> * Are there any other optimizations that I could employ?  I have looked
>>>> into LzoCompression but I'd like to still work without compression
>> since the
>>>> single thread job that I'm comparing to doesn't use any sort of
>> compression.
>>>> I know I'm comparing apples to pears a little here so please feel free
>> to
>>>> correct this assumption.
>>>> * Is Hadoop really only good for jobs where the data doesn't fit on a
>>>> single node?  At some level, I assume that it can still speedup jobs
>> that do
>>>> fit on one node, if only because you are performing tasks in parallel.
>>>> 
>>>> Thanks!
>>>> 
>>>> --Andrew
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>> 
>> 
>> 
> 
> 
> -- 
> Todd Lipcon
> Software Engineer, Cloudera

Re: Optimal setup for a test problem

Reply via email to