Re: Optimal setup for a test problem

Brian Bockelman Tue, 13 Apr 2010 11:49:52 -0700

Hey Andrew,

I can name 3 California universities (San Diego, Caltech, Santa-Barbera) that 
use Hadoop at a small (~20TB raw) or medium scale (~800TB raw).  Why not go 
talk to those guys?


Otherwise, you might just be able to confirm old hardware is old  (there's good 
money that you might be hard-drive limited, not network limited anyway.  
3.4MB/s triple replicated = 10MB/s on PATA, which might approach the hardware 
capability).  Alternately, you can always try running on Amazon, which allows 
you to test scaling at a very, very marginal cost.

Brian

On Apr 13, 2010, at 1:40 PM, Andrew Nguyen wrote:

> Good to know...  The problem is that I'm in an academic environment that
> needs a lot of convincing regarding new computational technologies.  I need
> to show proven benefit before getting the funds to actually implement
> anything.  These servers were the best I could come up with for this
> proof-of-concept.
> 
> I changed some settings on the nodes and have been experimenting - and I'm
> seeing about 3.4 mb/sec with TestDFSIO which is pretty consistent with your
> observations below.
> 
> Given that, would increasing the block sizes help my performance?  This
> should result in fewer map jobs and keeping the computation locally,
> longer...?  I just need to show that the numbers are better than a single
> machine, even if sacrificing redundancy (or other factors) in the current
> setup.
> 
> @alex:
> 
> Thanks for the links, it gives me another bit of evidence to convince
> those controlling the money flow...
> 
> --Andrew
> 
> On Tue, 13 Apr 2010 10:29:06 -0700, Todd Lipcon <t...@cloudera.com> wrote:
>> On Mon, Apr 12, 2010 at 1:45 PM, Andrew Nguyen <
>> andrew-lists-had...@ucsfcti.org> wrote:
>> 
>>> I don't think you can :-).  Sorry, they are 100Mbps NIC's...  I get
>>> 95Mbit/sec from one node to another with iperf.
>>> 
>>> Should I still be expecting such dismal performance with just 100Mbps?
>>> 
>> 
>> Yes - in my experience on gigabit, when lots of transfers are going
> between
>> the nodes, TCP performance actually drops to around half the network
>> capacity. In the case of 100Mbps, this is probably going to be around
>> 5MB/sec
>> 
>> So when you're writing output at 3x replication, it's going to be very
> very
>> slow on this network.
>> 
>> -Todd
>> 
>> 
>>> 
>>> On Apr 12, 2010, at 1:31 PM, Todd Lipcon wrote:
>>> 
>>>> On Mon, Apr 12, 2010 at 1:05 PM, Andrew Nguyen <
>>>> andrew-lists-had...@ucsfcti.org> wrote:
>>>> 
>>>>> 5 identically spec'ed nodes, each has:
>>>>> 
>>>>> 2 GB RAM
>>>>> Pentium 4 3.0G with HT
>>>>> 250GB HDD on PATA
>>>>> 10Mbps NIC
>>>>> 
>>>> 
>>>> This is probably your issue - 10mbps nic? I didn't know you could
> even
>>> get
>>>> those anymore!
>>>> 
>>>> Hadoop runs on commodity hardware, but you're not likely to get
>>> reasonable
>>>> performance with hardware like that.
>>>> 
>>>> -Todd
>>>> 
>>>> 
>>>>> On Apr 12, 2010, at 11:58 AM, alex kamil wrote:
>>>>> 
>>>>>> Andrew,
>>>>>> 
>>>>>> I would also suggest to run DFSIO benchmark to isolate io related
>>> issues
>>>>>> 
>>>>>> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -write -nrFiles 10
>>> -fileSize
>>>>> 1000
>>>>>> hadoop jar hadoop-0.20.2-test.jar TestDFSIO -read -nrFiles 10
>>>>>> -fileSize
>>>>> 1000
>>>>>> 
>>>>>> there are additional tests specific for mapreduce -  run  "hadoop
> jar
>>>>> hadoop-0.20.2-test.jar" for the complete list
>>>>>> 
>>>>>> 45 min for mapping 6GB on 5 nodes is way too high assuming your
>>>>> gain/offset conversion is a simple algebraic manipulation
>>>>>> 
>>>>>> it takes less than 5 min  to run a simple mapper (using streaming)
>>>>>> on a
>>>>> 4 nodes cluster on something like 10GB, the mapper i used was an awk
>>> command
>>>>> extracting <key:value> pair from a log (no reducer)
>>>>>> 
>>>>>> Thanks
>>>>>> Alex
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On Mon, Apr 12, 2010 at 1:53 PM, Todd Lipcon <t...@cloudera.com>
>>> wrote:
>>>>>> Hi Andrew,
>>>>>> 
>>>>>> Do you need the sorting behavior that having an identity reducer
>>>>>> gives
>>>>> you?
>>>>>> If not, set the number of reduce tasks to 0 and you'll end up with
> a
>>> map
>>>>>> only job, which should be significantly faster.
>>>>>> 
>>>>>> -Todd
>>>>>> 
>>>>>> On Mon, Apr 12, 2010 at 9:43 AM, Andrew Nguyen <
>>>>>> andrew-lists-had...@ucsfcti.org> wrote:
>>>>>> 
>>>>>>> Hello,
>>>>>>> 
>>>>>>> I recently setup a 5 node cluster (1 master, 4 slaves) and am
>>>>>>> looking
>>>>> to
>>>>>>> use it to process high volumes of patient physiologic data.  As an
>>>>> initial
>>>>>>> exercise to gain a better understanding, I have attempted to run
> the
>>>>>>> following problem (which isn't the type of problem that Hadoop was
>>>>> really
>>>>>>> designed for, as is my understanding).
>>>>>>> 
>>>>>>> I have a 6G data file, that contains key/value of <sample number,
>>>>> sample
>>>>>>> value>.  I'd like to convert the values based on a gain/offset to
>>> their
>>>>>>> physical units.  I've setup a MapReduce job using streaming where
>>>>>>> the
>>>>> mapper
>>>>>>> does the conversion, and the reducer is just an identity reducer.
>>>>> Based on
>>>>>>> other threads on the mailing list, my initial results are
> consistent
>>> in
>>>>> the
>>>>>>> fact that it takes considerably more time to process this in
> Hadoop
>>>>> then it
>>>>>>> is on my Macbook pro (45 minutes vs. 13 minutes).  The input is a
>>>>> single 6G
>>>>>>> file and it looks like the file is being split into 101 map tasks.
>>>>> This is
>>>>>>> consistent with the 64M block sizes.
>>>>>>> 
>>>>>>> So my questions are:
>>>>>>> 
>>>>>>> * Would it help to increase the block size to 128M?  Or, decrease
>>>>>>> the
>>>>> block
>>>>>>> size?  What are some key factors to think about with this
> question?
>>>>>>> * Are there any other optimizations that I could employ?  I have
>>> looked
>>>>>>> into LzoCompression but I'd like to still work without compression
>>>>> since the
>>>>>>> single thread job that I'm comparing to doesn't use any sort of
>>>>> compression.
>>>>>>> I know I'm comparing apples to pears a little here so please feel
>>>>>>> free
>>>>> to
>>>>>>> correct this assumption.
>>>>>>> * Is Hadoop really only good for jobs where the data doesn't fit
> on
>>>>>>> a
>>>>>>> single node?  At some level, I assume that it can still speedup
> jobs
>>>>> that do
>>>>>>> fit on one node, if only because you are performing tasks in
>>>>>>> parallel.
>>>>>>> 
>>>>>>> Thanks!
>>>>>>> 
>>>>>>> --Andrew
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Todd Lipcon
>>>>>> Software Engineer, Cloudera
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>> 
>>>

smime.p7s
Description: S/MIME cryptographic signature

Re: Optimal setup for a test problem

Reply via email to