The argument currently against increasing num-mappers is that the machines
will get into oom and since a lot of the jobs are crawlers I need more
ip-numbers so I don't get banned :)

Thing is that we currently have solr on the very same machines and
data-nodes as well so I can only give the MR nodes about 1G memory since I
need SOLR to have 4G...

Now I see that I should get some obvious and juste critique about the layout
of this arch but I'm a little limited in budget and so is then the arch :)

However is it wise to have the MR tasks on the same nodes as the data-nodes
or should I split the arch ? I mean the data-nodes perhaps need more disk-IO
and the MR more memory and CPU ?

Trying to find a sweetspot hardware spec of those two roles.

//Marcus



On Sat, Jun 27, 2009 at 4:24 AM, Brian Bockelman <bbock...@cse.unl.edu>wrote:

> Hey Marcus,
>
> Are you recording the data rates coming out of HDFS?  Since you have such a
> low CPU utilizations, I'd look at boxes utterly packed with big hard drives
> (also, why are you using RAID1 for Hadoop??).
>
> You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive bays.
>  Based on the data rates you see, make the call.
>
> On the other hand, what's the argument against running 3x more mappers per
> box?  It seems that your boxes still have more overhead to use -- there's no
> I/O wait.
>
> Brian
>
>
> On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote:
>
>  Hi.
>>
>> We have a deployment of 10 hadoop servers and I now need more mapping
>> capability (no not just add more mappers per instance) since I have so
>> many
>> jobs running. Now I am wondering what I should aim on...
>> Memory, cpu or disk... How long is a rope perhaps you would say ?
>>
>> A typical server is currently using about 15-20% cpu today on a quad-core
>> 2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks.
>>
>> Some specs below.
>>
>>> mpstat 2 5
>>>
>> Linux 2.6.24-19-server (mapreduce2)     06/26/2009
>>
>> 11:36:13 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal
>> %idle    intr/s
>> 11:36:15 PM  all   22.82    0.00    3.24    1.37    0.62    2.49    0.00
>> 69.45   8572.50
>> 11:36:17 PM  all   13.56    0.00    1.74    1.99    0.62    2.61    0.00
>> 79.48   8075.50
>> 11:36:19 PM  all   14.32    0.00    2.24    1.12    1.12    2.24    0.00
>> 78.95   9219.00
>> 11:36:21 PM  all   14.71    0.00    0.87    1.62    0.25    1.75    0.00
>> 80.80   8489.50
>> 11:36:23 PM  all   12.69    0.00    0.87    1.24    0.50    0.75    0.00
>> 83.96   5495.00
>> Average:     all   15.62    0.00    1.79    1.47    0.62    1.97    0.00
>> 78.53   7970.30
>>
>> What I am thinking is... Is it wiser to go for many of these cheap boxes
>> with 8GB of RAM or should I for instance focus on machines which can give
>> more I|O throughput ?
>>
>> I know that these things are hard but perhaps someone have draw some
>> conclusions before the pragmatic way.
>>
>> Kindly
>>
>> //Marcus
>>
>>
>> --
>> Marcus Herou CTO and co-founder Tailsweep AB
>> +46702561312
>> marcus.he...@tailsweep.com
>> http://www.tailsweep.com/
>>
>
>


-- 
Marcus Herou CTO and co-founder Tailsweep AB
+46702561312
marcus.he...@tailsweep.com
http://www.tailsweep.com/

Reply via email to