The argument currently against increasing num-mappers is that the machines will get into oom and since a lot of the jobs are crawlers I need more ip-numbers so I don't get banned :)
Thing is that we currently have solr on the very same machines and data-nodes as well so I can only give the MR nodes about 1G memory since I need SOLR to have 4G... Now I see that I should get some obvious and juste critique about the layout of this arch but I'm a little limited in budget and so is then the arch :) However is it wise to have the MR tasks on the same nodes as the data-nodes or should I split the arch ? I mean the data-nodes perhaps need more disk-IO and the MR more memory and CPU ? Trying to find a sweetspot hardware spec of those two roles. //Marcus On Sat, Jun 27, 2009 at 4:24 AM, Brian Bockelman <bbock...@cse.unl.edu>wrote: > Hey Marcus, > > Are you recording the data rates coming out of HDFS? Since you have such a > low CPU utilizations, I'd look at boxes utterly packed with big hard drives > (also, why are you using RAID1 for Hadoop??). > > You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive bays. > Based on the data rates you see, make the call. > > On the other hand, what's the argument against running 3x more mappers per > box? It seems that your boxes still have more overhead to use -- there's no > I/O wait. > > Brian > > > On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote: > > Hi. >> >> We have a deployment of 10 hadoop servers and I now need more mapping >> capability (no not just add more mappers per instance) since I have so >> many >> jobs running. Now I am wondering what I should aim on... >> Memory, cpu or disk... How long is a rope perhaps you would say ? >> >> A typical server is currently using about 15-20% cpu today on a quad-core >> 2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks. >> >> Some specs below. >> >>> mpstat 2 5 >>> >> Linux 2.6.24-19-server (mapreduce2) 06/26/2009 >> >> 11:36:13 PM CPU %user %nice %sys %iowait %irq %soft %steal >> %idle intr/s >> 11:36:15 PM all 22.82 0.00 3.24 1.37 0.62 2.49 0.00 >> 69.45 8572.50 >> 11:36:17 PM all 13.56 0.00 1.74 1.99 0.62 2.61 0.00 >> 79.48 8075.50 >> 11:36:19 PM all 14.32 0.00 2.24 1.12 1.12 2.24 0.00 >> 78.95 9219.00 >> 11:36:21 PM all 14.71 0.00 0.87 1.62 0.25 1.75 0.00 >> 80.80 8489.50 >> 11:36:23 PM all 12.69 0.00 0.87 1.24 0.50 0.75 0.00 >> 83.96 5495.00 >> Average: all 15.62 0.00 1.79 1.47 0.62 1.97 0.00 >> 78.53 7970.30 >> >> What I am thinking is... Is it wiser to go for many of these cheap boxes >> with 8GB of RAM or should I for instance focus on machines which can give >> more I|O throughput ? >> >> I know that these things are hard but perhaps someone have draw some >> conclusions before the pragmatic way. >> >> Kindly >> >> //Marcus >> >> >> -- >> Marcus Herou CTO and co-founder Tailsweep AB >> +46702561312 >> marcus.he...@tailsweep.com >> http://www.tailsweep.com/ >> > > -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/