MApr has its own concept of storage pools and stripe width Sent from my iPhone
On Nov 28, 2012, at 5:28 PM, Jean-Marc Spaggiari <jean-m...@spaggiari.org> wrote: > Hi Mike, > > Why not using LVM with MapR? Since LVM is reading from 2 drives almost > at the same time, it should be better than RAID0 or a single drive, > no? > > 2012/11/28, Michael Segel <michael_se...@hotmail.com>: >> Just a couple of things. >> >> I'm neutral on the use of LVMs. Some would point out that there's some >> overhead, but on the flip side, it can make managing the machines easier. >> If you're using MapR, you don't want to use LVMs but raw devices. >> >> In terms of GC, its going to depend on the heap size and not the total >> memory. With respect to HBase. ... MSLABS is the way to go. >> >> >> On Nov 28, 2012, at 12:05 PM, Jean-Marc Spaggiari <jean-m...@spaggiari.org> >> wrote: >> >>> Hi Gregory, >>> >>> I founs this about LVM: >>> -> http://blog.andrew.net.au/2006/08/09 >>> -> http://www.phoronix.com/scan.php?page=article&item=fedora_15_lvm&num=2 >>> >>> Seems that performances are still correct with it. I will most >>> probably give it a try and bench that too... I have one new hard drive >>> which should arrived tomorrow. Perfect timing ;) >>> >>> >>> >>> JM >>> >>> 2012/11/28, Mohit Anchlia <mohitanch...@gmail.com>: >>>> >>>> >>>> >>>> >>>> On Nov 28, 2012, at 9:07 AM, Adrien Mogenet <adrien.moge...@gmail.com> >>>> wrote: >>>> >>>>> Does HBase really benefit from 64 GB of RAM since allocating too large >>>>> heap >>>>> might increase GC time ? >>>> Benefit you get is from OS cache >>>>> Another question : why not RAID 0, in order to aggregate disk bandwidth >>>>> ? >>>>> (and thus keep 3x replication factor) >>>>> >>>>> >>>>> On Wed, Nov 28, 2012 at 5:58 PM, Michael Segel >>>>> <michael_se...@hotmail.com>wrote: >>>>> >>>>>> Sorry, >>>>>> >>>>>> I need to clarify. >>>>>> >>>>>> 4GB per physical core is a good starting point. >>>>>> So with 2 quad core chips, that is going to be 32GB. >>>>>> >>>>>> IMHO that's a minimum. If you go with HBase, you will want more. >>>>>> (Actually >>>>>> you will need more.) The next logical jump would be to 48 or 64GB. >>>>>> >>>>>> If we start to price out memory, depending on vendor, your company's >>>>>> procurement, there really isn't much of a price difference in terms >>>>>> of >>>>>> 32,48, or 64 GB. >>>>>> Note that it also depends on the chips themselves. Also you need to >>>>>> see >>>>>> how many memory channels exist in the mother board. You may need to >>>>>> buy >>>>>> in >>>>>> pairs or triplets. Your hardware vendor can help you. (Also you need >>>>>> to >>>>>> keep an eye on your hardware vendor. Sometimes they will give you >>>>>> higher >>>>>> density chips that are going to be more expensive...) ;-) >>>>>> >>>>>> I tend to like having extra memory from the start. >>>>>> It gives you a bit more freedom and also protects you from 'fat' code. >>>>>> >>>>>> Looking at YARN... you will need more memory too. >>>>>> >>>>>> >>>>>> With respect to the hard drives... >>>>>> >>>>>> The best recommendation is to keep the drives as JBOD and then use 3x >>>>>> replication. >>>>>> In this case, make sure that the disk controller cards can handle >>>>>> JBOD. >>>>>> (Some don't support JBOD out of the box) >>>>>> >>>>>> With respect to RAID... >>>>>> >>>>>> If you are running MapR, no need for RAID. >>>>>> If you are running an Apache derivative, you could use RAID 1. Then >>>>>> cut >>>>>> your replication to 2X. This makes it easier to manage drive failures. >>>>>> (Its not the norm, but it works...) In some clusters, they are using >>>>>> appliances like Net App's e series where the machines see the drives >>>>>> as >>>>>> local attached storage and I think the appliances themselves are using >>>>>> RAID. I haven't played with this configuration, however it could make >>>>>> sense and its a valid design. >>>>>> >>>>>> HTH >>>>>> >>>>>> -Mike >>>>>> >>>>>> On Nov 28, 2012, at 10:33 AM, Jean-Marc Spaggiari >>>>>> <jean-m...@spaggiari.org> >>>>>> wrote: >>>>>> >>>>>>> Hi Mike, >>>>>>> >>>>>>> Thanks for all those details! >>>>>>> >>>>>>> So to simplify the equation, for 16 virtual cores we need 48 to 64GB. >>>>>>> Which mean 3 to 4GB per core. So with quad cores, 12GB to 16GB are a >>>>>>> good start? Or I simplified it to much? >>>>>>> >>>>>>> Regarding the hard drives. If you add more than one drive, do you >>>>>>> need >>>>>>> to build them on RAID or similar systems? Or can Hadoop/HBase be >>>>>>> configured to use more than one drive? >>>>>>> >>>>>>> Thanks, >>>>>>> >>>>>>> JM >>>>>>> >>>>>>> 2012/11/27, Michael Segel <michael_se...@hotmail.com>: >>>>>>>> >>>>>>>> OK... I don't know why Cloudera is so hung up on 32GB. ;-) [Its an >>>>>> inside >>>>>>>> joke ...] >>>>>>>> >>>>>>>> So here's the problem... >>>>>>>> >>>>>>>> By default, your child processes in a map/reduce job get a default >>>>>> 512MB. >>>>>>>> The majority of the time, this gets raised to 1GB. >>>>>>>> >>>>>>>> 8 cores (dual quad cores) shows up at 16 virtual processors in >>>>>>>> Linux. >>>>>> (Note: >>>>>>>> This is why when people talk about the number of cores, you have to >>>>>> specify >>>>>>>> physical cores or logical cores....) >>>>>>>> >>>>>>>> So if you were to over subscribe and have lets say 12 mappers and >>>>>>>> 12 >>>>>>>> reducers, that's 24 slots. Which means that you would need 24GB of >>>>>> memory >>>>>>>> reserved just for the child processes. This would leave 8GB for DN, >>>>>>>> TT >>>>>> and >>>>>>>> the rest of the linux OS processes. >>>>>>>> >>>>>>>> Can you live with that? Sure. >>>>>>>> Now add in R, HBase, Impala, or some other set of tools on top of >>>>>>>> the >>>>>>>> cluster. >>>>>>>> >>>>>>>> Ooops! Now you are in trouble because you will swap. >>>>>>>> Also adding in R, you may want to bump up those child procs from 1GB >>>>>>>> to >>>>>> 2 >>>>>>>> GB. That means the 24 slots would now require 48GB. Now you have >>>>>>>> swap >>>>>> and >>>>>>>> if that happens you will see HBase in a cascading failure. >>>>>>>> >>>>>>>> So while you can do a rolling restart with the changed configuration >>>>>>>> (reducing the number of mappers and reducers) you end up with less >>>>>>>> slots >>>>>>>> which will mean in longer run time for your jobs. (Less slots == >>>>>>>> less >>>>>>>> parallelism ) >>>>>>>> >>>>>>>> Looking at the price of memory... you can get 48GB or even 64GB for >>>>>> around >>>>>>>> the same price point. (8GB chips) >>>>>>>> >>>>>>>> And I didn't even talk about adding SOLR either again a memory >>>>>>>> hog... >>>>>> ;-) >>>>>>>> >>>>>>>> Note that I matched the number of mappers w reducers. You could go >>>>>>>> with >>>>>>>> fewer reducers if you want. I tend to recommend a ratio of 2:1 >>>>>>>> mappers >>>>>> to >>>>>>>> reducers, depending on the work flow.... >>>>>>>> >>>>>>>> As to the disks... no 7200 SATA III drives are fine. SATA III >>>>>>>> interface >>>>>> is >>>>>>>> pretty much available in the new kit being shipped. >>>>>>>> Its just that you don't have enough drives. 8 cores should be 8 >>>>>> spindles if >>>>>>>> available. >>>>>>>> Otherwise you end up seeing your CPU load climb on wait states as >>>>>>>> the >>>>>>>> processes wait for the disk i/o to catch up. >>>>>>>> >>>>>>>> I mean you could build out a cluster w 4 x 3 3.5" 2TB drives in a 1 >>>>>>>> U >>>>>>>> chassis based on price. You're making a trade off and you should be >>>>>> aware of >>>>>>>> the performance hit you will take. >>>>>>>> >>>>>>>> HTH >>>>>>>> >>>>>>>> -Mike >>>>>>>> >>>>>>>> On Nov 27, 2012, at 1:52 PM, Jean-Marc Spaggiari < >>>>>> jean-m...@spaggiari.org> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Michael, >>>>>>>>> >>>>>>>>> so are you recommanding 32Gb per node? >>>>>>>>> >>>>>>>>> What about the disks? SATA drives are to slow? >>>>>>>>> >>>>>>>>> JM >>>>>>>>> >>>>>>>>> 2012/11/26, Michael Segel <michael_se...@hotmail.com>: >>>>>>>>>> Uhm, those specs are actually now out of date. >>>>>>>>>> >>>>>>>>>> If you're running HBase, or want to also run R on top of Hadoop, >>>>>>>>>> you >>>>>>>>>> will >>>>>>>>>> need to add more memory. >>>>>>>>>> Also forget 1GBe got 10GBe, and w 2 SATA drives, you will be disk >>>>>>>>>> i/o >>>>>>>>>> bound >>>>>>>>>> way too quickly. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Nov 26, 2012, at 8:05 AM, Marcos Ortiz <mlor...@uci.cu> wrote: >>>>>>>>>> >>>>>>>>>>> Are you asking about hardware recommendations? >>>>>>>>>>> Eric Sammer on his "Hadoop Operations" book, did a great job >>>>>>>>>>> about >>>>>>>>>>> this: >>>>>>>>>>> For middle size clusters (until 300 nodes): >>>>>>>>>>> Processor: A dual quad-core 2.6 Ghz >>>>>>>>>>> RAM: 24 GB DDR3 >>>>>>>>>>> Dual 1 Gb Ethernet NICs >>>>>>>>>>> a SAS drive controller >>>>>>>>>>> at least two SATA II drives in a JBOD configuration >>>>>>>>>>> >>>>>>>>>>> The replication factor depends heavily of the primary use of your >>>>>>>>>>> cluster. >>>>>>>>>>> >>>>>>>>>>> On 11/26/2012 08:53 AM, David Charle wrote: >>>>>>>>>>>> hi >>>>>>>>>>>> >>>>>>>>>>>> what's the recommended nodes for NN, hmaster and zk nodes for a >>>>>> larger >>>>>>>>>>>> cluster, lets say 50-100+ >>>>>>>>>>>> >>>>>>>>>>>> also, what would be the ideal replication factor for larger >>>>>>>>>>>> clusters >>>>>>>>>>>> when >>>>>>>>>>>> u have 3-4 racks ? >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> David >>>>>>>>>>>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS >>>>>>>>>>>> CIENCIAS >>>>>>>>>>>> INFORMATICAS... >>>>>>>>>>>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION >>>>>>>>>>>> >>>>>>>>>>>> http://www.uci.cu >>>>>>>>>>>> http://www.facebook.com/universidad.uci >>>>>>>>>>>> http://www.flickr.com/photos/universidad_uci >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> >>>>>>>>>>> Marcos Luis OrtÃz Valmaseda >>>>>>>>>>> about.me/marcosortiz <http://about.me/marcosortiz> >>>>>>>>>>> @marcosluis2186 <http://twitter.com/marcosluis2186> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> 10mo. ANIVERSARIO DE LA CREACION DE LA UNIVERSIDAD DE LAS >>>>>>>>>>> CIENCIAS >>>>>>>>>>> INFORMATICAS... >>>>>>>>>>> CONECTADOS AL FUTURO, CONECTADOS A LA REVOLUCION >>>>>>>>>>> >>>>>>>>>>> http://www.uci.cu >>>>>>>>>>> http://www.facebook.com/universidad.uci >>>>>>>>>>> http://www.flickr.com/photos/universidad_uci >>>>> >>>>> >>>>> -- >>>>> Adrien Mogenet >>>>> 06.59.16.64.22 >>>>> http://www.mogenet.me >> >>