Re: [ceph-users] Hardware recommendation / calculation for large cluster
On Mon, May 13, 2013 at 09:30:38PM +0200, Tim Mohlmann wrote: > Hi, > > Ok, thanks for al the info. > > Just "fyi", I am a mechanical / electrical marine service engineer. So > basicly > I think in Pressure, Flow, Contents, Voltage, (mili)amps, power and torque. > So > I am just trying to relevate it to the same prinicple. Hence my questions. I > am certainly not a noob in linux, opensource and that kind of stuff. > > It is just I got interested in on-line storage and by some googling I came > across certain "products" (most of them being propietary) and some of them > opensource (but looked unmaintained / not very modern) and one of them was > ceph. After the reading the docs I had some questions and in my opinion they > are answered. > > I know now how to spend the money, and now it time to start finding out how > to > make it. I've got a whole bucket of ideas about public apps for my storage > and > all this needs to be researched for possibilities. (This was yet the start of > my quest). > Every journey starts with the first step. :-) > Again, thanks for the info. If this baby is going to fly, I will keep you > posted about my findings. Maybe (and really really maybe) I will try to > contribute to the source, for some features I already think I want to have ;). > Don't be shy about sharing your ideas for new features, maybe some are already available or you might be able to do it with some scripting. Maybe someone thinks it is something they might want to have as well and will start to work on it. Recently there was an online developer summit to try and compile such a list for the near term: http://ceph.com/events/ceph-developer-summit-summary-and-session-videos/ > Regards, Tim > > > On Monday 13 May 2013 00:25:19 Dmitri Maziuk wrote: > > On 2013-05-12 08:34, Tim Mohlmann wrote: > > > As for choking the backplane: That would just slow things down a bit, am I > > > right? > > > > A bit, a lot, or not at all -- I think IRL you'll have to test it under > > your workload and see. > > > > [ WD performance ] > > > > > Did not know that. Do you have any references. Does this also apply for > > > the > > > enterprise disks? > > > > Here's one write-up: https://wiki.archlinux.org/index.php/Advanced_Format > > > > Have not tested "enterprise" disks. > > > > > Another question: do you use desktop or enterprise disks in your cluster? > > > I am having trouble finding a MTBFs for desktop drives. And if I find > > > them, they are almost the same as enterprise drives. Is there a caveat in > > > there? Is the failure test done is different conditions? (Not that you > > > have to know that) > > > > > > If the annual failure rate would be double, it would still be cheaper to > > > use desktop drives in a large cluster, but I just like to know to be > > > sure. > > I don't think anyone knows for sure how much of it is marketing bull. > > One rumour is the difference between "enterprise" and "desktop" drives > > is very often only the firmware and the price tag. So yeah, we use > > desktop versions because it's cheaper, but we use them in raids (usually > > 1/10 - and it's still cheaper), and we don't do super high performance > > i/o on them. (Our requirements are size rather than speed.) > > > > Dima > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hardware recommendation / calculation for large cluster
Hi, Ok, thanks for al the info. Just "fyi", I am a mechanical / electrical marine service engineer. So basicly I think in Pressure, Flow, Contents, Voltage, (mili)amps, power and torque. So I am just trying to relevate it to the same prinicple. Hence my questions. I am certainly not a noob in linux, opensource and that kind of stuff. It is just I got interested in on-line storage and by some googling I came across certain "products" (most of them being propietary) and some of them opensource (but looked unmaintained / not very modern) and one of them was ceph. After the reading the docs I had some questions and in my opinion they are answered. I know now how to spend the money, and now it time to start finding out how to make it. I've got a whole bucket of ideas about public apps for my storage and all this needs to be researched for possibilities. (This was yet the start of my quest). Again, thanks for the info. If this baby is going to fly, I will keep you posted about my findings. Maybe (and really really maybe) I will try to contribute to the source, for some features I already think I want to have ;). Regards, Tim On Monday 13 May 2013 00:25:19 Dmitri Maziuk wrote: > On 2013-05-12 08:34, Tim Mohlmann wrote: > > As for choking the backplane: That would just slow things down a bit, am I > > right? > > A bit, a lot, or not at all -- I think IRL you'll have to test it under > your workload and see. > > [ WD performance ] > > > Did not know that. Do you have any references. Does this also apply for > > the > > enterprise disks? > > Here's one write-up: https://wiki.archlinux.org/index.php/Advanced_Format > > Have not tested "enterprise" disks. > > > Another question: do you use desktop or enterprise disks in your cluster? > > I am having trouble finding a MTBFs for desktop drives. And if I find > > them, they are almost the same as enterprise drives. Is there a caveat in > > there? Is the failure test done is different conditions? (Not that you > > have to know that) > > > > If the annual failure rate would be double, it would still be cheaper to > > use desktop drives in a large cluster, but I just like to know to be > > sure. > I don't think anyone knows for sure how much of it is marketing bull. > One rumour is the difference between "enterprise" and "desktop" drives > is very often only the firmware and the price tag. So yeah, we use > desktop versions because it's cheaper, but we use them in raids (usually > 1/10 - and it's still cheaper), and we don't do super high performance > i/o on them. (Our requirements are size rather than speed.) > > Dima > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hardware recommendation / calculation for large cluster
On 2013-05-12 08:34, Tim Mohlmann wrote: As for choking the backplane: That would just slow things down a bit, am I right? A bit, a lot, or not at all -- I think IRL you'll have to test it under your workload and see. [ WD performance ] Did not know that. Do you have any references. Does this also apply for the enterprise disks? Here's one write-up: https://wiki.archlinux.org/index.php/Advanced_Format Have not tested "enterprise" disks. Another question: do you use desktop or enterprise disks in your cluster? I am having trouble finding a MTBFs for desktop drives. And if I find them, they are almost the same as enterprise drives. Is there a caveat in there? Is the failure test done is different conditions? (Not that you have to know that) If the annual failure rate would be double, it would still be cheaper to use desktop drives in a large cluster, but I just like to know to be sure. I don't think anyone knows for sure how much of it is marketing bull. One rumour is the difference between "enterprise" and "desktop" drives is very often only the firmware and the price tag. So yeah, we use desktop versions because it's cheaper, but we use them in raids (usually 1/10 - and it's still cheaper), and we don't do super high performance i/o on them. (Our requirements are size rather than speed.) Dima ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hardware recommendation / calculation for large cluster
On Sun, May 12, 2013 at 10:22:10PM +0200, Tim Mohlmann wrote: > Hi, > > On Sunday 12 May 2013 18:05:16 Leen Besselink wrote: > > > > > I did see you mentioned you wanted to have, many disks in the same machine. > > > > Not just machines with let's say 12 disks for example. > > > > Did you know you need the CPU-power of a 1Ghz Xeon core per OSD for the > > times when recovery is happening ? > Nope, did not know it. > > The current intent is to install 2x 2.4 Ghz xeon CPU, handeling 8 threads > each. So, 2*8*2.4=38.4 for max OSD's. It should be fine. > > If I would go for the 72 disk option, I have to consider doubling that power. > The current max I can select from the dealer I am looking at, for the socket > housed in the supermicro 72x 3.5" version are 2x a Xeon x5680. Utilizing 12 > threads each, at 3.33Ghz. So, 2*12*3.33=79.79 for max OSD's. Also this should > be fine. > > What will happen if the CPU is maxed out anyway? Slowing things or crashing > things? In my opinion it is not a bad thing if a system is maxed out in such > a > massive migration, which should not occur on a daily base. Sure, a disk that > fails every two weeks, no prob. What are we talking about? 0.3% of the > complete storage cluster. Even 0.15% if I would take the 72x3.5" servers. > Even if one disk/OSD fails, it would need to recheck where each placement groups should be stored and move stuff around if needed. If during this action your CPUs are maxed out, you might start to lose connections between OSDs and the process will need to start over. At least that is how I understand it, I've done a few test installations, but not yet deployed it in production. The Inktank people said in the presentations I've seen (and looking at the picture in the video from DreamHost I have a feeling that is what they've deployed): 12 HDD == 12 OSD per machine is ideal, maybe with 2 or 3 SSD for journaling if you want more performance. > If a complete server stops working, that is something else. But as I said in > a > different split of this thread: if that happens I have got different things > to > worry about, than a slow migration of data. As long as there is no data lost, > I don't really care it takes a bit longer. > > Thanks for the advise. > > Tim > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hardware recommendation / calculation for large cluster
Hi, On Sunday 12 May 2013 18:05:16 Leen Besselink wrote: > > I did see you mentioned you wanted to have, many disks in the same machine. > > Not just machines with let's say 12 disks for example. > > Did you know you need the CPU-power of a 1Ghz Xeon core per OSD for the > times when recovery is happening ? Nope, did not know it. The current intent is to install 2x 2.4 Ghz xeon CPU, handeling 8 threads each. So, 2*8*2.4=38.4 for max OSD's. It should be fine. If I would go for the 72 disk option, I have to consider doubling that power. The current max I can select from the dealer I am looking at, for the socket housed in the supermicro 72x 3.5" version are 2x a Xeon x5680. Utilizing 12 threads each, at 3.33Ghz. So, 2*12*3.33=79.79 for max OSD's. Also this should be fine. What will happen if the CPU is maxed out anyway? Slowing things or crashing things? In my opinion it is not a bad thing if a system is maxed out in such a massive migration, which should not occur on a daily base. Sure, a disk that fails every two weeks, no prob. What are we talking about? 0.3% of the complete storage cluster. Even 0.15% if I would take the 72x3.5" servers. If a complete server stops working, that is something else. But as I said in a different split of this thread: if that happens I have got different things to worry about, than a slow migration of data. As long as there is no data lost, I don't really care it takes a bit longer. Thanks for the advise. Tim ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hardware recommendation / calculation for large cluster
On Sun, May 12, 2013 at 03:14:15PM +0200, Tim Mohlmann wrote: > Hi, > > On Saturday 11 May 2013 16:04:27 Leen Besselink wrote: > > > Someone is going to correct me if I'm wrong, but I think you misread > > something. > > > > > > The Mon-daemon doesn't need that much RAM: > > > > The 'RAM: 1 GB per daemon' is per Mon-daemon, not per OSD-daemon. > > > Gosh, I feel embarresed. This ectually was my main concern / bottleneck. > Thanks for pointing this out. Seems Ceph really rocks in deploying affordable > data clusters. > I did see you mentioned you wanted to have, many disks in the same machine. Not just machines with let's say 12 disks for example. Did you know you need the CPU-power of a 1Ghz Xeon core per OSD for the times when recovery is happening ? > Regards, Tim > > > On Sat, May 11, 2013 at 03:42:59PM +0200, Tim Mohlmann wrote: > > > Hi, > > > > > > First of all I am new to ceph and this mailing list. At this moment I am > > > looking into the possibilities to get involved in the storage business. I > > > am trying to get an estimate about costs and after that I will start to > > > determine how to get sufficient income. > > > > > > First I will describe my case, at the bottom you will find my questions. > > > > > > > > > GENERAL LAYOUT: > > > > > > Part of this cost calculation is of course hardware. For the larger part > > > I've already figured it out. In my plans I will be leasing a full rack > > > (46U). Depending on the domestic needs I will be using 36 or 40U for ODS > > > storage servers. (I will assume 36U from here on, to keep a solid value > > > for calculation and have enough spare space for extra devices). > > > > > > Each OSD server uses 4U and can take 36x3.5" drives. So in 36U I can put > > > 36/4=9 OSD servers, containing 9*36=324 HDDs. > > > > > > > > > HARD DISK DRIVES > > > > > > I have been looking for WD digital RE and RED series. RE is more expensive > > > per GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap > > > per GB, but only goes as far a 3TB. > > > > > > At my current calculations it does not matter much if I would put > > > expensive WD RE 4TB disks or cheaper WD RED 3TB, the price per GB over > > > the complete cluster expense and 3 years of running costs (including AFR) > > > is almost the same. > > > > > > So basically, if I could reduce the costs of all the other components used > > > in the cluster, I would go for the 3TB disk and if the costs will be > > > higher then my first calculation, I would use the 4TB disk. > > > > > > Let's assume 4TB from now on. So, 4*324=1296TB. So lets go Peta-byte ;). > > > > > > > > > NETWORK > > > > > > I will use a redundant 2x10Gbe network connection for each node. Two > > > independent 10Gbe switches will be used and I will use bonding between the > > > interfaces on each node. (Thanks some guy in the #Ceph irc for pointing > > > this option out). I will use VLAN's to split front-side, backside and > > > Internet networks. > > > > > > > > > OSD SERVER > > > > > > SuperMicro based, 36 HDD hotswap. Dual socket mainboard. 16x DIMM sockets. > > > It is advertised they can take up to 512GB of RAM. I will install 2 x > > > Intel Xeon E5620 2.40ghz processor, having 4 cores and 8 threads each. > > > For the RAM I am in doubt (see below). I am looking into running 1 OSD > > > per disk. > > > > > > > > > MON AND MDS SERVERS > > > > > > Now comes the big question. What specs are required? It first I had the > > > plan to use 4 SuperMicro superservers, with a 4 socket mainboards that > > > contain up to the new 16core AMD processors and up to 1TB of RAM. > > > > > > I want all 4 of the servers to run a MON service, MDS service and costumer > > > / public services. Probably I would use VM's (kvm) to separate them. I > > > will compile my own kernel to enable Kernel Samepage Merge, Hugepage > > > support and memory compaction to make RAM use more efficient. The > > > requirements for my public services will be added up, once I know what I > > > need for MON and MDS. > > > > > > > > > RAM FOR ALL SERVERS > > > > > > So what would you estimate to be the ram usage? > > > http://ceph.com/docs/master/install/hardware-recommendations/#minimum- > > > hardware-recommendations. > > > > > > Sounds OK for the OSD part. 500 MB per daemon, would put the minimum RAM > > > requirement for my OSD server to 18GB. 32GB should be more then enough. > > > Although I would like to see if it is possible to use btrfs compression? > > > In > > > that case I'd need more RAM in there. > > > > > > What I really want to know: how many RAM do I need for MON and MDS > > > servers? > > > 1GB per daemon sounds pretty steep. As everybody knows, RAM is expensive! > > > > > > In my case I would need at least 324 GB of ram for each of them. Initially > > > I was planning to use 4 servers and each of them running both. Joining > > > those in a single system, with the other duties the system has to perform > > > I woul
Re: [ceph-users] Hardware recommendation / calculation for large cluster
Hi, On Saturday 11 May 2013 16:22:15 Dimitri Maziuk wrote: > SuperMicro has a new 4U chassis w/ 72x3.5" drives (2/canister). You can > double the number of drives. (With faster drives you may be getting > close to chocking the expander backplane, though.) Just checked their site and those are awesome. Did not ran into them before because they are not available / advertised yet in the Netherlands. Probably requesting a quote from a distributor would still be possible. As for choking the backplane: That would just slow things down a bit, am I right? Probably I intent to write some management scripting, not to keep all the disk in the cluster all the time. When storage grows, I will let the script add disks / osd's to the cluster. The unused disks will we in stand-by mode / spun down. Probably well before the last disks are put in the cluster, I should consider re-investment and adding servers anyway. > WD 3+TB drives don't have the option to turn off "advanced format" or > whatever it's called: the part where they lie to the OS about sector > size because they ran out of bits for some other counter (will they ever > learn). In my tests iostat shows 10x i/o wait on "desktop" wd drives > compared to seagates. Aligning partitions to 4096, 16384, or any other > sector boundary didn't seem to make any difference. Did not know that. Do you have any references. Does this also apply for the enterprise disks? > So we quit buying wds. Consider seagates, they go to 4TB in both > "enterprise" and desktop lines, too. Pricing is about the same, so why not? Another question: do you use desktop or enterprise disks in your cluster? I am having trouble finding a MTBFs for desktop drives. And if I find them, they are almost the same as enterprise drives. Is there a caveat in there? Is the failure test done is different conditions? (Not that you have to know that) If the annual failure rate would be double, it would still be cheaper to use desktop drives in a large cluster, but I just like to know to be sure. Thanks and regards, Tim ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hardware recommendation / calculation for large cluster
Hi, On Saturday 11 May 2013 16:04:27 Leen Besselink wrote: > Someone is going to correct me if I'm wrong, but I think you misread > something. > > > The Mon-daemon doesn't need that much RAM: > > The 'RAM: 1 GB per daemon' is per Mon-daemon, not per OSD-daemon. > Gosh, I feel embarresed. This ectually was my main concern / bottleneck. Thanks for pointing this out. Seems Ceph really rocks in deploying affordable data clusters. Regards, Tim > On Sat, May 11, 2013 at 03:42:59PM +0200, Tim Mohlmann wrote: > > Hi, > > > > First of all I am new to ceph and this mailing list. At this moment I am > > looking into the possibilities to get involved in the storage business. I > > am trying to get an estimate about costs and after that I will start to > > determine how to get sufficient income. > > > > First I will describe my case, at the bottom you will find my questions. > > > > > > GENERAL LAYOUT: > > > > Part of this cost calculation is of course hardware. For the larger part > > I've already figured it out. In my plans I will be leasing a full rack > > (46U). Depending on the domestic needs I will be using 36 or 40U for ODS > > storage servers. (I will assume 36U from here on, to keep a solid value > > for calculation and have enough spare space for extra devices). > > > > Each OSD server uses 4U and can take 36x3.5" drives. So in 36U I can put > > 36/4=9 OSD servers, containing 9*36=324 HDDs. > > > > > > HARD DISK DRIVES > > > > I have been looking for WD digital RE and RED series. RE is more expensive > > per GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap > > per GB, but only goes as far a 3TB. > > > > At my current calculations it does not matter much if I would put > > expensive WD RE 4TB disks or cheaper WD RED 3TB, the price per GB over > > the complete cluster expense and 3 years of running costs (including AFR) > > is almost the same. > > > > So basically, if I could reduce the costs of all the other components used > > in the cluster, I would go for the 3TB disk and if the costs will be > > higher then my first calculation, I would use the 4TB disk. > > > > Let's assume 4TB from now on. So, 4*324=1296TB. So lets go Peta-byte ;). > > > > > > NETWORK > > > > I will use a redundant 2x10Gbe network connection for each node. Two > > independent 10Gbe switches will be used and I will use bonding between the > > interfaces on each node. (Thanks some guy in the #Ceph irc for pointing > > this option out). I will use VLAN's to split front-side, backside and > > Internet networks. > > > > > > OSD SERVER > > > > SuperMicro based, 36 HDD hotswap. Dual socket mainboard. 16x DIMM sockets. > > It is advertised they can take up to 512GB of RAM. I will install 2 x > > Intel Xeon E5620 2.40ghz processor, having 4 cores and 8 threads each. > > For the RAM I am in doubt (see below). I am looking into running 1 OSD > > per disk. > > > > > > MON AND MDS SERVERS > > > > Now comes the big question. What specs are required? It first I had the > > plan to use 4 SuperMicro superservers, with a 4 socket mainboards that > > contain up to the new 16core AMD processors and up to 1TB of RAM. > > > > I want all 4 of the servers to run a MON service, MDS service and costumer > > / public services. Probably I would use VM's (kvm) to separate them. I > > will compile my own kernel to enable Kernel Samepage Merge, Hugepage > > support and memory compaction to make RAM use more efficient. The > > requirements for my public services will be added up, once I know what I > > need for MON and MDS. > > > > > > RAM FOR ALL SERVERS > > > > So what would you estimate to be the ram usage? > > http://ceph.com/docs/master/install/hardware-recommendations/#minimum- > > hardware-recommendations. > > > > Sounds OK for the OSD part. 500 MB per daemon, would put the minimum RAM > > requirement for my OSD server to 18GB. 32GB should be more then enough. > > Although I would like to see if it is possible to use btrfs compression? > > In > > that case I'd need more RAM in there. > > > > What I really want to know: how many RAM do I need for MON and MDS > > servers? > > 1GB per daemon sounds pretty steep. As everybody knows, RAM is expensive! > > > > In my case I would need at least 324 GB of ram for each of them. Initially > > I was planning to use 4 servers and each of them running both. Joining > > those in a single system, with the other duties the system has to perform > > I would need the full 1TB of RAM. I would need to use 32GB modules witch > > are really expensive per GB and difficult to find. (not may server > > hardware vendors in the Netherlands have them). > > > > > > QUESTIONS > > > > Question 1: Is it really the amount for OSD's that counts for MON and MDS > > RAM usage, or the size of the object store? > > > > Question 2: can I do it with less RAM? Any statistics, or better: a > > calculation? I can imagine memory pages becoming redundant if the cluster > > grows, so less memory r
Re: [ceph-users] Hardware recommendation / calculation for large cluster
On 05/11/2013 08:42 AM, Tim Mohlmann wrote: > Each OSD server uses 4U and can take 36x3.5" drives. So in 36U I can put > 36/4=9 OSD servers, containing 9*36=324 HDDs. SuperMicro has a new 4U chassis w/ 72x3.5" drives (2/canister). You can double the number of drives. (With faster drives you may be getting close to chocking the expander backplane, though.) > HARD DISK DRIVES > > I have been looking for WD digital RE and RED series. RE is more expensive > per > GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap per GB, > but only goes as far a 3TB. WD 3+TB drives don't have the option to turn off "advanced format" or whatever it's called: the part where they lie to the OS about sector size because they ran out of bits for some other counter (will they ever learn). In my tests iostat shows 10x i/o wait on "desktop" wd drives compared to seagates. Aligning partitions to 4096, 16384, or any other sector boundary didn't seem to make any difference. So we quit buying wds. Consider seagates, they go to 4TB in both "enterprise" and desktop lines, too. HTH -- Dimitri Maziuk Programmer/sysadmin BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Hardware recommendation / calculation for large cluster
Hi, Someone is going to correct me if I'm wrong, but I think you misread something. The Mon-daemon doesn't need that much RAM: The 'RAM: 1 GB per daemon' is per Mon-daemon, not per OSD-daemon. The same for disk-space. You should read this page again: http://ceph.com/docs/master/install/hardware-recommendations/ Some of the other questions are answered there as well. Like how much memory does a OSD-daemon need and why/when. On Sat, May 11, 2013 at 03:42:59PM +0200, Tim Mohlmann wrote: > Hi, > > First of all I am new to ceph and this mailing list. At this moment I am > looking into the possibilities to get involved in the storage business. I am > trying to get an estimate about costs and after that I will start to > determine > how to get sufficient income. > > First I will describe my case, at the bottom you will find my questions. > > > GENERAL LAYOUT: > > Part of this cost calculation is of course hardware. For the larger part I've > already figured it out. In my plans I will be leasing a full rack (46U). > Depending on the domestic needs I will be using 36 or 40U for ODS storage > servers. (I will assume 36U from here on, to keep a solid value for > calculation and have enough spare space for extra devices). > > Each OSD server uses 4U and can take 36x3.5" drives. So in 36U I can put > 36/4=9 OSD servers, containing 9*36=324 HDDs. > > > HARD DISK DRIVES > > I have been looking for WD digital RE and RED series. RE is more expensive > per > GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap per GB, > but only goes as far a 3TB. > > At my current calculations it does not matter much if I would put expensive > WD > RE 4TB disks or cheaper WD RED 3TB, the price per GB over the complete > cluster > expense and 3 years of running costs (including AFR) is almost the same. > > So basically, if I could reduce the costs of all the other components used in > the cluster, I would go for the 3TB disk and if the costs will be higher then > my first calculation, I would use the 4TB disk. > > Let's assume 4TB from now on. So, 4*324=1296TB. So lets go Peta-byte ;). > > > NETWORK > > I will use a redundant 2x10Gbe network connection for each node. Two > independent 10Gbe switches will be used and I will use bonding between the > interfaces on each node. (Thanks some guy in the #Ceph irc for pointing this > option out). I will use VLAN's to split front-side, backside and Internet > networks. > > > OSD SERVER > > SuperMicro based, 36 HDD hotswap. Dual socket mainboard. 16x DIMM sockets. It > is advertised they can take up to 512GB of RAM. I will install 2 x Intel Xeon > E5620 2.40ghz processor, having 4 cores and 8 threads each. For the RAM I am > in doubt (see below). I am looking into running 1 OSD per disk. > > > MON AND MDS SERVERS > > Now comes the big question. What specs are required? It first I had the plan > to > use 4 SuperMicro superservers, with a 4 socket mainboards that contain up to > the new 16core AMD processors and up to 1TB of RAM. > > I want all 4 of the servers to run a MON service, MDS service and costumer / > public services. Probably I would use VM's (kvm) to separate them. I will > compile my own kernel to enable Kernel Samepage Merge, Hugepage support and > memory compaction to make RAM use more efficient. The requirements for my > public > services will be added up, once I know what I need for MON and MDS. > > > RAM FOR ALL SERVERS > > So what would you estimate to be the ram usage? > http://ceph.com/docs/master/install/hardware-recommendations/#minimum- > hardware-recommendations. > > Sounds OK for the OSD part. 500 MB per daemon, would put the minimum RAM > requirement for my OSD server to 18GB. 32GB should be more then enough. > Although I would like to see if it is possible to use btrfs compression? In > that case I'd need more RAM in there. > > What I really want to know: how many RAM do I need for MON and MDS servers? > 1GB per daemon sounds pretty steep. As everybody knows, RAM is expensive! > > In my case I would need at least 324 GB of ram for each of them. Initially I > was planning to use 4 servers and each of them running both. Joining those in > a single system, with the other duties the system has to perform I would need > the full 1TB of RAM. I would need to use 32GB modules witch are really > expensive per GB and difficult to find. (not may server hardware vendors in > the > Netherlands have them). > > > QUESTIONS > > Question 1: Is it really the amount for OSD's that counts for MON and MDS RAM > usage, or the size of the object store? > > Question 2: can I do it with less RAM? Any statistics, or better: a > calculation? I can imagine memory pages becoming redundant if the cluster > grows, so less memory required per OSD. > > Question 3: If it is the amount of OSDs that counts, would it be beneficial > to > combine disks in a RAID 0 (lvm or btrfs) array? > > Question 4: Is
[ceph-users] Hardware recommendation / calculation for large cluster
Hi, First of all I am new to ceph and this mailing list. At this moment I am looking into the possibilities to get involved in the storage business. I am trying to get an estimate about costs and after that I will start to determine how to get sufficient income. First I will describe my case, at the bottom you will find my questions. GENERAL LAYOUT: Part of this cost calculation is of course hardware. For the larger part I've already figured it out. In my plans I will be leasing a full rack (46U). Depending on the domestic needs I will be using 36 or 40U for ODS storage servers. (I will assume 36U from here on, to keep a solid value for calculation and have enough spare space for extra devices). Each OSD server uses 4U and can take 36x3.5" drives. So in 36U I can put 36/4=9 OSD servers, containing 9*36=324 HDDs. HARD DISK DRIVES I have been looking for WD digital RE and RED series. RE is more expensive per GB, but has a larger MTBF and offers a 4TB model. RED is (real) cheap per GB, but only goes as far a 3TB. At my current calculations it does not matter much if I would put expensive WD RE 4TB disks or cheaper WD RED 3TB, the price per GB over the complete cluster expense and 3 years of running costs (including AFR) is almost the same. So basically, if I could reduce the costs of all the other components used in the cluster, I would go for the 3TB disk and if the costs will be higher then my first calculation, I would use the 4TB disk. Let's assume 4TB from now on. So, 4*324=1296TB. So lets go Peta-byte ;). NETWORK I will use a redundant 2x10Gbe network connection for each node. Two independent 10Gbe switches will be used and I will use bonding between the interfaces on each node. (Thanks some guy in the #Ceph irc for pointing this option out). I will use VLAN's to split front-side, backside and Internet networks. OSD SERVER SuperMicro based, 36 HDD hotswap. Dual socket mainboard. 16x DIMM sockets. It is advertised they can take up to 512GB of RAM. I will install 2 x Intel Xeon E5620 2.40ghz processor, having 4 cores and 8 threads each. For the RAM I am in doubt (see below). I am looking into running 1 OSD per disk. MON AND MDS SERVERS Now comes the big question. What specs are required? It first I had the plan to use 4 SuperMicro superservers, with a 4 socket mainboards that contain up to the new 16core AMD processors and up to 1TB of RAM. I want all 4 of the servers to run a MON service, MDS service and costumer / public services. Probably I would use VM's (kvm) to separate them. I will compile my own kernel to enable Kernel Samepage Merge, Hugepage support and memory compaction to make RAM use more efficient. The requirements for my public services will be added up, once I know what I need for MON and MDS. RAM FOR ALL SERVERS So what would you estimate to be the ram usage? http://ceph.com/docs/master/install/hardware-recommendations/#minimum- hardware-recommendations. Sounds OK for the OSD part. 500 MB per daemon, would put the minimum RAM requirement for my OSD server to 18GB. 32GB should be more then enough. Although I would like to see if it is possible to use btrfs compression? In that case I'd need more RAM in there. What I really want to know: how many RAM do I need for MON and MDS servers? 1GB per daemon sounds pretty steep. As everybody knows, RAM is expensive! In my case I would need at least 324 GB of ram for each of them. Initially I was planning to use 4 servers and each of them running both. Joining those in a single system, with the other duties the system has to perform I would need the full 1TB of RAM. I would need to use 32GB modules witch are really expensive per GB and difficult to find. (not may server hardware vendors in the Netherlands have them). QUESTIONS Question 1: Is it really the amount for OSD's that counts for MON and MDS RAM usage, or the size of the object store? Question 2: can I do it with less RAM? Any statistics, or better: a calculation? I can imagine memory pages becoming redundant if the cluster grows, so less memory required per OSD. Question 3: If it is the amount of OSDs that counts, would it be beneficial to combine disks in a RAID 0 (lvm or btrfs) array? Question 4: Is it safe / possible to store MON files inside of the cluster itself? The 10GB per daemon requirement would mean I need 3240GB of storage for each MON, meaning I need to get some huge disks and a (lvm) RAID 1 array for redundancy, while I have a huge redundant file sytem at hand already. Question 5: Is it possible to enable btrfs compression? I know btrfs is not stable for production yet, but it would be nice if compression is supported in the future, when it does become stable If the RAM requirement is not so steep, I am thinking about the possibility to run the MON service from 4 OSD servers. Upgrading them to 16x16GB of RAM would give me 256GB of RAM. (Again, 32GB modules are to expensive and not an option). T