Re: [ceph-users] Ceph Journal Disk Size
Regarding using spinning disks for journals, before I was able to put SSDs in my deployment I came up wit ha somewhat novel journal setup that gave my cluster way more life than having all the journals on a single disk, or having the journal on the disk with the OSD. I called it interleaved journals. Essentially offset the journal location by one disk, so in a 4 disk system: OS disk sda has journal for sdb OSD sdb OSD disk has journal for sdc OSD sdc OSD disk has journal for sdd OSD sdd OSD disk has no journal on it This limited the contention substantially. When the cluster got busy enough that multiple OSDs on the same machine were writing simultaneously it still took a hit, but it was a big upgrade from the out of the box deployment. I also tried leaving the OS drive out and only interleaving the journals on the OSD drives, but that was slightly worse under load than this configuration. It seems that the contention of the journals and OSDs was stronger than the contention with logging. QH On Fri, Jul 3, 2015 at 1:23 AM, Van Leeuwen, Robert rovanleeu...@ebay.com wrote: Another issue is performance : you'll get 4x more IOPS with 4 x 2TB drives than with one single 8TB. So if you have a performance target your money might be better spent on smaller drives Regardless of the discussion if it is smart to have very large spinners: Be aware that some of the bigger drives use SMR technology. Quoting wikipedia on SMR: shingled recording writes new tracks that overlap part of the previously written magnetic track, leaving the previous track thinner and allowing for higher track density.” and The overlapping-tracks architecture may slow down the writing process since writing to one track overwrites adjacent tracks, and requires them to be rewritten as well. Usually these these disks are marketed for archival use. Generally speaking you really should not use these unless you exactly know which write workload is hitting the disk and it is just very big sequential writes. Cheers, Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Journal Disk Size
The biggest thing to be careful of with this kind of deployment is that now a single drive failure will take out 2 OSDs instead of 1 which means OSD failure rates and associated recovery traffic go up. I'm not sure that's worth the trade-off... Mark On 07/08/2015 11:01 AM, Quentin Hartman wrote: Regarding using spinning disks for journals, before I was able to put SSDs in my deployment I came up wit ha somewhat novel journal setup that gave my cluster way more life than having all the journals on a single disk, or having the journal on the disk with the OSD. I called it interleaved journals. Essentially offset the journal location by one disk, so in a 4 disk system: OS disk sda has journal for sdb OSD sdb OSD disk has journal for sdc OSD sdc OSD disk has journal for sdd OSD sdd OSD disk has no journal on it This limited the contention substantially. When the cluster got busy enough that multiple OSDs on the same machine were writing simultaneously it still took a hit, but it was a big upgrade from the out of the box deployment. I also tried leaving the OS drive out and only interleaving the journals on the OSD drives, but that was slightly worse under load than this configuration. It seems that the contention of the journals and OSDs was stronger than the contention with logging. QH On Fri, Jul 3, 2015 at 1:23 AM, Van Leeuwen, Robert rovanleeu...@ebay.com mailto:rovanleeu...@ebay.com wrote: Another issue is performance : you'll get 4x more IOPS with 4 x 2TB drives than with one single 8TB. So if you have a performance target your money might be better spent on smaller drives Regardless of the discussion if it is smart to have very large spinners: Be aware that some of the bigger drives use SMR technology. Quoting wikipedia on SMR: shingled recording writes new tracks that overlap part of the previously written magnetic track, leaving the previous track thinner and allowing for higher track density.” and The overlapping-tracks architecture may slow down the writing process since writing to one track overwrites adjacent tracks, and requires them to be rewritten as well. Usually these these disks are marketed for archival use. Generally speaking you really should not use these unless you exactly know which write workload is hitting the disk and it is just very big sequential writes. Cheers, Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Journal Disk Size
I don't see it as being any worse than having multiple journals on a single drive. If your journal drive tanks, you're out X OSDs as well. It's arguably better, since the number of affected OSDs per drive failure is lower. Admittedly, neither deployment is ideal, but it an effective way to get from A to B for those of us with limited hardware options. QH On Wed, Jul 8, 2015 at 10:32 AM, Mark Nelson mnel...@redhat.com wrote: The biggest thing to be careful of with this kind of deployment is that now a single drive failure will take out 2 OSDs instead of 1 which means OSD failure rates and associated recovery traffic go up. I'm not sure that's worth the trade-off... Mark On 07/08/2015 11:01 AM, Quentin Hartman wrote: Regarding using spinning disks for journals, before I was able to put SSDs in my deployment I came up wit ha somewhat novel journal setup that gave my cluster way more life than having all the journals on a single disk, or having the journal on the disk with the OSD. I called it interleaved journals. Essentially offset the journal location by one disk, so in a 4 disk system: OS disk sda has journal for sdb OSD sdb OSD disk has journal for sdc OSD sdc OSD disk has journal for sdd OSD sdd OSD disk has no journal on it This limited the contention substantially. When the cluster got busy enough that multiple OSDs on the same machine were writing simultaneously it still took a hit, but it was a big upgrade from the out of the box deployment. I also tried leaving the OS drive out and only interleaving the journals on the OSD drives, but that was slightly worse under load than this configuration. It seems that the contention of the journals and OSDs was stronger than the contention with logging. QH On Fri, Jul 3, 2015 at 1:23 AM, Van Leeuwen, Robert rovanleeu...@ebay.com mailto:rovanleeu...@ebay.com wrote: Another issue is performance : you'll get 4x more IOPS with 4 x 2TB drives than with one single 8TB. So if you have a performance target your money might be better spent on smaller drives Regardless of the discussion if it is smart to have very large spinners: Be aware that some of the bigger drives use SMR technology. Quoting wikipedia on SMR: shingled recording writes new tracks that overlap part of the previously written magnetic track, leaving the previous track thinner and allowing for higher track density.” and The overlapping-tracks architecture may slow down the writing process since writing to one track overwrites adjacent tracks, and requires them to be rewritten as well. Usually these these disks are marketed for archival use. Generally speaking you really should not use these unless you exactly know which write workload is hitting the disk and it is just very big sequential writes. Cheers, Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Journal Disk Size
Another issue is performance : you'll get 4x more IOPS with 4 x 2TB drives than with one single 8TB. So if you have a performance target your money might be better spent on smaller drives Regardless of the discussion if it is smart to have very large spinners: Be aware that some of the bigger drives use SMR technology. Quoting wikipedia on SMR: shingled recording writes new tracks that overlap part of the previously written magnetic track, leaving the previous track thinner and allowing for higher track density.” and The overlapping-tracks architecture may slow down the writing process since writing to one track overwrites adjacent tracks, and requires them to be rewritten as well. Usually these these disks are marketed for archival use. Generally speaking you really should not use these unless you exactly know which write workload is hitting the disk and it is just very big sequential writes. Cheers, Robert van Leeuwen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Journal Disk Size
Lionel - thanks for the feedback ... inline below ... On 7/2/15, 9:58 AM, Lionel Bouton lionel+c...@bouton.namemailto:lionel+c...@bouton.name wrote: Ouch. These spinning disks are probably a bottleneck: there are regular advices on this list to use one DC SSD for 4 OSDs. You would probably better off with a dedicated partition at the beginning of each OSD disk or worse one file on the filesystem but it should still be better than a shared spinning disk. I understand the benefit of journals on SSDs - but if you don't have them, you don't have them. With that in mind, I'm completely open to any ideas on the best structuring of using 7200 rpm disks with journal/osd device types. I'm open to playing around with performance testing various scenarios. Again - we realize this is less than optimal, but I would like to explore tweaking and tuning this setup for the best possible performance you can get out of it. Anyway given that you get to use 720 disks (12 disks on 60 servers), I'd still prefer your setup to mine (24 OSDs) even with what I consider a bottleneck your setup as probably far more bandwidth ;-) My understanding from reading the Ceph docs was that mixing Journal on the OSD disks was strongly considered a very bad idea, due to the IO operations between the Journal and OSD disk itself creating contention. Like I said - I'm open to testing this configuration ... and probably will. We're finalizing our build/deployment harness right now to be able to modify the architecture of the OSDs with a fresh build fairly easily. A reaction to one of your earlier mails: You said you are going to 8TB drives. The problem isn't so much with the time needed to create new replicas when an OSD fails but the time to fill one freshly installed. The rebalancing is much faster when you add 4 x 2TB drives than 1 x 8TB drives. Why should it matter how long it takes a single drive to fill?? Please note that I'm very very new to operating Ceph, so am working to understand these details - and I'm certain my understanding is still a bit ... simplistic ... :-) If a drive failes, wouldn't the replica copies on that drive be replicated across other OSD devices when appropriate timers/triggers cause those data migration/re-replications to kick off? Subsequently, you add a new OSD and bring it online. It's now ready to be used - and depending on your CRUSH map policies, will start to fill - yes, this process ... to fill an entire 8TB drive certainly would take a while, but that shouldn't block or degrade the entire cluster - since we have a replica copy set of 3 ... there are two other replica copies to service read requests. If a replica copy is updated, which is currently in flight with the rebalancing to that new OSD, yes, I can see where there would be latency/delays/issues. As the drive is rebalanced, is it marked available for new writes? That would certainly cause significant latency with a new write request - I'd hope that during rebalance operation, that OSD disk is not marked available for new writes. Which brings me to a question ... Are there any good documents out there that detail (preferably via a flow chart/diagram or similar) how the various failure/recovery scenarios cause change or impact to the cluster? I've seen very little in regards to this, but may be digging in the wrong places? Thank you for any follow up information that helps illuminate my understanding (or lack thereof) how Ceph and failure/recovery situations should impact a cluster... ~~shane ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Journal Disk Size
I'd def be happy to share what numbers I can get out of it. I'm still a neophyte w/ Ceph, and learning how to operate it, set it up ... etc... My limited performance testing to date has been with stock XFS ceph-disk built filesystem for the OSDs, basic PG/CRUSH map stuff - and using dd across RBD mounted volumes ... I'm learning how to scale it up, and start tweaking and tuning. If anyone on the list is interested in specific tests and can provide specific detailed instructions on configuration, test patterns, etc ... Im happy to run them if I can ... We're baking in automation around the Ceph deoployment from fresh build using the Open Crowbar deployment tooling, with a Ceph work load on it. RIght now, modifying the Ceph work load to work across multple L3 rack boundaries in the cluster. Physical servers are Dell R720xd platforms, with 12 spinning (4TB 7200 rpm) data disks, and 2x 10k 600 GB mirrired OS disks. Memory is 128 GB, and dual 6-core HT CPUs. ~~shane On 7/1/15, 5:24 PM, German Anders gand...@despegar.commailto:gand...@despegar.com wrote: I'm interested in such a configuration, can you share some perfomance test/numbers? Thanks in advance, Best regards, German 2015-07-01 21:16 GMT-03:00 Shane Gibson shane_gib...@symantec.commailto:shane_gib...@symantec.com: It also depends a lot on the size of your cluster ... I have a test cluster I'm standing up right now with 60 nodes - a total of 600 OSDs each at 4 TB ... If I lose 4 TB - that's a very small fraction of the data. My replicas are going to be spread out across a lot of spindles, and replicating that missing 4 TB isn't much of an issue, across 3 racks each with 80 gbit/sec ToR uplinks to Spine. Each node has 20 gbit/sec to ToR in a bond. On the other hand ... if you only have 4 .. or 8 ... or 10 servers ... and a smaller number of OSDs - you have fewer spindles replicating that loss, and it might be more of an issue. It just depends on the size/scale of your environment. We're going to 8 TB drives - and that will ultimately be spread over a 100 or more physical servers w/ 10 OSD disks per server. This will be across 7 to 10 racks (same network topology) ... so an 8 TB drive loss isn't too big of an issue. Now that assumes that replication actually works well in that size cluster. We're still cessing out this part of the PoC engagement. ~~shane On 7/1/15, 5:05 PM, ceph-users on behalf of German Anders ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com on behalf of gand...@despegar.commailto:gand...@despegar.com wrote: ask the other guys on the list, but for me to lose 4TB of data is to much, the cluster will still running fine, but in some point you need to recover that disk, and also if you lose one server with all the 4TB disk in that case yeah it will hurt the cluster, also take into account that with that kind of disk you will get no more than 100-110 iops per disk German Anders Storage System Engineer Leader Despegar | IT Team office +54 11 4894 3500 x3408 mobile +54 911 3493 7262 mail gand...@despegar.commailto:gand...@despegar.com 2015-07-01 20:54 GMT-03:00 Nate Curry cu...@mosaicatm.commailto:cu...@mosaicatm.com: 4TB is too much to lose? Why would it matter if you lost one 4TB with the redundancy? Won't it auto recover from the disk failure? Nate Curry On Jul 1, 2015 6:12 PM, German Anders gand...@despegar.commailto:gand...@despegar.com wrote: I would probably go with less size osd disks, 4TB is to much to loss in case of a broken disk, so maybe more osd daemons with less size, maybe 1TB or 2TB size. 4:1 relationship is good enough, also i think that 200G disk for the journals would be ok, so you can save some money there, the osd's of course configured them as a JBOD, don't use any RAID under it, and use two different networks for public and cluster net. German 2015-07-01 18:49 GMT-03:00 Nate Curry cu...@mosaicatm.commailto:cu...@mosaicatm.com: I would like to get some clarification on the size of the journal disks that I should get for my new Ceph cluster I am planning. I read about the journal settings on http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings but that didn't really clarify it for me that or I just didn't get it. I found in the Learning Ceph Packt book it states that you should have one disk for journalling for every 4 OSDs. Using that as a reference I was planning on getting multiple systems with 8 x 6TB inline SAS drives for OSDs with two SSDs for journalling per host as well as 2 hot spares for the 6TB drives and 2 drives for the OS. I was thinking of 400GB SSD drives but am wondering if that is too much. Any informed opinions would be appreciated. Thanks, Nate Curry ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Journal Disk Size
Are you using the 4TB disks for the journal? *Nate Curry* IT Manager ISSM *Mosaic ATM* mobile: 240.285.7341 office: 571.223.7036 x226 cu...@mosaicatm.com On Thu, Jul 2, 2015 at 12:16 PM, Shane Gibson shane_gib...@symantec.com wrote: I'd def be happy to share what numbers I can get out of it. I'm still a neophyte w/ Ceph, and learning how to operate it, set it up ... etc... My limited performance testing to date has been with stock XFS ceph-disk built filesystem for the OSDs, basic PG/CRUSH map stuff - and using dd across RBD mounted volumes ... I'm learning how to scale it up, and start tweaking and tuning. If anyone on the list is interested in specific tests and can provide specific detailed instructions on configuration, test patterns, etc ... Im happy to run them if I can ... We're baking in automation around the Ceph deoployment from fresh build using the Open Crowbar deployment tooling, with a Ceph work load on it. RIght now, modifying the Ceph work load to work across multple L3 rack boundaries in the cluster. Physical servers are Dell R720xd platforms, with 12 spinning (4TB 7200 rpm) data disks, and 2x 10k 600 GB mirrired OS disks. Memory is 128 GB, and dual 6-core HT CPUs. ~~shane On 7/1/15, 5:24 PM, German Anders gand...@despegar.com wrote: I'm interested in such a configuration, can you share some perfomance test/numbers? Thanks in advance, Best regards, *German* 2015-07-01 21:16 GMT-03:00 Shane Gibson shane_gib...@symantec.com: It also depends a lot on the size of your cluster ... I have a test cluster I'm standing up right now with 60 nodes - a total of 600 OSDs each at 4 TB ... If I lose 4 TB - that's a very small fraction of the data. My replicas are going to be spread out across a lot of spindles, and replicating that missing 4 TB isn't much of an issue, across 3 racks each with 80 gbit/sec ToR uplinks to Spine. Each node has 20 gbit/sec to ToR in a bond. On the other hand ... if you only have 4 .. or 8 ... or 10 servers ... and a smaller number of OSDs - you have fewer spindles replicating that loss, and it might be more of an issue. It just depends on the size/scale of your environment. We're going to 8 TB drives - and that will ultimately be spread over a 100 or more physical servers w/ 10 OSD disks per server. This will be across 7 to 10 racks (same network topology) ... so an 8 TB drive loss isn't too big of an issue. Now that assumes that replication actually works well in that size cluster. We're still cessing out this part of the PoC engagement. ~~shane On 7/1/15, 5:05 PM, ceph-users on behalf of German Anders ceph-users-boun...@lists.ceph.com on behalf of gand...@despegar.com wrote: ask the other guys on the list, but for me to lose 4TB of data is to much, the cluster will still running fine, but in some point you need to recover that disk, and also if you lose one server with all the 4TB disk in that case yeah it will hurt the cluster, also take into account that with that kind of disk you will get no more than 100-110 iops per disk *German Anders* Storage System Engineer Leader *Despegar* | IT Team *office* +54 11 4894 3500 x3408 *mobile* +54 911 3493 7262 *mail* gand...@despegar.com 2015-07-01 20:54 GMT-03:00 Nate Curry cu...@mosaicatm.com: 4TB is too much to lose? Why would it matter if you lost one 4TB with the redundancy? Won't it auto recover from the disk failure? Nate Curry On Jul 1, 2015 6:12 PM, German Anders gand...@despegar.com wrote: I would probably go with less size osd disks, 4TB is to much to loss in case of a broken disk, so maybe more osd daemons with less size, maybe 1TB or 2TB size. 4:1 relationship is good enough, also i think that 200G disk for the journals would be ok, so you can save some money there, the osd's of course configured them as a JBOD, don't use any RAID under it, and use two different networks for public and cluster net. *German* 2015-07-01 18:49 GMT-03:00 Nate Curry cu...@mosaicatm.com: I would like to get some clarification on the size of the journal disks that I should get for my new Ceph cluster I am planning. I read about the journal settings on http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings but that didn't really clarify it for me that or I just didn't get it. I found in the Learning Ceph Packt book it states that you should have one disk for journalling for every 4 OSDs. Using that as a reference I was planning on getting multiple systems with 8 x 6TB inline SAS drives for OSDs with two SSDs for journalling per host as well as 2 hot spares for the 6TB drives and 2 drives for the OS. I was thinking of 400GB SSD drives but am wondering if that is too much. Any informed opinions would be appreciated. Thanks, *Nate Curry* ___ ceph-users mailing list ceph-users@lists.ceph.com
[ceph-users] Ceph Journal Disk Size
I would like to get some clarification on the size of the journal disks that I should get for my new Ceph cluster I am planning. I read about the journal settings on http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings but that didn't really clarify it for me that or I just didn't get it. I found in the Learning Ceph Packt book it states that you should have one disk for journalling for every 4 OSDs. Using that as a reference I was planning on getting multiple systems with 8 x 6TB inline SAS drives for OSDs with two SSDs for journalling per host as well as 2 hot spares for the 6TB drives and 2 drives for the OS. I was thinking of 400GB SSD drives but am wondering if that is too much. Any informed opinions would be appreciated. Thanks, *Nate Curry* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Journal Disk Size
I would probably go with less size osd disks, 4TB is to much to loss in case of a broken disk, so maybe more osd daemons with less size, maybe 1TB or 2TB size. 4:1 relationship is good enough, also i think that 200G disk for the journals would be ok, so you can save some money there, the osd's of course configured them as a JBOD, don't use any RAID under it, and use two different networks for public and cluster net. *German* 2015-07-01 18:49 GMT-03:00 Nate Curry cu...@mosaicatm.com: I would like to get some clarification on the size of the journal disks that I should get for my new Ceph cluster I am planning. I read about the journal settings on http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings but that didn't really clarify it for me that or I just didn't get it. I found in the Learning Ceph Packt book it states that you should have one disk for journalling for every 4 OSDs. Using that as a reference I was planning on getting multiple systems with 8 x 6TB inline SAS drives for OSDs with two SSDs for journalling per host as well as 2 hot spares for the 6TB drives and 2 drives for the OS. I was thinking of 400GB SSD drives but am wondering if that is too much. Any informed opinions would be appreciated. Thanks, *Nate Curry* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Journal Disk Size
It also depends a lot on the size of your cluster ... I have a test cluster I'm standing up right now with 60 nodes - a total of 600 OSDs each at 4 TB ... If I lose 4 TB - that's a very small fraction of the data. My replicas are going to be spread out across a lot of spindles, and replicating that missing 4 TB isn't much of an issue, across 3 racks each with 80 gbit/sec ToR uplinks to Spine. Each node has 20 gbit/sec to ToR in a bond. On the other hand ... if you only have 4 .. or 8 ... or 10 servers ... and a smaller number of OSDs - you have fewer spindles replicating that loss, and it might be more of an issue. It just depends on the size/scale of your environment. We're going to 8 TB drives - and that will ultimately be spread over a 100 or more physical servers w/ 10 OSD disks per server. This will be across 7 to 10 racks (same network topology) ... so an 8 TB drive loss isn't too big of an issue. Now that assumes that replication actually works well in that size cluster. We're still cessing out this part of the PoC engagement. ~~shane On 7/1/15, 5:05 PM, ceph-users on behalf of German Anders ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com on behalf of gand...@despegar.commailto:gand...@despegar.com wrote: ask the other guys on the list, but for me to lose 4TB of data is to much, the cluster will still running fine, but in some point you need to recover that disk, and also if you lose one server with all the 4TB disk in that case yeah it will hurt the cluster, also take into account that with that kind of disk you will get no more than 100-110 iops per disk German Anders Storage System Engineer Leader Despegar | IT Team office +54 11 4894 3500 x3408 mobile +54 911 3493 7262 mail gand...@despegar.commailto:gand...@despegar.com 2015-07-01 20:54 GMT-03:00 Nate Curry cu...@mosaicatm.commailto:cu...@mosaicatm.com: 4TB is too much to lose? Why would it matter if you lost one 4TB with the redundancy? Won't it auto recover from the disk failure? Nate Curry On Jul 1, 2015 6:12 PM, German Anders gand...@despegar.commailto:gand...@despegar.com wrote: I would probably go with less size osd disks, 4TB is to much to loss in case of a broken disk, so maybe more osd daemons with less size, maybe 1TB or 2TB size. 4:1 relationship is good enough, also i think that 200G disk for the journals would be ok, so you can save some money there, the osd's of course configured them as a JBOD, don't use any RAID under it, and use two different networks for public and cluster net. German 2015-07-01 18:49 GMT-03:00 Nate Curry cu...@mosaicatm.commailto:cu...@mosaicatm.com: I would like to get some clarification on the size of the journal disks that I should get for my new Ceph cluster I am planning. I read about the journal settings on http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings but that didn't really clarify it for me that or I just didn't get it. I found in the Learning Ceph Packt book it states that you should have one disk for journalling for every 4 OSDs. Using that as a reference I was planning on getting multiple systems with 8 x 6TB inline SAS drives for OSDs with two SSDs for journalling per host as well as 2 hot spares for the 6TB drives and 2 drives for the OS. I was thinking of 400GB SSD drives but am wondering if that is too much. Any informed opinions would be appreciated. Thanks, Nate Curry ___ ceph-users mailing list ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph Journal Disk Size
I'm interested in such a configuration, can you share some perfomance test/numbers? Thanks in advance, Best regards, *German* 2015-07-01 21:16 GMT-03:00 Shane Gibson shane_gib...@symantec.com: It also depends a lot on the size of your cluster ... I have a test cluster I'm standing up right now with 60 nodes - a total of 600 OSDs each at 4 TB ... If I lose 4 TB - that's a very small fraction of the data. My replicas are going to be spread out across a lot of spindles, and replicating that missing 4 TB isn't much of an issue, across 3 racks each with 80 gbit/sec ToR uplinks to Spine. Each node has 20 gbit/sec to ToR in a bond. On the other hand ... if you only have 4 .. or 8 ... or 10 servers ... and a smaller number of OSDs - you have fewer spindles replicating that loss, and it might be more of an issue. It just depends on the size/scale of your environment. We're going to 8 TB drives - and that will ultimately be spread over a 100 or more physical servers w/ 10 OSD disks per server. This will be across 7 to 10 racks (same network topology) ... so an 8 TB drive loss isn't too big of an issue. Now that assumes that replication actually works well in that size cluster. We're still cessing out this part of the PoC engagement. ~~shane On 7/1/15, 5:05 PM, ceph-users on behalf of German Anders ceph-users-boun...@lists.ceph.com on behalf of gand...@despegar.com wrote: ask the other guys on the list, but for me to lose 4TB of data is to much, the cluster will still running fine, but in some point you need to recover that disk, and also if you lose one server with all the 4TB disk in that case yeah it will hurt the cluster, also take into account that with that kind of disk you will get no more than 100-110 iops per disk *German Anders* Storage System Engineer Leader *Despegar* | IT Team *office* +54 11 4894 3500 x3408 *mobile* +54 911 3493 7262 *mail* gand...@despegar.com 2015-07-01 20:54 GMT-03:00 Nate Curry cu...@mosaicatm.com: 4TB is too much to lose? Why would it matter if you lost one 4TB with the redundancy? Won't it auto recover from the disk failure? Nate Curry On Jul 1, 2015 6:12 PM, German Anders gand...@despegar.com wrote: I would probably go with less size osd disks, 4TB is to much to loss in case of a broken disk, so maybe more osd daemons with less size, maybe 1TB or 2TB size. 4:1 relationship is good enough, also i think that 200G disk for the journals would be ok, so you can save some money there, the osd's of course configured them as a JBOD, don't use any RAID under it, and use two different networks for public and cluster net. *German* 2015-07-01 18:49 GMT-03:00 Nate Curry cu...@mosaicatm.com: I would like to get some clarification on the size of the journal disks that I should get for my new Ceph cluster I am planning. I read about the journal settings on http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings but that didn't really clarify it for me that or I just didn't get it. I found in the Learning Ceph Packt book it states that you should have one disk for journalling for every 4 OSDs. Using that as a reference I was planning on getting multiple systems with 8 x 6TB inline SAS drives for OSDs with two SSDs for journalling per host as well as 2 hot spares for the 6TB drives and 2 drives for the OS. I was thinking of 400GB SSD drives but am wondering if that is too much. Any informed opinions would be appreciated. Thanks, *Nate Curry* ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com