Re: [ceph-users] OSD and Journal Files
Excellent overview Mike! Mark On 09/18/2013 10:03 AM, Mike Dawson wrote: Ian, There are two schools of thought here. Some people say, run the journal on a separate partition on the spinner alongside the OSD partition, and don't mess with SSDs for journals. This may be the best practice for an architecture of high-density chassis. The other design is to use SSDs for journals, but design with an appropriate ratio of journals per SSD. Plus, you need to understand losing an SSD will cause the loss of ALL of the OSDs which had their journal on the failed SSD. For now, I'll assume you want to use SSDs and offer some suggestions. First, you probably don't want RAID1 for the journal SSDs. It isn't particularly needed for resiliency and certainly isn't beneficial from a throughput perspective. Next, the best practice is to have enough throughput in the Journals (SSDs) so your OSDs (spinners) aren't starved. Let's assume your SSDs sustain writes at 450MB/s and the spinners can do 120MB/s. 450MB/s divided by 120MB/s = 3.75 Which I would round to a ratio of four OSD Journals on each SSD. Since it appears you are using 24-drive chassis and the first two drives are taken by the RAID1 set for the OS, you have 22 drives left. You could do: - 4 SSDs, each with 4 Journals - 16 spinners, each running an OSD process - 2 RAID1 OS - 2 Empty Or, if you want to push the ratio a bit farther (6 OSD journals on an SSD): - 3 SSDs, each with 6 Journals - 18 spinners, each running an OSD process - 1 spinner for OS (no RAID1) Because your 10Gb network will peak at 1,250MB/s the 6:1 ratio shown above should be fine (as you're limited to ~70MB/s for each OSD by the network anyway). I think you'll be OK on CPU and RAM. Journals are small (default of 1GB, I run 10GB). Create a 10GB unformatted partition for each journal and leave the rest of the SSD unallocated (it will be used for wear-leveling). If you use high-endurance SSDs, you could certainly consider smaller drives as long as they maintain sufficient performance characteristics. Thanks, Mike Dawson Co-Founder Director of Cloud Architecture Cloudapt LLC On 9/18/2013 9:52 AM, ian_m_por...@dell.com wrote: *Dell - Internal Use - Confidential * Hi, I read in the ceph documentation that one of the main performance snags in ceph was running the OSDs and journal files on the same disks and you should consider at a minimum running the journals on SSDs. Given I am looking to design a 150 TB cluster, I’m considering the following configuration for the storage nodes No of replicas: 3 Each node ·18 x 1 TB for storage (1 OSD per node, journals for each OSD are stored to volume on SSD) ·2 x 512 GB SSD drives configured as RAID 1 to store the journal files (assuming journal files are not replicated, correct me if Im wrong) ·2 x 300 GB drives for OS/software (RAID 1) ·48 GB RAM ·2 x 10 Gb for public and storage network ·1 x 1 Gb for management network ·Dual E2660 CPU No of nodes required for 150 TB = 150*3/(18*1) = 25 Unfortunately I don’t have any metrics on the throughput into the cluster so I can’t tell whether 512 GB for journal files will be sufficient so it’s a best guess and may be overkill. Also, any concerns regarding number of OSDs running on each node, ive seen some articles on the web saying the sweet spot is around 8 OSDs per node? Thanks Ian Dell Corporation Limited is registered in England and Wales. Company Registration Number: 2081369 Registered address: Dell House, The Boulevard, Cain Road, Bracknell, Berkshire, RG12 1LF, UK. Company details for other Dell UK entities can be found on www.dell.co.uk. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD and Journal Files
Dell - Internal Use - Confidential Thanks Mike, great info! -Original Message- From: Mike Dawson [mailto:mike.daw...@cloudapt.com] Sent: 18 September 2013 16:04 To: Porter, Ian M; ceph-users@lists.ceph.com Subject: Re: [ceph-users] OSD and Journal Files Ian, There are two schools of thought here. Some people say, run the journal on a separate partition on the spinner alongside the OSD partition, and don't mess with SSDs for journals. This may be the best practice for an architecture of high-density chassis. The other design is to use SSDs for journals, but design with an appropriate ratio of journals per SSD. Plus, you need to understand losing an SSD will cause the loss of ALL of the OSDs which had their journal on the failed SSD. For now, I'll assume you want to use SSDs and offer some suggestions. First, you probably don't want RAID1 for the journal SSDs. It isn't particularly needed for resiliency and certainly isn't beneficial from a throughput perspective. Next, the best practice is to have enough throughput in the Journals (SSDs) so your OSDs (spinners) aren't starved. Let's assume your SSDs sustain writes at 450MB/s and the spinners can do 120MB/s. 450MB/s divided by 120MB/s = 3.75 Which I would round to a ratio of four OSD Journals on each SSD. Since it appears you are using 24-drive chassis and the first two drives are taken by the RAID1 set for the OS, you have 22 drives left. You could do: - 4 SSDs, each with 4 Journals - 16 spinners, each running an OSD process - 2 RAID1 OS - 2 Empty Or, if you want to push the ratio a bit farther (6 OSD journals on an SSD): - 3 SSDs, each with 6 Journals - 18 spinners, each running an OSD process - 1 spinner for OS (no RAID1) Because your 10Gb network will peak at 1,250MB/s the 6:1 ratio shown above should be fine (as you're limited to ~70MB/s for each OSD by the network anyway). I think you'll be OK on CPU and RAM. Journals are small (default of 1GB, I run 10GB). Create a 10GB unformatted partition for each journal and leave the rest of the SSD unallocated (it will be used for wear-leveling). If you use high-endurance SSDs, you could certainly consider smaller drives as long as they maintain sufficient performance characteristics. Thanks, Mike Dawson Co-Founder Director of Cloud Architecture Cloudapt LLC On 9/18/2013 9:52 AM, ian_m_por...@dell.com wrote: *Dell - Internal Use - Confidential * Hi, I read in the ceph documentation that one of the main performance snags in ceph was running the OSDs and journal files on the same disks and you should consider at a minimum running the journals on SSDs. Given I am looking to design a 150 TB cluster, I'm considering the following configuration for the storage nodes No of replicas: 3 Each node *18 x 1 TB for storage (1 OSD per node, journals for each OSD are stored to volume on SSD) *2 x 512 GB SSD drives configured as RAID 1 to store the journal files (assuming journal files are not replicated, correct me if Im wrong) *2 x 300 GB drives for OS/software (RAID 1) *48 GB RAM *2 x 10 Gb for public and storage network *1 x 1 Gb for management network *Dual E2660 CPU No of nodes required for 150 TB = 150*3/(18*1) = 25 Unfortunately I don't have any metrics on the throughput into the cluster so I can't tell whether 512 GB for journal files will be sufficient so it's a best guess and may be overkill. Also, any concerns regarding number of OSDs running on each node, ive seen some articles on the web saying the sweet spot is around 8 OSDs per node? Thanks Ian Dell Corporation Limited is registered in England and Wales. Company Registration Number: 2081369 Registered address: Dell House, The Boulevard, Cain Road, Bracknell, Berkshire, RG12 1LF, UK. Company details for other Dell UK entities can be found on www.dell.co.uk. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com Dell Corporation Limited is registered in England and Wales. Company Registration Number: 2081369 Registered address: Dell House, The Boulevard, Cain Road, Bracknell, Berkshire, RG12 1LF, UK. Company details for other Dell UK entities can be found on www.dell.co.uk. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD and Journal Files
Am 18.09.2013 17:03, schrieb Mike Dawson: I think you'll be OK on CPU and RAM. I'm running latest dumpling here and with default settings each osd consumes more than 3 GB RAM peak. So with 48 GB RAM it would not be possible to run the desired 18 osds. I filed a bug report for this here http://tracker.ceph.com/issues/5700 Corin ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD and Journal Files
-Original Message- From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users- boun...@lists.ceph.com] On Behalf Of Mike Dawson you need to understand losing an SSD will cause the loss of ALL of the OSDs which had their journal on the failed SSD. First, you probably don't want RAID1 for the journal SSDs. It isn't particularly needed for resiliency and certainly isn't beneficial from a throughput perspective. Sorry, can you clarify this further for me? If losing the SSD would cause losing all the OSDs journaling on it why would you not want to RAID it? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] OSD and Journal Files
FWIW, we run into this same issue, and cannot get a good enough SSD: spinning ratio, and decided on simply running the journals on each (spinning) drive, for hosts that have 24 slots. The problem gets even worse when we're talking about some of the newer boxes. Warren Warren On Wed, Sep 18, 2013 at 1:56 PM, Mike Dawson mike.daw...@cloudapt.comwrote: Joseph, With properly architected failure domains and replication in a Ceph cluster, RAID1 has diminishing returns. A well-designed CRUSH map should allow for failures at any level of your hierarchy (OSDs, hosts, racks, rows, etc) while protecting the data with a configurable number of copies. That being said, losing a series of six OSDs is certainly a hassle and journals on a RAID1 set could help prevent that senerio. But where do you stop? 3 monitors, 5, 7? RAID1 for OSDs, too? 3x replication, 4x, 10x? I suppose each operator gets to decide how far to chase the diminishing returns. Thanks, Mike Dawson Co-Founder Director of Cloud Architecture Cloudapt LLC On 9/18/2013 1:27 PM, Gruher, Joseph R wrote: -Original Message- From: ceph-users-boun...@lists.ceph.**comceph-users-boun...@lists.ceph.com[mailto: ceph-users- boun...@lists.ceph.com] On Behalf Of Mike Dawson you need to understand losing an SSD will cause the loss of ALL of the OSDs which had their journal on the failed SSD. First, you probably don't want RAID1 for the journal SSDs. It isn't particularly needed for resiliency and certainly isn't beneficial from a throughput perspective. Sorry, can you clarify this further for me? If losing the SSD would cause losing all the OSDs journaling on it why would you not want to RAID it? __**_ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com