Re: [ceph-users] OSD and Journal Files

2013-09-18 Thread Mark Nelson

Excellent overview Mike!

Mark

On 09/18/2013 10:03 AM, Mike Dawson wrote:

Ian,

There are two schools of thought here. Some people say, run the journal
on a separate partition on the spinner alongside the OSD partition, and
don't mess with SSDs for journals. This may be the best practice for an
architecture of high-density chassis.

The other design is to use SSDs for journals, but design with an
appropriate ratio of journals per SSD. Plus, you need to understand
losing an SSD will cause the loss of ALL of the OSDs which had their
journal on the failed SSD.

For now, I'll assume you want to use SSDs and offer some suggestions.

First, you probably don't want RAID1 for the journal SSDs. It isn't
particularly needed for resiliency and certainly isn't beneficial from a
throughput perspective.

Next, the best practice is to have enough throughput in the Journals
(SSDs) so your OSDs (spinners) aren't starved. Let's assume your SSDs
sustain writes at 450MB/s and the spinners can do 120MB/s.

450MB/s divided by 120MB/s = 3.75

Which I would round to a ratio of four OSD Journals on each SSD.

Since it appears you are using 24-drive chassis and the first two drives
are taken by the RAID1 set for the OS, you have 22 drives left. You
could do:

- 4 SSDs, each with 4 Journals
- 16 spinners, each running an OSD process
- 2 RAID1 OS
- 2 Empty

Or, if you want to push the ratio a bit farther (6 OSD journals on an SSD):

- 3 SSDs, each with 6 Journals
- 18 spinners, each running an OSD process
- 1 spinner for OS (no RAID1)

Because your 10Gb network will peak at 1,250MB/s the 6:1 ratio shown
above should be fine (as you're limited to ~70MB/s for each OSD by the
network anyway).

I think you'll be OK on CPU and RAM.

Journals are small (default of 1GB, I run 10GB). Create a 10GB
unformatted partition for each journal and leave the rest of the SSD
unallocated (it will be used for wear-leveling). If you use
high-endurance SSDs, you could certainly consider smaller drives as long
as they maintain sufficient performance characteristics.

Thanks,

Mike Dawson
Co-Founder  Director of Cloud Architecture
Cloudapt LLC


On 9/18/2013 9:52 AM, ian_m_por...@dell.com wrote:

*Dell - Internal Use - Confidential *

Hi,

I read in the ceph documentation that one of the main performance snags
in ceph was running the OSDs and journal files on the same disks and you
should consider at a minimum running the journals on SSDs.

Given I am looking to design a 150 TB cluster, I’m considering the
following configuration for the storage nodes

No of replicas: 3

Each node

·18 x 1 TB for storage (1 OSD per node, journals for each OSD are stored
to volume on SSD)

·2  x 512 GB SSD drives configured as RAID 1  to store the journal files
(assuming journal files are not replicated, correct me if Im wrong)

·2 x 300 GB drives for OS/software (RAID 1)

·48 GB RAM

·2 x 10 Gb for public and storage network

·1 x 1 Gb for management network

·Dual E2660 CPU

No of nodes required for 150 TB = 150*3/(18*1) = 25

Unfortunately I don’t have any metrics on the throughput into the
cluster so I can’t tell whether 512 GB for journal files will be
sufficient so it’s a best guess and may be overkill. Also, any concerns
regarding number of OSDs running on each node, ive seen some articles on
the web saying the sweet spot is around 8 OSDs per node?

Thanks

Ian

Dell Corporation Limited is registered in England and Wales. Company
Registration Number: 2081369
Registered address: Dell House, The Boulevard, Cain Road, Bracknell,
Berkshire, RG12 1LF, UK.
Company details for other Dell UK entities can be found on
www.dell.co.uk.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD and Journal Files

2013-09-18 Thread Ian_M_Porter
Dell - Internal Use - Confidential
Thanks Mike, great info!

-Original Message-
From: Mike Dawson [mailto:mike.daw...@cloudapt.com]
Sent: 18 September 2013 16:04
To: Porter, Ian M; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] OSD and Journal Files

Ian,

There are two schools of thought here. Some people say, run the journal on a 
separate partition on the spinner alongside the OSD partition, and don't mess 
with SSDs for journals. This may be the best practice for an architecture of 
high-density chassis.

The other design is to use SSDs for journals, but design with an appropriate 
ratio of journals per SSD. Plus, you need to understand losing an SSD will 
cause the loss of ALL of the OSDs which had their journal on the failed SSD.

For now, I'll assume you want to use SSDs and offer some suggestions.

First, you probably don't want RAID1 for the journal SSDs. It isn't 
particularly needed for resiliency and certainly isn't beneficial from a 
throughput perspective.

Next, the best practice is to have enough throughput in the Journals
(SSDs) so your OSDs (spinners) aren't starved. Let's assume your SSDs sustain 
writes at 450MB/s and the spinners can do 120MB/s.

450MB/s divided by 120MB/s = 3.75

Which I would round to a ratio of four OSD Journals on each SSD.

Since it appears you are using 24-drive chassis and the first two drives are 
taken by the RAID1 set for the OS, you have 22 drives left. You could do:

- 4 SSDs, each with 4 Journals
- 16 spinners, each running an OSD process
- 2 RAID1 OS
- 2 Empty

Or, if you want to push the ratio a bit farther (6 OSD journals on an SSD):

- 3 SSDs, each with 6 Journals
- 18 spinners, each running an OSD process
- 1 spinner for OS (no RAID1)

Because your 10Gb network will peak at 1,250MB/s the 6:1 ratio shown above 
should be fine (as you're limited to ~70MB/s for each OSD by the network 
anyway).

I think you'll be OK on CPU and RAM.

Journals are small (default of 1GB, I run 10GB). Create a 10GB unformatted 
partition for each journal and leave the rest of the SSD unallocated (it will 
be used for wear-leveling). If you use high-endurance SSDs, you could certainly 
consider smaller drives as long as they maintain sufficient performance 
characteristics.

Thanks,

Mike Dawson
Co-Founder  Director of Cloud Architecture Cloudapt LLC


On 9/18/2013 9:52 AM, ian_m_por...@dell.com wrote:
 *Dell - Internal Use - Confidential *

 Hi,

 I read in the ceph documentation that one of the main performance
 snags in ceph was running the OSDs and journal files on the same disks
 and you should consider at a minimum running the journals on SSDs.

 Given I am looking to design a 150 TB cluster, I'm considering the
 following configuration for the storage nodes

 No of replicas: 3

 Each node

 *18 x 1 TB for storage (1 OSD per node, journals for each OSD are
 stored to volume on SSD)

 *2  x 512 GB SSD drives configured as RAID 1  to store the journal
 files (assuming journal files are not replicated, correct me if Im
 wrong)

 *2 x 300 GB drives for OS/software (RAID 1)

 *48 GB RAM

 *2 x 10 Gb for public and storage network

 *1 x 1 Gb for management network

 *Dual E2660 CPU

 No of nodes required for 150 TB = 150*3/(18*1) = 25

 Unfortunately I don't have any metrics on the throughput into the
 cluster so I can't tell whether 512 GB for journal files will be
 sufficient so it's a best guess and may be overkill. Also, any
 concerns regarding number of OSDs running on each node, ive seen some
 articles on the web saying the sweet spot is around 8 OSDs per node?

 Thanks

 Ian

 Dell Corporation Limited is registered in England and Wales. Company
 Registration Number: 2081369 Registered address: Dell House, The
 Boulevard, Cain Road, Bracknell, Berkshire, RG12 1LF, UK.
 Company details for other Dell UK entities can be found on  www.dell.co.uk.



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Dell Corporation Limited is registered in England and Wales. Company 
Registration Number: 2081369
Registered address: Dell House, The Boulevard, Cain Road, Bracknell,  
Berkshire, RG12 1LF, UK.
Company details for other Dell UK entities can be found on  www.dell.co.uk.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD and Journal Files

2013-09-18 Thread Corin Langosch

Am 18.09.2013 17:03, schrieb Mike Dawson:


I think you'll be OK on CPU and RAM.



I'm running latest dumpling here and with default settings each osd consumes 
more than 3 GB RAM peak. So with 48 GB RAM it would not be possible to run the 
desired 18 osds. I filed a bug report for this here 
http://tracker.ceph.com/issues/5700


Corin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD and Journal Files

2013-09-18 Thread Gruher, Joseph R


-Original Message-
From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
boun...@lists.ceph.com] On Behalf Of Mike Dawson
 
 you need to understand losing an SSD will cause
the loss of ALL of the OSDs which had their journal on the failed SSD.

First, you probably don't want RAID1 for the journal SSDs. It isn't 
particularly
needed for resiliency and certainly isn't beneficial from a throughput
perspective.

Sorry, can you clarify this further for me?  If losing the SSD would cause 
losing all the OSDs journaling on it why would you not want to RAID it?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD and Journal Files

2013-09-18 Thread Warren Wang
FWIW, we run into this same issue, and cannot get a good enough SSD:
spinning ratio, and decided on simply running the journals on each
(spinning) drive, for hosts that have 24 slots.  The problem gets even
worse when we're talking about some of the newer boxes.

Warren

Warren


On Wed, Sep 18, 2013 at 1:56 PM, Mike Dawson mike.daw...@cloudapt.comwrote:

 Joseph,

 With properly architected failure domains and replication in a Ceph
 cluster, RAID1 has diminishing returns.

 A well-designed CRUSH map should allow for failures at any level of your
 hierarchy (OSDs, hosts, racks, rows, etc) while protecting the data with a
 configurable number of copies.

 That being said, losing a series of six OSDs is certainly a hassle and
 journals on a RAID1 set could help prevent that senerio.

 But where do you stop? 3 monitors, 5, 7? RAID1 for OSDs, too? 3x
 replication, 4x, 10x? I suppose each operator gets to decide how far to
 chase the diminishing returns.


 Thanks,

 Mike Dawson
 Co-Founder  Director of Cloud Architecture
 Cloudapt LLC

 On 9/18/2013 1:27 PM, Gruher, Joseph R wrote:



  -Original Message-
 From: 
 ceph-users-boun...@lists.ceph.**comceph-users-boun...@lists.ceph.com[mailto:
 ceph-users-
 boun...@lists.ceph.com] On Behalf Of Mike Dawson

 you need to understand losing an SSD will cause
 the loss of ALL of the OSDs which had their journal on the failed SSD.

 First, you probably don't want RAID1 for the journal SSDs. It isn't
 particularly
 needed for resiliency and certainly isn't beneficial from a throughput
 perspective.


 Sorry, can you clarify this further for me?  If losing the SSD would
 cause losing all the OSDs journaling on it why would you not want to RAID
 it?

  __**_
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com