Re: [ceph-users] OSD and Journal Files

2013-09-18 Thread Warren Wang
FWIW, we run into this same issue, and cannot get a good enough SSD:
spinning ratio, and decided on simply running the journals on each
(spinning) drive, for hosts that have 24 slots.  The problem gets even
worse when we're talking about some of the newer boxes.

Warren

Warren


On Wed, Sep 18, 2013 at 1:56 PM, Mike Dawson wrote:

> Joseph,
>
> With properly architected failure domains and replication in a Ceph
> cluster, RAID1 has diminishing returns.
>
> A well-designed CRUSH map should allow for failures at any level of your
> hierarchy (OSDs, hosts, racks, rows, etc) while protecting the data with a
> configurable number of copies.
>
> That being said, losing a series of six OSDs is certainly a hassle and
> journals on a RAID1 set could help prevent that senerio.
>
> But where do you stop? 3 monitors, 5, 7? RAID1 for OSDs, too? 3x
> replication, 4x, 10x? I suppose each operator gets to decide how far to
> chase the diminishing returns.
>
>
> Thanks,
>
> Mike Dawson
> Co-Founder & Director of Cloud Architecture
> Cloudapt LLC
>
> On 9/18/2013 1:27 PM, Gruher, Joseph R wrote:
>
>>
>>
>>  -Original Message-
>>> From: 
>>> ceph-users-boun...@lists.ceph.**com[mailto:
>>> ceph-users-
>>> boun...@lists.ceph.com] On Behalf Of Mike Dawson
>>>
>>> you need to understand losing an SSD will cause
>>> the loss of ALL of the OSDs which had their journal on the failed SSD.
>>>
>>> First, you probably don't want RAID1 for the journal SSDs. It isn't
>>> particularly
>>> needed for resiliency and certainly isn't beneficial from a throughput
>>> perspective.
>>>
>>
>> Sorry, can you clarify this further for me?  If losing the SSD would
>> cause losing all the OSDs journaling on it why would you not want to RAID
>> it?
>>
>>  __**_
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/**listinfo.cgi/ceph-users-ceph.**com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD and Journal Files

2013-09-18 Thread Mike Dawson

Joseph,

With properly architected failure domains and replication in a Ceph 
cluster, RAID1 has diminishing returns.


A well-designed CRUSH map should allow for failures at any level of your 
hierarchy (OSDs, hosts, racks, rows, etc) while protecting the data with 
a configurable number of copies.


That being said, losing a series of six OSDs is certainly a hassle and 
journals on a RAID1 set could help prevent that senerio.


But where do you stop? 3 monitors, 5, 7? RAID1 for OSDs, too? 3x 
replication, 4x, 10x? I suppose each operator gets to decide how far to 
chase the diminishing returns.


Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC

On 9/18/2013 1:27 PM, Gruher, Joseph R wrote:




-Original Message-
From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
boun...@lists.ceph.com] On Behalf Of Mike Dawson

you need to understand losing an SSD will cause
the loss of ALL of the OSDs which had their journal on the failed SSD.

First, you probably don't want RAID1 for the journal SSDs. It isn't particularly
needed for resiliency and certainly isn't beneficial from a throughput
perspective.


Sorry, can you clarify this further for me?  If losing the SSD would cause 
losing all the OSDs journaling on it why would you not want to RAID it?


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD and Journal Files

2013-09-18 Thread Gruher, Joseph R


>-Original Message-
>From: ceph-users-boun...@lists.ceph.com [mailto:ceph-users-
>boun...@lists.ceph.com] On Behalf Of Mike Dawson
> 
> you need to understand losing an SSD will cause
>the loss of ALL of the OSDs which had their journal on the failed SSD.
>
>First, you probably don't want RAID1 for the journal SSDs. It isn't 
>particularly
>needed for resiliency and certainly isn't beneficial from a throughput
>perspective.

Sorry, can you clarify this further for me?  If losing the SSD would cause 
losing all the OSDs journaling on it why would you not want to RAID it?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD and Journal Files

2013-09-18 Thread Corin Langosch

Am 18.09.2013 17:03, schrieb Mike Dawson:


I think you'll be OK on CPU and RAM.



I'm running latest dumpling here and with default settings each osd consumes 
more than 3 GB RAM peak. So with 48 GB RAM it would not be possible to run the 
desired 18 osds. I filed a bug report for this here 
http://tracker.ceph.com/issues/5700


Corin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD and Journal Files

2013-09-18 Thread Ian_M_Porter
Dell - Internal Use - Confidential
Thanks Mike, great info!

-Original Message-
From: Mike Dawson [mailto:mike.daw...@cloudapt.com]
Sent: 18 September 2013 16:04
To: Porter, Ian M; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] OSD and Journal Files

Ian,

There are two schools of thought here. Some people say, run the journal on a 
separate partition on the spinner alongside the OSD partition, and don't mess 
with SSDs for journals. This may be the best practice for an architecture of 
high-density chassis.

The other design is to use SSDs for journals, but design with an appropriate 
ratio of journals per SSD. Plus, you need to understand losing an SSD will 
cause the loss of ALL of the OSDs which had their journal on the failed SSD.

For now, I'll assume you want to use SSDs and offer some suggestions.

First, you probably don't want RAID1 for the journal SSDs. It isn't 
particularly needed for resiliency and certainly isn't beneficial from a 
throughput perspective.

Next, the best practice is to have enough throughput in the Journals
(SSDs) so your OSDs (spinners) aren't starved. Let's assume your SSDs sustain 
writes at 450MB/s and the spinners can do 120MB/s.

450MB/s divided by 120MB/s = 3.75

Which I would round to a ratio of four OSD Journals on each SSD.

Since it appears you are using 24-drive chassis and the first two drives are 
taken by the RAID1 set for the OS, you have 22 drives left. You could do:

- 4 SSDs, each with 4 Journals
- 16 spinners, each running an OSD process
- 2 RAID1 OS
- 2 Empty

Or, if you want to push the ratio a bit farther (6 OSD journals on an SSD):

- 3 SSDs, each with 6 Journals
- 18 spinners, each running an OSD process
- 1 spinner for OS (no RAID1)

Because your 10Gb network will peak at 1,250MB/s the 6:1 ratio shown above 
should be fine (as you're limited to ~70MB/s for each OSD by the network 
anyway).

I think you'll be OK on CPU and RAM.

Journals are small (default of 1GB, I run 10GB). Create a 10GB unformatted 
partition for each journal and leave the rest of the SSD unallocated (it will 
be used for wear-leveling). If you use high-endurance SSDs, you could certainly 
consider smaller drives as long as they maintain sufficient performance 
characteristics.

Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture Cloudapt LLC


On 9/18/2013 9:52 AM, ian_m_por...@dell.com wrote:
> *Dell - Internal Use - Confidential *
>
> Hi,
>
> I read in the ceph documentation that one of the main performance
> snags in ceph was running the OSDs and journal files on the same disks
> and you should consider at a minimum running the journals on SSDs.
>
> Given I am looking to design a 150 TB cluster, I'm considering the
> following configuration for the storage nodes
>
> No of replicas: 3
>
> Each node
>
> *18 x 1 TB for storage (1 OSD per node, journals for each OSD are
> stored to volume on SSD)
>
> *2  x 512 GB SSD drives configured as RAID 1  to store the journal
> files (assuming journal files are not replicated, correct me if Im
> wrong)
>
> *2 x 300 GB drives for OS/software (RAID 1)
>
> *48 GB RAM
>
> *2 x 10 Gb for public and storage network
>
> *1 x 1 Gb for management network
>
> *Dual E2660 CPU
>
> No of nodes required for 150 TB = 150*3/(18*1) = 25
>
> Unfortunately I don't have any metrics on the throughput into the
> cluster so I can't tell whether 512 GB for journal files will be
> sufficient so it's a best guess and may be overkill. Also, any
> concerns regarding number of OSDs running on each node, ive seen some
> articles on the web saying the sweet spot is around 8 OSDs per node?
>
> Thanks
>
> Ian
>
> Dell Corporation Limited is registered in England and Wales. Company
> Registration Number: 2081369 Registered address: Dell House, The
> Boulevard, Cain Road, Bracknell, Berkshire, RG12 1LF, UK.
> Company details for other Dell UK entities can be found on  www.dell.co.uk.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
Dell Corporation Limited is registered in England and Wales. Company 
Registration Number: 2081369
Registered address: Dell House, The Boulevard, Cain Road, Bracknell,  
Berkshire, RG12 1LF, UK.
Company details for other Dell UK entities can be found on  www.dell.co.uk.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD and Journal Files

2013-09-18 Thread Mark Nelson

Excellent overview Mike!

Mark

On 09/18/2013 10:03 AM, Mike Dawson wrote:

Ian,

There are two schools of thought here. Some people say, run the journal
on a separate partition on the spinner alongside the OSD partition, and
don't mess with SSDs for journals. This may be the best practice for an
architecture of high-density chassis.

The other design is to use SSDs for journals, but design with an
appropriate ratio of journals per SSD. Plus, you need to understand
losing an SSD will cause the loss of ALL of the OSDs which had their
journal on the failed SSD.

For now, I'll assume you want to use SSDs and offer some suggestions.

First, you probably don't want RAID1 for the journal SSDs. It isn't
particularly needed for resiliency and certainly isn't beneficial from a
throughput perspective.

Next, the best practice is to have enough throughput in the Journals
(SSDs) so your OSDs (spinners) aren't starved. Let's assume your SSDs
sustain writes at 450MB/s and the spinners can do 120MB/s.

450MB/s divided by 120MB/s = 3.75

Which I would round to a ratio of four OSD Journals on each SSD.

Since it appears you are using 24-drive chassis and the first two drives
are taken by the RAID1 set for the OS, you have 22 drives left. You
could do:

- 4 SSDs, each with 4 Journals
- 16 spinners, each running an OSD process
- 2 RAID1 OS
- 2 Empty

Or, if you want to push the ratio a bit farther (6 OSD journals on an SSD):

- 3 SSDs, each with 6 Journals
- 18 spinners, each running an OSD process
- 1 spinner for OS (no RAID1)

Because your 10Gb network will peak at 1,250MB/s the 6:1 ratio shown
above should be fine (as you're limited to ~70MB/s for each OSD by the
network anyway).

I think you'll be OK on CPU and RAM.

Journals are small (default of 1GB, I run 10GB). Create a 10GB
unformatted partition for each journal and leave the rest of the SSD
unallocated (it will be used for wear-leveling). If you use
high-endurance SSDs, you could certainly consider smaller drives as long
as they maintain sufficient performance characteristics.

Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC


On 9/18/2013 9:52 AM, ian_m_por...@dell.com wrote:

*Dell - Internal Use - Confidential *

Hi,

I read in the ceph documentation that one of the main performance snags
in ceph was running the OSDs and journal files on the same disks and you
should consider at a minimum running the journals on SSDs.

Given I am looking to design a 150 TB cluster, I’m considering the
following configuration for the storage nodes

No of replicas: 3

Each node

·18 x 1 TB for storage (1 OSD per node, journals for each OSD are stored
to volume on SSD)

·2  x 512 GB SSD drives configured as RAID 1  to store the journal files
(assuming journal files are not replicated, correct me if Im wrong)

·2 x 300 GB drives for OS/software (RAID 1)

·48 GB RAM

·2 x 10 Gb for public and storage network

·1 x 1 Gb for management network

·Dual E2660 CPU

No of nodes required for 150 TB = 150*3/(18*1) = 25

Unfortunately I don’t have any metrics on the throughput into the
cluster so I can’t tell whether 512 GB for journal files will be
sufficient so it’s a best guess and may be overkill. Also, any concerns
regarding number of OSDs running on each node, ive seen some articles on
the web saying the sweet spot is around 8 OSDs per node?

Thanks

Ian

Dell Corporation Limited is registered in England and Wales. Company
Registration Number: 2081369
Registered address: Dell House, The Boulevard, Cain Road, Bracknell,
Berkshire, RG12 1LF, UK.
Company details for other Dell UK entities can be found on
www.dell.co.uk.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD and Journal Files

2013-09-18 Thread Mike Dawson

Ian,

There are two schools of thought here. Some people say, run the journal 
on a separate partition on the spinner alongside the OSD partition, and 
don't mess with SSDs for journals. This may be the best practice for an 
architecture of high-density chassis.


The other design is to use SSDs for journals, but design with an 
appropriate ratio of journals per SSD. Plus, you need to understand 
losing an SSD will cause the loss of ALL of the OSDs which had their 
journal on the failed SSD.


For now, I'll assume you want to use SSDs and offer some suggestions.

First, you probably don't want RAID1 for the journal SSDs. It isn't 
particularly needed for resiliency and certainly isn't beneficial from a 
throughput perspective.


Next, the best practice is to have enough throughput in the Journals 
(SSDs) so your OSDs (spinners) aren't starved. Let's assume your SSDs 
sustain writes at 450MB/s and the spinners can do 120MB/s.


450MB/s divided by 120MB/s = 3.75

Which I would round to a ratio of four OSD Journals on each SSD.

Since it appears you are using 24-drive chassis and the first two drives 
are taken by the RAID1 set for the OS, you have 22 drives left. You 
could do:


- 4 SSDs, each with 4 Journals
- 16 spinners, each running an OSD process
- 2 RAID1 OS
- 2 Empty

Or, if you want to push the ratio a bit farther (6 OSD journals on an SSD):

- 3 SSDs, each with 6 Journals
- 18 spinners, each running an OSD process
- 1 spinner for OS (no RAID1)

Because your 10Gb network will peak at 1,250MB/s the 6:1 ratio shown 
above should be fine (as you're limited to ~70MB/s for each OSD by the 
network anyway).


I think you'll be OK on CPU and RAM.

Journals are small (default of 1GB, I run 10GB). Create a 10GB 
unformatted partition for each journal and leave the rest of the SSD 
unallocated (it will be used for wear-leveling). If you use 
high-endurance SSDs, you could certainly consider smaller drives as long 
as they maintain sufficient performance characteristics.


Thanks,

Mike Dawson
Co-Founder & Director of Cloud Architecture
Cloudapt LLC


On 9/18/2013 9:52 AM, ian_m_por...@dell.com wrote:

*Dell - Internal Use - Confidential *

Hi,

I read in the ceph documentation that one of the main performance snags
in ceph was running the OSDs and journal files on the same disks and you
should consider at a minimum running the journals on SSDs.

Given I am looking to design a 150 TB cluster, I’m considering the
following configuration for the storage nodes

No of replicas: 3

Each node

·18 x 1 TB for storage (1 OSD per node, journals for each OSD are stored
to volume on SSD)

·2  x 512 GB SSD drives configured as RAID 1  to store the journal files
(assuming journal files are not replicated, correct me if Im wrong)

·2 x 300 GB drives for OS/software (RAID 1)

·48 GB RAM

·2 x 10 Gb for public and storage network

·1 x 1 Gb for management network

·Dual E2660 CPU

No of nodes required for 150 TB = 150*3/(18*1) = 25

Unfortunately I don’t have any metrics on the throughput into the
cluster so I can’t tell whether 512 GB for journal files will be
sufficient so it’s a best guess and may be overkill. Also, any concerns
regarding number of OSDs running on each node, ive seen some articles on
the web saying the sweet spot is around 8 OSDs per node?

Thanks

Ian

Dell Corporation Limited is registered in England and Wales. Company
Registration Number: 2081369
Registered address: Dell House, The Boulevard, Cain Road, Bracknell,
Berkshire, RG12 1LF, UK.
Company details for other Dell UK entities can be found on  www.dell.co.uk.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD and Journal Files

2013-09-18 Thread Ian_M_Porter
Dell - Internal Use - Confidential
Hi,

I read in the ceph documentation that one of the main performance snags in ceph 
was running the OSDs and journal files on the same disks and you should 
consider at a minimum running the journals on SSDs.

Given I am looking to design a 150 TB cluster, I'm considering the following 
configuration for the storage nodes

No of replicas: 3

Each node

* 18 x 1 TB for storage (1 OSD per node, journals for each OSD are 
stored to volume on SSD)

* 2  x 512 GB SSD drives configured as RAID 1  to store the journal 
files (assuming journal files are not replicated, correct me if Im wrong)

* 2 x 300 GB drives for OS/software (RAID 1)

* 48 GB RAM

* 2 x 10 Gb for public and storage network

* 1 x 1 Gb for management network

* Dual E2660 CPU


No of nodes required for 150 TB = 150*3/(18*1) = 25

Unfortunately I don't have any metrics on the throughput into the cluster so I 
can't tell whether 512 GB for journal files will be sufficient so it's a best 
guess and may be overkill. Also, any concerns regarding number of OSDs running 
on each node, ive seen some articles on the web saying the sweet spot is around 
8 OSDs per node?

Thanks

Ian

Dell Corporation Limited is registered in England and Wales. Company 
Registration Number: 2081369
Registered address: Dell House, The Boulevard, Cain Road, Bracknell,  
Berkshire, RG12 1LF, UK.
Company details for other Dell UK entities can be found on  www.dell.co.uk.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com