Re: [ceph-users] Ceph Journal Disk Size

2015-07-08 Thread Quentin Hartman
Regarding using spinning disks for journals, before I was able to put SSDs
in my deployment I came up wit ha somewhat novel journal setup that gave my
cluster way more life than having all the journals on a single disk, or
having the journal on the disk with the OSD. I called it interleaved
journals. Essentially offset the journal location by one disk, so in a 4
disk system:

OS disk sda has journal for sdb OSD
sdb OSD disk has journal for sdc OSD
sdc OSD disk has journal for sdd OSD
sdd OSD disk has no journal on it

This limited the contention substantially. When the cluster got busy enough
that multiple OSDs on the same machine were writing simultaneously it still
took a hit, but it was a big upgrade from the out of the box deployment. I
also tried leaving the OS drive out and only interleaving the journals on
the OSD drives, but that was slightly worse under load than this
configuration. It seems that the contention of the journals and OSDs was
stronger than the contention with logging.

QH

On Fri, Jul 3, 2015 at 1:23 AM, Van Leeuwen, Robert rovanleeu...@ebay.com
wrote:

   Another issue is performance : you'll get 4x more IOPS with 4 x 2TB
 drives than with one single 8TB.
  So if you have a performance target your money might be better spent on
 smaller drives

  Regardless of the discussion if it is smart to have very large spinners:
 Be aware that some of the bigger drives use SMR technology.
 Quoting wikipedia on SMR:
 shingled recording writes new tracks that overlap part of the previously
 written magnetic track, leaving the previous track thinner and allowing for
 higher track density.”
 and
 The overlapping-tracks architecture may slow down the writing process
 since writing to one track overwrites adjacent tracks, and requires them to
 be rewritten as well.

  Usually these these disks are marketed for archival use.
 Generally speaking you really should not use these unless you exactly know
 which write workload is hitting the disk and it is just very big sequential
 writes.

  Cheers,
  Robert van Leeuwen

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Journal Disk Size

2015-07-08 Thread Mark Nelson
The biggest thing to be careful of with this kind of deployment is that 
now a single drive failure will take out 2 OSDs instead of 1 which means 
OSD failure rates and associated recovery traffic go up.  I'm not sure 
that's worth the trade-off...


Mark

On 07/08/2015 11:01 AM, Quentin Hartman wrote:

Regarding using spinning disks for journals, before I was able to put
SSDs in my deployment I came up wit ha somewhat novel journal setup that
gave my cluster way more life than having all the journals on a single
disk, or having the journal on the disk with the OSD. I called it
interleaved journals. Essentially offset the journal location by one
disk, so in a 4 disk system:

OS disk sda has journal for sdb OSD
sdb OSD disk has journal for sdc OSD
sdc OSD disk has journal for sdd OSD
sdd OSD disk has no journal on it

This limited the contention substantially. When the cluster got busy
enough that multiple OSDs on the same machine were writing
simultaneously it still took a hit, but it was a big upgrade from the
out of the box deployment. I also tried leaving the OS drive out and
only interleaving the journals on the OSD drives, but that was slightly
worse under load than this configuration. It seems that the contention
of the journals and OSDs was stronger than the contention with logging.

QH

On Fri, Jul 3, 2015 at 1:23 AM, Van Leeuwen, Robert
rovanleeu...@ebay.com mailto:rovanleeu...@ebay.com wrote:

 Another issue is performance : you'll get 4x more IOPS with 4 x 2TB 
drives than with one single 8TB.
 So if you have a performance target your money might be better spent on 
smaller drives

Regardless of the discussion if it is smart to have very large
spinners:
Be aware that some of the bigger drives use SMR technology.
Quoting wikipedia on SMR:
shingled recording writes new tracks that overlap part of the
previously written magnetic track, leaving the previous track
thinner and allowing for higher track density.”
and
The overlapping-tracks architecture may slow down the writing
process since writing to one track overwrites adjacent tracks, and
requires them to be rewritten as well.

Usually these these disks are marketed for archival use.
Generally speaking you really should not use these unless you
exactly know which write workload is hitting the disk and it is just
very big sequential writes.

Cheers,
Robert van Leeuwen

___
ceph-users mailing list
ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Journal Disk Size

2015-07-08 Thread Quentin Hartman
I don't see it as being any worse than having multiple journals on a single
drive. If your journal drive tanks, you're out X OSDs as well. It's
arguably better, since the number of affected OSDs per drive failure is
lower. Admittedly, neither deployment is ideal, but it an effective way to
get from A to B for those of us with limited hardware options.

QH

On Wed, Jul 8, 2015 at 10:32 AM, Mark Nelson mnel...@redhat.com wrote:

 The biggest thing to be careful of with this kind of deployment is that
 now a single drive failure will take out 2 OSDs instead of 1 which means
 OSD failure rates and associated recovery traffic go up.  I'm not sure
 that's worth the trade-off...

 Mark

 On 07/08/2015 11:01 AM, Quentin Hartman wrote:

 Regarding using spinning disks for journals, before I was able to put
 SSDs in my deployment I came up wit ha somewhat novel journal setup that
 gave my cluster way more life than having all the journals on a single
 disk, or having the journal on the disk with the OSD. I called it
 interleaved journals. Essentially offset the journal location by one
 disk, so in a 4 disk system:

 OS disk sda has journal for sdb OSD
 sdb OSD disk has journal for sdc OSD
 sdc OSD disk has journal for sdd OSD
 sdd OSD disk has no journal on it

 This limited the contention substantially. When the cluster got busy
 enough that multiple OSDs on the same machine were writing
 simultaneously it still took a hit, but it was a big upgrade from the
 out of the box deployment. I also tried leaving the OS drive out and
 only interleaving the journals on the OSD drives, but that was slightly
 worse under load than this configuration. It seems that the contention
 of the journals and OSDs was stronger than the contention with logging.

 QH

 On Fri, Jul 3, 2015 at 1:23 AM, Van Leeuwen, Robert
 rovanleeu...@ebay.com mailto:rovanleeu...@ebay.com wrote:

  Another issue is performance : you'll get 4x more IOPS with 4 x 2TB
 drives than with one single 8TB.
  So if you have a performance target your money might be better
 spent on smaller drives

 Regardless of the discussion if it is smart to have very large
 spinners:
 Be aware that some of the bigger drives use SMR technology.
 Quoting wikipedia on SMR:
 shingled recording writes new tracks that overlap part of the
 previously written magnetic track, leaving the previous track
 thinner and allowing for higher track density.”
 and
 The overlapping-tracks architecture may slow down the writing
 process since writing to one track overwrites adjacent tracks, and
 requires them to be rewritten as well.

 Usually these these disks are marketed for archival use.
 Generally speaking you really should not use these unless you
 exactly know which write workload is hitting the disk and it is just
 very big sequential writes.

 Cheers,
 Robert van Leeuwen

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com mailto:ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

  ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Journal Disk Size

2015-07-03 Thread Van Leeuwen, Robert
 Another issue is performance : you'll get 4x more IOPS with 4 x 2TB drives 
 than with one single 8TB.
 So if you have a performance target your money might be better spent on 
 smaller drives

Regardless of the discussion if it is smart to have very large spinners:
Be aware that some of the bigger drives use SMR technology.
Quoting wikipedia on SMR:
shingled recording writes new tracks that overlap part of the previously 
written magnetic track, leaving the previous track thinner and allowing for 
higher track density.”
and
The overlapping-tracks architecture may slow down the writing process since 
writing to one track overwrites adjacent tracks, and requires them to be 
rewritten as well.

Usually these these disks are marketed for archival use.
Generally speaking you really should not use these unless you exactly know 
which write workload is hitting the disk and it is just very big sequential 
writes.

Cheers,
Robert van Leeuwen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Journal Disk Size

2015-07-02 Thread Shane Gibson

Lionel - thanks for the feedback ... inline below ...

On 7/2/15, 9:58 AM, Lionel Bouton 
lionel+c...@bouton.namemailto:lionel+c...@bouton.name wrote:

Ouch. These spinning disks are probably a bottleneck: there are regular advices 
on this list to use one DC SSD for 4 OSDs. You would probably better off with a 
dedicated partition at the beginning of each OSD disk or worse one file on the 
filesystem but it should still be better than a shared spinning disk.

I understand the benefit of journals on SSDs - but if you don't have them, you 
don't have them.  With that in mind, I'm completely open to any ideas on the 
best structuring of using 7200 rpm disks with journal/osd device types.
I'm open to playing around with performance testing various scenarios.  Again - 
we realize this is less than optimal, but I would like to explore tweaking 
and tuning this setup for the best possible performance you can get out of it.


Anyway given that you get to use 720 disks (12 disks on 60 servers), I'd still 
prefer your setup to mine (24 OSDs) even with what I consider a bottleneck your 
setup as probably far more bandwidth ;-)

My understanding from reading the Ceph docs was that mixing Journal on the OSD 
disks was strongly considered a very bad idea, due to the IO operations 
between the Journal and OSD disk itself creating contention.  Like I said - I'm 
open to testing this configuration ... and probably will.  We're finalizing our 
build/deployment harness right now to be able to modify the architecture of the 
OSDs with a fresh build fairly easily.


A reaction to one of your earlier mails:
You said you are going to 8TB drives. The problem isn't so much with the time 
needed to create new replicas when an OSD fails but the time to fill one 
freshly installed. The rebalancing is much faster when you add 4 x 2TB drives 
than 1 x 8TB drives.

Why should it matter how long it takes a single drive to fill??  Please note 
that I'm very very new to operating Ceph, so am working to understand these 
details - and I'm certain my understanding is still a bit ... simplistic ... :-)

If a drive failes, wouldn't the replica copies on that drive be replicated 
across other OSD devices when appropriate timers/triggers cause those data 
migration/re-replications to kick off?

Subsequently, you add a new OSD and bring it online.  It's now ready to be used 
- and depending on your CRUSH map policies, will start to fill - yes, this 
process ... to fill an entire 8TB drive certainly would take a while, but 
that shouldn't block or degrade the entire cluster - since we have a replica 
copy set of 3 ... there are two other replica copies to service read 
requests.  If a replica copy is updated, which is currently in flight with the 
rebalancing to that new OSD, yes, I can see where there would be 
latency/delays/issues.   As the drive is rebalanced, is it marked available 
for new writes?  That would certainly cause significant latency with a new 
write request - I'd hope that during rebalance operation, that OSD disk is 
not marked available for new writes.

Which brings me to a question ...

Are there any good documents out there that detail (preferably via a flow 
chart/diagram or similar) how the various failure/recovery scenarios cause 
change or impact to the cluster?   I've seen very little in regards to 
this, but may be digging in the wrong places?

Thank you for any follow up information that helps illuminate my understanding 
(or lack thereof) how Ceph and failure/recovery situations should impact a 
cluster...

~~shane



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Journal Disk Size

2015-07-02 Thread Shane Gibson

I'd def be happy to share what numbers I can get out of it.  I'm still a 
neophyte w/ Ceph, and learning how to operate it, set it up ... etc...

My limited performance testing to date has been with stock XFS ceph-disk 
built filesystem for the OSDs, basic PG/CRUSH map stuff - and using dd across 
RBD mounted volumes ...  I'm learning how to scale it up, and start tweaking 
and tuning.

If anyone on the list is interested in specific tests and can provide specific 
detailed instructions on configuration, test patterns, etc ... Im happy to run 
them if I can ...  We're baking in automation around the Ceph deoployment from 
fresh build using the Open Crowbar deployment tooling, with a Ceph work load on 
it.  RIght now, modifying the Ceph work load to work across multple L3 rack 
boundaries in the cluster.

Physical servers are Dell R720xd platforms, with 12 spinning (4TB 7200 rpm) 
data disks, and 2x 10k 600 GB mirrired OS disks.  Memory is 128 GB, and dual 
6-core HT CPUs.

~~shane



On 7/1/15, 5:24 PM, German Anders 
gand...@despegar.commailto:gand...@despegar.com wrote:

I'm interested in such a configuration, can you share some perfomance 
test/numbers?

Thanks in advance,

Best regards,


German

2015-07-01 21:16 GMT-03:00 Shane Gibson 
shane_gib...@symantec.commailto:shane_gib...@symantec.com:

It also depends a lot on the size of your cluster ... I have a test cluster I'm 
standing up right now with 60 nodes - a total of 600 OSDs each at 4 TB ... If I 
lose 4 TB - that's a very small fraction of the data.  My replicas are going to 
be spread out across a lot of spindles, and replicating that missing 4 TB isn't 
much of an issue, across 3 racks each with 80 gbit/sec ToR uplinks to Spine.  
Each node has 20 gbit/sec to ToR in a bond.

On the other hand ... if you only have 4 .. or 8 ... or 10 servers ... and a 
smaller number of OSDs - you have fewer spindles replicating that loss, and it 
might be more of an issue.

It just depends on the size/scale of  your environment.

We're going to 8 TB drives - and that will ultimately be spread over a 100 or 
more physical servers w/ 10 OSD disks per server.   This will be across 7 to 10 
racks (same network topology) ... so an 8 TB drive loss isn't too big of an 
issue.   Now that assumes that replication actually works well in that size 
cluster.  We're still cessing out this part of the PoC engagement.

~~shane



On 7/1/15, 5:05 PM, ceph-users on behalf of German Anders 
ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com on 
behalf of gand...@despegar.commailto:gand...@despegar.com wrote:

ask the other guys on the list, but for me to lose 4TB of data is to much, the 
cluster will still running fine, but in some point you need to recover that 
disk, and also if you lose one server with all the 4TB disk in that case yeah 
it will hurt the cluster, also take into account that with that kind of disk 
you will get no more than 100-110 iops per disk


German Anders
Storage System Engineer Leader
Despegar | IT Team
office +54 11 4894 3500 x3408
mobile +54 911 3493 7262
mail gand...@despegar.commailto:gand...@despegar.com

2015-07-01 20:54 GMT-03:00 Nate Curry 
cu...@mosaicatm.commailto:cu...@mosaicatm.com:

4TB is too much to lose?  Why would it matter if you lost one 4TB with the 
redundancy?  Won't it auto recover from the disk failure?

Nate Curry

On Jul 1, 2015 6:12 PM, German Anders 
gand...@despegar.commailto:gand...@despegar.com wrote:
I would probably go with less size osd disks, 4TB is to much to loss in case of 
a broken disk, so maybe more osd daemons with less size, maybe 1TB or 2TB size. 
4:1 relationship is good enough, also i think that 200G disk for the journals 
would be ok, so you can save some money there, the osd's of course configured 
them as a JBOD, don't use any RAID under it, and use two different networks for 
public and cluster net.


German

2015-07-01 18:49 GMT-03:00 Nate Curry 
cu...@mosaicatm.commailto:cu...@mosaicatm.com:
I would like to get some clarification on the size of the journal disks that I 
should get for my new Ceph cluster I am planning.  I read about the journal 
settings on 
http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings
 but that didn't really clarify it for me that or I just didn't get it.  I 
found in the Learning Ceph Packt book it states that you should have one disk 
for journalling for every 4 OSDs.  Using that as a reference I was planning on 
getting multiple systems with 8 x 6TB inline SAS drives for OSDs with two SSDs 
for journalling per host as well as 2 hot spares for the 6TB drives and 2 
drives for the OS.  I was thinking of 400GB SSD drives but am wondering if that 
is too much.  Any informed opinions would be appreciated.

Thanks,

Nate Curry


___
ceph-users mailing list
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





Re: [ceph-users] Ceph Journal Disk Size

2015-07-02 Thread Nate Curry
Are you using the 4TB disks for the journal?

*Nate Curry*
IT Manager
ISSM
*Mosaic ATM*
mobile: 240.285.7341
office: 571.223.7036 x226
cu...@mosaicatm.com

On Thu, Jul 2, 2015 at 12:16 PM, Shane Gibson shane_gib...@symantec.com
wrote:


 I'd def be happy to share what numbers I can get out of it.  I'm still a
 neophyte w/ Ceph, and learning how to operate it, set it up ... etc...

 My limited performance testing to date has been with stock XFS ceph-disk
 built filesystem for the OSDs, basic PG/CRUSH map stuff - and using dd
 across RBD mounted volumes ...  I'm learning how to scale it up, and start
 tweaking and tuning.

 If anyone on the list is interested in specific tests and can provide
 specific detailed instructions on configuration, test patterns, etc ... Im
 happy to run them if I can ...  We're baking in automation around the Ceph
 deoployment from fresh build using the Open Crowbar deployment tooling,
 with a Ceph work load on it.  RIght now, modifying the Ceph work load to
 work across multple L3 rack boundaries in the cluster.

 Physical servers are Dell R720xd platforms, with 12 spinning (4TB 7200
 rpm) data disks, and 2x 10k 600 GB mirrired OS disks.  Memory is 128 GB,
 and dual 6-core HT CPUs.

 ~~shane



 On 7/1/15, 5:24 PM, German Anders gand...@despegar.com wrote:

 I'm interested in such a configuration, can you share some perfomance
 test/numbers?

 Thanks in advance,

 Best regards,

 *German*

 2015-07-01 21:16 GMT-03:00 Shane Gibson shane_gib...@symantec.com:


 It also depends a lot on the size of your cluster ... I have a test
 cluster I'm standing up right now with 60 nodes - a total of 600 OSDs each
 at 4 TB ... If I lose 4 TB - that's a very small fraction of the data.  My
 replicas are going to be spread out across a lot of spindles, and
 replicating that missing 4 TB isn't much of an issue, across 3 racks each
 with 80 gbit/sec ToR uplinks to Spine.  Each node has 20 gbit/sec to ToR in
 a bond.

 On the other hand ... if you only have 4 .. or 8 ... or 10 servers ...
 and a smaller number of OSDs - you have fewer spindles replicating that
 loss, and it might be more of an issue.

 It just depends on the size/scale of  your environment.

 We're going to 8 TB drives - and that will ultimately be spread over a
 100 or more physical servers w/ 10 OSD disks per server.   This will be
 across 7 to 10 racks (same network topology) ... so an 8 TB drive loss
 isn't too big of an issue.   Now that assumes that replication actually
 works well in that size cluster.  We're still cessing out this part of the
 PoC engagement.

 ~~shane




 On 7/1/15, 5:05 PM, ceph-users on behalf of German Anders 
 ceph-users-boun...@lists.ceph.com on behalf of gand...@despegar.com
 wrote:

 ask the other guys on the list, but for me to lose 4TB of data is to
 much, the cluster will still running fine, but in some point you need to
 recover that disk, and also if you lose one server with all the 4TB disk in
 that case yeah it will hurt the cluster, also take into account that with
 that kind of disk you will get no more than 100-110 iops per disk

 *German Anders*
 Storage System Engineer Leader
 *Despegar* | IT Team
 *office* +54 11 4894 3500 x3408
 *mobile* +54 911 3493 7262
 *mail* gand...@despegar.com

 2015-07-01 20:54 GMT-03:00 Nate Curry cu...@mosaicatm.com:

 4TB is too much to lose?  Why would it matter if you lost one 4TB with
 the redundancy?  Won't it auto recover from the disk failure?

 Nate Curry
 On Jul 1, 2015 6:12 PM, German Anders gand...@despegar.com wrote:

 I would probably go with less size osd disks, 4TB is to much to loss in
 case of a broken disk, so maybe more osd daemons with less size, maybe 1TB
 or 2TB size. 4:1 relationship is good enough, also i think that 200G disk
 for the journals would be ok, so you can save some money there, the osd's
 of course configured them as a JBOD, don't use any RAID under it, and use
 two different networks for public and cluster net.

 *German*

 2015-07-01 18:49 GMT-03:00 Nate Curry cu...@mosaicatm.com:

 I would like to get some clarification on the size of the journal
 disks that I should get for my new Ceph cluster I am planning.  I read
 about the journal settings on
 http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings
 but that didn't really clarify it for me that or I just didn't get it.  I
 found in the Learning Ceph Packt book it states that you should have one
 disk for journalling for every 4 OSDs.  Using that as a reference I was
 planning on getting multiple systems with 8 x 6TB inline SAS drives for
 OSDs with two SSDs for journalling per host as well as 2 hot spares for 
 the
 6TB drives and 2 drives for the OS.  I was thinking of 400GB SSD drives 
 but
 am wondering if that is too much.  Any informed opinions would be
 appreciated.

 Thanks,

 *Nate Curry*


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 

[ceph-users] Ceph Journal Disk Size

2015-07-01 Thread Nate Curry
I would like to get some clarification on the size of the journal disks
that I should get for my new Ceph cluster I am planning.  I read about the
journal settings on
http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings
but that didn't really clarify it for me that or I just didn't get it.  I
found in the Learning Ceph Packt book it states that you should have one
disk for journalling for every 4 OSDs.  Using that as a reference I was
planning on getting multiple systems with 8 x 6TB inline SAS drives for
OSDs with two SSDs for journalling per host as well as 2 hot spares for the
6TB drives and 2 drives for the OS.  I was thinking of 400GB SSD drives but
am wondering if that is too much.  Any informed opinions would be
appreciated.

Thanks,

*Nate Curry*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Journal Disk Size

2015-07-01 Thread German Anders
I would probably go with less size osd disks, 4TB is to much to loss in
case of a broken disk, so maybe more osd daemons with less size, maybe 1TB
or 2TB size. 4:1 relationship is good enough, also i think that 200G disk
for the journals would be ok, so you can save some money there, the osd's
of course configured them as a JBOD, don't use any RAID under it, and use
two different networks for public and cluster net.

*German*

2015-07-01 18:49 GMT-03:00 Nate Curry cu...@mosaicatm.com:

 I would like to get some clarification on the size of the journal disks
 that I should get for my new Ceph cluster I am planning.  I read about the
 journal settings on
 http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings
 but that didn't really clarify it for me that or I just didn't get it.  I
 found in the Learning Ceph Packt book it states that you should have one
 disk for journalling for every 4 OSDs.  Using that as a reference I was
 planning on getting multiple systems with 8 x 6TB inline SAS drives for
 OSDs with two SSDs for journalling per host as well as 2 hot spares for the
 6TB drives and 2 drives for the OS.  I was thinking of 400GB SSD drives but
 am wondering if that is too much.  Any informed opinions would be
 appreciated.

 Thanks,

 *Nate Curry*


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Journal Disk Size

2015-07-01 Thread Shane Gibson

It also depends a lot on the size of your cluster ... I have a test cluster I'm 
standing up right now with 60 nodes - a total of 600 OSDs each at 4 TB ... If I 
lose 4 TB - that's a very small fraction of the data.  My replicas are going to 
be spread out across a lot of spindles, and replicating that missing 4 TB isn't 
much of an issue, across 3 racks each with 80 gbit/sec ToR uplinks to Spine.  
Each node has 20 gbit/sec to ToR in a bond.

On the other hand ... if you only have 4 .. or 8 ... or 10 servers ... and a 
smaller number of OSDs - you have fewer spindles replicating that loss, and it 
might be more of an issue.

It just depends on the size/scale of  your environment.

We're going to 8 TB drives - and that will ultimately be spread over a 100 or 
more physical servers w/ 10 OSD disks per server.   This will be across 7 to 10 
racks (same network topology) ... so an 8 TB drive loss isn't too big of an 
issue.   Now that assumes that replication actually works well in that size 
cluster.  We're still cessing out this part of the PoC engagement.

~~shane



On 7/1/15, 5:05 PM, ceph-users on behalf of German Anders 
ceph-users-boun...@lists.ceph.commailto:ceph-users-boun...@lists.ceph.com on 
behalf of gand...@despegar.commailto:gand...@despegar.com wrote:

ask the other guys on the list, but for me to lose 4TB of data is to much, the 
cluster will still running fine, but in some point you need to recover that 
disk, and also if you lose one server with all the 4TB disk in that case yeah 
it will hurt the cluster, also take into account that with that kind of disk 
you will get no more than 100-110 iops per disk


German Anders
Storage System Engineer Leader
Despegar | IT Team
office +54 11 4894 3500 x3408
mobile +54 911 3493 7262
mail gand...@despegar.commailto:gand...@despegar.com

2015-07-01 20:54 GMT-03:00 Nate Curry 
cu...@mosaicatm.commailto:cu...@mosaicatm.com:

4TB is too much to lose?  Why would it matter if you lost one 4TB with the 
redundancy?  Won't it auto recover from the disk failure?

Nate Curry

On Jul 1, 2015 6:12 PM, German Anders 
gand...@despegar.commailto:gand...@despegar.com wrote:
I would probably go with less size osd disks, 4TB is to much to loss in case of 
a broken disk, so maybe more osd daemons with less size, maybe 1TB or 2TB size. 
4:1 relationship is good enough, also i think that 200G disk for the journals 
would be ok, so you can save some money there, the osd's of course configured 
them as a JBOD, don't use any RAID under it, and use two different networks for 
public and cluster net.


German

2015-07-01 18:49 GMT-03:00 Nate Curry 
cu...@mosaicatm.commailto:cu...@mosaicatm.com:
I would like to get some clarification on the size of the journal disks that I 
should get for my new Ceph cluster I am planning.  I read about the journal 
settings on 
http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings
 but that didn't really clarify it for me that or I just didn't get it.  I 
found in the Learning Ceph Packt book it states that you should have one disk 
for journalling for every 4 OSDs.  Using that as a reference I was planning on 
getting multiple systems with 8 x 6TB inline SAS drives for OSDs with two SSDs 
for journalling per host as well as 2 hot spares for the 6TB drives and 2 
drives for the OS.  I was thinking of 400GB SSD drives but am wondering if that 
is too much.  Any informed opinions would be appreciated.

Thanks,

Nate Curry


___
ceph-users mailing list
ceph-users@lists.ceph.commailto:ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Journal Disk Size

2015-07-01 Thread German Anders
I'm interested in such a configuration, can you share some perfomance
test/numbers?

Thanks in advance,

Best regards,

*German*

2015-07-01 21:16 GMT-03:00 Shane Gibson shane_gib...@symantec.com:


 It also depends a lot on the size of your cluster ... I have a test
 cluster I'm standing up right now with 60 nodes - a total of 600 OSDs each
 at 4 TB ... If I lose 4 TB - that's a very small fraction of the data.  My
 replicas are going to be spread out across a lot of spindles, and
 replicating that missing 4 TB isn't much of an issue, across 3 racks each
 with 80 gbit/sec ToR uplinks to Spine.  Each node has 20 gbit/sec to ToR in
 a bond.

 On the other hand ... if you only have 4 .. or 8 ... or 10 servers ... and
 a smaller number of OSDs - you have fewer spindles replicating that loss,
 and it might be more of an issue.

 It just depends on the size/scale of  your environment.

 We're going to 8 TB drives - and that will ultimately be spread over a 100
 or more physical servers w/ 10 OSD disks per server.   This will be across
 7 to 10 racks (same network topology) ... so an 8 TB drive loss isn't too
 big of an issue.   Now that assumes that replication actually works well in
 that size cluster.  We're still cessing out this part of the PoC
 engagement.

 ~~shane




 On 7/1/15, 5:05 PM, ceph-users on behalf of German Anders 
 ceph-users-boun...@lists.ceph.com on behalf of gand...@despegar.com
 wrote:

 ask the other guys on the list, but for me to lose 4TB of data is to much,
 the cluster will still running fine, but in some point you need to recover
 that disk, and also if you lose one server with all the 4TB disk in that
 case yeah it will hurt the cluster, also take into account that with that
 kind of disk you will get no more than 100-110 iops per disk

 *German Anders*
 Storage System Engineer Leader
 *Despegar* | IT Team
 *office* +54 11 4894 3500 x3408
 *mobile* +54 911 3493 7262
 *mail* gand...@despegar.com

 2015-07-01 20:54 GMT-03:00 Nate Curry cu...@mosaicatm.com:

 4TB is too much to lose?  Why would it matter if you lost one 4TB with
 the redundancy?  Won't it auto recover from the disk failure?

 Nate Curry
 On Jul 1, 2015 6:12 PM, German Anders gand...@despegar.com wrote:

 I would probably go with less size osd disks, 4TB is to much to loss in
 case of a broken disk, so maybe more osd daemons with less size, maybe 1TB
 or 2TB size. 4:1 relationship is good enough, also i think that 200G disk
 for the journals would be ok, so you can save some money there, the osd's
 of course configured them as a JBOD, don't use any RAID under it, and use
 two different networks for public and cluster net.

 *German*

 2015-07-01 18:49 GMT-03:00 Nate Curry cu...@mosaicatm.com:

 I would like to get some clarification on the size of the journal disks
 that I should get for my new Ceph cluster I am planning.  I read about the
 journal settings on
 http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings
 but that didn't really clarify it for me that or I just didn't get it.  I
 found in the Learning Ceph Packt book it states that you should have one
 disk for journalling for every 4 OSDs.  Using that as a reference I was
 planning on getting multiple systems with 8 x 6TB inline SAS drives for
 OSDs with two SSDs for journalling per host as well as 2 hot spares for the
 6TB drives and 2 drives for the OS.  I was thinking of 400GB SSD drives but
 am wondering if that is too much.  Any informed opinions would be
 appreciated.

 Thanks,

 *Nate Curry*


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com