G'day Greg, thanks for the fast response.

Yes, I forgot to explicitly state the Journal would go to SATA Journals in 
CASE1, and it is easy to appreciate the performance impact of this case as you 
documented nicely in your response.

Re your second point: 
> The other big advantage an SSD provides is in write latency; if you're 
> journaling on an SSD you can send things to disk and get a commit back 
> without having to wait on rotating media. How big an impact that will make 
> will depend on your other config options and use case, though.

Are you able to detail which config options tune this, and an example use case 
to illustrate?

Many thanks

Paul


-----Original Message-----
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Gregory Farnum
Sent: Tuesday, 21 February 2012 10:50 AM
To: Paul Pettigrew
Cc: Sage Weil; Wido den Hollander; ceph-devel@vger.kernel.org
Subject: Re: Which SSD method is better for performance?

On Mon, Feb 20, 2012 at 4:44 PM, Paul Pettigrew <paul.pettig...@mach.com.au> 
wrote:
> Thanks Sage
>
> So following through by two examples, to confirm my understanding........
>
> HDD SPECS:
> 8x 2TB SATA HDD's able to do sustained read/write speed of 138MB/s 
> each 1x SSD able to do sustained read/write speed of 475MB/s
>
> CASE1
> (not using SSD)
> 8x OSD's each for the SATA HDD's
> Therefore able to parallelise IO operations Sustained write sent to 
> Ceph of very large file say 500GB (therefore caches all used up and 
> bottleneck becomes SATA IO speed) Gives 8x 138MB/s = 1,104 MB/s
>
> CASE 2
> (using 1x SSD)
> SSD partitioned into 8x separate partitions, 1x for each OSD Sustained 
> write (with OSD-Journal to SSD) sent to Ceph of very large file (say 
> 500GB) Write spilt across 8x OSD-Journal partitions on the single SSD 
> = limited to aggregate of 475MB/s
>
> ANALYSIS:
> If my examples are how Ceph operates, then it is necessary to not exceed a 
> ratio of 3SATA:1SSD, if 4 or more SATA's are used then the SSD becomes the 
> bottleneck.
>
> Is this analysis accurate? Are there other benefits that SSD provide 
> (including in non-sustained peak write performance use case) that would 
> otherwise justify their usage? What ratios are other users sticking to when 
> deciding for their design?

Well, you seem to be leaving out the journals entirely in the first case. You 
could put them on a separate partition on the SATA disks if you wanted, which 
(on a modern drive) would net you half the single-stream throughput, or 
~552MB/s aggregate.

The other big advantage an SSD provides is in write latency; if you're 
journaling on an SSD you can send things to disk and get a commit back without 
having to wait on rotating media. How big an impact that will make will depend 
on your other config options and use case, though.
-Greg

>
> Many thanks all - this is all being rolled up into a new "Ceph SSD" wiki page 
> I will be offering to Sage to include in the main Ceph wiki site.
>
> Paul
>
>
>
> -----Original Message-----
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Sage Weil
> Sent: Monday, 20 February 2012 1:16 PM
> To: Paul Pettigrew
> Cc: Wido den Hollander; ceph-devel@vger.kernel.org
> Subject: RE: Which SSD method is better for performance?
>
> On Mon, 20 Feb 2012, Paul Pettigrew wrote:
>> And secondly, should the SSD Journal sizes be large or small?  Ie, is 
>> say 1G partition per paired 2-3TB SATA disk OK? Or as large an SSD as 
>> possible? There are many forum posts that say 100-200MB will suffice.
>> A quick piece of advice will save us hopefully sever days of 
>> reconfiguring and benchmarking the Cluster :-)
>
> ceph-osd will periodically do a 'commit' to ensure that stuff in the journal 
> is written safely to the file system.  On btrfs that's a snapshot, on 
> anything else it's a sync(2).  When the journals hits 50% we trigger a 
> commit, or when a timer expires (I think 30 seconds by default).  There is 
> some overhead associated with the sync/snapshot, so less is generally better.
>
> A decent rule of thumb is probably to make the journal big enough to consume 
> sustained writes for 10-30 seconds.  On modern disks, that's probably 1-3GB?  
> If the journal is on the same spindle as the fs, it'll be probably half 
> that...
> </hand waving>
>
> sage
>
>
>
>>
>> Thanks
>>
>> Paul
>>
>>
>> -----Original Message-----
>> From: ceph-devel-ow...@vger.kernel.org 
>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Wido den 
>> Hollander
>> Sent: Tuesday, 14 February 2012 10:46 PM
>> To: Paul Pettigrew
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: Which SSD method is better for performance?
>>
>> Hi,
>>
>> On 02/14/2012 01:39 AM, Paul Pettigrew wrote:
>> > G'day all
>> >
>> > About to commence an R&D eval of the Ceph platform having been impressed 
>> > with the momentum achieved over the past 12mths.
>> >
>> > I have one question re design before rolling out to metal........
>> >
>> > I will be using 1x SSD drive per storage server node (assume it is 
>> > /dev/sdb for this discussion), and cannot readily determine the pro/con's 
>> > for the two methods of using it for OSD-Journal, being:
>> > #1. place it in the main [osd] stanza and reference the whole drive 
>> > as a single partition; or
>>
>> That won't work. If you do that all OSD's will try to open the journal.
>> The journal for each OSD has to be unique.
>>
>> > #2. partition up the disk, so 1x partition per SATA HDD, and place 
>> > each partition in the [osd.N] portion
>>
>> That would be your best option.
>>
>> I'm doing the same: http://zooi.widodh.nl/ceph/ceph.conf
>>
>> the VG "data" is placed on a SSD (Intel X25-M).
>>
>> >
>> > So if I were to code #1 in the ceph.conf file, it would be:
>> > [osd]
>> > osd journal = /dev/sdb
>> >
>> > Or, #2 would be like:
>> > [osd.0]
>> >          host = ceph1
>> >          btrfs devs = /dev/sdc
>> >          osd journal = /dev/sdb5
>> > [osd.1]
>> >          host = ceph1
>> >          btrfs devs = /dev/sdd
>> >          osd journal = /dev/sdb6
>> > [osd.2]
>> >          host = ceph1
>> >          btrfs devs = /dev/sde
>> >          osd journal = /dev/sdb7
>> > [osd.3]
>> >          host = ceph1
>> >          btrfs devs = /dev/sdf
>> >          osd journal = /dev/sdb8
>> >
>> > I am asking therefore, is the added work (and constraints) of specifying 
>> > down to individual partitions per #2 worth it in performance gains? Does 
>> > it not also have a constraint, in that if I wanted to add more HDD's into 
>> > the server (we buy 45 bay units, and typically provision HDD's "on demand" 
>> > i.e. 15x at a time as usage grows), I would have to additionally partition 
>> > the SSD (taking it offline) - but if it were #1 option, I would only have 
>> > to add more [osd.N] sections (and not have to worry about getting the SSD 
>> > with 45x partitions)?
>> >
>>
>> You'd still have to go for #2. However, running 45 OSD's on a single machine 
>> is a bit tricky imho.
>>
>> If that machine fails you would loose 45 OSD's at once, that will put a lot 
>> of stress on the recovery of your cluster.
>>
>> You'd also need a lot of RAM to accommodate those 45 OSD's, at least 48GB of 
>> RAM I guess.
>>
>> A last note, if you use a SSD for your journaling, make sure that you align 
>> your partitions which the page size of the SSD, otherwise you'd run into the 
>> write amplification of the SSD, resulting in a performance loss.
>>
>> Wido
>>
>> > One final related question, if I were to use #1 method (which I would 
>> > prefer if there is no material performance or other reason to use #2), 
>> > then that specification (i.e. the "osd journal = /dev/sdb") SSD disk 
>> > reference would have to be identical on all other hardware nodes, yes (I 
>> > want to use the same ceph.conf file on all servers per the doco 
>> > recommendations)? What would happen if for example, the SSD was on 
>> > /dev/sde on a new node added into the cluster? References to 
>> > /dev/disk/by-id etc are clearly no help, so should a symlink be used from 
>> > the get-go? Eg something like "ln -s /dev/sdb /srv/ssd" on one box, and  
>> > "ln -s /dev/sde /srv/ssd" on the other box, so that in the [osd] section 
>> > we could use this line which would find the SSD disk on all nodes "osd 
>> > journal = /srv/ssd"?
>> >
>> > Many thanks for any advice provided.
>> >
>> > Cheers
>> >
>> > Paul
>> >
>> >
>> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> > in the body of a message to majord...@vger.kernel.org More 
>> > majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majord...@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majord...@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majord...@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majord...@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html




--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to