I have a SSD pool for testing (only 8 Drives) but when I do a 1 SSD with
journal and 1 SSD with Data I get > 300 MB/s write. When I change all 8
Disks to house the journal I get < 184MB/s write.


On Mon, Apr 20, 2015 at 10:16 AM, Mark Nelson <mnel...@redhat.com> wrote:

> The big question is how fast these drives can do O_DSYNC writes.  The
> basic gist of this is that for every write to the journal, an ATA_CMD_FLUSH
> call is made to ensure that the device (or potentially the controller) know
> that this data really needs to be stored safely before the flush is
> acknowledged.  How this gets handled is really important.
>
> 1) If devices have limited or no power loss protection, they need to flush
> the contents of any caches to non-volatile memory.  How quickly this can
> happen depends on a lot of factors, but even on SSDs may be slow enough to
> limit performance greatly relative how quickly writes can proceed if
> uninterrupted.
>
> * It's very important to note that some devices that lack power loss
> protection may simply *ignore* ATA_CMD_FLUSH and return immediately so as
> to appear fast, even though this means that data may become corrupt.  Be
> very careful putting journals on devices that do this!
>
> ** Some devices that have claimed to have power loss protection don't
> actually have capacitors big enough to flush data from cache.  This has
> lead to huge amounts of confusion and you have to be very careful.  For a
> specific example see the section titled "The Truth About Micron's
> Power-Loss Portection" here:
> http://www.anandtech.com/show/8528/micron-m600-128gb-256gb-1tb-ssd-review-nda-placeholder
>
> 2) Devices that feature proper power loss protection such that caches can
> be flushed in the event of power failure can safely ignore ATA_CMD_FLUSH
> and return immediately when ATA_CMD_FLUSH is called.  This greatly improves
> the performance of ceph journal writes and usually allows the journal to
> perform at or near the theoretical sequential write performance of the
> device.
>
> 3) Some controllers may be able to intercept these calls and return
> immediately on ATA_CMD_FLUSH if they have an on-board BBU that functions in
> the same way as PLP on the drives would.  Unfortunately on many controllers
> this is tied to enabling writeback cache and running the drives in some
> kind of RAID mode (single-disk RAID0 LUNs are often used for Ceph OSDs with
> this kind of setup).  In some cases the controller itself can become a
> bottleneck with SSDs so it's important to test this out and make sure it
> works well in practice.
>
> Regarding the 840 EVO, it sounds like based on user reports that it does
> not have PLP and does flush data on ATA_CMD_FLUSH resulting in quite a bit
> slower performance when doing O_DSYNC writes.  Unfortunately we don't have
> any in the lab we can test, but likely this is why you are seeing slower
> write performance on them when journals are placed on the SSD.
>
> Mark
>
> On 04/20/2015 09:48 AM, J-P Methot wrote:
>
>> My journals are on-disk, each disk being a SSD. The reason I didn't go
>> with dedicated drives for journals is that when designing the setup, I
>> was told that having dedicated journal SSDs on a full-SSD setup would
>> not give me performance increases.
>>
>> So that makes the journal disk to data disk ratio 1:1.
>>
>> The replication size is 3, yes. The pools are replicated.
>>
>> On 4/20/2015 10:43 AM, Barclay Jameson wrote:
>>
>>> Are your journals on separate disks? What is your ratio of journal
>>> disks to data disks? Are you doing replication size 3 ?
>>>
>>> On Mon, Apr 20, 2015 at 9:30 AM, J-P Methot <jpmet...@gtcomm.net
>>> <mailto:jpmet...@gtcomm.net>> wrote:
>>>
>>>     Hi,
>>>
>>>     This is similar to another thread running right now, but since our
>>>     current setup is completely different from the one described in
>>>     the other thread, I thought it may be better to start a new one.
>>>
>>>     We are running Ceph Firefly 0.80.8 (soon to be upgraded to
>>>     0.80.9). We have 6 OSD hosts with 16 OSD each (so a total of 96
>>>     OSDs). Each OSD is a Samsung SSD 840 EVO on which I can reach
>>>     write speeds of roughly 400 MB/sec, plugged in jbod on a
>>>     controller that can theoretically transfer at 6gb/sec. All of that
>>>     is linked to openstack compute nodes on two bonded 10gbps links
>>>     (so a max transfer rate of 20 gbps).
>>>
>>>     When I run rados bench from the compute nodes, I reach the network
>>>     cap in read speed. However, write speeds are vastly inferior,
>>>     reaching about 920 MB/sec. If I have 4 compute nodes running the
>>>     write benchmark at the same time, I can see the number plummet to
>>>     350 MB/sec . For our planned usage, we find it to be rather slow,
>>>     considering we will run a high number of virtual machines in there.
>>>
>>>     Of course, the first thing to do would be to transfer the journal
>>>     on faster drives. However, these are SSDs we're talking about. We
>>>     don't really have access to faster drives. I must find a way to
>>>     get better write speeds. Thus, I am looking for suggestions as to
>>>     how to make it faster.
>>>
>>>     I have also thought of options myself like:
>>>     -Upgrading to the latest stable hammer version (would that really
>>>     give me a big performance increase?)
>>>     -Crush map modifications? (this is a long shot, but I'm still
>>>     using the default crush map, maybe there's a change there I could
>>>     make to improve performances)
>>>
>>>     Any suggestions as to anything else I can tweak would be strongly
>>>     appreciated.
>>>
>>>     For reference, here's part of my ceph.conf:
>>>
>>>     [global]
>>>     auth_service_required = cephx
>>>     filestore_xattr_use_omap = true
>>>     auth_client_required = cephx
>>>     auth_cluster_required = cephx
>>>     osd pool default size = 3
>>>
>>>
>>>     osd pg bits = 12
>>>     osd pgp bits = 12
>>>     osd pool default pg num = 800
>>>     osd pool default pgp num = 800
>>>
>>>     [client]
>>>     rbd cache = true
>>>     rbd cache writethrough until flush = true
>>>
>>>     [osd]
>>>     filestore_fd_cache_size = 1000000
>>>     filestore_omap_header_cache_size = 1000000
>>>     filestore_fd_cache_random = true
>>>     filestore_queue_max_ops = 5000
>>>     journal_queue_max_ops = 1000000
>>>     max_open_files = 1000000
>>>     osd journal size = 10000
>>>
>>>     --
>>>     ======================
>>>     Jean-Philippe Méthot
>>>     Administrateur système / System administrator
>>>     GloboTech Communications
>>>     Phone: 1-514-907-0050 <tel:1-514-907-0050>
>>>     Toll Free: 1-(888)-GTCOMM1
>>>     Fax: 1-(514)-907-0750 <tel:1-%28514%29-907-0750>
>>>     jpmet...@gtcomm.net <mailto:jpmet...@gtcomm.net>
>>>     http://www.gtcomm.net
>>>
>>>     _______________________________________________
>>>     ceph-users mailing list
>>>     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>>>     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>>
>>
>> --
>> ======================
>> Jean-Philippe Méthot
>> Administrateur système / System administrator
>> GloboTech Communications
>> Phone: 1-514-907-0050
>> Toll Free: 1-(888)-GTCOMM1
>> Fax: 1-(514)-907-0750
>> jpmet...@gtcomm.net
>> http://www.gtcomm.net
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>  _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to