I have a SSD pool for testing (only 8 Drives) but when I do a 1 SSD with journal and 1 SSD with Data I get > 300 MB/s write. When I change all 8 Disks to house the journal I get < 184MB/s write.
On Mon, Apr 20, 2015 at 10:16 AM, Mark Nelson <mnel...@redhat.com> wrote: > The big question is how fast these drives can do O_DSYNC writes. The > basic gist of this is that for every write to the journal, an ATA_CMD_FLUSH > call is made to ensure that the device (or potentially the controller) know > that this data really needs to be stored safely before the flush is > acknowledged. How this gets handled is really important. > > 1) If devices have limited or no power loss protection, they need to flush > the contents of any caches to non-volatile memory. How quickly this can > happen depends on a lot of factors, but even on SSDs may be slow enough to > limit performance greatly relative how quickly writes can proceed if > uninterrupted. > > * It's very important to note that some devices that lack power loss > protection may simply *ignore* ATA_CMD_FLUSH and return immediately so as > to appear fast, even though this means that data may become corrupt. Be > very careful putting journals on devices that do this! > > ** Some devices that have claimed to have power loss protection don't > actually have capacitors big enough to flush data from cache. This has > lead to huge amounts of confusion and you have to be very careful. For a > specific example see the section titled "The Truth About Micron's > Power-Loss Portection" here: > http://www.anandtech.com/show/8528/micron-m600-128gb-256gb-1tb-ssd-review-nda-placeholder > > 2) Devices that feature proper power loss protection such that caches can > be flushed in the event of power failure can safely ignore ATA_CMD_FLUSH > and return immediately when ATA_CMD_FLUSH is called. This greatly improves > the performance of ceph journal writes and usually allows the journal to > perform at or near the theoretical sequential write performance of the > device. > > 3) Some controllers may be able to intercept these calls and return > immediately on ATA_CMD_FLUSH if they have an on-board BBU that functions in > the same way as PLP on the drives would. Unfortunately on many controllers > this is tied to enabling writeback cache and running the drives in some > kind of RAID mode (single-disk RAID0 LUNs are often used for Ceph OSDs with > this kind of setup). In some cases the controller itself can become a > bottleneck with SSDs so it's important to test this out and make sure it > works well in practice. > > Regarding the 840 EVO, it sounds like based on user reports that it does > not have PLP and does flush data on ATA_CMD_FLUSH resulting in quite a bit > slower performance when doing O_DSYNC writes. Unfortunately we don't have > any in the lab we can test, but likely this is why you are seeing slower > write performance on them when journals are placed on the SSD. > > Mark > > On 04/20/2015 09:48 AM, J-P Methot wrote: > >> My journals are on-disk, each disk being a SSD. The reason I didn't go >> with dedicated drives for journals is that when designing the setup, I >> was told that having dedicated journal SSDs on a full-SSD setup would >> not give me performance increases. >> >> So that makes the journal disk to data disk ratio 1:1. >> >> The replication size is 3, yes. The pools are replicated. >> >> On 4/20/2015 10:43 AM, Barclay Jameson wrote: >> >>> Are your journals on separate disks? What is your ratio of journal >>> disks to data disks? Are you doing replication size 3 ? >>> >>> On Mon, Apr 20, 2015 at 9:30 AM, J-P Methot <jpmet...@gtcomm.net >>> <mailto:jpmet...@gtcomm.net>> wrote: >>> >>> Hi, >>> >>> This is similar to another thread running right now, but since our >>> current setup is completely different from the one described in >>> the other thread, I thought it may be better to start a new one. >>> >>> We are running Ceph Firefly 0.80.8 (soon to be upgraded to >>> 0.80.9). We have 6 OSD hosts with 16 OSD each (so a total of 96 >>> OSDs). Each OSD is a Samsung SSD 840 EVO on which I can reach >>> write speeds of roughly 400 MB/sec, plugged in jbod on a >>> controller that can theoretically transfer at 6gb/sec. All of that >>> is linked to openstack compute nodes on two bonded 10gbps links >>> (so a max transfer rate of 20 gbps). >>> >>> When I run rados bench from the compute nodes, I reach the network >>> cap in read speed. However, write speeds are vastly inferior, >>> reaching about 920 MB/sec. If I have 4 compute nodes running the >>> write benchmark at the same time, I can see the number plummet to >>> 350 MB/sec . For our planned usage, we find it to be rather slow, >>> considering we will run a high number of virtual machines in there. >>> >>> Of course, the first thing to do would be to transfer the journal >>> on faster drives. However, these are SSDs we're talking about. We >>> don't really have access to faster drives. I must find a way to >>> get better write speeds. Thus, I am looking for suggestions as to >>> how to make it faster. >>> >>> I have also thought of options myself like: >>> -Upgrading to the latest stable hammer version (would that really >>> give me a big performance increase?) >>> -Crush map modifications? (this is a long shot, but I'm still >>> using the default crush map, maybe there's a change there I could >>> make to improve performances) >>> >>> Any suggestions as to anything else I can tweak would be strongly >>> appreciated. >>> >>> For reference, here's part of my ceph.conf: >>> >>> [global] >>> auth_service_required = cephx >>> filestore_xattr_use_omap = true >>> auth_client_required = cephx >>> auth_cluster_required = cephx >>> osd pool default size = 3 >>> >>> >>> osd pg bits = 12 >>> osd pgp bits = 12 >>> osd pool default pg num = 800 >>> osd pool default pgp num = 800 >>> >>> [client] >>> rbd cache = true >>> rbd cache writethrough until flush = true >>> >>> [osd] >>> filestore_fd_cache_size = 1000000 >>> filestore_omap_header_cache_size = 1000000 >>> filestore_fd_cache_random = true >>> filestore_queue_max_ops = 5000 >>> journal_queue_max_ops = 1000000 >>> max_open_files = 1000000 >>> osd journal size = 10000 >>> >>> -- >>> ====================== >>> Jean-Philippe Méthot >>> Administrateur système / System administrator >>> GloboTech Communications >>> Phone: 1-514-907-0050 <tel:1-514-907-0050> >>> Toll Free: 1-(888)-GTCOMM1 >>> Fax: 1-(514)-907-0750 <tel:1-%28514%29-907-0750> >>> jpmet...@gtcomm.net <mailto:jpmet...@gtcomm.net> >>> http://www.gtcomm.net >>> >>> _______________________________________________ >>> ceph-users mailing list >>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >>> >>> >> >> -- >> ====================== >> Jean-Philippe Méthot >> Administrateur système / System administrator >> GloboTech Communications >> Phone: 1-514-907-0050 >> Toll Free: 1-(888)-GTCOMM1 >> Fax: 1-(514)-907-0750 >> jpmet...@gtcomm.net >> http://www.gtcomm.net >> >> >> >> _______________________________________________ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com