Re: [ceph-users] Ceph performances
Hello Robert, OK. I already tried this but as you said performances decrease. I just built the 10.0.0 version and it seems that there are some regressions in there. I've now 3.5 Kiops instead of 21 Kiops in 9.2.0 :-/ Thanks. Rémi Le 2015-11-25 18:54, Robert LeBlanc a écrit : -BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I'm really surprised that you are getting 100K IOPs from the Intel S3610s. We are already in the process of ordering some to test along side other drives, so I should be able to verify that as well. With the S3700 and S3500, I was able to only get 20K IOPs when running 8 threads (that's about where the performance tapered off). With a single thread, I was getting 5.5K on the S3500 and 4.4K on the S3700. I'll be really happy if we are seeing 100K out of the S3610s, that will make some decisions much easier. Something else you can try is increasing the number of disk threads in the OSD so you get more parallelism to the drive. I've heard from someone else that increase the thread count did not do as much as partitioning up the drive and having multiple OSDs on the same SSD. He found that increases diminished after 4 OSDs. You can try something like that. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.2.3 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWVfXNCRDmVDuy+mK58QAAshgP/RL6sdt2fGdB8OnmyMxs 3LZHgXqeBW8fEJx6hW0y9ElENDIZlW4QawUOcu6eclW9vB7ZrcFpyOjAbnqT llB6yH+t9NAO0JA7HN/tbME5pzUYM4hI0LxVffnIml+fw5Doj0mm+Lp1tpA4 K2PyUj3PZSj9TDlrL8zkyx5l9xA9NUrPXB/L/hprcDOI+nK6IRCcm/7g1YuG wlLZxVoRemhtL6isLKGv3s79RwpcXf7bbKRN556Ypj53n8ry19USrNcr+hLy ZcIeibB9bZIhI5XjA+Fj58D9wqBRM9r0a9yEwXABZ4Sekb/wOWKby3Sr7nJ4 MPuIjplzWV9AEiGm3D0nvaZVlEpVSHjKVhu5nu3DyIWQvKkkOeOVqvwNe7zZ WJHyaQg9c3viLwGSoxYyOBt4YQ2jJoncWtjj9AkBkQrlfZKlGbz3952SeHct 32UFbcWbkaODX3xbo92oUyitAuuOJTUcAAxamienyZCj7QUSnRVdDLyTZJvk /4SmFGmh+XXipdPlQKcadky9ZZr9Ipzq9vzIMHo9HY3giSrbs8PZ9N3+HnPb saqdmQXKaqM/n0i/1Jo85zmxranAEZDEYeR57LwNIBA53IRU2Gbd2ms9KT59 BIh1fgj6n04fSEM6k706YHqzghfdJ+mlqTyB6jdSyAPMNu5oX6LgqlIWoaGi CWOT =r3k4 -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Nov 25, 2015 at 5:40 AM, Rémi BUISSON wrote: Hello Robert, Sorry for late answer. Thanks for your reply. I updated to infernalis and I applied all your recommendations but it doesn't change anything, with or without cache tiering :-/ I also compared XFS to EXT4 and BTRFS but it doesn't make the difference. The fio command from Sebastien Han tells me my disks can do 100 Kiops actually, so it's really frustrating :-S Rémi Le 2015-11-07 15:59, Robert LeBlanc a écrit : You most likely did the wrong test to get baseline Ceph IOPS or of your ssds. Ceph is really hard on SSDS and it does direct sync writes which drives handle very different even between models of the same brand. Start with http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ as your base numbers and just realize that hammer still can't use all those IOps. I was able to gain 50% in SSD IOPS by: disabling translated huge pages, ld_preloading jemalloc (uses a little more RAM but your config should be ok), enabling numad, dialing irqbalance, setting vfs_cache_pressure to 500, and greatly increasing the network buffers and disabling the slow tcp startup. We are also using EXT4 which I've found is a bit faster but it had recently been reported that someone is having deadlocks/crashes with it. We are having an XFS log issue on one of our clusters causing an OSD or two to fail every week. When I tested the same workload in an SSD cache tier the performance was only 50% of what I was able to achieve on the pure SSD tier (I'm guessing overhead of the cache tier). And this was with having the entire test set in the SSD tier so there was no spindle activity. Short answer is that your will need a lot more SSDS to hit your target with hammer. Or if you can wait for Jewel you may be able to get by with only needing a little bit more. Robert LeBlanc Sent from a mobile device please excuse any typos. On Nov 7, 2015 1:24 AM, "Rémi BUISSON" wrote: Hi guys, I would need your help to figure out performance issues on my ceph cluster. I've read pretty much every thread on the net concerning this topic but I didn't manage to have acceptable performances. In my company, we are planning to replace our existing virtualization infrastucture NAS by a ceph cluster in order to improve the global platform performances, scalability and security. The current NAS we have handle about 50k iops. For this we bought: 2 x NFS servers: 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 32 GB RAM, 2 x 10Gbps network interfaces (bonding) 3 x MON servers: 1 x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz, 16 GB RAM, 2 x 10Gbps network interfaces (bonding) 2 x MDS servers: 2 x Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz, 32 GB RAM, 2 x 10Gbps network interface
Re: [ceph-users] Ceph performances
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 I'm really surprised that you are getting 100K IOPs from the Intel S3610s. We are already in the process of ordering some to test along side other drives, so I should be able to verify that as well. With the S3700 and S3500, I was able to only get 20K IOPs when running 8 threads (that's about where the performance tapered off). With a single thread, I was getting 5.5K on the S3500 and 4.4K on the S3700. I'll be really happy if we are seeing 100K out of the S3610s, that will make some decisions much easier. Something else you can try is increasing the number of disk threads in the OSD so you get more parallelism to the drive. I've heard from someone else that increase the thread count did not do as much as partitioning up the drive and having multiple OSDs on the same SSD. He found that increases diminished after 4 OSDs. You can try something like that. -BEGIN PGP SIGNATURE- Version: Mailvelope v1.2.3 Comment: https://www.mailvelope.com wsFcBAEBCAAQBQJWVfXNCRDmVDuy+mK58QAAshgP/RL6sdt2fGdB8OnmyMxs 3LZHgXqeBW8fEJx6hW0y9ElENDIZlW4QawUOcu6eclW9vB7ZrcFpyOjAbnqT llB6yH+t9NAO0JA7HN/tbME5pzUYM4hI0LxVffnIml+fw5Doj0mm+Lp1tpA4 K2PyUj3PZSj9TDlrL8zkyx5l9xA9NUrPXB/L/hprcDOI+nK6IRCcm/7g1YuG wlLZxVoRemhtL6isLKGv3s79RwpcXf7bbKRN556Ypj53n8ry19USrNcr+hLy ZcIeibB9bZIhI5XjA+Fj58D9wqBRM9r0a9yEwXABZ4Sekb/wOWKby3Sr7nJ4 MPuIjplzWV9AEiGm3D0nvaZVlEpVSHjKVhu5nu3DyIWQvKkkOeOVqvwNe7zZ WJHyaQg9c3viLwGSoxYyOBt4YQ2jJoncWtjj9AkBkQrlfZKlGbz3952SeHct 32UFbcWbkaODX3xbo92oUyitAuuOJTUcAAxamienyZCj7QUSnRVdDLyTZJvk /4SmFGmh+XXipdPlQKcadky9ZZr9Ipzq9vzIMHo9HY3giSrbs8PZ9N3+HnPb saqdmQXKaqM/n0i/1Jo85zmxranAEZDEYeR57LwNIBA53IRU2Gbd2ms9KT59 BIh1fgj6n04fSEM6k706YHqzghfdJ+mlqTyB6jdSyAPMNu5oX6LgqlIWoaGi CWOT =r3k4 -END PGP SIGNATURE- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Wed, Nov 25, 2015 at 5:40 AM, Rémi BUISSON wrote: > > > > > Hello Robert, > > Sorry for late answer. > > Thanks for your reply. I updated to infernalis and I applied all your > recommendations but it doesn't change anything, with or without cache > tiering :-/ > > I also compared XFS to EXT4 and BTRFS but it doesn't make the difference. > > The fio command from Sebastien Han tells me my disks can do 100 Kiops > actually, so it's really frustrating :-S > > Rémi > > Le 2015-11-07 15:59, Robert LeBlanc a écrit : > > You most likely did the wrong test to get baseline Ceph IOPS or of your > ssds. Ceph is really hard on SSDS and it does direct sync writes which > drives handle very different even between models of the same brand. Start > with > http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ > as your base numbers and just realize that hammer still can't use all those > IOps. I was able to gain 50% in SSD IOPS by: disabling translated huge > pages, ld_preloading jemalloc (uses a little more RAM but your config should > be ok), enabling numad, dialing irqbalance, setting vfs_cache_pressure to > 500, and greatly increasing the network buffers and disabling the slow tcp > startup. We are also using EXT4 which I've found is a bit faster but it had > recently been reported that someone is having deadlocks/crashes with it. We > are having an XFS log issue on one of our clusters causing an OSD or two to > fail every week. > > When I tested the same workload in an SSD cache tier the performance was > only 50% of what I was able to achieve on the pure SSD tier (I'm guessing > overhead of the cache tier). And this was with having the entire test set in > the SSD tier so there was no spindle activity. > > Short answer is that your will need a lot more SSDS to hit your target with > hammer. Or if you can wait for Jewel you may be able to get by with only > needing a little bit more. > > Robert LeBlanc > > Sent from a mobile device please excuse any typos. > > On Nov 7, 2015 1:24 AM, "Rémi BUISSON" wrote: >> >> Hi guys, >> >> I would need your help to figure out performance issues on my ceph >> cluster. >> I've read pretty much every thread on the net concerning this topic but I >> didn't manage to have acceptable performances. >> In my company, we are planning to replace our existing virtualization >> infrastucture NAS by a ceph cluster in order to improve the global platform >> performances, scalability and security. The current NAS we have handle about >> 50k iops. >> >> For this we bought: >> 2 x NFS servers: 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 32 GB RAM, >> 2 x 10Gbps network interfaces (bonding) >> 3 x MON servers: 1 x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz, 16 GB RAM, >> 2 x 10Gbps network interfaces (bonding) >> 2 x MDS servers: 2 x Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz, 32 GB >> RAM, 2 x 10Gbps network interfaces (bonding) >> 2 x OSD servers (cache): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, >> 256 GB RAM, 2 x SSD INTEL SSDSC2BX200G4 (200 GB) for journal, 6 x SSD INTEL >> SSDSC2BX016T4R
Re: [ceph-users] Ceph performances
Hello Hugo, Yes you're right. With Sebastien Han fio command I manage to see that my disks can finally handle 100 Kiops, so the theoritical value is then: 2 x 2 x 100 / 2 = 200k. I put the journal on the SSDSC2BX016T4R which is then supposed to double my IOs, but it's not the case. Rémi Le 2015-11-08 07:06, Hugo Slabbert a écrit : On Sat 2015-Nov-07 09:24:06 +0100, Rémi BUISSON wrote: Hi guys, I would need your help to figure out performance issues on my ceph cluster. I've read pretty much every thread on the net concerning this topic but I didn't manage to have acceptable performances. In my company, we are planning to replace our existing virtualization infrastucture NAS by a ceph cluster in order to improve the global platform performances, scalability and security. The current NAS we have handle about 50k iops. For this we bought: 2 x NFS servers: 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 32 GB RAM, 2 x 10Gbps network interfaces (bonding) 3 x MON servers: 1 x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz, 16 GB RAM, 2 x 10Gbps network interfaces (bonding) 2 x MDS servers: 2 x Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz, 32 GB RAM, 2 x 10Gbps network interfaces (bonding) 2 x OSD servers (cache): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 256 GB RAM, 2 x SSD INTEL SSDSC2BX200G4 (200 GB) for journal, 6 x SSD INTEL SSDSC2BX016T4R (1,4 TB) for data, 2 x 10Gbps network interfaces (bonding) 4 x OSD servers (storage): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 256 GB RAM, 4 x SSD TOSHIBA PX02SMF020 (200GB) for journal, 18 x HGST Ultrastar HUC101818CS4204 (1.8TB) for data, 2 x 10Gbps network interfaces (bonding) The total of this is 84 OSDs. I created two 4096 pgs pools, one called rbd-cold-storage and the other rbd-hot-storage. As you may guess, the rbd-cold-storage is composed of the 4 OSD servers with platter disks and the rbd-hot-storage is composed of the 2 OSD servers with SSD disks. On the rdb-cold-storage, I created an rbd device which is mapped on the NFS server. I benched each of the SSD we have and it can handle 40k iops each. As my replication factor is 2, the theoritical performance of the cluster is (2 x 6 (OSD cache) x 40k) / 2 = 240k iops. Aside from the other more detailed replies re: tuning, isn't the layout of the caching tier journals sub-optimal in this scenario? Given the similar model numbers there, I'm assuming the performance (throughput, IOPS) of the journal & data disks are similar, but please correct me if I'm wrong there. My understanding of ceph's design (newer to ceph; please excuse misunderstandings) is that writes pass through the journals, the OSD will ACK writes when they are committed to the journal(s) of the OSDs in that PG (so, one other OSD in this case with a replication factor of 2), and journals are then flushed to OSDs asynchronously. Rather than "(2 x 6 (OSD cache) x 40k) / 2 = 240k iops", isn't the calculation actually: (# hosts) x (# journal disks) x (IOPS per journal disk) / (replication factor) ? IOW: (2 x 2 (OSD cache journal SSDs) x 40K) / 2 = 80K Yes, putting journals on the same disk as the OSD's data halves your write performance because data has to flush from the journal partition to the data partition on the same SSD, but in this case would it not be more optimal to just chuck the 2x SSDSC2BX200G4 per cache host, replace them with 2x more data disks (SSDSC2BX016T4R) for 8 total per cache OSD host, and then go with journals on the same disk? In that case we're looking at: (# hosts) x (# journal disks) x (IOPS per journal disk) / (replication factor) / 2 ...where the final division by 2 is our write penalty for sharing journal and data on the same disk. So, in this scenario: 2 x 8 (OSD cache SSDs) x 40K / 2 (replication factor) / 2 = 160K Yes/no? In a regular "SSD journals + spinners for data" setup, journals on discrete/partitioned SSDs makes sense in e.g. a 3:1 ratio as your performance (well, throughput; IOPS is another story) on the SSD will generally be ~3x what your SAS/SATA spinners can do. So: 1 SSD has 3x partitions and serves journals for 3x OSDs backed by spinners, the numbers are matched up so that it has the capacity to absorb (write) data as fast as it can flush it down to the spinners and it can pretty much max the spinnners' write capacity. Overload the SSD with too many journals and it will be maxed with spinners sitting waiting/idle. But in scenarios where the performance of the journal SSDs matches the performance of the backing disks for the OSDs and with a 3:1 ratio on data disks to journal disks, the data SSDs will still have more write performance capacity to spare while the journal SSD is maxed. Don't we need something with greater throughput/IOPS in the journal than in our data partition in order to make discrete journals be of benefit? I guess the alternative to swapping the 2x SSDSC2BX200G4 journals in the cache for simply more data disks (SSDSC2BX016T4R) would
Re: [ceph-users] Ceph performances
Hello Robert, Sorry for late answer. Thanks for your reply. I updated to infernalis and I applied all your recommendations but it doesn't change anything, with or without cache tiering :-/ I also compared XFS to EXT4 and BTRFS but it doesn't make the difference. The fio command from Sebastien Han tells me my disks can do 100 Kiops actually, so it's really frustrating :-S Rémi Le 2015-11-07 15:59, Robert LeBlanc a écrit : > You most likely did the wrong test to get baseline Ceph IOPS or of your ssds. > Ceph is really hard on SSDS and it does direct sync writes which drives > handle very different even between models of the same brand. Start with > http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ > [4] as your base numbers and just realize that hammer still can't use all > those IOps. I was able to gain 50% in SSD IOPS by: disabling translated huge > pages, ld_preloading jemalloc (uses a little more RAM but your config should > be ok), enabling numad, dialing irqbalance, setting vfs_cache_pressure to > 500, and greatly increasing the network buffers and disabling the slow tcp > startup. We are also using EXT4 which I've found is a bit faster but it had > recently been reported that someone is having deadlocks/crashes with it. We > are having an XFS log issue on one of our clusters causing an OSD or two to > fail every week. > > When I tested the same workload in an SSD cache tier the performance was only > 50% of what I was able to achieve on the pure SSD tier (I'm guessing overhead > of the cache tier). And this was with having the entire test set in the SSD > tier so there was no spindle activity. > > Short answer is that your will need a lot more SSDS to hit your target with > hammer. Or if you can wait for Jewel you may be able to get by with only > needing a little bit more. > > Robert LeBlanc > > Sent from a mobile device please excuse any typos. > On Nov 7, 2015 1:24 AM, "Rémi BUISSON" wrote: > >> Hi guys, >> >> I would need your help to figure out performance issues on my ceph cluster. >> I've read pretty much every thread on the net concerning this topic but I >> didn't manage to have acceptable performances. >> In my company, we are planning to replace our existing virtualization >> infrastucture NAS by a ceph cluster in order to improve the global platform >> performances, scalability and security. The current NAS we have handle about >> 50k iops. >> >> For this we bought: >> 2 x NFS servers: 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 32 GB RAM, 2 >> x 10Gbps network interfaces (bonding) >> 3 x MON servers: 1 x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz, 16 GB RAM, 2 >> x 10Gbps network interfaces (bonding) >> 2 x MDS servers: 2 x Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz, 32 GB RAM, >> 2 x 10Gbps network interfaces (bonding) >> 2 x OSD servers (cache): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 256 >> GB RAM, 2 x SSD INTEL SSDSC2BX200G4 (200 GB) for journal, 6 x SSD INTEL >> SSDSC2BX016T4R (1,4 TB) for data, 2 x 10Gbps network interfaces (bonding) >> 4 x OSD servers (storage): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, >> 256 GB RAM, 4 x SSD TOSHIBA PX02SMF020 (200GB) for journal, 18 x HGST >> Ultrastar HUC101818CS4204 (1.8TB) for data, 2 x 10Gbps network interfaces >> (bonding) >> >> The total of this is 84 OSDs. >> >> I created two 4096 pgs pools, one called rbd-cold-storage and the other >> rbd-hot-storage. As you may guess, the rbd-cold-storage is composed of the 4 >> OSD servers with platter disks and the rbd-hot-storage is composed of the 2 >> OSD servers with SSD disks. >> On the rdb-cold-storage, I created an rbd device which is mapped on the NFS >> server. >> >> I benched each of the SSD we have and it can handle 40k iops each. As my >> replication factor is 2, the theoritical performance of the cluster is (2 x >> 6 (OSD cache) x 40k) / 2 = 240k iops. >> >> I'm currently benching the cluster with fio tool from one NFS server. Here >> my fio job file: >> [global] >> ioengine=libaio >> iodepth=32 >> runtime=300 >> direct=1 >> filename=/dev/rbd0 >> group_reporting=1 >> gtod_reduce=1 >> randrepeat=1 >> size=4G >> numjobs=1 >> >> [4k-rand-write] >> new_group >> bs=4k >> rw=randwrite >> stonewall >> >> The problem is I can't get more than 15k iops for writes. In my monitoring >> engine, I can see that each of the OSD (cache) SSD are not doing more than >> 2,5k iops which seems to correspond with 6 x 2,5k = 15k iops. I don't expect >> to reach the theoritical value but reaching 100k iops would be perfect. >> >> My cluster is running on debian jessie with ceph Hammer v0.94.5 debian >> package (compiled with --with-jemalloc option, I also tried without). Here >> is my ceph.conf: >> >> [global] >> fsid = 5046f766-670f-4705-adcc-290f434c8a83 >> >> # basic settings >> mon initial members = a01cepmon001,a01cepmon002,a01cepmon003 >> mon host = 10.10.69.25
Re: [ceph-users] Ceph performances
On 11/07/2015 09:44 AM, Oliver Dzombic wrote: > setting inode64 in osd_mount_options_xfs might help a little. sorry, inode64 is the default mount option with xfs. Björn ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph performances
On Sat 2015-Nov-07 09:24:06 +0100, Rémi BUISSON wrote: Hi guys, I would need your help to figure out performance issues on my ceph cluster. I've read pretty much every thread on the net concerning this topic but I didn't manage to have acceptable performances. In my company, we are planning to replace our existing virtualization infrastucture NAS by a ceph cluster in order to improve the global platform performances, scalability and security. The current NAS we have handle about 50k iops. For this we bought: 2 x NFS servers: 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 32 GB RAM, 2 x 10Gbps network interfaces (bonding) 3 x MON servers: 1 x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz, 16 GB RAM, 2 x 10Gbps network interfaces (bonding) 2 x MDS servers: 2 x Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz, 32 GB RAM, 2 x 10Gbps network interfaces (bonding) 2 x OSD servers (cache): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 256 GB RAM, 2 x SSD INTEL SSDSC2BX200G4 (200 GB) for journal, 6 x SSD INTEL SSDSC2BX016T4R (1,4 TB) for data, 2 x 10Gbps network interfaces (bonding) 4 x OSD servers (storage): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 256 GB RAM, 4 x SSD TOSHIBA PX02SMF020 (200GB) for journal, 18 x HGST Ultrastar HUC101818CS4204 (1.8TB) for data, 2 x 10Gbps network interfaces (bonding) The total of this is 84 OSDs. I created two 4096 pgs pools, one called rbd-cold-storage and the other rbd-hot-storage. As you may guess, the rbd-cold-storage is composed of the 4 OSD servers with platter disks and the rbd-hot-storage is composed of the 2 OSD servers with SSD disks. On the rdb-cold-storage, I created an rbd device which is mapped on the NFS server. I benched each of the SSD we have and it can handle 40k iops each. As my replication factor is 2, the theoritical performance of the cluster is (2 x 6 (OSD cache) x 40k) / 2 = 240k iops. Aside from the other more detailed replies re: tuning, isn't the layout of the caching tier journals sub-optimal in this scenario? Given the similar model numbers there, I'm assuming the performance (throughput, IOPS) of the journal & data disks are similar, but please correct me if I'm wrong there. My understanding of ceph's design (newer to ceph; please excuse misunderstandings) is that writes pass through the journals, the OSD will ACK writes when they are committed to the journal(s) of the OSDs in that PG (so, one other OSD in this case with a replication factor of 2), and journals are then flushed to OSDs asynchronously. Rather than "(2 x 6 (OSD cache) x 40k) / 2 = 240k iops", isn't the calculation actually: (# hosts) x (# journal disks) x (IOPS per journal disk) / (replication factor) ? IOW: (2 x 2 (OSD cache journal SSDs) x 40K) / 2 = 80K Yes, putting journals on the same disk as the OSD's data halves your write performance because data has to flush from the journal partition to the data partition on the same SSD, but in this case would it not be more optimal to just chuck the 2x SSDSC2BX200G4 per cache host, replace them with 2x more data disks (SSDSC2BX016T4R) for 8 total per cache OSD host, and then go with journals on the same disk? In that case we're looking at: (# hosts) x (# journal disks) x (IOPS per journal disk) / (replication factor) / 2 ...where the final division by 2 is our write penalty for sharing journal and data on the same disk. So, in this scenario: 2 x 8 (OSD cache SSDs) x 40K / 2 (replication factor) / 2 = 160K Yes/no? In a regular "SSD journals + spinners for data" setup, journals on discrete/partitioned SSDs makes sense in e.g. a 3:1 ratio as your performance (well, throughput; IOPS is another story) on the SSD will generally be ~3x what your SAS/SATA spinners can do. So: 1 SSD has 3x partitions and serves journals for 3x OSDs backed by spinners, the numbers are matched up so that it has the capacity to absorb (write) data as fast as it can flush it down to the spinners and it can pretty much max the spinnners' write capacity. Overload the SSD with too many journals and it will be maxed with spinners sitting waiting/idle. But in scenarios where the performance of the journal SSDs matches the performance of the backing disks for the OSDs and with a 3:1 ratio on data disks to journal disks, the data SSDs will still have more write performance capacity to spare while the journal SSD is maxed. Don't we need something with greater throughput/IOPS in the journal than in our data partition in order to make discrete journals be of benefit? I guess the alternative to swapping the 2x SSDSC2BX200G4 journals in the cache for simply more data disks (SSDSC2BX016T4R) would be to go PCIe/NVMe for the journals in the cache layer, at which point the discrete journals could be a net plus again? -- Hugo cell: 604-617-3133 h...@slabnet.com: email, xmpp/jabber PGP fingerprint (B178313E): CF18 15FA 9FE4 0CD1 2319 1D77 9AB1 0FFD B178 313E (also on Signal) signature.asc Descr
Re: [ceph-users] Ceph performances
You most likely did the wrong test to get baseline Ceph IOPS or of your ssds. Ceph is really hard on SSDS and it does direct sync writes which drives handle very different even between models of the same brand. Start with http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/ as your base numbers and just realize that hammer still can't use all those IOps. I was able to gain 50% in SSD IOPS by: disabling translated huge pages, ld_preloading jemalloc (uses a little more RAM but your config should be ok), enabling numad, dialing irqbalance, setting vfs_cache_pressure to 500, and greatly increasing the network buffers and disabling the slow tcp startup. We are also using EXT4 which I've found is a bit faster but it had recently been reported that someone is having deadlocks/crashes with it. We are having an XFS log issue on one of our clusters causing an OSD or two to fail every week. When I tested the same workload in an SSD cache tier the performance was only 50% of what I was able to achieve on the pure SSD tier (I'm guessing overhead of the cache tier). And this was with having the entire test set in the SSD tier so there was no spindle activity. Short answer is that your will need a lot more SSDS to hit your target with hammer. Or if you can wait for Jewel you may be able to get by with only needing a little bit more. Robert LeBlanc Sent from a mobile device please excuse any typos. On Nov 7, 2015 1:24 AM, "Rémi BUISSON" wrote: > Hi guys, > > I would need your help to figure out performance issues on my ceph cluster. > I've read pretty much every thread on the net concerning this topic but I > didn't manage to have acceptable performances. > In my company, we are planning to replace our existing virtualization > infrastucture NAS by a ceph cluster in order to improve the global platform > performances, scalability and security. The current NAS we have handle > about 50k iops. > > For this we bought: > 2 x NFS servers: 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 32 GB RAM, > 2 x 10Gbps network interfaces (bonding) > 3 x MON servers: 1 x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz, 16 GB RAM, > 2 x 10Gbps network interfaces (bonding) > 2 x MDS servers: 2 x Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz, 32 GB > RAM, 2 x 10Gbps network interfaces (bonding) > 2 x OSD servers (cache): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, > 256 GB RAM, 2 x SSD INTEL SSDSC2BX200G4 (200 GB) for journal, 6 x SSD INTEL > SSDSC2BX016T4R (1,4 TB) for data, 2 x 10Gbps network interfaces (bonding) > 4 x OSD servers (storage): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, > 256 GB RAM, 4 x SSD TOSHIBA PX02SMF020 (200GB) for journal, 18 x HGST > Ultrastar HUC101818CS4204 (1.8TB) for data, 2 x 10Gbps network interfaces > (bonding) > > The total of this is 84 OSDs. > > I created two 4096 pgs pools, one called rbd-cold-storage and the other > rbd-hot-storage. As you may guess, the rbd-cold-storage is composed of the > 4 OSD servers with platter disks and the rbd-hot-storage is composed of the > 2 OSD servers with SSD disks. > On the rdb-cold-storage, I created an rbd device which is mapped on the > NFS server. > > I benched each of the SSD we have and it can handle 40k iops each. As my > replication factor is 2, the theoritical performance of the cluster is (2 x > 6 (OSD cache) x 40k) / 2 = 240k iops. > > I'm currently benching the cluster with fio tool from one NFS server. Here > my fio job file: > [global] > ioengine=libaio > iodepth=32 > runtime=300 > direct=1 > filename=/dev/rbd0 > group_reporting=1 > gtod_reduce=1 > randrepeat=1 > size=4G > numjobs=1 > > [4k-rand-write] > new_group > bs=4k > rw=randwrite > stonewall > > The problem is I can't get more than 15k iops for writes. In my monitoring > engine, I can see that each of the OSD (cache) SSD are not doing more than > 2,5k iops which seems to correspond with 6 x 2,5k = 15k iops. I don't > expect to reach the theoritical value but reaching 100k iops would be > perfect. > > My cluster is running on debian jessie with ceph Hammer v0.94.5 debian > package (compiled with --with-jemalloc option, I also tried without). Here > is my ceph.conf: > > > [global] > fsid = 5046f766-670f-4705-adcc-290f434c8a83 > > # basic settings > mon initial members = a01cepmon001,a01cepmon002,a01cepmon003 > mon host = 10.10.69.254,10.10.69.253,10.10.69.252 > mon osd allow primary affinity = true > # network settings > public network = 10.10.69.128/25 > cluster network = 10.10.69.0/25 > > # auth settings > auth cluster required = cephx > auth service required = cephx > auth client required = cephx > > # default pools settings > osd pool default size = 2 > osd pool default min size = 1 > osd pool default pg num = 8192 > osd pool default pgp num = 8192 > osd crush chooseleaf type = 1 > > # debug settings > debug lockdep = 0/0 > debug context = 0/0 > debug crush = 0/0 > debug buffer = 0/0 > debug timer = 0/0 > debug journaler = 0/0 > debug osd = 0/0
Re: [ceph-users] Ceph performances
Hello, I just saw the release announce of infernalis. I will test it in the meantime. Rémi On 07/11/2015 09:24, Rémi BUISSON wrote: Hi guys, I would need your help to figure out performance issues on my ceph cluster. I've read pretty much every thread on the net concerning this topic but I didn't manage to have acceptable performances. In my company, we are planning to replace our existing virtualization infrastucture NAS by a ceph cluster in order to improve the global platform performances, scalability and security. The current NAS we have handle about 50k iops. For this we bought: 2 x NFS servers: 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 32 GB RAM, 2 x 10Gbps network interfaces (bonding) 3 x MON servers: 1 x Intel(R) Xeon(R) CPU E5-2603 v3 @ 1.60GHz, 16 GB RAM, 2 x 10Gbps network interfaces (bonding) 2 x MDS servers: 2 x Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz, 32 GB RAM, 2 x 10Gbps network interfaces (bonding) 2 x OSD servers (cache): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 256 GB RAM, 2 x SSD INTEL SSDSC2BX200G4 (200 GB) for journal, 6 x SSD INTEL SSDSC2BX016T4R (1,4 TB) for data, 2 x 10Gbps network interfaces (bonding) 4 x OSD servers (storage): 2 x Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz, 256 GB RAM, 4 x SSD TOSHIBA PX02SMF020 (200GB) for journal, 18 x HGST Ultrastar HUC101818CS4204 (1.8TB) for data, 2 x 10Gbps network interfaces (bonding) The total of this is 84 OSDs. I created two 4096 pgs pools, one called rbd-cold-storage and the other rbd-hot-storage. As you may guess, the rbd-cold-storage is composed of the 4 OSD servers with platter disks and the rbd-hot-storage is composed of the 2 OSD servers with SSD disks. On the rdb-cold-storage, I created an rbd device which is mapped on the NFS server. I benched each of the SSD we have and it can handle 40k iops each. As my replication factor is 2, the theoritical performance of the cluster is (2 x 6 (OSD cache) x 40k) / 2 = 240k iops. I'm currently benching the cluster with fio tool from one NFS server. Here my fio job file: [global] ioengine=libaio iodepth=32 runtime=300 direct=1 filename=/dev/rbd0 group_reporting=1 gtod_reduce=1 randrepeat=1 size=4G numjobs=1 [4k-rand-write] new_group bs=4k rw=randwrite stonewall The problem is I can't get more than 15k iops for writes. In my monitoring engine, I can see that each of the OSD (cache) SSD are not doing more than 2,5k iops which seems to correspond with 6 x 2,5k = 15k iops. I don't expect to reach the theoritical value but reaching 100k iops would be perfect. My cluster is running on debian jessie with ceph Hammer v0.94.5 debian package (compiled with --with-jemalloc option, I also tried without). Here is my ceph.conf: [global] fsid = 5046f766-670f-4705-adcc-290f434c8a83 # basic settings mon initial members = a01cepmon001,a01cepmon002,a01cepmon003 mon host = 10.10.69.254,10.10.69.253,10.10.69.252 mon osd allow primary affinity = true # network settings public network = 10.10.69.128/25 cluster network = 10.10.69.0/25 # auth settings auth cluster required = cephx auth service required = cephx auth client required = cephx # default pools settings osd pool default size = 2 osd pool default min size = 1 osd pool default pg num = 8192 osd pool default pgp num = 8192 osd crush chooseleaf type = 1 # debug settings debug lockdep = 0/0 debug context = 0/0 debug crush = 0/0 debug buffer = 0/0 debug timer = 0/0 debug journaler = 0/0 debug osd = 0/0 debug optracker = 0/0 debug objclass = 0/0 debug filestore = 0/0 debug journal = 0/0 debug ms = 0/0 debug monc = 0/0 debug tp = 0/0 debug auth = 0/0 debug finisher = 0/0 debug heartbeatmap = 0/0 debug perfcounter = 0/0 debug asok = 0/0 debug throttle = 0/0 throttler perf counter = false osd enable op tracker = false ## OSD settings [osd] # OSD FS settings osd mkfs type = xfs osd mkfs options xfs = -f -i size=2048 osd mount options xfs = rw,noatime,logbsize=256k,delaylog # OSD journal settings osd journal block align = true osd journal aio = true osd journal dio = true # Performance tuning filestore xattr use omap = true filestore merge threshold = 40 filestore split multiple = 8 filestore max sync interval = 10 filestore queue max ops = 10 filestore queue max bytes = 1GiB filestore op threads = 20 filestore journal writeahead = true filestore fd cache size = 10240 osd op threads = 8 Disabling throttling doesn't change anything. So after all I read, I would like to know if, since the few months old threads, someone to fix those kind of problems ? any idea or thoughts to improve this ? Thanks. Rémi ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph performances
Hi Remi, setting inode64 in osd_mount_options_xfs might help a little. -- Mit freundlichen Gruessen / Best regards Oliver Dzombic ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com