Re: [ceph-users] slow perfomance: sanity check
On 04/06/2017 01:54 PM, Adam Carheden wrote: 60-80MBs/s for what sort of setup? Is that 1Gbe rather than 10Gbe? 60-80MB/s per disk, assuming fairly standard 7200RPM disks before any replication takes place and assuming journals are on SSDs with fast O_DSYNC write performance. Any network limitations may decrease that further. Basically the gist of it is that you take a fairly standard ~140-150MB/s per disk, assume you get half that due to metadata writes, flushes, inode seeks, etc. I consistently get 80-90Mb/s bandwidth as measured by `rados bench -p rbd 10 write` run from a ceph node on a cluster with: * 3 nodes * 4 OSD/node, 600GB 15kRPM SAS disks * 1G disk controller cache write cache shared by all disks in each node * No SSDs * 2x1Gbe lacp bond for redundancy, no jumbo frames * 512 PGs for a cluster of 12 OSDs * All disks in one pool of size=3, min_size=2 IOzone run on a VM using an rbd as it's HD confirms that setup maxes out at around just under 100 MB/s for best-case scenarios, so I assumed the 1Gb network was the bottleneck. The network is a good guess. With 3 1GbE nodes and 3X replication you aren't going to do any better than ~110MB/s. You are a little below that but it's in the right ballpark. I'm in the process of planning a hardware purchase for a larger cluster: more nodes, more drives, SSD journals and 10Gbe. I'm asuming I'll get better performance. You should, but it can be tricky to balance out everything. Figure that 80MB/s per disk (with 7200rpm disks and SSD journals) is the typical upper limit of what to expect with filestore on XFS, and any potential additional bottlenecks may bring that down. Some folks have started playing with things like Intel's CAS software to potentially improve those numbers through SSD caching, but it's not a typical setup. What's the upper bound on CEPH performance for large sequential writes from a single-client with all the recommended bells and whistles (ssd journal, 10Gbe)? I assume it depends on both the total number of OSDs and possibly OSDs per node if one had enough to saturate the network, correct? Yep, and that's sort of tough to answer. The fastest single client performance I've seen was a little over 4GB/s doing 4MB writes to an RBD volume on 16 NVMe OSDs using 40GbE (ie maxing it out on the client). If I had enough switch ports to do bonded I could probably having gotten closer to 8GB/s since the cluster was capable of it. Having said that, there's a *lot* of ways to hurt performance. Red Hat has a ref architecture team that tests various hardware that might be able to give you a better idea of what works well these days. Mark ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow perfomance: sanity check
60-80MBs/s for what sort of setup? Is that 1Gbe rather than 10Gbe? I consistently get 80-90Mb/s bandwidth as measured by `rados bench -p rbd 10 write` run from a ceph node on a cluster with: * 3 nodes * 4 OSD/node, 600GB 15kRPM SAS disks * 1G disk controller cache write cache shared by all disks in each node * No SSDs * 2x1Gbe lacp bond for redundancy, no jumbo frames * 512 PGs for a cluster of 12 OSDs * All disks in one pool of size=3, min_size=2 IOzone run on a VM using an rbd as it's HD confirms that setup maxes out at around just under 100 MB/s for best-case scenarios, so I assumed the 1Gb network was the bottleneck. I'm in the process of planning a hardware purchase for a larger cluster: more nodes, more drives, SSD journals and 10Gbe. I'm asuming I'll get better performance. What's the upper bound on CEPH performance for large sequential writes from a single-client with all the recommended bells and whistles (ssd journal, 10Gbe)? I assume it depends on both the total number of OSDs and possibly OSDs per node if one had enough to saturate the network, correct? -- Adam Carheden On 04/06/2017 12:29 PM, Mark Nelson wrote: > With filestore on XFS using SSD journals that have good O_DSYNC write > performance, we typically see between 60-80MB/s per disk before > replication for large object writes. This is assuming there are no > other bottlenecks or things going on though (pg splitting, recovery, > network issues, etc). Probably the best case scenario would be large > writes to an RBD volume with 4MB objects and enough PGs in the pool that > splits never need to happen. > > Having said that, on setups where some of the drives are slow, the > network is misconfigured, there are too few PGs, there are too many > drives on one controller, or other issues, 25-30MB/s per disk is > certainly possible. > > Mark > > On 04/06/2017 10:05 AM, Stanislav Kopp wrote: >> I've reduced OSDs to 12 and moved journal to ssd drives and now have >> "boost" with writes to ~33-35MB/s. Is it maximum without full ssd >> pools? >> >> Best, >> Stan >> >> 2017-04-06 9:34 GMT+02:00 Stanislav Kopp: >>> Hello, >>> >>> I'm evaluate ceph cluster, to see if you can use it for our >>> virtualization solution (proxmox). I'm using 3 nodes, running Ubuntu >>> 16.04 with stock ceph (10.2.6), every OSD uses separate 8 TB spinning >>> drive (XFS), MONITORs are installed on the same nodes, all nodes are >>> connected via 10G switch. >>> >>> The problem is, on client I have only ~25-30 MB/s with seq. write. (dd >>> with "oflag=direct"). Proxmox uses Firefly, which is old, I know. But >>> I have the same performance on my desktop running the same version as >>> ceph nodes using rbd mount, iperf shows full speed (1GB or 10GB up to >>> client). >>> I know that this setup is not optimal and for production I will use >>> separate MON nodes and ssd for OSDs, but was wondering is this >>> performance still normal. This is my cluster status. >>> >>> cluster 3ea55c7e-5829-46d0-b83a-92c6798bde55 >>> health HEALTH_OK >>> monmap e5: 3 mons at >>> {ceph01=10.1.8.31:6789/0,ceph02=10.1.8.32:6789/0,ceph03=10.1.8.33:6789/0} >>> >>> election epoch 60, quorum 0,1,2 ceph01,ceph02,ceph03 >>> osdmap e570: 42 osds: 42 up, 42 in >>> flags sortbitwise,require_jewel_osds >>> pgmap v14784: 1024 pgs, 1 pools, 23964 MB data, 6047 objects >>> 74743 MB used, 305 TB / 305 TB avail >>> 1024 active+clean >>> >>> btw, bench on nodes itself looks good as far I see. >>> >>> ceph01:~# rados bench -p rbd 10 write >>> >>> Total time run: 10.159667 >>> Total writes made: 1018 >>> Write size: 4194304 >>> Object size:4194304 >>> Bandwidth (MB/sec): 400.801 >>> Stddev Bandwidth: 38.2018 >>> Max bandwidth (MB/sec): 472 >>> Min bandwidth (MB/sec): 344 >>> Average IOPS: 100 >>> Stddev IOPS:9 >>> Max IOPS: 118 >>> Min IOPS: 86 >>> Average Latency(s): 0.159395 >>> Stddev Latency(s): 0.110994 >>> Max latency(s): 1.1069 >>> Min latency(s): 0.0432668 >>> >>> >>> Thanks, >>> Stan >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow perfomance: sanity check
With filestore on XFS using SSD journals that have good O_DSYNC write performance, we typically see between 60-80MB/s per disk before replication for large object writes. This is assuming there are no other bottlenecks or things going on though (pg splitting, recovery, network issues, etc). Probably the best case scenario would be large writes to an RBD volume with 4MB objects and enough PGs in the pool that splits never need to happen. Having said that, on setups where some of the drives are slow, the network is misconfigured, there are too few PGs, there are too many drives on one controller, or other issues, 25-30MB/s per disk is certainly possible. Mark On 04/06/2017 10:05 AM, Stanislav Kopp wrote: I've reduced OSDs to 12 and moved journal to ssd drives and now have "boost" with writes to ~33-35MB/s. Is it maximum without full ssd pools? Best, Stan 2017-04-06 9:34 GMT+02:00 Stanislav Kopp: Hello, I'm evaluate ceph cluster, to see if you can use it for our virtualization solution (proxmox). I'm using 3 nodes, running Ubuntu 16.04 with stock ceph (10.2.6), every OSD uses separate 8 TB spinning drive (XFS), MONITORs are installed on the same nodes, all nodes are connected via 10G switch. The problem is, on client I have only ~25-30 MB/s with seq. write. (dd with "oflag=direct"). Proxmox uses Firefly, which is old, I know. But I have the same performance on my desktop running the same version as ceph nodes using rbd mount, iperf shows full speed (1GB or 10GB up to client). I know that this setup is not optimal and for production I will use separate MON nodes and ssd for OSDs, but was wondering is this performance still normal. This is my cluster status. cluster 3ea55c7e-5829-46d0-b83a-92c6798bde55 health HEALTH_OK monmap e5: 3 mons at {ceph01=10.1.8.31:6789/0,ceph02=10.1.8.32:6789/0,ceph03=10.1.8.33:6789/0} election epoch 60, quorum 0,1,2 ceph01,ceph02,ceph03 osdmap e570: 42 osds: 42 up, 42 in flags sortbitwise,require_jewel_osds pgmap v14784: 1024 pgs, 1 pools, 23964 MB data, 6047 objects 74743 MB used, 305 TB / 305 TB avail 1024 active+clean btw, bench on nodes itself looks good as far I see. ceph01:~# rados bench -p rbd 10 write Total time run: 10.159667 Total writes made: 1018 Write size: 4194304 Object size:4194304 Bandwidth (MB/sec): 400.801 Stddev Bandwidth: 38.2018 Max bandwidth (MB/sec): 472 Min bandwidth (MB/sec): 344 Average IOPS: 100 Stddev IOPS:9 Max IOPS: 118 Min IOPS: 86 Average Latency(s): 0.159395 Stddev Latency(s): 0.110994 Max latency(s): 1.1069 Min latency(s): 0.0432668 Thanks, Stan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow perfomance: sanity check
Also make sure your PGs per pool and per entire Cluster are correct... you want 50-100 PGs per OSD total, otherwise performance can be impacted. Also if the cluster is new, it might take it a little while to rebalance and be available 100%, at that point speed can be affected too. Those are a couple issues I had just recently, thought I'd share with you too. On 2017-04-06 12:40 AM, Piotr Dałek wrote: On 04/06/2017 09:34 AM, Stanislav Kopp wrote: Hello, I'm evaluate ceph cluster, to see if you can use it for our virtualization solution (proxmox). I'm using 3 nodes, running Ubuntu 16.04 with stock ceph (10.2.6), every OSD uses separate 8 TB spinning drive (XFS), MONITORs are installed on the same nodes, all nodes are connected via 10G switch. The problem is, on client I have only ~25-30 MB/s with seq. write. (dd with "oflag=direct"). [..] 8TB size suggest these are some kind of "archive" drives (SMR drives). Is that correct? If so, you may want to use non-SMR drives, because Ceph is not optimized for those. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow perfomance: sanity check
I've reduced OSDs to 12 and moved journal to ssd drives and now have "boost" with writes to ~33-35MB/s. Is it maximum without full ssd pools? Best, Stan 2017-04-06 9:34 GMT+02:00 Stanislav Kopp: > Hello, > > I'm evaluate ceph cluster, to see if you can use it for our > virtualization solution (proxmox). I'm using 3 nodes, running Ubuntu > 16.04 with stock ceph (10.2.6), every OSD uses separate 8 TB spinning > drive (XFS), MONITORs are installed on the same nodes, all nodes are > connected via 10G switch. > > The problem is, on client I have only ~25-30 MB/s with seq. write. (dd > with "oflag=direct"). Proxmox uses Firefly, which is old, I know. But > I have the same performance on my desktop running the same version as > ceph nodes using rbd mount, iperf shows full speed (1GB or 10GB up to > client). > I know that this setup is not optimal and for production I will use > separate MON nodes and ssd for OSDs, but was wondering is this > performance still normal. This is my cluster status. > > cluster 3ea55c7e-5829-46d0-b83a-92c6798bde55 > health HEALTH_OK > monmap e5: 3 mons at > {ceph01=10.1.8.31:6789/0,ceph02=10.1.8.32:6789/0,ceph03=10.1.8.33:6789/0} > election epoch 60, quorum 0,1,2 ceph01,ceph02,ceph03 > osdmap e570: 42 osds: 42 up, 42 in > flags sortbitwise,require_jewel_osds > pgmap v14784: 1024 pgs, 1 pools, 23964 MB data, 6047 objects > 74743 MB used, 305 TB / 305 TB avail > 1024 active+clean > > btw, bench on nodes itself looks good as far I see. > > ceph01:~# rados bench -p rbd 10 write > > Total time run: 10.159667 > Total writes made: 1018 > Write size: 4194304 > Object size:4194304 > Bandwidth (MB/sec): 400.801 > Stddev Bandwidth: 38.2018 > Max bandwidth (MB/sec): 472 > Min bandwidth (MB/sec): 344 > Average IOPS: 100 > Stddev IOPS:9 > Max IOPS: 118 > Min IOPS: 86 > Average Latency(s): 0.159395 > Stddev Latency(s): 0.110994 > Max latency(s): 1.1069 > Min latency(s): 0.0432668 > > > Thanks, > Stan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow perfomance: sanity check
On 04/06/2017 09:34 AM, Stanislav Kopp wrote: Hello, I'm evaluate ceph cluster, to see if you can use it for our virtualization solution (proxmox). I'm using 3 nodes, running Ubuntu 16.04 with stock ceph (10.2.6), every OSD uses separate 8 TB spinning drive (XFS), MONITORs are installed on the same nodes, all nodes are connected via 10G switch. The problem is, on client I have only ~25-30 MB/s with seq. write. (dd with "oflag=direct"). [..] 8TB size suggest these are some kind of "archive" drives (SMR drives). Is that correct? If so, you may want to use non-SMR drives, because Ceph is not optimized for those. -- Piotr Dałek piotr.da...@corp.ovh.com https://www.ovh.com/us/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] slow perfomance: sanity check
Hello, I'm evaluate ceph cluster, to see if you can use it for our virtualization solution (proxmox). I'm using 3 nodes, running Ubuntu 16.04 with stock ceph (10.2.6), every OSD uses separate 8 TB spinning drive (XFS), MONITORs are installed on the same nodes, all nodes are connected via 10G switch. The problem is, on client I have only ~25-30 MB/s with seq. write. (dd with "oflag=direct"). Proxmox uses Firefly, which is old, I know. But I have the same performance on my desktop running the same version as ceph nodes using rbd mount, iperf shows full speed (1GB or 10GB up to client). I know that this setup is not optimal and for production I will use separate MON nodes and ssd for OSDs, but was wondering is this performance still normal. This is my cluster status. cluster 3ea55c7e-5829-46d0-b83a-92c6798bde55 health HEALTH_OK monmap e5: 3 mons at {ceph01=10.1.8.31:6789/0,ceph02=10.1.8.32:6789/0,ceph03=10.1.8.33:6789/0} election epoch 60, quorum 0,1,2 ceph01,ceph02,ceph03 osdmap e570: 42 osds: 42 up, 42 in flags sortbitwise,require_jewel_osds pgmap v14784: 1024 pgs, 1 pools, 23964 MB data, 6047 objects 74743 MB used, 305 TB / 305 TB avail 1024 active+clean btw, bench on nodes itself looks good as far I see. ceph01:~# rados bench -p rbd 10 write Total time run: 10.159667 Total writes made: 1018 Write size: 4194304 Object size:4194304 Bandwidth (MB/sec): 400.801 Stddev Bandwidth: 38.2018 Max bandwidth (MB/sec): 472 Min bandwidth (MB/sec): 344 Average IOPS: 100 Stddev IOPS:9 Max IOPS: 118 Min IOPS: 86 Average Latency(s): 0.159395 Stddev Latency(s): 0.110994 Max latency(s): 1.1069 Min latency(s): 0.0432668 Thanks, Stan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com