Re: [ceph-users] how to find the lazy egg - poor performance - interesting observations [klartext]
On Wed, Nov 13, 2019 at 10:13 AM Stefan Bauer wrote: > > Paul, > > > i would like to take the chance, to thank you and ask if it could not be, that > subop_latency reports high value (is that avgtime in seconds reported?) > because the communication partner is slow in writing/commiting? no Paul > > > Dont want to follow the red hering :/ > > > We have the following times on our 11 osds. Attached image. > > > > -Ursprüngliche Nachricht- > Von: Paul Emmerich > Gesendet: Donnerstag 7 November 2019 19:04 > An: Stefan Bauer > CC: ceph-users@lists.ceph.com > Betreff: Re: [ceph-users] how to find the lazy egg - poor performance - > interesting observations [klartext] > > You can have a look at subop_latency in "ceph daemon osd.XX perf > dump", it tells you how long an OSD took to reply to another OSD. > That's usually a good indicator if an OSD is dragging down others. > Or have a look at "ceph osd perf dump" which is basically disk > latency; simpler to acquire but with less information > > Paul > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > > On Thu, Nov 7, 2019 at 6:55 PM Stefan Bauer wrote: > > > > Hi folks, > > > > > > we are running a 3 node proxmox-cluster with - of corse - ceph :) > > > > ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous > > (stable) > > > > > > 10G network. iperf reports almost 10G between all nodes. > > > > > > We are using mixed standard SSDs (crucial / samsung). We are aware, that > > these disks can not delivery high iops or great throughput, but we have > > several of these clusters and this one is showing very poor performance. > > > > > > NOW the strange fact: > > > > > > When a specific node is rebooting, the throughput is acceptable. > > > > > > But when the specific node is back, the results dropped by almost 100%. > > > > > > 2 NODES (one rebooting) > > > > > > # rados bench -p scbench 10 write --no-cleanup > > hints = 1 > > Maintaining 16 concurrent writes of 4194304 bytes to objects of size > > 4194304 for up to 10 seconds or 0 objects > > Object prefix: benchmark_data_pve3_1767693 > > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg > > lat(s) > > 0 0 0 0 0 0 - > > 0 > > 1 165539 155.992 156 0.0445665 > > 0.257988 > > 2 16 11094187.98 2200.087097 > > 0.291173 > > 3 16 156 140 186.645 1840.462171 > > 0.286895 > > 4 16 184 168167.98 112 0.0235336 > > 0.358085 > > 5 16 210 194 155.181 1040.112401 > > 0.347883 > > 6 16 252 236 157.314 1680.134099 > > 0.382159 > > 7 16 287 271 154.838 140 0.0264864 > > 0.40092 > > 8 16 329 313 156.481 168 0.0609964 > > 0.394753 > > 9 16 364 348 154.649 1400.244309 > > 0.392331 > >10 16 416 400 159.981 2080.277489 > > 0.387424 > > Total time run: 10.335496 > > Total writes made: 417 > > Write size: 4194304 > > Object size:4194304 > > Bandwidth (MB/sec): 161.386 > > Stddev Bandwidth: 37.8065 > > Max bandwidth (MB/sec): 220 > > Min bandwidth (MB/sec): 104 > > Average IOPS: 40 > > Stddev IOPS:9 > > Max IOPS: 55 > > Min IOPS: 26 > > Average Latency(s): 0.396434 > > Stddev Latency(s): 0.428527 > > Max latency(s): 1.86968 > > Min latency(s): 0.020558 > > > > > > > > THIRD NODE ONLINE: > > > > > > > > root@pve3:/# rados bench -p scbench 10 write --no-cleanup > > hints = 1 > > Maintaining 16 concurrent writes of 4194304 bytes to objects of size > > 4194304 for up to 10 seconds or 0 objects > > Object prefix: benchmark_data_pve3_1771977 > > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg > > lat(s) > > 0 0 0 0 0 0 - > > 0 > &g
Re: [ceph-users] how to find the lazy egg - poor performance - interesting observations [klartext]
Paul, i would like to take the chance, to thank you and ask if it could not be, that subop_latency reports high value (is that avgtime in seconds reported?) "subop_latency": { "avgcount": 7782673, "sum": 38852.140794738, "avgtime": 0.004992133 because the communication partner is slow in writing/commiting? Dont want to follow the red hering :/ We have the following times on our 11 osds. Attached image. -Ursprüngliche Nachricht- Von: Paul Emmerich Gesendet: Donnerstag 7 November 2019 19:04 An: Stefan Bauer CC: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] how to find the lazy egg - poor performance - interesting observations [klartext] You can have a look at subop_latency in "ceph daemon osd.XX perf dump", it tells you how long an OSD took to reply to another OSD. That's usually a good indicator if an OSD is dragging down others. Or have a look at "ceph osd perf dump" which is basically disk latency; simpler to acquire but with less information Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Thu, Nov 7, 2019 at 6:55 PM Stefan Bauer wrote: > > Hi folks, > > > we are running a 3 node proxmox-cluster with - of corse - ceph :) > > ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous > (stable) > > > 10G network. iperf reports almost 10G between all nodes. > > > We are using mixed standard SSDs (crucial / samsung). We are aware, that > these disks can not delivery high iops or great throughput, but we have > several of these clusters and this one is showing very poor performance. > > > NOW the strange fact: > > > When a specific node is rebooting, the throughput is acceptable. > > > But when the specific node is back, the results dropped by almost 100%. > > > 2 NODES (one rebooting) > > > # rados bench -p scbench 10 write --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 > for up to 10 seconds or 0 objects > Object prefix: benchmark_data_pve3_1767693 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 0 0 0 0 0 0 - 0 > 1 165539 155.992 156 0.04456650.257988 > 2 16 11094187.98 2200.0870970.291173 > 3 16 156 140 186.645 1840.4621710.286895 > 4 16 184 168167.98 112 0.02353360.358085 > 5 16 210 194 155.181 1040.1124010.347883 > 6 16 252 236 157.314 1680.1340990.382159 > 7 16 287 271 154.838 140 0.0264864 0.40092 > 8 16 329 313 156.481 168 0.06099640.394753 > 9 16 364 348 154.649 1400.2443090.392331 >10 16 416 400 159.981 2080.2774890.387424 > Total time run: 10.335496 > Total writes made: 417 > Write size: 4194304 > Object size:4194304 > Bandwidth (MB/sec): 161.386 > Stddev Bandwidth: 37.8065 > Max bandwidth (MB/sec): 220 > Min bandwidth (MB/sec): 104 > Average IOPS: 40 > Stddev IOPS:9 > Max IOPS: 55 > Min IOPS: 26 > Average Latency(s): 0.396434 > Stddev Latency(s): 0.428527 > Max latency(s): 1.86968 > Min latency(s): 0.020558 > > > > THIRD NODE ONLINE: > > > > root@pve3:/# rados bench -p scbench 10 write --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 > for up to 10 seconds or 0 objects > Object prefix: benchmark_data_pve3_1771977 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 0 0 0 0 0 0 - 0 > 1 163923 91.994392 0.213530.267249 > 2 164630 59.992428 0.295270.268672 > 3 165337 49.3271280.1227320.259731 > 4 165337 36.9954 0 -0.259731 > 5 165337 29.5963 0 -0.259731 > 6 168771 47.3271 45.0.241921 1.19831 > 7 16 10690 51.4214760.124821 1.07941 > 8 16 129 113
[ceph-users] how to find the lazy egg - poor performance - interesting observations [klartext]
This only happens with this one specific node?checked system logs? checked SMART on all disks?I mean technically it's expected to have slower writes when the third node is there, it's by ceph design. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to find the lazy egg - poor performance - interesting observations [klartext]
Thank you Paul. I'm not sure if these low values will be of any help: osd commit_latency(ms) apply_latency(ms) 0 0 0 1 0 0 5 0 0 4 0 0 3 0 0 2 0 0 6 0 0 7 3 3 8 3 3 9 3 3 10 3 3 11 0 0 But still, there are some higher OSDs. If i do some stresstest on a VM, the values increase heavily but Im unsure if this is not only a peak by the data distribution through crush-map and part of the game. osd commit_latency(ms) apply_latency(ms) 0 8 8 1 18 18 5 0 0 4 0 0 3 0 0 2 7 7 6 0 0 7 100 100 8 44 44 9 199 199 10 512 512 11 15 15 osd commit_latency(ms) apply_latency(ms) 0 30 30 1 5 5 5 0 0 4 0 0 3 0 0 2 719 719 6 0 0 7 150 150 8 22 22 9 110 110 10 94 94 11 24 24 Stefan Von: Paul Emmerich You can have a look at subop_latency in "ceph daemon osd.XX perf dump", it tells you how long an OSD took to reply to another OSD. That's usually a good indicator if an OSD is dragging down others. Or have a look at "ceph osd perf dump" which is basically disk latency; simpler to acquire but with less information ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to find the lazy egg - poor performance - interesting observations [klartext]
You can have a look at subop_latency in "ceph daemon osd.XX perf dump", it tells you how long an OSD took to reply to another OSD. That's usually a good indicator if an OSD is dragging down others. Or have a look at "ceph osd perf dump" which is basically disk latency; simpler to acquire but with less information Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Thu, Nov 7, 2019 at 6:55 PM Stefan Bauer wrote: > > Hi folks, > > > we are running a 3 node proxmox-cluster with - of corse - ceph :) > > ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous > (stable) > > > 10G network. iperf reports almost 10G between all nodes. > > > We are using mixed standard SSDs (crucial / samsung). We are aware, that > these disks can not delivery high iops or great throughput, but we have > several of these clusters and this one is showing very poor performance. > > > NOW the strange fact: > > > When a specific node is rebooting, the throughput is acceptable. > > > But when the specific node is back, the results dropped by almost 100%. > > > 2 NODES (one rebooting) > > > # rados bench -p scbench 10 write --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 > for up to 10 seconds or 0 objects > Object prefix: benchmark_data_pve3_1767693 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 0 0 0 0 0 0 - 0 > 1 165539 155.992 156 0.04456650.257988 > 2 16 11094187.98 2200.0870970.291173 > 3 16 156 140 186.645 1840.4621710.286895 > 4 16 184 168167.98 112 0.02353360.358085 > 5 16 210 194 155.181 1040.1124010.347883 > 6 16 252 236 157.314 1680.1340990.382159 > 7 16 287 271 154.838 140 0.0264864 0.40092 > 8 16 329 313 156.481 168 0.06099640.394753 > 9 16 364 348 154.649 1400.2443090.392331 >10 16 416 400 159.981 2080.2774890.387424 > Total time run: 10.335496 > Total writes made: 417 > Write size: 4194304 > Object size:4194304 > Bandwidth (MB/sec): 161.386 > Stddev Bandwidth: 37.8065 > Max bandwidth (MB/sec): 220 > Min bandwidth (MB/sec): 104 > Average IOPS: 40 > Stddev IOPS:9 > Max IOPS: 55 > Min IOPS: 26 > Average Latency(s): 0.396434 > Stddev Latency(s): 0.428527 > Max latency(s): 1.86968 > Min latency(s): 0.020558 > > > > THIRD NODE ONLINE: > > > > root@pve3:/# rados bench -p scbench 10 write --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 > for up to 10 seconds or 0 objects > Object prefix: benchmark_data_pve3_1771977 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 0 0 0 0 0 0 - 0 > 1 163923 91.994392 0.213530.267249 > 2 164630 59.992428 0.295270.268672 > 3 165337 49.3271280.1227320.259731 > 4 165337 36.9954 0 -0.259731 > 5 165337 29.5963 0 -0.259731 > 6 168771 47.3271 45.0.241921 1.19831 > 7 16 10690 51.4214760.124821 1.07941 > 8 16 129 11356.49292 0.03141460.941378 > 9 16 142 126 55.9919520.2855360.871445 >10 16 147 131 52.3925200.3548030.852074 > Total time run: 10.138312 > Total writes made: 148 > Write size: 4194304 > Object size:4194304 > Bandwidth (MB/sec): 58.3924 > Stddev Bandwidth: 34.405 > Max bandwidth (MB/sec): 92 > Min bandwidth (MB/sec): 0 > Average IOPS: 14 > Stddev IOPS:8 > Max IOPS: 23 > Min IOPS: 0 > Average Latency(s): 1.08818 > Stddev Latency(s): 1.55967 > Max latency(s): 5.02514 > Min latency(s): 0.0255947 > > > > Is here a single node faulty? > > > > root@pve3:/# ceph status > cluster: > id: 138c857a-c4e6-4600-9320-9567011470d6 > health: HEALTH_WARN > application not enabled on 1 pool(s) (thats just for benchmarking) > > services: > mon: 3 daemons, quorum pve1,pve2,pve3 > mgr: pve1(active), standbys:
[ceph-users] how to find the lazy egg - poor performance - interesting observations [klartext]
Hi folks, we are running a 3 node proxmox-cluster with - of corse - ceph :) ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable) 10G network. iperf reports almost 10G between all nodes. We are using mixed standard SSDs (crucial / samsung). We are aware, that these disks can not delivery high iops or great throughput, but we have several of these clusters and this one is showing very poor performance. NOW the strange fact: When a specific node is rebooting, the throughput is acceptable. But when the specific node is back, the results dropped by almost 100%. 2 NODES (one rebooting) # rados bench -p scbench 10 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects Object prefix: benchmark_data_pve3_1767693 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 55 39 155.992 156 0.0445665 0.257988 2 16 110 94 187.98 220 0.087097 0.291173 3 16 156 140 186.645 184 0.462171 0.286895 4 16 184 168 167.98 112 0.0235336 0.358085 5 16 210 194 155.181 104 0.112401 0.347883 6 16 252 236 157.314 168 0.134099 0.382159 7 16 287 271 154.838 140 0.0264864 0.40092 8 16 329 313 156.481 168 0.0609964 0.394753 9 16 364 348 154.649 140 0.244309 0.392331 10 16 416 400 159.981 208 0.277489 0.387424 Total time run: 10.335496 Total writes made: 417 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 161.386 Stddev Bandwidth: 37.8065 Max bandwidth (MB/sec): 220 Min bandwidth (MB/sec): 104 Average IOPS: 40 Stddev IOPS: 9 Max IOPS: 55 Min IOPS: 26 Average Latency(s): 0.396434 Stddev Latency(s): 0.428527 Max latency(s): 1.86968 Min latency(s): 0.020558 THIRD NODE ONLINE: root@pve3:/# rados bench -p scbench 10 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects Object prefix: benchmark_data_pve3_1771977 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 39 23 91.9943 92 0.21353 0.267249 2 16 46 30 59.9924 28 0.29527 0.268672 3 16 53 37 49.3271 28 0.122732 0.259731 4 16 53 37 36.9954 0 - 0.259731 5 16 53 37 29.5963 0 - 0.259731 6 16 87 71 47.3271 45. 0.241921 1.19831 7 16 106 90 51.4214 76 0.124821 1.07941 8 16 129 113 56.492 92 0.0314146 0.941378 9 16 142 126 55.9919 52 0.285536 0.871445 10 16 147 131 52.3925 20 0.354803 0.852074 Total time run: 10.138312 Total writes made: 148 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 58.3924 Stddev Bandwidth: 34.405 Max bandwidth (MB/sec): 92 Min bandwidth (MB/sec): 0 Average IOPS: 14 Stddev IOPS: 8 Max IOPS: 23 Min IOPS: 0 Average Latency(s): 1.08818 Stddev Latency(s): 1.55967 Max latency(s): 5.02514 Min latency(s): 0.0255947 Is here a single node faulty? root@pve3:/# ceph status cluster: id: 138c857a-c4e6-4600-9320-9567011470d6 health: HEALTH_WARN application not enabled on 1 pool(s) (thats just for benchmarking) services: mon: 3 daemons, quorum pve1,pve2,pve3 mgr: pve1(active), standbys: pve3, pve2 osd: 12 osds: 12 up, 12 in data: pools: 2 pools, 612 pgs objects: 758.52k objects, 2.89TiB usage: 8.62TiB used, 7.75TiB / 16.4TiB avail pgs: 611 active+clean 1 active+clean+scrubbing+deep io: client: 4.99MiB/s rd, 1.36MiB/s wr, 678op/s rd, 105op/s wr Thank you. Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com