[ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]
Folks, i would like to thank you again for your help regarding performance speedup of our ceph cluster. Customer just reports, that database is around 40% faster than before without changing any hardware. This really kicks ass now! :) We measured the subop_latency - avgtime on our OSDs and could reduce latency from 2.5ms to 0.7ms now. :p Cheers Stefan -Ursprüngliche Nachricht- Von: Виталий Филиппов Gesendet: Dienstag 14 Januar 2020 10:28 An: Wido den Hollander ; Stefan Bauer CC: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext] ...disable signatures and rbd cache. I didn't mention it in the email to not repeat myself. But I have it in the article :-) -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]
Thank you all, performance is indeed better now. Can now go back to sleep ;) KR Stefan -Ursprüngliche Nachricht- Von: Виталий Филиппов Gesendet: Dienstag 14 Januar 2020 10:28 An: Wido den Hollander ; Stefan Bauer CC: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext] ...disable signatures and rbd cache. I didn't mention it in the email to not repeat myself. But I have it in the article :-) -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]
Hi Vitaliy, thank you for your time. Do you mean cephx sign messages = false with "diable signatures" ? KR Stefan -Ursprüngliche Nachricht- Von: Виталий Филиппов Gesendet: Dienstag 14 Januar 2020 10:28 An: Wido den Hollander ; Stefan Bauer CC: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext] ...disable signatures and rbd cache. I didn't mention it in the email to not repeat myself. But I have it in the article :-) -- With best regards, Vitaliy Filippov___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]
Hi Stefan, thank you for your time. "temporary write through" does not seem to be a legit parameter. However write through is already set: root@proxmox61:~# echo "temporary write through" > /sys/block/sdb/device/scsi_disk/*/cache_type root@proxmox61:~# cat /sys/block/sdb/device/scsi_disk/2\:0\:0\:0/cache_type write through is that, what you meant? Thank you. KR Stefan -Ursprüngliche Nachricht- Von: Stefan Priebe - Profihost AG this has something todo with the firmware and how the manufacturer handles syncs / flushes. Intel just ignores sync / flush commands for drives which have a capacitor. Samsung does not. The problem is that Ceph sends a lot of flush commands which slows down drives without capacitor. You can make linux to ignore those userspace requests with the following command: echo "temporary write through" > /sys/block/sdX/device/scsi_disk/*/cache_type Greets, Stefan Priebe Profihost AG > Thank you. > > > Stefan > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] low io with enterprise SSDs ceph luminous - can we expect more? [klartext]
Hi, we're playing around with ceph but are not quite happy with the IOs. 3 node ceph / proxmox cluster with each: LSI HBA 3008 controller 4 x MZILT960HAHQ/007 Samsung SSD Transport protocol: SAS (SPL-3) 40G fibre Intel 520 Network controller on Unifi Switch Ping roundtrip to partner node is 0.040ms average. Transport protocol: SAS (SPL-3) fio reports on a virtual machine with --randrepeat=1 --ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test --bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75 on average 5000 iops / write on average 13000 iops / read We're expecting more. :( any ideas or is that all we can expect? money is not a problem for this test-bed, any ideas howto gain more IOS is greatly appreciated. Thank you. Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] how to find the lazy egg - poor performance - interesting observations [klartext]
Paul, i would like to take the chance, to thank you and ask if it could not be, that subop_latency reports high value (is that avgtime in seconds reported?) "subop_latency": { "avgcount": 7782673, "sum": 38852.140794738, "avgtime": 0.004992133 because the communication partner is slow in writing/commiting? Dont want to follow the red hering :/ We have the following times on our 11 osds. Attached image. -Ursprüngliche Nachricht- Von: Paul Emmerich Gesendet: Donnerstag 7 November 2019 19:04 An: Stefan Bauer CC: ceph-users@lists.ceph.com Betreff: Re: [ceph-users] how to find the lazy egg - poor performance - interesting observations [klartext] You can have a look at subop_latency in "ceph daemon osd.XX perf dump", it tells you how long an OSD took to reply to another OSD. That's usually a good indicator if an OSD is dragging down others. Or have a look at "ceph osd perf dump" which is basically disk latency; simpler to acquire but with less information Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Thu, Nov 7, 2019 at 6:55 PM Stefan Bauer wrote: > > Hi folks, > > > we are running a 3 node proxmox-cluster with - of corse - ceph :) > > ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous > (stable) > > > 10G network. iperf reports almost 10G between all nodes. > > > We are using mixed standard SSDs (crucial / samsung). We are aware, that > these disks can not delivery high iops or great throughput, but we have > several of these clusters and this one is showing very poor performance. > > > NOW the strange fact: > > > When a specific node is rebooting, the throughput is acceptable. > > > But when the specific node is back, the results dropped by almost 100%. > > > 2 NODES (one rebooting) > > > # rados bench -p scbench 10 write --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 > for up to 10 seconds or 0 objects > Object prefix: benchmark_data_pve3_1767693 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 0 0 0 0 0 0 - 0 > 1 165539 155.992 156 0.04456650.257988 > 2 16 11094187.98 2200.0870970.291173 > 3 16 156 140 186.645 1840.4621710.286895 > 4 16 184 168167.98 112 0.02353360.358085 > 5 16 210 194 155.181 1040.1124010.347883 > 6 16 252 236 157.314 1680.1340990.382159 > 7 16 287 271 154.838 140 0.0264864 0.40092 > 8 16 329 313 156.481 168 0.06099640.394753 > 9 16 364 348 154.649 1400.2443090.392331 >10 16 416 400 159.981 2080.2774890.387424 > Total time run: 10.335496 > Total writes made: 417 > Write size: 4194304 > Object size:4194304 > Bandwidth (MB/sec): 161.386 > Stddev Bandwidth: 37.8065 > Max bandwidth (MB/sec): 220 > Min bandwidth (MB/sec): 104 > Average IOPS: 40 > Stddev IOPS:9 > Max IOPS: 55 > Min IOPS: 26 > Average Latency(s): 0.396434 > Stddev Latency(s): 0.428527 > Max latency(s): 1.86968 > Min latency(s): 0.020558 > > > > THIRD NODE ONLINE: > > > > root@pve3:/# rados bench -p scbench 10 write --no-cleanup > hints = 1 > Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 > for up to 10 seconds or 0 objects > Object prefix: benchmark_data_pve3_1771977 > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) > 0 0 0 0 0 0 - 0 > 1 163923 91.994392 0.213530.267249 > 2 164630 59.992428 0.295270.268672 > 3 165337 49.3271280.1227320.259731 > 4 165337 36.9954 0 -0.259731 > 5 165337 29.5963 0 -0.259731 > 6 168771 47.3271 45.0.241921 1.19831 > 7 16 10690 51.4214760.124821 1.07941 > 8 16 129 1
Re: [ceph-users] how to find the lazy egg - poor performance - interesting observations [klartext]
Thank you Paul. I'm not sure if these low values will be of any help: osd commit_latency(ms) apply_latency(ms) 0 0 0 1 0 0 5 0 0 4 0 0 3 0 0 2 0 0 6 0 0 7 3 3 8 3 3 9 3 3 10 3 3 11 0 0 But still, there are some higher OSDs. If i do some stresstest on a VM, the values increase heavily but Im unsure if this is not only a peak by the data distribution through crush-map and part of the game. osd commit_latency(ms) apply_latency(ms) 0 8 8 1 18 18 5 0 0 4 0 0 3 0 0 2 7 7 6 0 0 7 100 100 8 44 44 9 199 199 10 512 512 11 15 15 osd commit_latency(ms) apply_latency(ms) 0 30 30 1 5 5 5 0 0 4 0 0 3 0 0 2 719 719 6 0 0 7 150 150 8 22 22 9 110 110 10 94 94 11 24 24 Stefan Von: Paul Emmerich You can have a look at subop_latency in "ceph daemon osd.XX perf dump", it tells you how long an OSD took to reply to another OSD. That's usually a good indicator if an OSD is dragging down others. Or have a look at "ceph osd perf dump" which is basically disk latency; simpler to acquire but with less information ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] how to find the lazy egg - poor performance - interesting observations [klartext]
Hi folks, we are running a 3 node proxmox-cluster with - of corse - ceph :) ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous (stable) 10G network. iperf reports almost 10G between all nodes. We are using mixed standard SSDs (crucial / samsung). We are aware, that these disks can not delivery high iops or great throughput, but we have several of these clusters and this one is showing very poor performance. NOW the strange fact: When a specific node is rebooting, the throughput is acceptable. But when the specific node is back, the results dropped by almost 100%. 2 NODES (one rebooting) # rados bench -p scbench 10 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects Object prefix: benchmark_data_pve3_1767693 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 55 39 155.992 156 0.0445665 0.257988 2 16 110 94 187.98 220 0.087097 0.291173 3 16 156 140 186.645 184 0.462171 0.286895 4 16 184 168 167.98 112 0.0235336 0.358085 5 16 210 194 155.181 104 0.112401 0.347883 6 16 252 236 157.314 168 0.134099 0.382159 7 16 287 271 154.838 140 0.0264864 0.40092 8 16 329 313 156.481 168 0.0609964 0.394753 9 16 364 348 154.649 140 0.244309 0.392331 10 16 416 400 159.981 208 0.277489 0.387424 Total time run: 10.335496 Total writes made: 417 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 161.386 Stddev Bandwidth: 37.8065 Max bandwidth (MB/sec): 220 Min bandwidth (MB/sec): 104 Average IOPS: 40 Stddev IOPS: 9 Max IOPS: 55 Min IOPS: 26 Average Latency(s): 0.396434 Stddev Latency(s): 0.428527 Max latency(s): 1.86968 Min latency(s): 0.020558 THIRD NODE ONLINE: root@pve3:/# rados bench -p scbench 10 write --no-cleanup hints = 1 Maintaining 16 concurrent writes of 4194304 bytes to objects of size 4194304 for up to 10 seconds or 0 objects Object prefix: benchmark_data_pve3_1771977 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 16 39 23 91.9943 92 0.21353 0.267249 2 16 46 30 59.9924 28 0.29527 0.268672 3 16 53 37 49.3271 28 0.122732 0.259731 4 16 53 37 36.9954 0 - 0.259731 5 16 53 37 29.5963 0 - 0.259731 6 16 87 71 47.3271 45. 0.241921 1.19831 7 16 106 90 51.4214 76 0.124821 1.07941 8 16 129 113 56.492 92 0.0314146 0.941378 9 16 142 126 55.9919 52 0.285536 0.871445 10 16 147 131 52.3925 20 0.354803 0.852074 Total time run: 10.138312 Total writes made: 148 Write size: 4194304 Object size: 4194304 Bandwidth (MB/sec): 58.3924 Stddev Bandwidth: 34.405 Max bandwidth (MB/sec): 92 Min bandwidth (MB/sec): 0 Average IOPS: 14 Stddev IOPS: 8 Max IOPS: 23 Min IOPS: 0 Average Latency(s): 1.08818 Stddev Latency(s): 1.55967 Max latency(s): 5.02514 Min latency(s): 0.0255947 Is here a single node faulty? root@pve3:/# ceph status cluster: id: 138c857a-c4e6-4600-9320-9567011470d6 health: HEALTH_WARN application not enabled on 1 pool(s) (thats just for benchmarking) services: mon: 3 daemons, quorum pve1,pve2,pve3 mgr: pve1(active), standbys: pve3, pve2 osd: 12 osds: 12 up, 12 in data: pools: 2 pools, 612 pgs objects: 758.52k objects, 2.89TiB usage: 8.62TiB used, 7.75TiB / 16.4TiB avail pgs: 611 active+clean 1 active+clean+scrubbing+deep io: client: 4.99MiB/s rd, 1.36MiB/s wr, 678op/s rd, 105op/s wr Thank you. Stefan ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com