Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)
Hello Robert, I did not make any changes, so I'm still using the prio queue. Regards Le lun. 10 juin 2019 à 17:44, Robert LeBlanc a écrit : > I'm glad it's working, to be clear did you use wpq, or is it still the > prio queue? > > Sent from a mobile device, please excuse any typos. > > On Mon, Jun 10, 2019, 4:45 AM BASSAGET Cédric < > cedric.bassaget...@gmail.com> wrote: > >> an update from 12.2.9 to 12.2.12 seems to have fixed the problem ! >> >> Le lun. 10 juin 2019 à 12:25, BASSAGET Cédric < >> cedric.bassaget...@gmail.com> a écrit : >> >>> Hi Robert, >>> Before doing anything on my prod env, I generate r/w on ceph cluster >>> using fio . >>> On my newest cluster, release 12.2.12, I did not manage to get >>> the (REQUEST_SLOW) warning, even if my OSD disk usage goes above 95% (fio >>> ran from 4 diffrent hosts) >>> >>> On my prod cluster, release 12.2.9, as soon as I run fio on a single >>> host, I see a lot of REQUEST_SLOW warninr gmessages, but "iostat -xd 1" >>> does not show me a usage more that 5-10% on disks... >>> >>> Le lun. 10 juin 2019 à 10:12, Robert LeBlanc a >>> écrit : >>> On Mon, Jun 10, 2019 at 1:00 AM BASSAGET Cédric < cedric.bassaget...@gmail.com> wrote: > Hello Robert, > My disks did not reach 100% on the last warning, they climb to 70-80% > usage. But I see rrqm / wrqm counters increasing... > > Device: rrqm/s wrqm/s r/s w/srkB/swkB/s > avgrq-sz avgqu-sz await r_await w_await svctm %util > > sda 0.00 4.000.00 16.00 0.00 104.00 > 13.00 0.000.000.000.00 0.00 0.00 > sdb 0.00 2.001.00 3456.00 8.00 25996.00 > 15.04 5.761.670.001.67 0.03 9.20 > sdd 4.00 0.00 41462.00 1119.00 331272.00 7996.00 > 15.9419.890.470.480.21 0.02 66.00 > > dm-0 0.00 0.00 6825.00 503.00 330856.00 7996.00 > 92.48 4.000.550.560.30 0.09 66.80 > dm-1 0.00 0.001.00 1129.00 8.00 25996.00 > 46.02 1.030.910.000.91 0.09 10.00 > > > sda is my system disk (SAMSUNG MZILS480HEGR/007 GXL0), sdb and sdd > are my OSDs > > would "osd op queue = wpq" help in this case ? > Regards > Your disk times look okay, just a lot more unbalanced than I would expect. I'd give wpq a try, I use it all the time, just be sure to also include the op_cutoff setting too or it doesn't have much effect. Let me know how it goes. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)
I'm glad it's working, to be clear did you use wpq, or is it still the prio queue? Sent from a mobile device, please excuse any typos. On Mon, Jun 10, 2019, 4:45 AM BASSAGET Cédric wrote: > an update from 12.2.9 to 12.2.12 seems to have fixed the problem ! > > Le lun. 10 juin 2019 à 12:25, BASSAGET Cédric < > cedric.bassaget...@gmail.com> a écrit : > >> Hi Robert, >> Before doing anything on my prod env, I generate r/w on ceph cluster >> using fio . >> On my newest cluster, release 12.2.12, I did not manage to get >> the (REQUEST_SLOW) warning, even if my OSD disk usage goes above 95% (fio >> ran from 4 diffrent hosts) >> >> On my prod cluster, release 12.2.9, as soon as I run fio on a single >> host, I see a lot of REQUEST_SLOW warninr gmessages, but "iostat -xd 1" >> does not show me a usage more that 5-10% on disks... >> >> Le lun. 10 juin 2019 à 10:12, Robert LeBlanc a >> écrit : >> >>> On Mon, Jun 10, 2019 at 1:00 AM BASSAGET Cédric < >>> cedric.bassaget...@gmail.com> wrote: >>> Hello Robert, My disks did not reach 100% on the last warning, they climb to 70-80% usage. But I see rrqm / wrqm counters increasing... Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 4.000.00 16.00 0.00 104.00 13.00 0.000.000.000.00 0.00 0.00 sdb 0.00 2.001.00 3456.00 8.00 25996.00 15.04 5.761.670.001.67 0.03 9.20 sdd 4.00 0.00 41462.00 1119.00 331272.00 7996.00 15.9419.890.470.480.21 0.02 66.00 dm-0 0.00 0.00 6825.00 503.00 330856.00 7996.00 92.48 4.000.550.560.30 0.09 66.80 dm-1 0.00 0.001.00 1129.00 8.00 25996.00 46.02 1.030.910.000.91 0.09 10.00 sda is my system disk (SAMSUNG MZILS480HEGR/007 GXL0), sdb and sdd are my OSDs would "osd op queue = wpq" help in this case ? Regards >>> >>> Your disk times look okay, just a lot more unbalanced than I would >>> expect. I'd give wpq a try, I use it all the time, just be sure to also >>> include the op_cutoff setting too or it doesn't have much effect. Let me >>> know how it goes. >>> >>> Robert LeBlanc >>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >>> >> ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)
an update from 12.2.9 to 12.2.12 seems to have fixed the problem ! Le lun. 10 juin 2019 à 12:25, BASSAGET Cédric a écrit : > Hi Robert, > Before doing anything on my prod env, I generate r/w on ceph cluster using > fio . > On my newest cluster, release 12.2.12, I did not manage to get > the (REQUEST_SLOW) warning, even if my OSD disk usage goes above 95% (fio > ran from 4 diffrent hosts) > > On my prod cluster, release 12.2.9, as soon as I run fio on a single host, > I see a lot of REQUEST_SLOW warninr gmessages, but "iostat -xd 1" does not > show me a usage more that 5-10% on disks... > > Le lun. 10 juin 2019 à 10:12, Robert LeBlanc a > écrit : > >> On Mon, Jun 10, 2019 at 1:00 AM BASSAGET Cédric < >> cedric.bassaget...@gmail.com> wrote: >> >>> Hello Robert, >>> My disks did not reach 100% on the last warning, they climb to 70-80% >>> usage. But I see rrqm / wrqm counters increasing... >>> >>> Device: rrqm/s wrqm/s r/s w/srkB/swkB/s >>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>> >>> sda 0.00 4.000.00 16.00 0.00 104.00 >>> 13.00 0.000.000.000.00 0.00 0.00 >>> sdb 0.00 2.001.00 3456.00 8.00 25996.00 >>> 15.04 5.761.670.001.67 0.03 9.20 >>> sdd 4.00 0.00 41462.00 1119.00 331272.00 7996.00 >>> 15.9419.890.470.480.21 0.02 66.00 >>> >>> dm-0 0.00 0.00 6825.00 503.00 330856.00 7996.00 >>> 92.48 4.000.550.560.30 0.09 66.80 >>> dm-1 0.00 0.001.00 1129.00 8.00 25996.00 >>> 46.02 1.030.910.000.91 0.09 10.00 >>> >>> >>> sda is my system disk (SAMSUNG MZILS480HEGR/007 GXL0), sdb and sdd >>> are my OSDs >>> >>> would "osd op queue = wpq" help in this case ? >>> Regards >>> >> >> Your disk times look okay, just a lot more unbalanced than I would >> expect. I'd give wpq a try, I use it all the time, just be sure to also >> include the op_cutoff setting too or it doesn't have much effect. Let me >> know how it goes. >> >> Robert LeBlanc >> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)
Hi Robert, Before doing anything on my prod env, I generate r/w on ceph cluster using fio . On my newest cluster, release 12.2.12, I did not manage to get the (REQUEST_SLOW) warning, even if my OSD disk usage goes above 95% (fio ran from 4 diffrent hosts) On my prod cluster, release 12.2.9, as soon as I run fio on a single host, I see a lot of REQUEST_SLOW warninr gmessages, but "iostat -xd 1" does not show me a usage more that 5-10% on disks... Le lun. 10 juin 2019 à 10:12, Robert LeBlanc a écrit : > On Mon, Jun 10, 2019 at 1:00 AM BASSAGET Cédric < > cedric.bassaget...@gmail.com> wrote: > >> Hello Robert, >> My disks did not reach 100% on the last warning, they climb to 70-80% >> usage. But I see rrqm / wrqm counters increasing... >> >> Device: rrqm/s wrqm/s r/s w/srkB/swkB/s >> avgrq-sz avgqu-sz await r_await w_await svctm %util >> >> sda 0.00 4.000.00 16.00 0.00 104.00 >> 13.00 0.000.000.000.00 0.00 0.00 >> sdb 0.00 2.001.00 3456.00 8.00 25996.00 >> 15.04 5.761.670.001.67 0.03 9.20 >> sdd 4.00 0.00 41462.00 1119.00 331272.00 7996.00 >> 15.9419.890.470.480.21 0.02 66.00 >> >> dm-0 0.00 0.00 6825.00 503.00 330856.00 7996.00 >> 92.48 4.000.550.560.30 0.09 66.80 >> dm-1 0.00 0.001.00 1129.00 8.00 25996.00 >> 46.02 1.030.910.000.91 0.09 10.00 >> >> >> sda is my system disk (SAMSUNG MZILS480HEGR/007 GXL0), sdb and sdd are >> my OSDs >> >> would "osd op queue = wpq" help in this case ? >> Regards >> > > Your disk times look okay, just a lot more unbalanced than I would expect. > I'd give wpq a try, I use it all the time, just be sure to also include the > op_cutoff setting too or it doesn't have much effect. Let me know how it > goes. > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)
On Mon, Jun 10, 2019 at 1:00 AM BASSAGET Cédric < cedric.bassaget...@gmail.com> wrote: > Hello Robert, > My disks did not reach 100% on the last warning, they climb to 70-80% > usage. But I see rrqm / wrqm counters increasing... > > Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz > avgqu-sz await r_await w_await svctm %util > > sda 0.00 4.000.00 16.00 0.00 104.0013.00 > 0.000.000.000.00 0.00 0.00 > sdb 0.00 2.001.00 3456.00 8.00 25996.0015.04 > 5.761.670.001.67 0.03 9.20 > sdd 4.00 0.00 41462.00 1119.00 331272.00 7996.00 > 15.9419.890.470.480.21 0.02 66.00 > > dm-0 0.00 0.00 6825.00 503.00 330856.00 7996.00 > 92.48 4.000.550.560.30 0.09 66.80 > dm-1 0.00 0.001.00 1129.00 8.00 25996.0046.02 > 1.030.910.000.91 0.09 10.00 > > > sda is my system disk (SAMSUNG MZILS480HEGR/007 GXL0), sdb and sdd are > my OSDs > > would "osd op queue = wpq" help in this case ? > Regards > Your disk times look okay, just a lot more unbalanced than I would expect. I'd give wpq a try, I use it all the time, just be sure to also include the op_cutoff setting too or it doesn't have much effect. Let me know how it goes. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)
Hello Robert, My disks did not reach 100% on the last warning, they climb to 70-80% usage. But I see rrqm / wrqm counters increasing... Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 4.000.00 16.00 0.00 104.0013.00 0.000.000.000.00 0.00 0.00 sdb 0.00 2.001.00 3456.00 8.00 25996.0015.04 5.761.670.001.67 0.03 9.20 sdd 4.00 0.00 41462.00 1119.00 331272.00 7996.00 15.9419.890.470.480.21 0.02 66.00 dm-0 0.00 0.00 6825.00 503.00 330856.00 7996.0092.48 4.000.550.560.30 0.09 66.80 dm-1 0.00 0.001.00 1129.00 8.00 25996.0046.02 1.030.910.000.91 0.09 10.00 sda is my system disk (SAMSUNG MZILS480HEGR/007 GXL0), sdb and sdd are my OSDs would "osd op queue = wpq" help in this case ? Regards Le sam. 8 juin 2019 à 07:44, Robert LeBlanc a écrit : > With the low number of OSDs, you are probably satuarting the disks. Check > with `iostat -xd 2` and see what the utilization of your disks are. A lot > of SSDs don't perform well with Ceph's heavy sync writes and performance is > terrible. > > If some of your drives are 100% while others are lower utilization, you > can possibly get more performance and greatly reduce the blocked I/O with > the WPQ scheduler. In the ceph.conf add this to the [osd] section and > restart the processes: > > osd op queue = wpq > osd op queue cut off = high > > This has helped our clusters with fairness between OSDs and making > backfills not so disruptive. > > Robert LeBlanc > PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 > > > On Thu, Jun 6, 2019 at 1:43 AM BASSAGET Cédric < > cedric.bassaget...@gmail.com> wrote: > >> Hello, >> >> I see messages related to REQUEST_SLOW a few times per day. >> >> here's my ceph -s : >> >> root@ceph-pa2-1:/etc/ceph# ceph -s >> cluster: >> id: 72d94815-f057-4127-8914-448dfd25f5bc >> health: HEALTH_OK >> >> services: >> mon: 3 daemons, quorum ceph-pa2-1,ceph-pa2-2,ceph-pa2-3 >> mgr: ceph-pa2-3(active), standbys: ceph-pa2-1, ceph-pa2-2 >> osd: 6 osds: 6 up, 6 in >> >> data: >> pools: 1 pools, 256 pgs >> objects: 408.79k objects, 1.49TiB >> usage: 4.44TiB used, 37.5TiB / 41.9TiB avail >> pgs: 256 active+clean >> >> io: >> client: 8.00KiB/s rd, 17.2MiB/s wr, 1op/s rd, 546op/s wr >> >> >> Running ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) >> luminous (stable) >> >> I've check : >> - all my network stack : OK ( 2*10G LAG ) >> - memory usage : ok (256G on each host, about 2% used per osd) >> - cpu usage : OK (Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz) >> - disk status : OK (SAMSUNG AREA7680S5xnNTRI 3P04 => samsung DC series) >> >> I heard on IRC that it can be related to samsung PM / SM series. >> >> Do anybody here is facing the same problem ? What can I do to solve that ? >> Regards, >> Cédric >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] slow requests are blocked > 32 sec. Implicated osds 0, 2, 3, 4, 5 (REQUEST_SLOW)
With the low number of OSDs, you are probably satuarting the disks. Check with `iostat -xd 2` and see what the utilization of your disks are. A lot of SSDs don't perform well with Ceph's heavy sync writes and performance is terrible. If some of your drives are 100% while others are lower utilization, you can possibly get more performance and greatly reduce the blocked I/O with the WPQ scheduler. In the ceph.conf add this to the [osd] section and restart the processes: osd op queue = wpq osd op queue cut off = high This has helped our clusters with fairness between OSDs and making backfills not so disruptive. Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Thu, Jun 6, 2019 at 1:43 AM BASSAGET Cédric wrote: > Hello, > > I see messages related to REQUEST_SLOW a few times per day. > > here's my ceph -s : > > root@ceph-pa2-1:/etc/ceph# ceph -s > cluster: > id: 72d94815-f057-4127-8914-448dfd25f5bc > health: HEALTH_OK > > services: > mon: 3 daemons, quorum ceph-pa2-1,ceph-pa2-2,ceph-pa2-3 > mgr: ceph-pa2-3(active), standbys: ceph-pa2-1, ceph-pa2-2 > osd: 6 osds: 6 up, 6 in > > data: > pools: 1 pools, 256 pgs > objects: 408.79k objects, 1.49TiB > usage: 4.44TiB used, 37.5TiB / 41.9TiB avail > pgs: 256 active+clean > > io: > client: 8.00KiB/s rd, 17.2MiB/s wr, 1op/s rd, 546op/s wr > > > Running ceph version 12.2.9 (9e300932ef8a8916fb3fda78c58691a6ab0f4217) > luminous (stable) > > I've check : > - all my network stack : OK ( 2*10G LAG ) > - memory usage : ok (256G on each host, about 2% used per osd) > - cpu usage : OK (Intel(R) Xeon(R) CPU E5-2678 v3 @ 2.50GHz) > - disk status : OK (SAMSUNG AREA7680S5xnNTRI 3P04 => samsung DC series) > > I heard on IRC that it can be related to samsung PM / SM series. > > Do anybody here is facing the same problem ? What can I do to solve that ? > Regards, > Cédric > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com