Re: [ceph-users] Slow requests blocked. No rebalancing
Hi all, after increasing mon_max_pg_per_osd number ceph starts rebalancing as usual. However, the slow requests warnings are still there, even after setting primary-affinity to 0 beforehand. By the other hand, if I destroy the osd, ceph will start rebalancing unless noout flag is set, am I right? Thanks Jaime On 20/09/18 14:25, Paul Emmerich wrote: You can prevent creation of the PGs on the old filestore OSDs (which seems to be the culprit here) during replacement by replacing the disks the hard way: * ceph osd destroy osd.X * re-create with bluestore under the same id (ceph volume ... --osd-id X) it will then just backfill onto the same disk without moving any PG. Keep in mind that this means that you are running with one missing copy during the recovery, so that's not the recommended way to do that. Paul 2018-09-20 13:51 GMT+02:00 Eugen Block : Hi, to reduce impact on clients during migration I would set the OSD's primary-affinity to 0 beforehand. This should prevent the slow requests, at least this setting has helped us a lot with problematic OSDs. Regards Eugen Zitat von Jaime Ibar : Hi all, we recently upgrade from Jewel 10.2.10 to Luminous 12.2.7, now we're trying to migrate the osd's to Bluestore following this document[0], however when I mark the osd as out, I'm getting warnings similar to these ones 2018-09-20 09:32:46.079630 mon.dri-ceph01 [WRN] Health check failed: 2 slow requests are blocked > 32 sec. Implicated osds 16,28 (REQUEST_SLOW) 2018-09-20 09:32:52.841123 mon.dri-ceph01 [WRN] Health check update: 7 slow requests are blocked > 32 sec. Implicated osds 10,16,28,32,59 (REQUEST_SLOW) 2018-09-20 09:32:57.842230 mon.dri-ceph01 [WRN] Health check update: 15 slow requests are blocked > 32 sec. Implicated osds 10,16,28,31,32,59,78,80 (REQUEST_SLOW) 2018-09-20 09:32:58.851142 mon.dri-ceph01 [WRN] Health check update: 244944/40100780 objects misplaced (0.611%) (OBJECT_MISPLACED) 2018-09-20 09:32:58.851160 mon.dri-ceph01 [WRN] Health check update: 249 PGs pending on creation (PENDING_CREATING_PGS) which prevent ceph start rebalancing and the vm's running on ceph start hanging and we have to mark the osd back in. I tried to reweight the osd to 0.90 in order to minimize the impact on the cluster but the warnings are the same. I tried to increased these settings to mds cache memory limit = 2147483648 rocksdb cache size = 2147483648 but with no luck, same warnings. We also have cephfs for storing files from different projects(no directory fragmentation enabled). The problem here is that if one osd dies, all the services will be blocked as ceph won't be able to start rebalancing. The cluster is - 3 mons - 3 mds(running on the same hosts as the mons). 2 mds active and 1 standby - 3 mgr(running on the same hosts as the mons) - 6 servers, 12 osd's each. - 1GB private network Does anyone know how to fix or where the problem could be? Thanks a lot in advance. Jaime [0] http://docs.ceph.com/docs/luminous/rados/operations/bluestore-migration/ -- Jaime Ibar High Performance & Research Computing, IS Services Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. http://www.tchpc.tcd.ie/ |ja...@tchpc.tcd.ie Tel: +353-1-896-3725 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Jaime Ibar High Performance & Research Computing, IS Services Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie Tel: +353-1-896-3725 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow requests blocked. No rebalancing
You can prevent creation of the PGs on the old filestore OSDs (which seems to be the culprit here) during replacement by replacing the disks the hard way: * ceph osd destroy osd.X * re-create with bluestore under the same id (ceph volume ... --osd-id X) it will then just backfill onto the same disk without moving any PG. Keep in mind that this means that you are running with one missing copy during the recovery, so that's not the recommended way to do that. Paul 2018-09-20 13:51 GMT+02:00 Eugen Block : > Hi, > > to reduce impact on clients during migration I would set the OSD's > primary-affinity to 0 beforehand. This should prevent the slow requests, at > least this setting has helped us a lot with problematic OSDs. > > Regards > Eugen > > > Zitat von Jaime Ibar : > > >> Hi all, >> >> we recently upgrade from Jewel 10.2.10 to Luminous 12.2.7, now we're >> trying to migrate the >> >> osd's to Bluestore following this document[0], however when I mark the osd >> as out, >> >> I'm getting warnings similar to these ones >> >> 2018-09-20 09:32:46.079630 mon.dri-ceph01 [WRN] Health check failed: 2 >> slow requests are blocked > 32 sec. Implicated osds 16,28 (REQUEST_SLOW) >> 2018-09-20 09:32:52.841123 mon.dri-ceph01 [WRN] Health check update: 7 >> slow requests are blocked > 32 sec. Implicated osds 10,16,28,32,59 >> (REQUEST_SLOW) >> 2018-09-20 09:32:57.842230 mon.dri-ceph01 [WRN] Health check update: 15 >> slow requests are blocked > 32 sec. Implicated osds 10,16,28,31,32,59,78,80 >> (REQUEST_SLOW) >> >> 2018-09-20 09:32:58.851142 mon.dri-ceph01 [WRN] Health check update: >> 244944/40100780 objects misplaced (0.611%) (OBJECT_MISPLACED) >> 2018-09-20 09:32:58.851160 mon.dri-ceph01 [WRN] Health check update: 249 >> PGs pending on creation (PENDING_CREATING_PGS) >> >> which prevent ceph start rebalancing and the vm's running on ceph start >> hanging and we have to mark the osd back in. >> >> I tried to reweight the osd to 0.90 in order to minimize the impact on the >> cluster but the warnings are the same. >> >> I tried to increased these settings to >> >> mds cache memory limit = 2147483648 >> rocksdb cache size = 2147483648 >> >> but with no luck, same warnings. >> >> We also have cephfs for storing files from different projects(no directory >> fragmentation enabled). >> >> The problem here is that if one osd dies, all the services will be blocked >> as ceph won't be able to >> >> start rebalancing. >> >> The cluster is >> >> - 3 mons >> >> - 3 mds(running on the same hosts as the mons). 2 mds active and 1 standby >> >> - 3 mgr(running on the same hosts as the mons) >> >> - 6 servers, 12 osd's each. >> >> - 1GB private network >> >> >> Does anyone know how to fix or where the problem could be? >> >> Thanks a lot in advance. >> >> Jaime >> >> >> [0] >> http://docs.ceph.com/docs/luminous/rados/operations/bluestore-migration/ >> >> -- >> >> Jaime Ibar >> High Performance & Research Computing, IS Services >> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. >> http://www.tchpc.tcd.ie/ |ja...@tchpc.tcd.ie >> Tel: +353-1-896-3725 > > > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow requests blocked. No rebalancing
Hi, to reduce impact on clients during migration I would set the OSD's primary-affinity to 0 beforehand. This should prevent the slow requests, at least this setting has helped us a lot with problematic OSDs. Regards Eugen Zitat von Jaime Ibar : Hi all, we recently upgrade from Jewel 10.2.10 to Luminous 12.2.7, now we're trying to migrate the osd's to Bluestore following this document[0], however when I mark the osd as out, I'm getting warnings similar to these ones 2018-09-20 09:32:46.079630 mon.dri-ceph01 [WRN] Health check failed: 2 slow requests are blocked > 32 sec. Implicated osds 16,28 (REQUEST_SLOW) 2018-09-20 09:32:52.841123 mon.dri-ceph01 [WRN] Health check update: 7 slow requests are blocked > 32 sec. Implicated osds 10,16,28,32,59 (REQUEST_SLOW) 2018-09-20 09:32:57.842230 mon.dri-ceph01 [WRN] Health check update: 15 slow requests are blocked > 32 sec. Implicated osds 10,16,28,31,32,59,78,80 (REQUEST_SLOW) 2018-09-20 09:32:58.851142 mon.dri-ceph01 [WRN] Health check update: 244944/40100780 objects misplaced (0.611%) (OBJECT_MISPLACED) 2018-09-20 09:32:58.851160 mon.dri-ceph01 [WRN] Health check update: 249 PGs pending on creation (PENDING_CREATING_PGS) which prevent ceph start rebalancing and the vm's running on ceph start hanging and we have to mark the osd back in. I tried to reweight the osd to 0.90 in order to minimize the impact on the cluster but the warnings are the same. I tried to increased these settings to mds cache memory limit = 2147483648 rocksdb cache size = 2147483648 but with no luck, same warnings. We also have cephfs for storing files from different projects(no directory fragmentation enabled). The problem here is that if one osd dies, all the services will be blocked as ceph won't be able to start rebalancing. The cluster is - 3 mons - 3 mds(running on the same hosts as the mons). 2 mds active and 1 standby - 3 mgr(running on the same hosts as the mons) - 6 servers, 12 osd's each. - 1GB private network Does anyone know how to fix or where the problem could be? Thanks a lot in advance. Jaime [0] http://docs.ceph.com/docs/luminous/rados/operations/bluestore-migration/ -- Jaime Ibar High Performance & Research Computing, IS Services Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. http://www.tchpc.tcd.ie/ |ja...@tchpc.tcd.ie Tel: +353-1-896-3725 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Slow requests blocked. No rebalancing
Hello, 2018-09-20 09:32:58.851160 mon.dri-ceph01 [WRN] Health check update: 249 PGs pending on creation (PENDING_CREATING_PGS) This error might indicate that you are hitting a PG limit per osd. Here some information on it https://ceph.com/community/new-luminous-pg-overdose-protection/ . You might need to increase mon_max_pg_per_osd for OSD to start balancing out. On Thu, Sep 20, 2018 at 2:25 PM Jaime Ibar wrote: > > Hi all, > > we recently upgrade from Jewel 10.2.10 to Luminous 12.2.7, now we're trying > to migrate the > > osd's to Bluestore following this document[0], however when I mark the osd as > out, > > I'm getting warnings similar to these ones > > 2018-09-20 09:32:46.079630 mon.dri-ceph01 [WRN] Health check failed: 2 slow > requests are blocked > 32 sec. Implicated osds 16,28 (REQUEST_SLOW) > 2018-09-20 09:32:52.841123 mon.dri-ceph01 [WRN] Health check update: 7 slow > requests are blocked > 32 sec. Implicated osds 10,16,28,32,59 (REQUEST_SLOW) > 2018-09-20 09:32:57.842230 mon.dri-ceph01 [WRN] Health check update: 15 slow > requests are blocked > 32 sec. Implicated osds 10,16,28,31,32,59,78,80 > (REQUEST_SLOW) > > 2018-09-20 09:32:58.851142 mon.dri-ceph01 [WRN] Health check update: > 244944/40100780 objects misplaced (0.611%) (OBJECT_MISPLACED) > 2018-09-20 09:32:58.851160 mon.dri-ceph01 [WRN] Health check update: 249 PGs > pending on creation (PENDING_CREATING_PGS) > > which prevent ceph start rebalancing and the vm's running on ceph start > hanging and we have to mark the osd back in. > > I tried to reweight the osd to 0.90 in order to minimize the impact on the > cluster but the warnings are the same. > > I tried to increased these settings to > > mds cache memory limit = 2147483648 > rocksdb cache size = 2147483648 > > but with no luck, same warnings. > > We also have cephfs for storing files from different projects(no directory > fragmentation enabled). > > The problem here is that if one osd dies, all the services will be blocked as > ceph won't be able to > > start rebalancing. > > The cluster is > > - 3 mons > > - 3 mds(running on the same hosts as the mons). 2 mds active and 1 standby > > - 3 mgr(running on the same hosts as the mons) > > - 6 servers, 12 osd's each. > > - 1GB private network > > > Does anyone know how to fix or where the problem could be? > > Thanks a lot in advance. > > Jaime > > > [0] http://docs.ceph.com/docs/luminous/rados/operations/bluestore-migration/ > > -- > > Jaime Ibar > High Performance & Research Computing, IS Services > Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. > http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie > Tel: +353-1-896-3725 > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Slow requests blocked. No rebalancing
Hi all, we recently upgrade from Jewel 10.2.10 to Luminous 12.2.7, now we're trying to migrate the osd's to Bluestore following this document[0], however when I mark the osd as out, I'm getting warnings similar to these ones 2018-09-20 09:32:46.079630 mon.dri-ceph01 [WRN] Health check failed: 2 slow requests are blocked > 32 sec. Implicated osds 16,28 (REQUEST_SLOW) 2018-09-20 09:32:52.841123 mon.dri-ceph01 [WRN] Health check update: 7 slow requests are blocked > 32 sec. Implicated osds 10,16,28,32,59 (REQUEST_SLOW) 2018-09-20 09:32:57.842230 mon.dri-ceph01 [WRN] Health check update: 15 slow requests are blocked > 32 sec. Implicated osds 10,16,28,31,32,59,78,80 (REQUEST_SLOW) 2018-09-20 09:32:58.851142 mon.dri-ceph01 [WRN] Health check update: 244944/40100780 objects misplaced (0.611%) (OBJECT_MISPLACED) 2018-09-20 09:32:58.851160 mon.dri-ceph01 [WRN] Health check update: 249 PGs pending on creation (PENDING_CREATING_PGS) which prevent ceph start rebalancing and the vm's running on ceph start hanging and we have to mark the osd back in. I tried to reweight the osd to 0.90 in order to minimize the impact on the cluster but the warnings are the same. I tried to increased these settings to mds cache memory limit = 2147483648 rocksdb cache size = 2147483648 but with no luck, same warnings. We also have cephfs for storing files from different projects(no directory fragmentation enabled). The problem here is that if one osd dies, all the services will be blocked as ceph won't be able to start rebalancing. The cluster is - 3 mons - 3 mds(running on the same hosts as the mons). 2 mds active and 1 standby - 3 mgr(running on the same hosts as the mons) - 6 servers, 12 osd's each. - 1GB private network Does anyone know how to fix or where the problem could be? Thanks a lot in advance. Jaime [0] http://docs.ceph.com/docs/luminous/rados/operations/bluestore-migration/ -- Jaime Ibar High Performance & Research Computing, IS Services Lloyd Building, Trinity College Dublin, Dublin 2, Ireland. http://www.tchpc.tcd.ie/ |ja...@tchpc.tcd.ie Tel: +353-1-896-3725 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com