Re: [ceph-users] disk controller failure
On 12/14/18 1:44 AM, Christian Balzer wrote: > On Thu, 13 Dec 2018 19:44:30 +0100 Ronny Aasen wrote: > >> On 13.12.2018 18:19, Alex Gorbachev wrote: >>> On Thu, Dec 13, 2018 at 10:48 AM Dietmar Rieder >>> wrote: Hi Cephers, one of our OSD nodes is experiencing a Disk controller problem/failure (frequent resetting), so the OSDs on this controller are flapping (up/down in/out). I will hopefully get the replacement part soon. I have some simple questions, what are the best steps to take now before an after replacement of the controller? - marking down and shutting down all osds on that node? - waiting for rebalance is finished - replace the controller - just restart the osds? Or redeploy them, since they still hold data? We are running: ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable) CentOS 7.5 Sorry for my naive questions. >>> I usually do ceph osd set noout first to prevent any recoveries >>> >>> Then replace the hardware and make sure all OSDs come back online >>> >>> Then ceph osd unset noout >>> >>> Best regards, >>> Alex >> >> >> Setting noout prevents the osd's from re-balancing. ie when you do a >> short fix and do not want it to start re-balancing, since you know the >> data will be available shortly.. eg a reboot or similar. >> >> if osd's are flapping you normally want them out of the cluster, so they >> do not impact performance any more. >> > I think in this case the question is, how soon is the new controller going > to be there? > If it's soon and/or if rebalancing would severely impact the cluster > performance, I'd set noout and then shut the node down, stopping both the > flapping and preventing data movement. > Of course if it's a long time to repairs and/or a small cluster (is there > even enough space to rebalance a node worth of data?) things may be > different. > > I always set "mon_osd_down_out_subtree_limit = host" (and monitor things > of course) since I reckon a down node can often be brought back way faster > than a full rebalance. Thanks Christian for this comment and suggestion. I think setting noout and shutdown the node is a good option, because rebalancing would mean that ~22TB of data has to be moved. However the spare part seems to be delayed, so I'm affraid I'lll not get it before Monday. Best Dietmar > > Regards, > > Christian >> >> kind regards >> >> Ronny Aasen >> >> >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- _ D i e t m a r R i e d e r, Mag.Dr. Innsbruck Medical University Biocenter - Division for Bioinformatics Innrain 80, 6020 Innsbruck Phone: +43 512 9003 71402 Fax: +43 512 9003 73100 Email: dietmar.rie...@i-med.ac.at Web: http://www.icbi.at signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] disk controller failure
On Thu, 13 Dec 2018 19:44:30 +0100 Ronny Aasen wrote: > On 13.12.2018 18:19, Alex Gorbachev wrote: > > On Thu, Dec 13, 2018 at 10:48 AM Dietmar Rieder > > wrote: > >> Hi Cephers, > >> > >> one of our OSD nodes is experiencing a Disk controller problem/failure > >> (frequent resetting), so the OSDs on this controller are flapping > >> (up/down in/out). > >> > >> I will hopefully get the replacement part soon. > >> > >> I have some simple questions, what are the best steps to take now before > >> an after replacement of the controller? > >> > >> - marking down and shutting down all osds on that node? > >> - waiting for rebalance is finished > >> - replace the controller > >> - just restart the osds? Or redeploy them, since they still hold data? > >> > >> We are running: > >> > >> ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous > >> (stable) > >> CentOS 7.5 > >> > >> Sorry for my naive questions. > > I usually do ceph osd set noout first to prevent any recoveries > > > > Then replace the hardware and make sure all OSDs come back online > > > > Then ceph osd unset noout > > > > Best regards, > > Alex > > > Setting noout prevents the osd's from re-balancing. ie when you do a > short fix and do not want it to start re-balancing, since you know the > data will be available shortly.. eg a reboot or similar. > > if osd's are flapping you normally want them out of the cluster, so they > do not impact performance any more. > I think in this case the question is, how soon is the new controller going to be there? If it's soon and/or if rebalancing would severely impact the cluster performance, I'd set noout and then shut the node down, stopping both the flapping and preventing data movement. Of course if it's a long time to repairs and/or a small cluster (is there even enough space to rebalance a node worth of data?) things may be different. I always set "mon_osd_down_out_subtree_limit = host" (and monitor things of course) since I reckon a down node can often be brought back way faster than a full rebalance. Regards, Christian > > kind regards > > Ronny Aasen > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Rakuten Communications ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] disk controller failure
On 13.12.2018 18:19, Alex Gorbachev wrote: On Thu, Dec 13, 2018 at 10:48 AM Dietmar Rieder wrote: Hi Cephers, one of our OSD nodes is experiencing a Disk controller problem/failure (frequent resetting), so the OSDs on this controller are flapping (up/down in/out). I will hopefully get the replacement part soon. I have some simple questions, what are the best steps to take now before an after replacement of the controller? - marking down and shutting down all osds on that node? - waiting for rebalance is finished - replace the controller - just restart the osds? Or redeploy them, since they still hold data? We are running: ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable) CentOS 7.5 Sorry for my naive questions. I usually do ceph osd set noout first to prevent any recoveries Then replace the hardware and make sure all OSDs come back online Then ceph osd unset noout Best regards, Alex Setting noout prevents the osd's from re-balancing. ie when you do a short fix and do not want it to start re-balancing, since you know the data will be available shortly.. eg a reboot or similar. if osd's are flapping you normally want them out of the cluster, so they do not impact performance any more. kind regards Ronny Aasen ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] disk controller failure
On Thu, Dec 13, 2018 at 10:48 AM Dietmar Rieder wrote: > > Hi Cephers, > > one of our OSD nodes is experiencing a Disk controller problem/failure > (frequent resetting), so the OSDs on this controller are flapping > (up/down in/out). > > I will hopefully get the replacement part soon. > > I have some simple questions, what are the best steps to take now before > an after replacement of the controller? > > - marking down and shutting down all osds on that node? > - waiting for rebalance is finished > - replace the controller > - just restart the osds? Or redeploy them, since they still hold data? > > We are running: > > ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous > (stable) > CentOS 7.5 > > Sorry for my naive questions. I usually do ceph osd set noout first to prevent any recoveries Then replace the hardware and make sure all OSDs come back online Then ceph osd unset noout Best regards, Alex > > Thanks for any help > Dietmar > > -- > _ > D i e t m a r R i e d e r, Mag.Dr. > Innsbruck Medical University > Biocenter - Division for Bioinformatics > Innrain 80, 6020 Innsbruck > Email: dietmar.rie...@i-med.ac.at > Web: http://www.icbi.at > > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] disk controller failure
Hi, On 13/12/2018 16:44, Dietmar Rieder wrote: So you say, that there will be no problem when after the rebalancing I restart the stopped OSDs? I mean the have still the data on them. (Sorry, I just don't like to mess somthing up) It should be fine[0]; when the OSDs come back in ceph will know what to do with them. Regards, Matthew [0] this consultancy worth what you paid for it ;-) -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] disk controller failure
Hi Matthew, thanks for your reply and advise, I really appreciate it. So you say, that there will be no problem when after the rebalancing I restart the stopped OSDs? I mean the have still the data on them. (Sorry, I just don't like to mess somthing up) Best Dietmar On 12/13/18 5:11 PM, Matthew Vernon wrote: > Hi, > > On 13/12/2018 15:48, Dietmar Rieder wrote: > >> one of our OSD nodes is experiencing a Disk controller problem/failure >> (frequent resetting), so the OSDs on this controller are flapping >> (up/down in/out). > > Ah, hardware... > >> I have some simple questions, what are the best steps to take now before >> an after replacement of the controller? > > I would stop all the OSDs on the affected node and let the cluster > rebalance. Once you've replaced the disk controller, start them up again > and Ceph will rebalance back again. > > Regards, > > Matthew > > -- _ D i e t m a r R i e d e r, Mag.Dr. Innsbruck Medical University Biocenter - Division for Bioinformatics Innrain 80, 6020 Innsbruck Phone: +43 512 9003 71402 Fax: +43 512 9003 73100 Email: dietmar.rie...@i-med.ac.at Web: http://www.icbi.at ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] disk controller failure
Hi, On 13/12/2018 15:48, Dietmar Rieder wrote: one of our OSD nodes is experiencing a Disk controller problem/failure (frequent resetting), so the OSDs on this controller are flapping (up/down in/out). Ah, hardware... I have some simple questions, what are the best steps to take now before an after replacement of the controller? I would stop all the OSDs on the affected node and let the cluster rebalance. Once you've replaced the disk controller, start them up again and Ceph will rebalance back again. Regards, Matthew -- The Wellcome Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] disk controller failure
Hi Cephers, one of our OSD nodes is experiencing a Disk controller problem/failure (frequent resetting), so the OSDs on this controller are flapping (up/down in/out). I will hopefully get the replacement part soon. I have some simple questions, what are the best steps to take now before an after replacement of the controller? - marking down and shutting down all osds on that node? - waiting for rebalance is finished - replace the controller - just restart the osds? Or redeploy them, since they still hold data? We are running: ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous (stable) CentOS 7.5 Sorry for my naive questions. Thanks for any help Dietmar -- _ D i e t m a r R i e d e r, Mag.Dr. Innsbruck Medical University Biocenter - Division for Bioinformatics Innrain 80, 6020 Innsbruck Email: dietmar.rie...@i-med.ac.at Web: http://www.icbi.at signature.asc Description: OpenPGP digital signature ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com