Re: [ceph-users] disk controller failure

2018-12-14 Thread Dietmar Rieder
On 12/14/18 1:44 AM, Christian Balzer wrote:
> On Thu, 13 Dec 2018 19:44:30 +0100 Ronny Aasen wrote:
> 
>> On 13.12.2018 18:19, Alex Gorbachev wrote:
>>> On Thu, Dec 13, 2018 at 10:48 AM Dietmar Rieder
>>>  wrote:  
 Hi Cephers,

 one of our OSD nodes is experiencing a Disk controller problem/failure
 (frequent resetting), so the OSDs on this controller are flapping
 (up/down in/out).

 I will hopefully get the replacement part soon.

 I have some simple questions, what are the best steps to take now before
 an after replacement of the controller?

 - marking down and shutting down all osds on that node?
 - waiting for rebalance is finished
 - replace the controller
 - just restart the osds? Or redeploy them, since they still hold data?

 We are running:

 ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
 (stable)
 CentOS 7.5

 Sorry for my naive questions.  
>>> I usually do ceph osd set noout first to prevent any recoveries
>>>
>>> Then replace the hardware and make sure all OSDs come back online
>>>
>>> Then ceph osd unset noout
>>>
>>> Best regards,
>>> Alex  
>>
>>
>> Setting noout prevents the osd's from re-balancing.  ie when you do a 
>> short fix and do not want it to start re-balancing, since you know the 
>> data will be available shortly.. eg a reboot or similar.
>>
>> if osd's are flapping you normally want them out of the cluster, so they 
>> do not impact performance any more.
>>
> I think in this case the question is, how soon is the new controller going
> to be there?
> If it's soon and/or if rebalancing would severely impact the cluster
> performance, I'd set noout and then shut the node down, stopping both the
> flapping and preventing data movement. 
> Of course if it's a long time to repairs and/or a small cluster (is there
> even enough space to rebalance a node worth of data?) things may be
> different.
> 
> I always set "mon_osd_down_out_subtree_limit = host" (and monitor things
> of course) since I reckon a down node can often be brought back way faster
> than a full rebalance.


Thanks Christian for this comment and suggestion.

I think setting noout and shutdown the node is a good option, because
rebalancing would mean that ~22TB of data has to be moved.
However the spare part seems to be delayed, so I'm affraid I'lll not get
it before Monday.

Best
  Dietmar

> 
> Regards,
> 
> Christian
>>
>> kind regards
>>
>> Ronny Aasen
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] disk controller failure

2018-12-13 Thread Christian Balzer
On Thu, 13 Dec 2018 19:44:30 +0100 Ronny Aasen wrote:

> On 13.12.2018 18:19, Alex Gorbachev wrote:
> > On Thu, Dec 13, 2018 at 10:48 AM Dietmar Rieder
> >  wrote:  
> >> Hi Cephers,
> >>
> >> one of our OSD nodes is experiencing a Disk controller problem/failure
> >> (frequent resetting), so the OSDs on this controller are flapping
> >> (up/down in/out).
> >>
> >> I will hopefully get the replacement part soon.
> >>
> >> I have some simple questions, what are the best steps to take now before
> >> an after replacement of the controller?
> >>
> >> - marking down and shutting down all osds on that node?
> >> - waiting for rebalance is finished
> >> - replace the controller
> >> - just restart the osds? Or redeploy them, since they still hold data?
> >>
> >> We are running:
> >>
> >> ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
> >> (stable)
> >> CentOS 7.5
> >>
> >> Sorry for my naive questions.  
> > I usually do ceph osd set noout first to prevent any recoveries
> >
> > Then replace the hardware and make sure all OSDs come back online
> >
> > Then ceph osd unset noout
> >
> > Best regards,
> > Alex  
> 
> 
> Setting noout prevents the osd's from re-balancing.  ie when you do a 
> short fix and do not want it to start re-balancing, since you know the 
> data will be available shortly.. eg a reboot or similar.
> 
> if osd's are flapping you normally want them out of the cluster, so they 
> do not impact performance any more.
> 
I think in this case the question is, how soon is the new controller going
to be there?
If it's soon and/or if rebalancing would severely impact the cluster
performance, I'd set noout and then shut the node down, stopping both the
flapping and preventing data movement. 
Of course if it's a long time to repairs and/or a small cluster (is there
even enough space to rebalance a node worth of data?) things may be
different.

I always set "mon_osd_down_out_subtree_limit = host" (and monitor things
of course) since I reckon a down node can often be brought back way faster
than a full rebalance.

Regards,

Christian
> 
> kind regards
> 
> Ronny Aasen
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] disk controller failure

2018-12-13 Thread Ronny Aasen

On 13.12.2018 18:19, Alex Gorbachev wrote:

On Thu, Dec 13, 2018 at 10:48 AM Dietmar Rieder
 wrote:

Hi Cephers,

one of our OSD nodes is experiencing a Disk controller problem/failure
(frequent resetting), so the OSDs on this controller are flapping
(up/down in/out).

I will hopefully get the replacement part soon.

I have some simple questions, what are the best steps to take now before
an after replacement of the controller?

- marking down and shutting down all osds on that node?
- waiting for rebalance is finished
- replace the controller
- just restart the osds? Or redeploy them, since they still hold data?

We are running:

ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
(stable)
CentOS 7.5

Sorry for my naive questions.

I usually do ceph osd set noout first to prevent any recoveries

Then replace the hardware and make sure all OSDs come back online

Then ceph osd unset noout

Best regards,
Alex



Setting noout prevents the osd's from re-balancing.  ie when you do a 
short fix and do not want it to start re-balancing, since you know the 
data will be available shortly.. eg a reboot or similar.


if osd's are flapping you normally want them out of the cluster, so they 
do not impact performance any more.



kind regards

Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] disk controller failure

2018-12-13 Thread Alex Gorbachev
On Thu, Dec 13, 2018 at 10:48 AM Dietmar Rieder
 wrote:
>
> Hi Cephers,
>
> one of our OSD nodes is experiencing a Disk controller problem/failure
> (frequent resetting), so the OSDs on this controller are flapping
> (up/down in/out).
>
> I will hopefully get the replacement part soon.
>
> I have some simple questions, what are the best steps to take now before
> an after replacement of the controller?
>
> - marking down and shutting down all osds on that node?
> - waiting for rebalance is finished
> - replace the controller
> - just restart the osds? Or redeploy them, since they still hold data?
>
> We are running:
>
> ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
> (stable)
> CentOS 7.5
>
> Sorry for my naive questions.

I usually do ceph osd set noout first to prevent any recoveries

Then replace the hardware and make sure all OSDs come back online

Then ceph osd unset noout

Best regards,
Alex



>
> Thanks for any help
>   Dietmar
>
> --
> _
> D i e t m a r  R i e d e r, Mag.Dr.
> Innsbruck Medical University
> Biocenter - Division for Bioinformatics
> Innrain 80, 6020 Innsbruck
> Email: dietmar.rie...@i-med.ac.at
> Web:   http://www.icbi.at
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] disk controller failure

2018-12-13 Thread Matthew Vernon

Hi,

On 13/12/2018 16:44, Dietmar Rieder wrote:


So you say, that there will be no problem when after the rebalancing I
restart the stopped OSDs? I mean the have still the data on them.
(Sorry, I just don't like to mess somthing up)


It should be fine[0]; when the OSDs come back in ceph will know what to 
do with them.


Regards,

Matthew

[0] this consultancy worth what you paid for it ;-)




--
The Wellcome Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] disk controller failure

2018-12-13 Thread Dietmar Rieder
Hi Matthew,

thanks for your reply and advise, I really appreciate it.

So you say, that there will be no problem when after the rebalancing I
restart the stopped OSDs? I mean the have still the data on them.
(Sorry, I just don't like to mess somthing up)

Best
  Dietmar

On 12/13/18 5:11 PM, Matthew Vernon wrote:
> Hi,
> 
> On 13/12/2018 15:48, Dietmar Rieder wrote:
> 
>> one of our OSD nodes is experiencing a Disk controller problem/failure
>> (frequent resetting), so the OSDs on this controller are flapping
>> (up/down in/out).
> 
> Ah, hardware...
> 
>> I have some simple questions, what are the best steps to take now before
>> an after replacement of the controller?
> 
> I would stop all the OSDs on the affected node and let the cluster
> rebalance. Once you've replaced the disk controller, start them up again
> and Ceph will rebalance back again.
> 
> Regards,
> 
> Matthew
> 
> 


-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Phone: +43 512 9003 71402
Fax: +43 512 9003 73100
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] disk controller failure

2018-12-13 Thread Matthew Vernon

Hi,

On 13/12/2018 15:48, Dietmar Rieder wrote:


one of our OSD nodes is experiencing a Disk controller problem/failure
(frequent resetting), so the OSDs on this controller are flapping
(up/down in/out).


Ah, hardware...


I have some simple questions, what are the best steps to take now before
an after replacement of the controller?


I would stop all the OSDs on the affected node and let the cluster 
rebalance. Once you've replaced the disk controller, start them up again 
and Ceph will rebalance back again.


Regards,

Matthew


--
The Wellcome Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] disk controller failure

2018-12-13 Thread Dietmar Rieder
Hi Cephers,

one of our OSD nodes is experiencing a Disk controller problem/failure
(frequent resetting), so the OSDs on this controller are flapping
(up/down in/out).

I will hopefully get the replacement part soon.

I have some simple questions, what are the best steps to take now before
an after replacement of the controller?

- marking down and shutting down all osds on that node?
- waiting for rebalance is finished
- replace the controller
- just restart the osds? Or redeploy them, since they still hold data?

We are running:

ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
(stable)
CentOS 7.5

Sorry for my naive questions.

Thanks for any help
  Dietmar

-- 
_
D i e t m a r  R i e d e r, Mag.Dr.
Innsbruck Medical University
Biocenter - Division for Bioinformatics
Innrain 80, 6020 Innsbruck
Email: dietmar.rie...@i-med.ac.at
Web:   http://www.icbi.at




signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com