Re: [ceph-users] Gracefully reboot OSD node

2017-08-03 Thread Wido den Hollander

> Op 3 augustus 2017 om 14:14 schreef Hans van den Bogert 
> :
> 
> 
> Thanks for answering even before I asked the questions:)
> 
> So bottom line, HEALTH_ERR state is simply part of taking a (bunch of) OSD
> down?  Is HEALTH_ERR period of 2-4 seconds within normal bounds? For
> context, CPUs are 2609v3 per 4 OSDs. (I know; they're far from the fastest
> CPUs)
> 

Yes. Prior to Jewel Ceph wouldn't go to ERR if PGs were inactive, where peering 
or down is a inactive state. It would just stay in WARN which implies nothing 
is really wrong.

You can influence the behavior with mon_pg_min_inactive. It's set to 1 by 
default, but that controls how many PGs need to be inactive before it goes to 
ERR. But that's merely suppressing the error.

A 1.9Ghz CPU isn't the fastest indeed. And most of the peering work is single 
threaded, so yes, this behavior is normal. If you would have faster CPUs you 
could reduce this time.

Still, 2 to 4 seconds isn't that bad.

Wido

> On Thu, Aug 3, 2017 at 1:55 PM, Hans van den Bogert 
> wrote:
> 
> > What are the implications of this? Because I can see a lot of blocked
> > requests piling up when using 'noout' and 'nodown'. That probably makes
> > sense though.
> > Another thing, no when the OSDs come back online, I again see multiple
> > periods of HEALTH_ERR state. Is that to be expected?
> >
> > On Thu, Aug 3, 2017 at 1:36 PM, linghucongsong 
> > wrote:
> >
> >>
> >>
> >> set the osd noout nodown
> >>
> >>
> >>
> >>
> >> At 2017-08-03 18:29:47, "Hans van den Bogert" 
> >> wrote:
> >>
> >> Hi all,
> >>
> >> One thing which has bothered since the beginning of using ceph is that a
> >> reboot of a single OSD causes a HEALTH_ERR state for the cluster for at
> >> least a couple of seconds.
> >>
> >> In the case of planned reboot of a OSD node, should I do some extra
> >> commands in order not to go to HEALTH_ERR state?
> >>
> >> Thanks,
> >>
> >> Hans
> >>
> >>
> >>
> >>
> >>
> >
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Gracefully reboot OSD node

2017-08-03 Thread Hans van den Bogert
Thanks for answering even before I asked the questions:)

So bottom line, HEALTH_ERR state is simply part of taking a (bunch of) OSD
down?  Is HEALTH_ERR period of 2-4 seconds within normal bounds? For
context, CPUs are 2609v3 per 4 OSDs. (I know; they're far from the fastest
CPUs)

On Thu, Aug 3, 2017 at 1:55 PM, Hans van den Bogert 
wrote:

> What are the implications of this? Because I can see a lot of blocked
> requests piling up when using 'noout' and 'nodown'. That probably makes
> sense though.
> Another thing, no when the OSDs come back online, I again see multiple
> periods of HEALTH_ERR state. Is that to be expected?
>
> On Thu, Aug 3, 2017 at 1:36 PM, linghucongsong 
> wrote:
>
>>
>>
>> set the osd noout nodown
>>
>>
>>
>>
>> At 2017-08-03 18:29:47, "Hans van den Bogert" 
>> wrote:
>>
>> Hi all,
>>
>> One thing which has bothered since the beginning of using ceph is that a
>> reboot of a single OSD causes a HEALTH_ERR state for the cluster for at
>> least a couple of seconds.
>>
>> In the case of planned reboot of a OSD node, should I do some extra
>> commands in order not to go to HEALTH_ERR state?
>>
>> Thanks,
>>
>> Hans
>>
>>
>>
>>
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Gracefully reboot OSD node

2017-08-03 Thread Hans van den Bogert
What are the implications of this? Because I can see a lot of blocked
requests piling up when using 'noout' and 'nodown'. That probably makes
sense though.
Another thing, no when the OSDs come back online, I again see multiple
periods of HEALTH_ERR state. Is that to be expected?

On Thu, Aug 3, 2017 at 1:36 PM, linghucongsong 
wrote:

>
>
> set the osd noout nodown
>
>
>
>
> At 2017-08-03 18:29:47, "Hans van den Bogert" 
> wrote:
>
> Hi all,
>
> One thing which has bothered since the beginning of using ceph is that a
> reboot of a single OSD causes a HEALTH_ERR state for the cluster for at
> least a couple of seconds.
>
> In the case of planned reboot of a OSD node, should I do some extra
> commands in order not to go to HEALTH_ERR state?
>
> Thanks,
>
> Hans
>
>
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Gracefully reboot OSD node

2017-08-03 Thread Wido den Hollander

> Op 3 augustus 2017 om 13:36 schreef linghucongsong :
> 
> 
> 
> 
> set the osd noout nodown
> 

While noout is correct and might help in some situations, never set nodown 
unless you really need that. It will block I/O since you are taking down OSDs 
which aren't marked as down.

In Hans's case the 'problem' is that the HEALTH_ERR is correct. Since Jewel 
Ceph's health will go to ERR as soon as PGs are not active.

When you take down a node they will re-peer PGs and during that time no I/O can 
be performed on those PGs and that is a ERR state.

Peering can be done faster by having higher clocked CPUs, but there will be a 
short moment where I/O will block for a set of PGs.

Wido

> 
> 
> 
> At 2017-08-03 18:29:47, "Hans van den Bogert"  wrote:
> 
> Hi all,
> 
> 
> One thing which has bothered since the beginning of using ceph is that a 
> reboot of a single OSD causes a HEALTH_ERR state for the cluster for at least 
> a couple of seconds.
> 
> 
> 
> In the case of planned reboot of a OSD node, should I do some extra commands 
> in order not to go to HEALTH_ERR state?
> 
> 
> Thanks,
> 
> 
> Hans
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Gracefully reboot OSD node

2017-08-03 Thread linghucongsong


set the osd noout nodown




At 2017-08-03 18:29:47, "Hans van den Bogert"  wrote:

Hi all,


One thing which has bothered since the beginning of using ceph is that a reboot 
of a single OSD causes a HEALTH_ERR state for the cluster for at least a couple 
of seconds.



In the case of planned reboot of a OSD node, should I do some extra commands in 
order not to go to HEALTH_ERR state?


Thanks,


Hans
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Gracefully reboot OSD node

2017-08-03 Thread Hans van den Bogert
Hi all,

One thing which has bothered since the beginning of using ceph is that a
reboot of a single OSD causes a HEALTH_ERR state for the cluster for at
least a couple of seconds.

In the case of planned reboot of a OSD node, should I do some extra
commands in order not to go to HEALTH_ERR state?

Thanks,

Hans
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com