Re: [ceph-users] PG stuck inconsistent, but appears ok?

2017-07-14 Thread Dan van der Ster
You probably have osd_max_scrubs=1 and the PG just isn't getting a
slot to start.
Here's a little trick to get that going right away:

ceph osd set noscrub
ceph osd set nodeep-scrub
ceph tell osd.* injectargs -- --osd_max_scrubs 2
ceph pg deep-scrub 22.1611
... wait until it starts scrubbing ...
ceph tell osd.* injectargs -- --osd_max_scrubs 1
ceph osd unset nodeep-scrub
ceph osd unset noscrub

.. Dan


On Fri, Jul 14, 2017 at 7:45 PM, Aaron Bassett
 wrote:
> I issued the pg deep scrub command ~24 hours ago and nothing has changed. I
> see nothing in the active osd's log about kicking off the scrub.
>
> On Jul 13, 2017, at 2:24 PM, David Turner  wrote:
>
> # ceph pg deep-scrub 22.1611
>
> On Thu, Jul 13, 2017 at 1:00 PM Aaron Bassett 
> wrote:
>>
>> I'm not sure if I'm doing something wrong, but when I run this:
>>
>> # ceph osd deep-scrub 294
>>
>>
>> All i get in the osd log is:
>>
>> 2017-07-13 16:57:53.782841 7f40d089f700  0 log_channel(cluster) log [INF]
>> : 21.1ae9 deep-scrub starts
>> 2017-07-13 16:57:53.785261 7f40ce09a700  0 log_channel(cluster) log [INF]
>> : 21.1ae9 deep-scrub ok
>>
>>
>> each time I run it, its the same pg.
>>
>> Is there some reason its not scrubbing all the pgs?
>>
>> Aaron
>>
>> > On Jul 13, 2017, at 10:29 AM, Aaron Bassett
>> >  wrote:
>> >
>> > Ok good to hear, I just kicked one off on the acting primary so I guess
>> > I'll be patient now...
>> >
>> > Thanks,
>> > Aaron
>> >
>> >> On Jul 13, 2017, at 10:28 AM, Dan van der Ster 
>> >> wrote:
>> >>
>> >> On Thu, Jul 13, 2017 at 4:23 PM, Aaron Bassett
>> >>  wrote:
>> >>> Because it was a read error I check SMART stats for that osd's disk
>> >>> and sure enough, it had some uncorrected read errors. In order to stop it
>> >>> from causing more problems > I stopped the daemon to let ceph recover 
>> >>> from
>> >>> the other osds. The cluster has now finished rebalancing, but remains in 
>> >>> ERR
>> >>> state as it still thinks this pg is inconsistent.
>> >>
>> >> It should clear up after you trigger another deep-scrub on that PG.
>> >>
>> >> Cheers, Dan
>> >
>>
>> CONFIDENTIALITY NOTICE
>> This e-mail message and any attachments are only for the use of the
>> intended recipient and may contain information that is privileged,
>> confidential or exempt from disclosure under applicable law. If you are not
>> the intended recipient, any disclosure, distribution or other use of this
>> e-mail message or attachments is prohibited. If you have received this
>> e-mail message in error, please delete and notify the sender immediately.
>> Thank you.
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck inconsistent, but appears ok?

2017-07-14 Thread Aaron Bassett
I issued the pg deep scrub command ~24 hours ago and nothing has changed. I see 
nothing in the active osd's log about kicking off the scrub.

On Jul 13, 2017, at 2:24 PM, David Turner 
> wrote:

# ceph pg deep-scrub 22.1611

On Thu, Jul 13, 2017 at 1:00 PM Aaron Bassett 
> wrote:
I'm not sure if I'm doing something wrong, but when I run this:

# ceph osd deep-scrub 294


All i get in the osd log is:

2017-07-13 16:57:53.782841 7f40d089f700  0 log_channel(cluster) log [INF] : 
21.1ae9 deep-scrub starts
2017-07-13 16:57:53.785261 7f40ce09a700  0 log_channel(cluster) log [INF] : 
21.1ae9 deep-scrub ok


each time I run it, its the same pg.

Is there some reason its not scrubbing all the pgs?

Aaron

> On Jul 13, 2017, at 10:29 AM, Aaron Bassett 
> > wrote:
>
> Ok good to hear, I just kicked one off on the acting primary so I guess I'll 
> be patient now...
>
> Thanks,
> Aaron
>
>> On Jul 13, 2017, at 10:28 AM, Dan van der Ster 
>> > wrote:
>>
>> On Thu, Jul 13, 2017 at 4:23 PM, Aaron Bassett
>> > wrote:
>>> Because it was a read error I check SMART stats for that osd's disk and 
>>> sure enough, it had some uncorrected read errors. In order to stop it from 
>>> causing more problems > I stopped the daemon to let ceph recover from the 
>>> other osds. The cluster has now finished rebalancing, but remains in ERR 
>>> state as it still thinks this pg is inconsistent.
>>
>> It should clear up after you trigger another deep-scrub on that PG.
>>
>> Cheers, Dan
>

CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended 
recipient and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient, any disclosure, distribution or other use of this e-mail message or 
attachments is prohibited. If you have received this e-mail message in error, 
please delete and notify the sender immediately. Thank you.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck inconsistent, but appears ok?

2017-07-13 Thread David Turner
# ceph pg deep-scrub 22.1611

On Thu, Jul 13, 2017 at 1:00 PM Aaron Bassett 
wrote:

> I'm not sure if I'm doing something wrong, but when I run this:
>
> # ceph osd deep-scrub 294
>
>
> All i get in the osd log is:
>
> 2017-07-13 16:57:53.782841 7f40d089f700  0 log_channel(cluster) log [INF]
> : 21.1ae9 deep-scrub starts
> 2017-07-13 16:57:53.785261 7f40ce09a700  0 log_channel(cluster) log [INF]
> : 21.1ae9 deep-scrub ok
>
>
> each time I run it, its the same pg.
>
> Is there some reason its not scrubbing all the pgs?
>
> Aaron
>
> > On Jul 13, 2017, at 10:29 AM, Aaron Bassett 
> wrote:
> >
> > Ok good to hear, I just kicked one off on the acting primary so I guess
> I'll be patient now...
> >
> > Thanks,
> > Aaron
> >
> >> On Jul 13, 2017, at 10:28 AM, Dan van der Ster 
> wrote:
> >>
> >> On Thu, Jul 13, 2017 at 4:23 PM, Aaron Bassett
> >>  wrote:
> >>> Because it was a read error I check SMART stats for that osd's disk
> and sure enough, it had some uncorrected read errors. In order to stop it
> from causing more problems > I stopped the daemon to let ceph recover from
> the other osds. The cluster has now finished rebalancing, but remains in
> ERR state as it still thinks this pg is inconsistent.
> >>
> >> It should clear up after you trigger another deep-scrub on that PG.
> >>
> >> Cheers, Dan
> >
>
> CONFIDENTIALITY NOTICE
> This e-mail message and any attachments are only for the use of the
> intended recipient and may contain information that is privileged,
> confidential or exempt from disclosure under applicable law. If you are not
> the intended recipient, any disclosure, distribution or other use of this
> e-mail message or attachments is prohibited. If you have received this
> e-mail message in error, please delete and notify the sender immediately.
> Thank you.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck inconsistent, but appears ok?

2017-07-13 Thread Aaron Bassett
I'm not sure if I'm doing something wrong, but when I run this:

# ceph osd deep-scrub 294


All i get in the osd log is:

2017-07-13 16:57:53.782841 7f40d089f700  0 log_channel(cluster) log [INF] : 
21.1ae9 deep-scrub starts
2017-07-13 16:57:53.785261 7f40ce09a700  0 log_channel(cluster) log [INF] : 
21.1ae9 deep-scrub ok


each time I run it, its the same pg.

Is there some reason its not scrubbing all the pgs?

Aaron

> On Jul 13, 2017, at 10:29 AM, Aaron Bassett  
> wrote:
>
> Ok good to hear, I just kicked one off on the acting primary so I guess I'll 
> be patient now...
>
> Thanks,
> Aaron
>
>> On Jul 13, 2017, at 10:28 AM, Dan van der Ster  wrote:
>>
>> On Thu, Jul 13, 2017 at 4:23 PM, Aaron Bassett
>>  wrote:
>>> Because it was a read error I check SMART stats for that osd's disk and 
>>> sure enough, it had some uncorrected read errors. In order to stop it from 
>>> causing more problems > I stopped the daemon to let ceph recover from the 
>>> other osds. The cluster has now finished rebalancing, but remains in ERR 
>>> state as it still thinks this pg is inconsistent.
>>
>> It should clear up after you trigger another deep-scrub on that PG.
>>
>> Cheers, Dan
>

CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended 
recipient and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient, any disclosure, distribution or other use of this e-mail message or 
attachments is prohibited. If you have received this e-mail message in error, 
please delete and notify the sender immediately. Thank you.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck inconsistent, but appears ok?

2017-07-13 Thread Aaron Bassett
Ok good to hear, I just kicked one off on the acting primary so I guess I'll be 
patient now...

Thanks,
Aaron

> On Jul 13, 2017, at 10:28 AM, Dan van der Ster  wrote:
>
> On Thu, Jul 13, 2017 at 4:23 PM, Aaron Bassett
>  wrote:
>> Because it was a read error I check SMART stats for that osd's disk and sure 
>> enough, it had some uncorrected read errors. In order to stop it from 
>> causing more problems > I stopped the daemon to let ceph recover from the 
>> other osds. The cluster has now finished rebalancing, but remains in ERR 
>> state as it still thinks this pg is inconsistent.
>
> It should clear up after you trigger another deep-scrub on that PG.
>
> Cheers, Dan

CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended 
recipient and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient, any disclosure, distribution or other use of this e-mail message or 
attachments is prohibited. If you have received this e-mail message in error, 
please delete and notify the sender immediately. Thank you.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PG stuck inconsistent, but appears ok?

2017-07-13 Thread Dan van der Ster
On Thu, Jul 13, 2017 at 4:23 PM, Aaron Bassett
 wrote:
> Because it was a read error I check SMART stats for that osd's disk and sure 
> enough, it had some uncorrected read errors. In order to stop it from causing 
> more problems > I stopped the daemon to let ceph recover from the other osds. 
> The cluster has now finished rebalancing, but remains in ERR state as it 
> still thinks this pg is inconsistent.

It should clear up after you trigger another deep-scrub on that PG.

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG stuck inconsistent, but appears ok?

2017-07-13 Thread Aaron Bassett
Good Morning,
I have an odd situation where a pg is listed inconsistent, but rados is 
struggling to tell me about it:

# ceph health detail
HEALTH_ERR 1 pgs inconsistent; 1 requests are blocked > 32 sec; 1 osds have 
slow requests; 1 scrub errors
pg 22.1611 is active+clean+inconsistent, acting 
[294,1080,970,324,722,70,949,874,943,606,518]
1 scrub errors

# rados list-inconsistent-pg .us-smr.rgw.buckets
["22.1611"]

# rados list-inconsistent-obj 22.1611
[]error 2: (2) No such file or directory

A little background, I got into this state because the inconsistent pg popped 
up in ceph -s. I used list-inconsistent-obj to find which osd was causing the 
problem:

{
"osd": 497,
"missing": false,
"read_error": true,
"data_digest_mismatch": false,
"omap_digest_mismatch": false,
"size_mismatch": false,
"size": 599488
},


Because it was a read error I check SMART stats for that osd's disk and sure 
enough, it had some uncorrected read errors. In order to stop it from causing 
more problems I stopped the daemon to let ceph recover from the other osds. The 
cluster has now finished rebalancing, but remains in ERR state as it still 
thinks this pg is inconsistent.

ceph pg query output is here: https://hastebin.com/mamesokexa.cpp

Thanks,
Aaron
CONFIDENTIALITY NOTICE
This e-mail message and any attachments are only for the use of the intended 
recipient and may contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient, any disclosure, distribution or other use of this e-mail message or 
attachments is prohibited. If you have received this e-mail message in error, 
please delete and notify the sender immediately. Thank you.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com