Re: [ceph-users] Inconsistent PG's, repair ineffective

2013-05-22 Thread David Zafman

You need to find out where the third copy is.  Corrupt it.  Then let repair 
copy the data from a good copy.

$ ceph pg map 19.1b

You should see something like this:
osdmap e158 pg 19.1b (19.1b) -> up [13, 22, xx] acting [13, 22, xx]

The osd xx that is NOT 13 or 22 has the corrupted copy.Connect to the node 
that has that osd.

Find in the mount for osd xx your object with name 
"rb.0.6989.2ae8944a.005b"

$ find /var/lib/ceph/osd/ceph-xx -name 'rb.0.6989.2ae8944a.005b*' -ls
2013266124 -rw-r--r--   1 root root  255 May 22 14:11 
/var/lib/ceph/osd/ceph-xx/current/19.1b_head/rb.0.6989.2ae8944a.005b__head___0

I would stop osd xx, first.  In this case we find the file is 255 bytes long.  
In order to make sure this bad copy isn't used.  Let's make the file 1 byte 
longer.

$ truncate -s 256 
/var/lib/ceph/osd/ceph-xx/current/19.1b_head/rb.0.6989.2ae8944a.005b__head___0

Restart osd xx.  Not sure how what command does that on your platform.

Verify that OSDs are all running.  Shows all osds are up and in.
$ ceph -s | grep osdmap
osdmap e6: 6 osds: 6 up, 6 in

$ ceph osd repair 19.1b
instructing pg 19.1b on osd.13 to repair


David Zafman
Senior Developer
http://www.inktank.com




On May 21, 2013, at 3:39 PM, John Nielsen  wrote:

> I've checked, all the disks are fine and the cluster is healthy except for 
> the inconsistent objects.
> 
> How would I go about manually repairing?
> 
> On May 21, 2013, at 3:26 PM, David Zafman  wrote:
> 
>> 
>> I can't reproduce this on v0.61-2.  Could the disks for osd.13 & osd.22 be 
>> unwritable?
>> 
>> In your case it looks like the 3rd replica is probably the bad one, since 
>> osd.13 and osd.22 are the same.  You probably want to manually repair the 
>> 3rd replica.
>> 
>> David Zafman
>> Senior Developer
>> http://www.inktank.com
>> 
>> 
>> 
>> 
>> On May 21, 2013, at 6:45 AM, John Nielsen  wrote:
>> 
>>> Cuttlefish on CentOS 6, ceph-0.61.2-0.el6.x86_64.
>>> 
>>> On May 21, 2013, at 12:13 AM, David Zafman  wrote:
>>> 
 
 What version of ceph are you running?
 
 David Zafman
 Senior Developer
 http://www.inktank.com
 
 On May 20, 2013, at 9:14 AM, John Nielsen  wrote:
 
> Some scrub errors showed up on our cluster last week. We had some issues 
> with host stability a couple weeks ago; my guess is that errors were 
> introduced at that point and a recent background scrub detected them. I 
> was able to clear most of them via "ceph pg repair", but several remain. 
> Based on some other posts, I'm guessing that they won't repair because it 
> is the primary copy that has the error. All of our pools are set to size 
> 3 so there _ought_ to be a way to verify and restore the correct data, 
> right?
> 
> Below is some log output about one of the problem PG's. Can anyone 
> suggest a way to fix the inconsistencies?
> 
> 2013-05-20 10:07:54.529582 osd.13 10.20.192.111:6818/20919 3451 : [ERR] 
> 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
> 4289025870 != known digest 4190506501
> 2013-05-20 10:07:54.529585 osd.13 10.20.192.111:6818/20919 3452 : [ERR] 
> 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
> 4289025870 != known digest 4190506501
> 2013-05-20 10:07:54.606034 osd.13 10.20.192.111:6818/20919 3453 : [ERR] 
> 19.1b repair 0 missing, 1 inconsistent objects
> 2013-05-20 10:07:54.606066 osd.13 10.20.192.111:6818/20919 3454 : [ERR] 
> 19.1b repair 2 errors, 2 fixed
> 2013-05-20 10:07:55.034221 osd.13 10.20.192.111:6818/20919 3455 : [ERR] 
> 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
> 4289025870 != known digest 4190506501
> 2013-05-20 10:07:55.034224 osd.13 10.20.192.111:6818/20919 3456 : [ERR] 
> 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
> 4289025870 != known digest 4190506501
> 2013-05-20 10:07:55.113230 osd.13 10.20.192.111:6818/20919 3457 : [ERR] 
> 19.1b deep-scrub 0 missing, 1 inconsistent objects
> 2013-05-20 10:07:55.113235 osd.13 10.20.192.111:6818/20919 3458 : [ERR] 
> 19.1b deep-scrub 2 errors
> 
> Thanks,
> 
> JN
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 
>>> 
>> 
>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PG's, repair ineffective

2013-05-21 Thread John Nielsen
I've checked, all the disks are fine and the cluster is healthy except for the 
inconsistent objects.

How would I go about manually repairing?

On May 21, 2013, at 3:26 PM, David Zafman  wrote:

> 
> I can't reproduce this on v0.61-2.  Could the disks for osd.13 & osd.22 be 
> unwritable?
> 
> In your case it looks like the 3rd replica is probably the bad one, since 
> osd.13 and osd.22 are the same.  You probably want to manually repair the 3rd 
> replica.
> 
> David Zafman
> Senior Developer
> http://www.inktank.com
> 
> 
> 
> 
> On May 21, 2013, at 6:45 AM, John Nielsen  wrote:
> 
>> Cuttlefish on CentOS 6, ceph-0.61.2-0.el6.x86_64.
>> 
>> On May 21, 2013, at 12:13 AM, David Zafman  wrote:
>> 
>>> 
>>> What version of ceph are you running?
>>> 
>>> David Zafman
>>> Senior Developer
>>> http://www.inktank.com
>>> 
>>> On May 20, 2013, at 9:14 AM, John Nielsen  wrote:
>>> 
 Some scrub errors showed up on our cluster last week. We had some issues 
 with host stability a couple weeks ago; my guess is that errors were 
 introduced at that point and a recent background scrub detected them. I 
 was able to clear most of them via "ceph pg repair", but several remain. 
 Based on some other posts, I'm guessing that they won't repair because it 
 is the primary copy that has the error. All of our pools are set to size 3 
 so there _ought_ to be a way to verify and restore the correct data, right?
 
 Below is some log output about one of the problem PG's. Can anyone suggest 
 a way to fix the inconsistencies?
 
 2013-05-20 10:07:54.529582 osd.13 10.20.192.111:6818/20919 3451 : [ERR] 
 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
 4289025870 != known digest 4190506501
 2013-05-20 10:07:54.529585 osd.13 10.20.192.111:6818/20919 3452 : [ERR] 
 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
 4289025870 != known digest 4190506501
 2013-05-20 10:07:54.606034 osd.13 10.20.192.111:6818/20919 3453 : [ERR] 
 19.1b repair 0 missing, 1 inconsistent objects
 2013-05-20 10:07:54.606066 osd.13 10.20.192.111:6818/20919 3454 : [ERR] 
 19.1b repair 2 errors, 2 fixed
 2013-05-20 10:07:55.034221 osd.13 10.20.192.111:6818/20919 3455 : [ERR] 
 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
 4289025870 != known digest 4190506501
 2013-05-20 10:07:55.034224 osd.13 10.20.192.111:6818/20919 3456 : [ERR] 
 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
 4289025870 != known digest 4190506501
 2013-05-20 10:07:55.113230 osd.13 10.20.192.111:6818/20919 3457 : [ERR] 
 19.1b deep-scrub 0 missing, 1 inconsistent objects
 2013-05-20 10:07:55.113235 osd.13 10.20.192.111:6818/20919 3458 : [ERR] 
 19.1b deep-scrub 2 errors
 
 Thanks,
 
 JN
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> 
>>> 
>> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PG's, repair ineffective

2013-05-21 Thread David Zafman

I can't reproduce this on v0.61-2.  Could the disks for osd.13 & osd.22 be 
unwritable?

In your case it looks like the 3rd replica is probably the bad one, since 
osd.13 and osd.22 are the same.  You probably want to manually repair the 3rd 
replica.

David Zafman
Senior Developer
http://www.inktank.com




On May 21, 2013, at 6:45 AM, John Nielsen  wrote:

> Cuttlefish on CentOS 6, ceph-0.61.2-0.el6.x86_64.
> 
> On May 21, 2013, at 12:13 AM, David Zafman  wrote:
> 
>> 
>> What version of ceph are you running?
>> 
>> David Zafman
>> Senior Developer
>> http://www.inktank.com
>> 
>> On May 20, 2013, at 9:14 AM, John Nielsen  wrote:
>> 
>>> Some scrub errors showed up on our cluster last week. We had some issues 
>>> with host stability a couple weeks ago; my guess is that errors were 
>>> introduced at that point and a recent background scrub detected them. I was 
>>> able to clear most of them via "ceph pg repair", but several remain. Based 
>>> on some other posts, I'm guessing that they won't repair because it is the 
>>> primary copy that has the error. All of our pools are set to size 3 so 
>>> there _ought_ to be a way to verify and restore the correct data, right?
>>> 
>>> Below is some log output about one of the problem PG's. Can anyone suggest 
>>> a way to fix the inconsistencies?
>>> 
>>> 2013-05-20 10:07:54.529582 osd.13 10.20.192.111:6818/20919 3451 : [ERR] 
>>> 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
>>> 4289025870 != known digest 4190506501
>>> 2013-05-20 10:07:54.529585 osd.13 10.20.192.111:6818/20919 3452 : [ERR] 
>>> 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
>>> 4289025870 != known digest 4190506501
>>> 2013-05-20 10:07:54.606034 osd.13 10.20.192.111:6818/20919 3453 : [ERR] 
>>> 19.1b repair 0 missing, 1 inconsistent objects
>>> 2013-05-20 10:07:54.606066 osd.13 10.20.192.111:6818/20919 3454 : [ERR] 
>>> 19.1b repair 2 errors, 2 fixed
>>> 2013-05-20 10:07:55.034221 osd.13 10.20.192.111:6818/20919 3455 : [ERR] 
>>> 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
>>> 4289025870 != known digest 4190506501
>>> 2013-05-20 10:07:55.034224 osd.13 10.20.192.111:6818/20919 3456 : [ERR] 
>>> 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
>>> 4289025870 != known digest 4190506501
>>> 2013-05-20 10:07:55.113230 osd.13 10.20.192.111:6818/20919 3457 : [ERR] 
>>> 19.1b deep-scrub 0 missing, 1 inconsistent objects
>>> 2013-05-20 10:07:55.113235 osd.13 10.20.192.111:6818/20919 3458 : [ERR] 
>>> 19.1b deep-scrub 2 errors
>>> 
>>> Thanks,
>>> 
>>> JN
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PG's, repair ineffective

2013-05-21 Thread John Nielsen
Cuttlefish on CentOS 6, ceph-0.61.2-0.el6.x86_64.

On May 21, 2013, at 12:13 AM, David Zafman  wrote:

> 
> What version of ceph are you running?
> 
> David Zafman
> Senior Developer
> http://www.inktank.com
> 
> On May 20, 2013, at 9:14 AM, John Nielsen  wrote:
> 
>> Some scrub errors showed up on our cluster last week. We had some issues 
>> with host stability a couple weeks ago; my guess is that errors were 
>> introduced at that point and a recent background scrub detected them. I was 
>> able to clear most of them via "ceph pg repair", but several remain. Based 
>> on some other posts, I'm guessing that they won't repair because it is the 
>> primary copy that has the error. All of our pools are set to size 3 so there 
>> _ought_ to be a way to verify and restore the correct data, right?
>> 
>> Below is some log output about one of the problem PG's. Can anyone suggest a 
>> way to fix the inconsistencies?
>> 
>> 2013-05-20 10:07:54.529582 osd.13 10.20.192.111:6818/20919 3451 : [ERR] 
>> 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
>> 4289025870 != known digest 4190506501
>> 2013-05-20 10:07:54.529585 osd.13 10.20.192.111:6818/20919 3452 : [ERR] 
>> 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
>> 4289025870 != known digest 4190506501
>> 2013-05-20 10:07:54.606034 osd.13 10.20.192.111:6818/20919 3453 : [ERR] 
>> 19.1b repair 0 missing, 1 inconsistent objects
>> 2013-05-20 10:07:54.606066 osd.13 10.20.192.111:6818/20919 3454 : [ERR] 
>> 19.1b repair 2 errors, 2 fixed
>> 2013-05-20 10:07:55.034221 osd.13 10.20.192.111:6818/20919 3455 : [ERR] 
>> 19.1b osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
>> 4289025870 != known digest 4190506501
>> 2013-05-20 10:07:55.034224 osd.13 10.20.192.111:6818/20919 3456 : [ERR] 
>> 19.1b osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 
>> 4289025870 != known digest 4190506501
>> 2013-05-20 10:07:55.113230 osd.13 10.20.192.111:6818/20919 3457 : [ERR] 
>> 19.1b deep-scrub 0 missing, 1 inconsistent objects
>> 2013-05-20 10:07:55.113235 osd.13 10.20.192.111:6818/20919 3458 : [ERR] 
>> 19.1b deep-scrub 2 errors
>> 
>> Thanks,
>> 
>> JN
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Inconsistent PG's, repair ineffective

2013-05-20 Thread David Zafman

What version of ceph are you running?

David Zafman
Senior Developer
http://www.inktank.com

On May 20, 2013, at 9:14 AM, John Nielsen  wrote:

> Some scrub errors showed up on our cluster last week. We had some issues with 
> host stability a couple weeks ago; my guess is that errors were introduced at 
> that point and a recent background scrub detected them. I was able to clear 
> most of them via "ceph pg repair", but several remain. Based on some other 
> posts, I'm guessing that they won't repair because it is the primary copy 
> that has the error. All of our pools are set to size 3 so there _ought_ to be 
> a way to verify and restore the correct data, right?
> 
> Below is some log output about one of the problem PG's. Can anyone suggest a 
> way to fix the inconsistencies?
> 
> 2013-05-20 10:07:54.529582 osd.13 10.20.192.111:6818/20919 3451 : [ERR] 19.1b 
> osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 4289025870 
> != known digest 4190506501
> 2013-05-20 10:07:54.529585 osd.13 10.20.192.111:6818/20919 3452 : [ERR] 19.1b 
> osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 4289025870 
> != known digest 4190506501
> 2013-05-20 10:07:54.606034 osd.13 10.20.192.111:6818/20919 3453 : [ERR] 19.1b 
> repair 0 missing, 1 inconsistent objects
> 2013-05-20 10:07:54.606066 osd.13 10.20.192.111:6818/20919 3454 : [ERR] 19.1b 
> repair 2 errors, 2 fixed
> 2013-05-20 10:07:55.034221 osd.13 10.20.192.111:6818/20919 3455 : [ERR] 19.1b 
> osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 4289025870 
> != known digest 4190506501
> 2013-05-20 10:07:55.034224 osd.13 10.20.192.111:6818/20919 3456 : [ERR] 19.1b 
> osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 4289025870 
> != known digest 4190506501
> 2013-05-20 10:07:55.113230 osd.13 10.20.192.111:6818/20919 3457 : [ERR] 19.1b 
> deep-scrub 0 missing, 1 inconsistent objects
> 2013-05-20 10:07:55.113235 osd.13 10.20.192.111:6818/20919 3458 : [ERR] 19.1b 
> deep-scrub 2 errors
> 
> Thanks,
> 
> JN
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Inconsistent PG's, repair ineffective

2013-05-20 Thread John Nielsen
Some scrub errors showed up on our cluster last week. We had some issues with 
host stability a couple weeks ago; my guess is that errors were introduced at 
that point and a recent background scrub detected them. I was able to clear 
most of them via "ceph pg repair", but several remain. Based on some other 
posts, I'm guessing that they won't repair because it is the primary copy that 
has the error. All of our pools are set to size 3 so there _ought_ to be a way 
to verify and restore the correct data, right?

Below is some log output about one of the problem PG's. Can anyone suggest a 
way to fix the inconsistencies?

2013-05-20 10:07:54.529582 osd.13 10.20.192.111:6818/20919 3451 : [ERR] 19.1b 
osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 4289025870 
!= known digest 4190506501
2013-05-20 10:07:54.529585 osd.13 10.20.192.111:6818/20919 3452 : [ERR] 19.1b 
osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 4289025870 
!= known digest 4190506501
2013-05-20 10:07:54.606034 osd.13 10.20.192.111:6818/20919 3453 : [ERR] 19.1b 
repair 0 missing, 1 inconsistent objects
2013-05-20 10:07:54.606066 osd.13 10.20.192.111:6818/20919 3454 : [ERR] 19.1b 
repair 2 errors, 2 fixed
2013-05-20 10:07:55.034221 osd.13 10.20.192.111:6818/20919 3455 : [ERR] 19.1b 
osd.13: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 4289025870 
!= known digest 4190506501
2013-05-20 10:07:55.034224 osd.13 10.20.192.111:6818/20919 3456 : [ERR] 19.1b 
osd.22: soid 507ada1b/rb.0.6989.2ae8944a.005b/5//19 digest 4289025870 
!= known digest 4190506501
2013-05-20 10:07:55.113230 osd.13 10.20.192.111:6818/20919 3457 : [ERR] 19.1b 
deep-scrub 0 missing, 1 inconsistent objects
2013-05-20 10:07:55.113235 osd.13 10.20.192.111:6818/20919 3458 : [ERR] 19.1b 
deep-scrub 2 errors

Thanks,

JN

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com