Re: [ceph-users] PG in active+clean+inconsistent, but list-inconsistent-obj doesn't show it

Ronny Aasen Thu, 28 Sep 2017 01:41:04 -0700

On 28. sep. 2017 09:27, Olivier Migeot wrote:

Greetings,
we're in the process of recovering a cluster after an electricaldisaster. Didn't work bad so far, we managed to clear most of errors.All that prevents return to HEALTH_OK now is a bunch (6) of scruberrors, apparently from a PG that's marked as active+clean+inconsistent.
Thing is, rados list-inconsistent-obj doesn't return anything but anempty list (plus, in the most recent attempts : error 2: (2) No suchfile or directory)
We're on Jewel (waiting for this to be fixed before planning upgrade),and the pool our PG belongs to has a replica of 2.
No success with ceph pg repair, and I already tried to remove and importthe most recent version of said PG in both its acting OSDs : it doesn'tchange a thing.
Is there anything else I could try?

Thanks,

size=2 is ofcourse horrible, and I assume you know that... But evenmore important: I hope you have min_size=2 so you avoid generating moreproblems in the future, or while troubleshooting.

!


first of all, read this link a few times:
http://ceph.com/geen-categorie/ceph-manually-repair-object/

you need to locate the bad objects to fix them. since

rados list-inconsistent-obj does not work you need to manualy check thelogs of the osd's that are participating in the pg in question. grep forERR,

once you find the name of the object with problem, you need to locatethe object using find /path/of/pg -name 'objectname'

once you have the objectpath you need to compare the 2 objects and findout what object is the bad one, this is where 3 replication would havehelped, since when one is bad, how do you know the bad from the good...

the error message in the log may give hints to the error. read andunderstand what the error message is, since it is critical tounderstanding what is wrong with the object.

the object type also helps when determining the wrong one. is it a radosobject, a rbd block or a cephfs metadata og data object. knowing what itshould be helps determining the wrong one.


things to try:

ls -lh $path ; compare metadata are there obvious problems? refer tothe error in the log.

- one have size 0 and there should have been a size?
- one have size greater then 0 and it should have been size 0?

- one is significantly larger then the other, perhaps one is truncated?perhaps one have garbage added.


md5sum $path

- perhaps a block have read error, it would show on this command. and bea dead giveaway to the problem object.

- compare checksum.  do you know what the object  should have as sum?

actualy look at the object. use strings or hexdump to try to determinethe contents, vs what the object should contain.

if you can locate the bad object. then stop the osd. flush it'sjournal. move away the bad object, (i just mv it to somewhere else).

restart the osd.

run repair on the pg, tail the logs and wait for the repair and scrubto finish.

--

if you are unable to determine the good object from the bad. You can tryto determine what file it refers to in cephfs, or what block it refersto in rbd. and by overwriting that file or block in cephfs or rbd youcan indirectly overwrite both objects with new data.

if this is a rbd you should run a filesystem check on the fs on that rbdafter all the ceph problems are repaired.


good luck
Ronny Aasen


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG in active+clean+inconsistent, but list-inconsistent-obj doesn't show it

Reply via email to