Thanks for suggestion about restarting OSD's, but this doesn't work either.
Anyway, I managed to fix second unrepairing PG by getting object from OSD and saving it again via rados, but still no luck with first one. I think, I found main problem why this doesn't work. Seems, object is not overwritten, even rados command returns no errors. I tried to delete object, but it still stays in pool untouched. There is an example of what I see: # rados -p .rgw.buckets ls | grep -i "sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d" default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d # rados -p .rgw.buckets get default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d testfile error getting .rgw.buckets/default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d: (2) No such file or directory # rados -p .rgw.buckets rm default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d # rados -p .rgw.buckets ls | grep -i "sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d" default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d I've never seen this in our Ceph clusters before. Should I report a bug about it? If any of you guys need more diagnostic info - let me know. Thanks, Arvydas On Tue, Aug 7, 2018 at 5:49 PM, Brent Kennedy <bkenn...@cfl.rr.com> wrote: > Last time I had an inconsistent PG that could not be repaired using the > repair command, I looked at which OSDs hosted the PG, then restarted them > one by one(usually stopping, waiting a few seconds, then starting them back > up ). You could also stop them, flush the journal, then start them back > up. > > > > If that didn’t work, it meant there was data loss and I had to use the > ceph-objectstore-tool repair tool to export the objects from a location > that had the latest data and import into the one that had no data. The > ceph-objectstore-tool is not a simple thing though and should not be used > lightly. When I say data loss, I mean that ceph thinks the last place > written has the data, that place being the OSD that doesn’t actually have > the data(meaning it failed to write there). > > > > If you want to go that route, let me know, I wrote a how to on it. Should > be the last resort though. I also don’t know your setup, so I would hate > to recommend something so drastic. > > > > -Brent > > > > *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf > Of *Arvydas Opulskis > *Sent:* Monday, August 6, 2018 4:12 AM > *To:* ceph-users@lists.ceph.com > *Subject:* Re: [ceph-users] Inconsistent PG could not be repaired > > > > Hi again, > > > > after two weeks I've got another inconsistent PG in same cluster. OSD's > are different from first PG, object can not be GET as well: > > > # rados list-inconsistent-obj 26.821 --format=json-pretty > > { > > "epoch": 178472, > > "inconsistents": [ > > { > > "object": { > > "name": "default.122888368.52__shadow_ > .3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7", > > "nspace": "", > > "locator": "", > > "snap": "head", > > "version": 118920 > > }, > > "errors": [], > > "union_shard_errors": [ > > "data_digest_mismatch_oi" > > ], > > "selected_object_info": "26:8411bae4:::default. > 122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7:head(126495'118920 > client.142609570.0:41412640 dirty|data_digest|omap_digest s 4194304 uv > 118920 dd cd142aaa od ffffffff alloc_hint [0 0])", > > "shards": [ > > { > > "osd": 20, > > "errors": [ > > "data_digest_mismatch_oi" > > ], > > "size": 4194304, > > "omap_digest": "0xffffffff", > > "data_digest": "0x6b102e59" > > }, > > { > > "osd": 44, > > "errors": [ > > "data_digest_mismatch_oi" > > ], > > "size": 4194304, > > "omap_digest": "0xffffffff", > > "data_digest": "0x6b102e59" > > } > > ] > > } > > ] > > } > > # rados -p .rgw.buckets get default.122888368.52__shadow_. > 3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7 test_2pg.file > > error getting .rgw.buckets/default.122888368.52__shadow_. > 3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7: (5) Input/output error > > > > > > Still struggling how to solve it. Any ideas, guys? > > > > Thank you > > > > > > > > On Tue, Jul 24, 2018 at 10:27 AM, Arvydas Opulskis <zebedie...@gmail.com> > wrote: > > Hello, Cephers, > > > > after trying different repair approaches I am out of ideas how to repair > inconsistent PG. I hope, someones sharp eye will notice what I overlooked. > > > > Some info about cluster: > > Centos 7.4 > > Jewel 10.2.10 > > Pool size 2 (yes, I know it's a very bad choice) > > Pool with inconsistent PG: .rgw.buckets > > > > After routine deep-scrub I've found PG 26.c3f in inconsistent status. > While running "ceph pg repair 26.c3f" command and monitoring "ceph -w" log, > I noticed these errors: > > 2018-07-24 08:28:06.517042 osd.36 [ERR] 26.c3f shard 30: soid > 26:fc32a1f1:::default.142609570.87_20180206.093111% > 2frepositories%2fnuget-local%2fApplication%2fCompany. > Application.Api%2fCompany.Application.Api.1.1.1.nupkg. > artifactory-metadata%2fproperties.xml:head data_digest 0x540e4f8b != > data_digest 0x49a34c1f from auth oi 26:e261561a:::default. > 168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data- > segmentation.application.131.xxx-jvm.cpu.load%2f2018-05- > 05T03%3a51%3a39+00%3a00.sha1:head(167828'216051 > client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 > dd 49a34c1f od ffffffff alloc_hint [0 0]) > > > > 2018-07-24 08:28:06.517118 osd.36 [ERR] 26.c3f shard 36: soid > 26:fc32a1f1:::default.142609570.87_20180206.093111% > 2frepositories%2fnuget-local%2fApplication%2fCompany. > Application.Api%2fCompany.Application.Api.1.1.1.nupkg. > artifactory-metadata%2fproperties.xml:head data_digest 0x540e4f8b != > data_digest 0x49a34c1f from auth oi 26:e261561a:::default. > 168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data- > segmentation.application.131.xxx-jvm.cpu.load%2f2018-05- > 05T03%3a51%3a39+00%3a00.sha1:head(167828'216051 > client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 > dd 49a34c1f od ffffffff alloc_hint [0 0]) > > > > 2018-07-24 08:28:06.517122 osd.36 [ERR] 26.c3f soid 26:fc32a1f1:::default. > 142609570.87_20180206.093111%2frepositories%2fnuget-local% > 2fApplication%2fCompany.Application.Api%2fCompany. > Application.Api.1.1.1.nupkg.artifactory-metadata%2fproperties.xml:head: > failed to pick suitable auth object > > > > ...and same errors about another object on same PG. > > > > Repair failed, so I checked inconsistencies "rados list-inconsistent-obj > 26.c3f --format=json-pretty": > > > > { > > "epoch": 178403, > > "inconsistents": [ > > { > > "object": { > > "name": "default.142609570.87_ > 20180203.020047\/repositories\/docker-local\/yyy\/company. > yyy.api.assets\/1.2.4\/sha256__ce41e5246ead8bddd2a2b5bbb863db > 250f328be9dc5c3041481d778a32f8130d", > > "nspace": "", > > "locator": "", > > "snap": "head", > > "version": 217749 > > }, > > "errors": [], > > "union_shard_errors": [ > > "data_digest_mismatch_oi" > > ], > > "selected_object_info": "26:f4ce1748:::default. > 168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data- > segmentation.application.131.xxx-jvm.cpu.load%2f2018-05- > 08T03%3a45%3a15+00%3a00.sha1:head(167944'217749 > client.177936559.0:1884719302 dirty|data_digest|omap_digest s 40 uv 217749 > dd 422f251b od ffffffff alloc_hint [0 0])", > > "shards": [ > > { > > "osd": 30, > > "errors": [ > > "data_digest_mismatch_oi" > > ], > > "size": 40, > > "omap_digest": "0xffffffff", > > "data_digest": "0x551c282f" > > }, > > { > > "osd": 36, > > "errors": [ > > "data_digest_mismatch_oi" > > ], > > "size": 40, > > "omap_digest": "0xffffffff", > > "data_digest": "0x551c282f" > > } > > ] > > }, > > { > > "object": { > > "name": "default.142609570.87_ > 20180206.093111\/repositories\/nuget-local\/Application\/ > Company.Application.Api\/Company.Application.Api.1.1.1. > nupkg.artifactory-metadata\/properties.xml", > > "nspace": "", > > "locator": "", > > "snap": "head", > > "version": 216051 > > }, > > "errors": [], > > "union_shard_errors": [ > > "data_digest_mismatch_oi" > > ], > > "selected_object_info": "26:e261561a:::default. > 168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data- > segmentation.application.131.xxx-jvm.cpu.load%2f2018-05- > 05T03%3a51%3a39+00%3a00.sha1:head(167828'216051 > client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 > dd 49a34c1f od ffffffff alloc_hint [0 0])", > > "shards": [ > > { > > "osd": 30, > > "errors": [ > > "data_digest_mismatch_oi" > > ], > > "size": 40, > > "omap_digest": "0xffffffff", > > "data_digest": "0x540e4f8b" > > }, > > { > > "osd": 36, > > "errors": [ > > "data_digest_mismatch_oi" > > ], > > "size": 40, > > "omap_digest": "0xffffffff", > > "data_digest": "0x540e4f8b" > > } > > ] > > } > > ] > > } > > > > > > After some reading, I understand, I needed rados get/put trick to solve > this problem. I couldn't do rados get, because I was getting "no such file" > error, even objects were listed by "rados ls" command, so I got them > directly from OSD. After putting them back to rados (rados commands doesn't > returned any errors) and doing deep-scrub on same PG, problem still > existed. The only thing changed - when I try to get object via rados now I > get "(5) Input/output error". > > > > I tried force object size to 40 (it's real size of both objects) by adding > "-o 40" option to "rados put" command, but with no luck. > > > > Guys, maybe you have other ideas what to try? Why overwriting object > doesn't solve this problem? > > > > Thanks a lot! > > > > Arvydas > > > > > > > > > > >
_______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com