Hi Arvydas,

 

The error seems to suggest this is not an issue with your object data, but the 
expected object digest data. I am unable to access where I stored my very hacky 
diagnosis process for this, but our eventual fix was to locate the bucket or 
files affected and then rename an object within it, forcing a recalculation of 
the digest. Depending on the size of the pool perhaps it would be possible to 
randomly rename a few files to cause this recalculation to occur to see if this 
remedies it?

 

Kind Regards,

 

Tom

 

From: ceph-users <ceph-users-boun...@lists.ceph.com> On Behalf Of Arvydas 
Opulskis
Sent: 14 August 2018 12:33
To: Brent Kennedy <bkenn...@cfl.rr.com>
Cc: Ceph Users <ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Inconsistent PG could not be repaired

 

Thanks for suggestion about restarting OSD's, but this doesn't work either.

 

Anyway, I managed to fix second unrepairing PG by getting object from OSD and 
saving it again via rados, but still no luck with first one. 

I think, I found main problem why this doesn't work. Seems, object is not 
overwritten, even rados command returns no errors. I tried to delete object, 
but it still stays in pool untouched. There is an example of what I see:

 

# rados -p .rgw.buckets ls | grep -i 
"sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d"
default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d

# rados -p .rgw.buckets get 
default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d
 testfile
error getting 
.rgw.buckets/default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d:
 (2) No such file or directory

# rados -p .rgw.buckets rm 
default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d

# rados -p .rgw.buckets ls | grep -i 
"sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d"
default.142609570.87_20180203.020047/repositories/docker-local/yyy/company.yyy.api.assets/1.2.4/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d

 

I've never seen this in our Ceph clusters before. Should I report a bug about 
it? If any of you guys need more diagnostic info - let me know.

 

Thanks,

Arvydas

 

On Tue, Aug 7, 2018 at 5:49 PM, Brent Kennedy <bkenn...@cfl.rr.com 
<mailto:bkenn...@cfl.rr.com> > wrote:

Last time I had an inconsistent PG that could not be repaired using the repair 
command, I looked at which OSDs hosted the PG, then restarted them one by 
one(usually stopping, waiting a few seconds, then starting them back up ).  You 
could also stop them, flush the journal, then start them back up.  

 

If that didn’t work, it meant there was data loss and I had to use the 
ceph-objectstore-tool repair tool to export the objects from a location that 
had the latest data and import into the one that had no data.  The 
ceph-objectstore-tool is not a simple thing though and should not be used 
lightly.  When I say data loss, I mean that ceph thinks the last place written 
has the data, that place being the OSD that doesn’t actually have the 
data(meaning it failed to write there).

 

If you want to go that route, let me know, I wrote a how to on it.  Should be 
the last resort though.  I also don’t know your setup, so I would hate to 
recommend something so drastic.

 

-Brent

 

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
<mailto:ceph-users-boun...@lists.ceph.com> ] On Behalf Of Arvydas Opulskis
Sent: Monday, August 6, 2018 4:12 AM
To: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com> 
Subject: Re: [ceph-users] Inconsistent PG could not be repaired

 

Hi again,

 

after two weeks I've got another inconsistent PG in same cluster. OSD's are 
different from first PG, object can not be GET as well: 


# rados list-inconsistent-obj 26.821 --format=json-pretty

{

    "epoch": 178472,

    "inconsistents": [

        {

            "object": {

                "name": 
"default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7",

                "nspace": "",

                "locator": "",

                "snap": "head",

                "version": 118920

            },

            "errors": [],

            "union_shard_errors": [

                "data_digest_mismatch_oi"

            ],

            "selected_object_info": 
"26:8411bae4:::default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7:head(126495'118920
 client.142609570.0:41412640 dirty|data_digest|omap_digest s 4194304 uv 118920 
dd cd142aaa od ffffffff alloc_hint [0 0])",

            "shards": [

                {

                    "osd": 20,

                    "errors": [

                        "data_digest_mismatch_oi"

                    ],

                    "size": 4194304,

                    "omap_digest": "0xffffffff",

                    "data_digest": "0x6b102e59"

                },

                {

                    "osd": 44,

                    "errors": [

                        "data_digest_mismatch_oi"

                    ],

                    "size": 4194304,

                    "omap_digest": "0xffffffff",

                    "data_digest": "0x6b102e59"

                }

            ]

        }

    ]

}

# rados -p .rgw.buckets get 
default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7 test_2pg.file

error getting 
.rgw.buckets/default.122888368.52__shadow_.3ubGZwLcz0oQ55-LTb7PCOTwKkv-nQf_7: 
(5) Input/output error

 

 

Still struggling how to solve it. Any ideas, guys?

 

Thank you

 

 

 

On Tue, Jul 24, 2018 at 10:27 AM, Arvydas Opulskis <zebedie...@gmail.com 
<mailto:zebedie...@gmail.com> > wrote:

Hello, Cephers,

 

after trying different repair approaches I am out of ideas how to repair 
inconsistent PG. I hope, someones sharp eye will notice what I overlooked.

 

Some info about cluster:

Centos 7.4

Jewel 10.2.10 

Pool size 2 (yes, I know it's a very bad choice)

Pool with inconsistent PG: .rgw.buckets 

 

After routine deep-scrub I've found PG 26.c3f in inconsistent status. While 
running "ceph pg repair 26.c3f" command and monitoring "ceph -w" log, I noticed 
these errors:

2018-07-24 08:28:06.517042 osd.36 [ERR] 26.c3f shard 30: soid 
26:fc32a1f1:::default.142609570.87_20180206.093111%2frepositories%2fnuget-local%2fApplication%2fCompany.Application.Api%2fCompany.Application.Api.1.1.1.nupkg.artifactory-metadata%2fproperties.xml:head
 data_digest 0x540e4f8b != data_digest 0x49a34c1f from auth oi 
26:e261561a:::default.168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
 client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 dd 
49a34c1f od ffffffff alloc_hint [0 0])

 

2018-07-24 08:28:06.517118 osd.36 [ERR] 26.c3f shard 36: soid 
26:fc32a1f1:::default.142609570.87_20180206.093111%2frepositories%2fnuget-local%2fApplication%2fCompany.Application.Api%2fCompany.Application.Api.1.1.1.nupkg.artifactory-metadata%2fproperties.xml:head
 data_digest 0x540e4f8b != data_digest 0x49a34c1f from auth oi 
26:e261561a:::default.168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
 client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 dd 
49a34c1f od ffffffff alloc_hint [0 0])

 

2018-07-24 08:28:06.517122 osd.36 [ERR] 26.c3f soid 
26:fc32a1f1:::default.142609570.87_20180206.093111%2frepositories%2fnuget-local%2fApplication%2fCompany.Application.Api%2fCompany.Application.Api.1.1.1.nupkg.artifactory-metadata%2fproperties.xml:head:
 failed to pick suitable auth object

 

...and same errors about another object on same PG.

 

Repair failed, so I checked inconsistencies "rados list-inconsistent-obj 26.c3f 
--format=json-pretty":

 

{

    "epoch": 178403,

    "inconsistents": [

        {

            "object": {

                "name": 
"default.142609570.87_20180203.020047\/repositories\/docker-local\/yyy\/company.yyy.api.assets\/1.2.4\/sha256__ce41e5246ead8bddd2a2b5bbb863db250f328be9dc5c3041481d778a32f8130d",

                "nspace": "",

                "locator": "",

                "snap": "head",

                "version": 217749

            },

            "errors": [],

            "union_shard_errors": [

                "data_digest_mismatch_oi"

            ],

            "selected_object_info": 
"26:f4ce1748:::default.168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-08T03%3a45%3a15+00%3a00.sha1:head(167944'217749
 client.177936559.0:1884719302 dirty|data_digest|omap_digest s 40 uv 217749 dd 
422f251b od ffffffff alloc_hint [0 0])",

            "shards": [

                {

                    "osd": 30,

                    "errors": [

                        "data_digest_mismatch_oi"

                    ],

                    "size": 40,

                    "omap_digest": "0xffffffff",

                    "data_digest": "0x551c282f"

                },

                {

                    "osd": 36,

                    "errors": [

                        "data_digest_mismatch_oi"

                    ],

                    "size": 40,

                    "omap_digest": "0xffffffff",

                    "data_digest": "0x551c282f"

                }

            ]

        },

        {

            "object": {

                "name": 
"default.142609570.87_20180206.093111\/repositories\/nuget-local\/Application\/Company.Application.Api\/Company.Application.Api.1.1.1.nupkg.artifactory-metadata\/properties.xml",

                "nspace": "",

                "locator": "",

                "snap": "head",

                "version": 216051

            },

            "errors": [],

            "union_shard_errors": [

                "data_digest_mismatch_oi"

            ],

            "selected_object_info": 
"26:e261561a:::default.168602061.10_team-xxx.xxx-jobs.H6.HADOOP.data-segmentation.application.131.xxx-jvm.cpu.load%2f2018-05-05T03%3a51%3a39+00%3a00.sha1:head(167828'216051
 client.179334015.0:1847715760 dirty|data_digest|omap_digest s 40 uv 216051 dd 
49a34c1f od ffffffff alloc_hint [0 0])",

            "shards": [

                {

                    "osd": 30,

                    "errors": [

                        "data_digest_mismatch_oi"

                    ],

                    "size": 40,

                    "omap_digest": "0xffffffff",

                    "data_digest": "0x540e4f8b"

                },

                {

                    "osd": 36,

                    "errors": [

                        "data_digest_mismatch_oi"

                    ],

                    "size": 40,

                    "omap_digest": "0xffffffff",

                    "data_digest": "0x540e4f8b"

                }

            ]

        }

    ]

}

 

 

After some reading, I understand, I needed rados get/put trick to solve this 
problem. I couldn't do rados get, because I was getting "no such file" error, 
even objects were listed by "rados ls" command, so I got them directly from 
OSD. After putting them back to rados (rados commands doesn't returned any 
errors) and doing deep-scrub on same PG, problem still existed. The only thing 
changed - when I try to get object via rados now I get "(5) Input/output 
error". 

 

I tried force object size to 40 (it's real size of both objects) by adding "-o 
40" option to "rados put" command, but with no luck.

 

Guys, maybe you have other ideas what to try? Why overwriting object doesn't 
solve this problem?

 

Thanks a lot!

 

Arvydas

 

 

 

 

 

 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to