[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-12-19 Thread Christian Rohmann

Hello Tomasz,


I observe a strange accumulation of inconsistencies for an RGW-only 
(+multisite) setup, with errors just like those you reported.
I collected some info and raised a bug ticket:  
https://tracker.ceph.com/issues/53663
Two more inconsistencies have just shown up hours after repairing the 
other, adding to the theory of something really odd going on.




Did you upgrade to Octopus in the end then? Any more issues with such 
inconsistencies on your side Tomasz?




Regards

Christian



On 20/10/2021 10:33, Tomasz Płaza wrote:
As the upgrade process states, rgw are the last one to be upgraded, so 
they are still on nautilus (centos7). Those logs showed up after 
upgrade of the first osd host. It is a multisite setup so I am a 
little afraid of upgrading rgw now.


Etienne:

Sorry for answering in this thread, but somehow I do not get messages 
directed only to ceph-users list. I did "rados list-inconsistent-pg" 
and got many entries like:


{
  "object": {
    "name": ".dir.99a07ed8-2112-429b-9f94-81383220a95b.7104621.23.7",
    "nspace": "",
    "locator": "",
    "snap": "head",
    "version": 82561410
  },
  "errors": [
    "omap_digest_mismatch"
  ],
  "union_shard_errors": [],
  "selected_object_info": {
    "oid": {
  "oid": ".dir.99a07ed8-2112-429b-9f94-81383220a95b.7104621.23.7",
  "key": "",
  "snapid": -2,
  "hash": 3316145293,
  "max": 0,
  "pool": 230,
  "namespace": ""
    },
    "version": "107760'82561410",
    "prior_version": "106468'82554595",
    "last_reqid": "client.392341383.0:2027385771",
    "user_version": 82561410,
    "size": 0,
    "mtime": "2021-10-19T16:32:25.699134+0200",
    "local_mtime": "2021-10-19T16:32:25.699073+0200",
    "lost": 0,
    "flags": [
  "dirty",
  "omap",
  "data_digest"
    ],
    "truncate_seq": 0,
    "truncate_size": 0,
    "data_digest": "0x",
    "omap_digest": "0x",
    "expected_object_size": 0,
    "expected_write_size": 0,
    "alloc_hint_flags": 0,
    "manifest": {
  "type": 0
    },
    "watchers": {}
  },
  "shards": [
    {
  "osd": 56,
  "primary": true,
  "errors": [],
  "size": 0,
  "omap_digest": "0xf4cf0e1c",
  "data_digest": "0x"
    },
    {
  "osd": 58,
  "primary": false,
  "errors": [],
  "size": 0,
  "omap_digest": "0xf4cf0e1c",
  "data_digest": "0x"
    },
    {
  "osd": 62,
  "primary": false,
  "errors": [],
  "size": 0,
  "omap_digest": "0x4bd5703a",
  "data_digest": "0x"
    }
  ]
}


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Szabo, Istvan (Agoda)
Have you tried to repair pg?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---

On 2021. Oct 20., at 9:04, Glaza  wrote:

Email received from the internet. If in doubt, don't click any link nor open 
any attachment !


Hi Everyone,   I am in the process of
upgrading nautilus (14.2.22) to octopus (15.2.14) on centos7 (Mon/Mgr
were additionally migrated to centos8 beforehand). Each day I upgraded
one host and after all osds were up, I manually compacted them one by
one.  Today (8 hosts upgraded, 7 still to go) I started
getting errors like Possible data damage: 1 pg inconsistent. For the
first time it was acting [56,58,62] but I thought OK in 
osd.62 logs
there are many lines like osd.62 39892 class rgw_gc open got (1)
Operation not permitted Maybe rgw did not cleaned some omaps properly,
and ceph did not noticed it until scrub happened. But now I have got
acting [56,57,58] and none of this osds has those errors with 
rgw_gc
in logs. All affected osds are octopus 15.2.14 on NVMe hosting
default.rgw.buckets.index pool.  Has anyone experience with this problem?  Any 
help appreciated.

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Tomasz Płaza

Sorry Marc, didn't see second question.

As the upgrade process states, rgw are the last one to be upgraded, so 
they are still on nautilus (centos7). Those logs showed up after upgrade 
of the first osd host. It is a multisite setup so I am a little afraid 
of upgrading rgw now.


Etienne:

Sorry for answering in this thread, but somehow I do not get messages 
directed only to ceph-users list. I did "rados list-inconsistent-pg" and 
got many entries like:


{
  "object": {
    "name": ".dir.99a07ed8-2112-429b-9f94-81383220a95b.7104621.23.7",
    "nspace": "",
    "locator": "",
    "snap": "head",
    "version": 82561410
  },
  "errors": [
    "omap_digest_mismatch"
  ],
  "union_shard_errors": [],
  "selected_object_info": {
    "oid": {
  "oid": ".dir.99a07ed8-2112-429b-9f94-81383220a95b.7104621.23.7",
  "key": "",
  "snapid": -2,
  "hash": 3316145293,
  "max": 0,
  "pool": 230,
  "namespace": ""
    },
    "version": "107760'82561410",
    "prior_version": "106468'82554595",
    "last_reqid": "client.392341383.0:2027385771",
    "user_version": 82561410,
    "size": 0,
    "mtime": "2021-10-19T16:32:25.699134+0200",
    "local_mtime": "2021-10-19T16:32:25.699073+0200",
    "lost": 0,
    "flags": [
  "dirty",
  "omap",
  "data_digest"
    ],
    "truncate_seq": 0,
    "truncate_size": 0,
    "data_digest": "0x",
    "omap_digest": "0x",
    "expected_object_size": 0,
    "expected_write_size": 0,
    "alloc_hint_flags": 0,
    "manifest": {
  "type": 0
    },
    "watchers": {}
  },
  "shards": [
    {
  "osd": 56,
  "primary": true,
  "errors": [],
  "size": 0,
  "omap_digest": "0xf4cf0e1c",
  "data_digest": "0x"
    },
    {
  "osd": 58,
  "primary": false,
  "errors": [],
  "size": 0,
  "omap_digest": "0xf4cf0e1c",
  "data_digest": "0x"
    },
    {
  "osd": 62,
  "primary": false,
  "errors": [],
  "size": 0,
  "omap_digest": "0x4bd5703a",
  "data_digest": "0x"
    }
  ]
}


On 20.10.2021 o 09:51, Marc wrote:

Is the rgw still nautilus? What about trying with rgw of octopus?


and ceph did not noticed it until scrub happened. But now I have got
acting [56,57,58] and none of this osds has those errors
with rgw_gc
in logs. All affected osds are octopus 15.2.14 on NVMe hosting
default.rgw.buckets.index pool.  Has anyone experience with this
problem?  Any help appreciated.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Tomasz Płaza

I did it only on MON servers. OSDs are on centos 7. Process was:
1. stop mon
2. backup /var/lib/ceph
3. reinstall server as centos 8 and install ceph nautilus
4. restore /var/lib/ceph  and start mo
5. wait few days
6. upgrade mon to octopus

On 20.10.2021 o 09:51, Marc wrote:


How did you do the upgrade from centos7 to centos8? I assume you kept osd 
config's etc?


upgrading nautilus (14.2.22) to octopus (15.2.14) on centos7 (Mon/Mgr
were additionally migrated to centos8 beforehand). Each day I upgraded
one host and after all osds were up, I manually compacted them one
by
one.  Today (8 hosts upgraded, 7 still to go) I started
getting errors like Possible data damage: 1 pg inconsistent.
For the
first time it was acting [56,58,62] but I thought OK
in osd.62 logs
there are many lines like osd.62 39892 class rgw_gc open got (1)
Operation not permitted Maybe rgw did not cleaned some omaps
properly,

Is the rgw still nautilus? What about trying with rgw of octopus?


and ceph did not noticed it until scrub happened. But now I have got
acting [56,57,58] and none of this osds has those errors
with rgw_gc
in logs. All affected osds are octopus 15.2.14 on NVMe hosting
default.rgw.buckets.index pool.  Has anyone experience with this
problem?  Any help appreciated.



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Tomasz Płaza
Yes I did, and despite "Too many repaired reads on 1 OSDs" health is 
back to HEALTH_OK.
But it is second time it happened and do not know, should I go forward 
with update or hold on. Or maybe it is a bad move makeing compaction 
right after migration to 15.2.14


On 20.10.2021 o 09:21, Szabo, Istvan (Agoda) wrote:

Have you tried to repair pg?

Istvan Szabo
Senior Infrastructure Engineer
---
Agoda Services Co., Ltd.
e: istvan.sz...@agoda.com
---


On 2021. Oct 20., at 9:04, Glaza  wrote:

Email received from the internet. If in doubt, don't click any link 
nor open any attachment !



Hi Everyone,   I am in the process of
upgrading nautilus (14.2.22) to octopus (15.2.14) on centos7 (Mon/Mgr
were additionally migrated to centos8 beforehand). Each day I upgraded
one host and after all osds were up, I manually compacted them 
one by

one.  Today (8 hosts upgraded, 7 still to go) I started
getting errors like Possible data damage: 1 pg 
inconsistent. For the
first time it was acting [56,58,62] but I thought 
OK in osd.62 logs

there are many lines like osd.62 39892 class rgw_gc open got (1)
Operation not permitted Maybe rgw did not cleaned some omaps 
properly,

and ceph did not noticed it until scrub happened. But now I have got
acting [56,57,58] and none of this osds has those 
errors with rgw_gc

in logs. All affected osds are octopus 15.2.14 on NVMe hosting
default.rgw.buckets.index pool.  Has anyone experience with this 
problem?  Any help appreciated.


___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Marc



How did you do the upgrade from centos7 to centos8? I assume you kept osd 
config's etc?

> upgrading nautilus (14.2.22) to octopus (15.2.14) on centos7 (Mon/Mgr
> were additionally migrated to centos8 beforehand). Each day I upgraded
> one host and after all osds were up, I manually compacted them one
> by
> one.  Today (8 hosts upgraded, 7 still to go) I started
> getting errors like Possible data damage: 1 pg inconsistent.
> For the
> first time it was acting [56,58,62] but I thought OK
> in osd.62 logs
> there are many lines like osd.62 39892 class rgw_gc open got (1)
> Operation not permitted Maybe rgw did not cleaned some omaps
> properly,

Is the rgw still nautilus? What about trying with rgw of octopus?

> and ceph did not noticed it until scrub happened. But now I have got
> acting [56,57,58] and none of this osds has those errors
> with rgw_gc
> in logs. All affected osds are octopus 15.2.14 on NVMe hosting
> default.rgw.buckets.index pool.  Has anyone experience with this
> problem?  Any help appreciated.
> 

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: inconsistent pg after upgrade nautilus to octopus

2021-10-20 Thread Etienne Menguy
Hi,

You should check for inconsistency root cause. 
https://docs.ceph.com/en/latest/rados/troubleshooting/troubleshooting-pg/#pgs-inconsistent
 

 

-
Etienne Menguy
etienne.men...@croit.io




> On 20 Oct 2021, at 09:21, Szabo, Istvan (Agoda)  
> wrote:
> 
> Have you tried to repair pg?
> 
> Istvan Szabo
> Senior Infrastructure Engineer
> ---
> Agoda Services Co., Ltd.
> e: istvan.sz...@agoda.com
> ---
> 
> On 2021. Oct 20., at 9:04, Glaza  wrote:
> 
> Email received from the internet. If in doubt, don't click any link nor open 
> any attachment !
> 
> 
> Hi Everyone,   I am in the process of
> upgrading nautilus (14.2.22) to octopus (15.2.14) on centos7 (Mon/Mgr
> were additionally migrated to centos8 beforehand). Each day I upgraded
> one host and after all osds were up, I manually compacted them one by
> one.  Today (8 hosts upgraded, 7 still to go) I started
> getting errors like Possible data damage: 1 pg inconsistent. For the
> first time it was acting [56,58,62] but I thought OK in 
> osd.62 logs
> there are many lines like osd.62 39892 class rgw_gc open got (1)
> Operation not permitted Maybe rgw did not cleaned some omaps properly,
> and ceph did not noticed it until scrub happened. But now I have got
> acting [56,57,58] and none of this osds has those errors with 
> rgw_gc
> in logs. All affected osds are octopus 15.2.14 on NVMe hosting
> default.rgw.buckets.index pool.  Has anyone experience with this problem?  
> Any help appreciated.
> 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io