Hello Niklas,

The explanation looks plausible.

What you can do is try extracting the PG from the dead OSD disk
(please make absolutely sure that the OSD daemon is stopped!!!) and
reinjecting it into some other OSD (again, stop the daemon during this
procedure). This extra copy should act as an arbiter.

The relevant commands are:

systemctl stop ceph-osd@2
systemctl stop ceph-osd@3  # or whatever other OSD exists on the same host
systemctl mask ceph-osd@2
systemctl mask ceph-osd@3
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-2/ --pgid
2.87 --op export --file /some/local/storage/pg-2.87.exp
ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-3/ --type
bluestore --pgid 2.87 --op import --file
/some/local/storage/pg-2.87.exp
systemctl unmask ceph-osd@3
systemctl start ceph-osd@3
systemctl unmask ceph-osd@2

On Wed, Jun 28, 2023 at 8:31 AM Niklas Hambüchen <m...@nh2.me> wrote:
>
> Hi Alvaro,
>
> > Can you post the entire Ceph status output?
>
> Pasting here since it is short
>
>      cluster:
>        id:     d9000ec0-93c2-479f-bd5d-94ae9673e347
>        health: HEALTH_ERR
>                1 scrub errors
>                Possible data damage: 1 pg inconsistent
>
>      services:
>        mon: 3 daemons, quorum node-4,node-5,node-6 (age 52m)
>        mgr: node-5(active, since 7d), standbys: node-6, node-4
>        mds: 1/1 daemons up, 2 standby
>        osd: 36 osds: 36 up (since 5d), 36 in (since 6d)
>
>      data:
>        volumes: 1/1 healthy
>        pools:   3 pools, 832 pgs
>        objects: 506.83M objects, 67 TiB
>        usage:   207 TiB used, 232 TiB / 439 TiB avail
>        pgs:     826 active+clean
>                5   active+clean+scrubbing+deep
>                1   active+clean+inconsistent
>
>      io:
>        client:   18 MiB/s wr, 0 op/s rd, 5 op/s wr
>
>
> > sometimes list-inconsistent-obj throws that error if a scrub job is still 
> > running.
>
> This would be surprising to me, because I did the disk replacement of the 
> broken OSD "2" already 7 days ago, and "list-inconsistent-obj" has not worked 
> at any time since then.
>
> > grep -Hn 'ERR' /var/log/ceph/ceph-osd.33.log
>
>      /var/log/ceph/ceph-osd.33.log:8005229:2023-06-16T16:29:57.704+0000 
> 7f9a985e5640 -1 log_channel(cluster) log [ERR] : 2.87 shard 2 soid 
> 2:e18c2025:::1001c78d046.00000000:head : candidate had a read error
>      /var/log/ceph/ceph-osd.33.log:8018716:2023-06-16T20:03:26.923+0000 
> 7f9a985e5640 -1 log_channel(cluster) log [ERR] : 2.87 deep-scrub 0 missing, 1 
> inconsistent objects
>      /var/log/ceph/ceph-osd.33.log:8018717:2023-06-16T20:03:26.923+0000 
> 7f9a985e5640 -1 log_channel(cluster) log [ERR] : 2.87 deep-scrub 1 errors
>
> The time "2023-06-16T16:29:57" above is the time at which the disk that 
> carried OSD "2" broke, its logs around the time are:
>
>      /var/log/ceph/ceph-osd.2.log:7855741:2023-06-16T16:29:57.690+0000 
> 7fbae3cf7640 -1 bdev(0x7fbaeef6c400 /var/lib/ceph/osd/ceph-2/block) 
> _aio_thread got r=-5 ((5) Input/output error)
>      /var/log/ceph/ceph-osd.2.log:7855743:2023-06-16T16:29:57.690+0000 
> 7fba62863640 -1 log_channel(cluster) log [ERR] : 2.b1 missing primary copy of 
> 2:8df449f9:::10016e7a962.00000000:head, will try copies on 19,32
>      /var/log/ceph/ceph-osd.2.log:7855747:2023-06-16T16:29:57.691+0000 
> 7fba63064640 -1 log_channel(cluster) log [ERR] : 2.a6 missing primary copy of 
> 2:65bd8cda:::10016ea4e67.00000000:head, will try copies on 17,28
>      -- note time jump by 3 days --
>      /var/log/ceph/ceph-osd.2.log:8096330:2023-06-19T06:42:48.712+0000 
> 7fba62863640 -1 log_channel(cluster) log [ERR] : 2.b1 missing primary copy of 
> 2:8d51be04:::1001d7b8447.00000334:head, will try copies on 19,32
>      /var/log/ceph/ceph-osd.2.log:8108684: -1867> 
> 2023-06-19T06:42:48.712+0000 7fba62863640 -1 log_channel(cluster) log [ERR] : 
> 2.b1 missing primary copy of 2:8d51be04:::1001d7b8447.00000334:head, will try 
> copies on 19,32
>      /var/log/ceph/ceph-osd.2.log:8108766: -1785> 
> 2023-06-19T06:42:49.035+0000 7fba6d879640 10 log_client  will send 
> 2023-06-19T06:42:48.713712+0000 osd.2 (osd.2) 179 : cluster [ERR] 2.b1 
> missing primary copy of 2:8d51be04:::1001d7b8447.00000334:head, will try 
> copies on 19,32
>      /var/log/ceph/ceph-osd.2.log:8108770: -1781> 
> 2023-06-19T06:42:49.525+0000 7fba7787f640 10 log_client  logged 
> 2023-06-19T06:42:48.713712+0000 osd.2 (osd.2) 179 : cluster [ERR] 2.b1 
> missing primary copy of 2:8d51be04:::1001d7b8447.00000334:head, will try 
> copies on 19,32
>      /var/log/ceph/ceph-osd.2.log:8111339:2023-06-19T06:51:13.940+0000 
> 7fb1518126c0 -1  ** ERROR: osd init failed: (5) Input/output error
>
> Does "candidate had a read error" on OSD "33" mean that a BlueStore checksum 
> error was detected on OSD "33" at the same time as the OSD "2" disk failed?
> If yes, maybe that is the explanation:
>
> * pg 2.87 is backed by OSDs [33,2,20]; OSD 2's hardware broke during the 
> scrub, OSD 33 detected a checksum error during the scrub, and thus we have 2 
> OSDs left (33 and 20) whose checksums disagree.
>
> I am just guessing this, though.
> Also, if this is correct, the next question would be: What is with OSD 20?
> Since there is no error reported at all for OSD 20, I assume that its 
> checksum agrees with its data.
> Now, can I find out whether OSD 20's checksum agrees with OSD 33's data?
>
> (Side note: The disk of OSD 33 looks fine in smartctl.)
>
> Thanks,
> Niklas
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io



-- 
Alexander E. Patrakov
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to