Hi Alison,

I have observed exactly that with OSDs "converted" from ceph-disk to 
ceph-volume. Someone thought it would be a great idea to store the /dev-device 
name in the config instead of the uuid or any other stable device path:

# cat /etc/ceph/osd/287-2eaf591b-bced-4097-9499-5fda071c6161.json
{
...
    "block": {
        "path": "/dev/disk/by-partuuid/0c8a9f89-efa7-4c75-87ad-2f0d5aa2d649",
        "uuid": "0c8a9f89-efa7-4c75-87ad-2f0d5aa2d649"
    },
...
    "data": {
        "path": "/dev/sdm1",
        "uuid": "2eaf591b-bced-4097-9499-5fda071c6161"
    },
...
}

Funnily enough, it has the by-uuid path stored as well, but the /dev path is 
actually used during activation. My "fix" is to re-generate the OSD-json just 
before every ceph-disk OSD start.

You seem to be using LVM OSDs already, so this is a bit weird (can't be the 
exact same issue). Still, I would not be surprised if you are bitten by 
something similar, some stored config (cache) overrides the actual drive 
location. It is really a bliss that the developers implemented a check that a 
partition actually points to the data with the correct OSD ID, otherwise our 
cluster would be rigged by now.

I would start by using low-level commands (ceph-volume) directly to see if the 
issue is low-level or sits in some higher-level interface. Log-in to the OSD 
node and check what "ceph-volume inventory" says and if you can manually 
activate/deactivate the OSD on disk (be careful to include the --no-systemd 
option everywhere to avoid unintended change of persistent configurations).

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: apeis...@fnal.gov <apeis...@fnal.gov>
Sent: Friday, August 25, 2023 10:29 PM
To: ceph-users@ceph.io
Subject: [ceph-users] Re: A couple OSDs not starting after host reboot

Hi,

Thank you for your reply. I don’t think the device names changed, but ceph 
seems to be confused about which device the OSD is on. It’s reporting that 
there are 2 OSDs on the same device although this is not true.

ceph device ls-by-host <osd-node> | grep sdu
ATA_HGST_HUH728080ALN600_VJH4GLUX sdu  osd.665
ATA_HGST_HUH728080ALN600_VJH60MAX sdu  osd.657

The osd.665 is actually on device sdm. Could this be the cause of the issue? Is 
there a way to correct it?
Thanks,
Alison
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to