[ceph-users] Re: Question about erasure coding on cephfs
Hi Erich, about a similar problem I asked some months ago,Frank Schilder published this on the list (December 6, 2023) and it may be helpfull for your setup. I've not tested yet, my cluster is still in deployment state. To provide some first-hand experience, I was operating a pool with a 6+2 EC profile on 4 hosts for a while (until we got more hosts) and the "subdivide a physical host into 2 crush-buckets" approach is actually working best (I basically tried all the approaches described in the linked post and they all had pitfalls). Procedure is more or less: - add second (logical) host bucket for each physical host by suffixing the host name with "-B" (ceph osd crush add-bucket ) - move half the OSDs per host to this new host bucket (ceph osd crush move osd.ID host=HOSTNAME-B) - make this location persist reboot of the OSDs (ceph config set osd.ID crush_location host=HOSTNAME-B") This will allow you to move OSDs back easily when you get more hosts and can afford the recommended 1 shard per host. It will also show which and where OSDs are moved to with a simple "ceph config dump | grep crush_location". Bets of all, you don't have to fiddle around with crush maps and hope they do what you want. Just use failure domain host and you are good. No more than 2 host buckets per physical host means no more than 2 shards per physical host with default placement rules. I was operating this set-up with min_size=6 and feeling bad about it due to the reduced maintainability (risk of data loss during maintenance). Its not great really, but sometimes there is no way around it. I was happy when I got the extra hosts. Patrick Le 02/03/2024 à 16:37, Erich Weiler a écrit : Hi Y'all, We have a new ceph cluster online that looks like this: md-01 : monitor, manager, mds md-02 : monitor, manager, mds md-03 : monitor, manager store-01 : twenty 30TB NVMe OSDs store-02 : twenty 30TB NVMe OSDs The cephfs storage is using erasure coding at 4:2. The crush domain is set to "osd". (I know that's not optimal but let me get to that in a minute) We have a current regular single NFS server (nfs-01) with the same storage as the OSD servers above (twenty 30TB NVME disks). We want to wipe the NFS server and integrate it into the above ceph cluster as "store-03". When we do that, we would then have three OSD servers. We would then switch the crush domain to "host". My question is this: Given that we have 4:2 erasure coding, would the data rebalance evenly across the three OSD servers after we add store-03 such that if a single OSD server went down, the other two would be enough to keep the system online? Like, with 4:2 erasure coding, would 2 shards go on store-01, then 2 shards on store-02, and then 2 shards on store-03? Is that how I understand it? Thanks for any insight! -erich ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm bootstrap on 3 network clusters
Hi Sebastian as you says "more than 3 public networks", did you manage Ceph daemons listening on multiple public interface ? I'm looking for such a possibility as daemons seams binded to one interface only but do not find any how-to. Thanks Patrick Le 03/01/2024 à 21:31, Sebastian a écrit : Hi, check routing table and default gateway and eventually fix it. use IP instead of dns name. I have more complicated situation :D I have more than 3 public networks and cluster networks… BR, Sebastian On Jan 3, 2024, at 16:40, Luis Domingues wrote: Why? The public network should not have any restrictions between the Ceph nodes. Same with the cluster network. Internal policies and network rules. Luis Domingues Proton AG On Wednesday, 3 January 2024 at 16:15, Robert Sander wrote: Hi Luis, On 1/3/24 16:12, Luis Domingues wrote: My issue is that mon1 cannot connect via SSH to itself using pub network, and bootstrap fail at the end when cephadm tries to add mon1 to the list of hosts. Why? The public network should not have any restrictions between the Ceph nodes. Same with the cluster network. Regards -- Robert Sander Heinlein Consulting GmbH Schwedter Str. 8/9b, 10119 Berlin https://www.heinlein-support.de Tel: 030 / 405051-43 Fax: 030 / 405051-19 Amtsgericht Berlin-Charlottenburg - HRB 220009 B Geschäftsführer: Peer Heinlein - Sitz: Berlin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: EC Profiles & DR
Le 06/12/2023 à 16:21, Frank Schilder a écrit : Hi, the post linked in the previous message is a good source for different approaches. To provide some first-hand experience, I was operating a pool with a 6+2 EC profile on 4 hosts for a while (until we got more hosts) and the "subdivide a physical host into 2 crush-buckets" approach is actually working best (I basically tried all the approaches described in the linked post and they all had pitfalls). Procedure is more or less: - add second (logical) host bucket for each physical host by suffixing the host name with "-B" (ceph osd crush add-bucket ) - move half the OSDs per host to this new host bucket (ceph osd crush move osd.ID host=HOSTNAME-B) - make this location persist reboot of the OSDs (ceph config set osd.ID crush_location host=HOSTNAME-B") This will allow you to move OSDs back easily when you get more hosts and can afford the recommended 1 shard per host. It will also show which and where OSDs are moved to with a simple "ceph config dump | grep crush_location". Bets of all, you don't have to fiddle around with crush maps and hope they do what you want. Just use failure domain host and you are good. No more than 2 host buckets per physical host means no more than 2 shards per physical host with default placement rules. I was operating this set-up with min_size=6 and feeling bad about it due to the reduced maintainability (risk of data loss during maintenance). Its not great really, but sometimes there is no way around it. I was happy when I got the extra hosts. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Curt Sent: Wednesday, December 6, 2023 3:56 PM To: Patrick Begou Cc:ceph-users@ceph.io Subject: [ceph-users] Re: EC Profiles & DR Hi Patrick, Yes K and M are chunks, but the default crush map is a chunk per host, which is probably the best way to do it, but I'm no expert. I'm not sure why you would want to do a crush map with 2 chunks per host and min size 4 as it' s just asking for trouble at some point, in my opinion. Anyway, take a look at this post if your interested in doing 2 chunks per host it will give you an idea of crushmap setup, https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NB3M22GNAC7VNWW7YBVYTH6TBZOYLTWA/ . Regards, Curt Thanks all for this details that clarify many things for me. Rich, yes I'm starting with 5 nodes and 4 HDD/node to set up the first Ceph cluster in the laboratory and my goal is to increase this cluster (may be up to 10 nodes) and to add storage in the nodes (until 12 OSD per node). It is a starting point for capacitif storage connected to my two clusters (400 cores + 256 cores). Thanks Franck for these details, as a newbie Iwould never have thought to this strategy. In my mind, this is the best way for starting the first setup and moving to a more standard configuration later. I've all the template now, just have to dive deeper in the details to build it. Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: EC Profiles & DR
Le 06/12/2023 à 00:11, Rich Freeman a écrit : On Tue, Dec 5, 2023 at 6:35 AM Patrick Begou wrote: Ok, so I've misunderstood the meaning of failure domain. If there is no way to request using 2 osd/node and node as failure domain, with 5 nodes k=3+m=1 is not secure enough and I will have to use k=2+m=2, so like a raid1 setup. A little bit better than replication in the point of view of global storage capacity. I'm not sure what you mean by requesting 2osd/node. If the failure domain is set to the host, then by default k/m refer to hosts, and the PGs will be spread across all OSDs on all hosts, but with any particular PG only being present on one OSD on each host. You can get fancy with device classes and crush rules and such and be more specific with how they're allocated, but that would be the typical behavior. Since k/m refer to hosts, then k+m must be less than or equal to the number of hosts or you'll have a degraded pool because there won't be enough hosts to allocate them all. It won't ever stack them across multiple OSDs on the same host with that configuration. k=2,m=2 with min=3 would require at least 4 hosts (k+m), and would allow you to operate degraded with a single host down, and the PGs would become inactive but would still be recoverable with two hosts down. While strictly speaking only 4 hosts are required, you'd do better to have more than that since then the cluster can immediately recover from a loss, assuming you have sufficient space. As you say it is no more space-efficient than RAID1 or size=2, and it suffers write amplification for modifications, but it does allow recovery after the loss of up to two hosts, and you can operate degraded with one host down which allows for somewhat high availability. Hi Rich, My understood was that k and m were for EC chunks not hosts. Of course if k and m are hosts the best choice would be k=2 and m=2. When Christian wrote: /For example if you run an EC=4+2 profile on 3 hosts you can structure your crushmap so that you have 2 chunks per host. This means even if one host is down you are still guaranteed to have 4 chunks available./ This is that I had thought before (and using 5 nodes instead of 3 as the Christian's example). But it does not match what you explain if k and m are nodes. I'm a little bit confused with crushmap settings. Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: EC Profiles & DR
Ok, so I've misunderstood the meaning of failure domain. If there is no way to request using 2 osd/node and node as failure domain, with 5 nodes k=3+m=1 is not secure enough and I will have to use k=2+m=2, so like a raid1 setup. A little bit better than replication in the point of view of global storage capacity. Patrick Le 05/12/2023 à 12:19, David C. a écrit : Hi, To return to my comparison with SANs, on a SAN you have spare disks to repair a failed disk. On Ceph, you therefore need at least one more host (k+m+1). If we take into consideration the formalities/delivery times of a new server, k+m+2 is not luxury (Depending on the growth of your volume). Cordialement, *David CASIER* Le mar. 5 déc. 2023 à 11:17, Patrick Begou a écrit : Hi Robert, Le 05/12/2023 à 10:05, Robert Sander a écrit : > On 12/5/23 10:01, duluxoz wrote: >> Thanks David, I knew I had something wrong :-) >> >> Just for my own edification: Why is k=2, m=1 not recommended for >> production? Considered to "fragile", or something else? > > It is the same as a replicated pool with size=2. Only one host can go > down. After that you risk to lose data. > > Erasure coding is possible with a cluster size of 10 nodes or more. > With smaller clusters you have to go with replicated pools. > Could you explain why 10 nodes are required for EC ? On my side, I'm working on building my first (small) Ceph cluster using E.C. and I was thinking about 5 nodes and k=4 m=2. With a failure domain on host and several osd by nodes, in my mind this setup may run degraded with 3 nodes using 2 distincts osd by node and the ultimate possibility to loose an additional node without loosing data. Of course with sufficient free storage available. Am I totally wrong in my first ceph approach ? Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: EC Profiles & DR
Hi Robert, Le 05/12/2023 à 10:05, Robert Sander a écrit : On 12/5/23 10:01, duluxoz wrote: Thanks David, I knew I had something wrong :-) Just for my own edification: Why is k=2, m=1 not recommended for production? Considered to "fragile", or something else? It is the same as a replicated pool with size=2. Only one host can go down. After that you risk to lose data. Erasure coding is possible with a cluster size of 10 nodes or more. With smaller clusters you have to go with replicated pools. Could you explain why 10 nodes are required for EC ? On my side, I'm working on building my first (small) Ceph cluster using E.C. and I was thinking about 5 nodes and k=4 m=2. With a failure domain on host and several osd by nodes, in my mind this setup may run degraded with 3 nodes using 2 distincts osd by node and the ultimate possibility to loose an additional node without loosing data. Of course with sufficient free storage available. Am I totally wrong in my first ceph approach ? Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
Hi all, First of all I apologize if I've not done things correctly but these are some tests results. 1) I've compiled the main branch in a fresh podman container (Alma Linux 8) and installed. Successfull! 2) I have done a copy of the /etc/ceph directory of the host (member of the ceph cluster in Pacific 16.2.14) in this container (good or bad idea ?) 3) "ceph-volume inventory" works but with some error messages: [root@74285dcfa91f etc]# ceph-volume inventory stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. stderr: Unknown device, --name=, --path=, or absolute path in /dev/ or /sys expected. Device Path Size Device nodes rotates available Model name /dev/sdc 232.83 GB sdc True True SAMSUNG HE253GJ /dev/sda 232.83 GB sda True False SAMSUNG HE253GJ /dev/sdb 465.76 GB sdb True False WDC WD5003ABYX-1 4) ceph version show: [root@74285dcfa91f etc]# ceph -v ceph version 18.0.0-6846-g2706ecac4a9 (2706ecac4a90447420904e42d6e0445134dff2be) reef (dev) 5) lsblk works (container launched with "--privileged" flag) [root@74285dcfa91f etc]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 1 232.9G 0 disk |-sda1 8:1 3.9G 0 part |-sda2 8:2 1 3.9G 0 part [SWAP] `-sda3 8:3 1 225G 0 part sdb 8:16 1 465.8G 0 disk sdc 8:32 1 232.9G 0 disk But some commands do not works (my setup or ceph ?) [root@74285dcfa91f etc]# ceph orch device zap mostha1.legi.grenoble-inp.fr /dev/sdc --force Error EINVAL: Device path '/dev/sdc' not found on host 'mostha1.legi.grenoble-inp.fr' [root@74285dcfa91f etc]# [root@74285dcfa91f etc]# ceph orch device ls [root@74285dcfa91f etc]# Patrick Le 24/10/2023 à 22:43, Zack Cerza a écrit : That's correct - it's the removable flag that's causing the disks to be excluded. I actually just merged this PR last week: https://github.com/ceph/ceph/pull/49954 One of the changes it made was to enable removable (but not USB) devices, as there are vendors that report hot-swappable drives as removable. Patrick, it looks like this may resolve your issue as well. On Tue, Oct 24, 2023 at 5:57 AM Eugen Block wrote: Hi, May be because they are hot-swappable hard drives. yes, that's my assumption as well. Zitat von Patrick Begou : Hi Eugen, Yes Eugen, all the devices /dev/sd[abc] have the removable flag set to 1. May be because they are hot-swappable hard drives. I have contacted the commit author Zack Cerza and he asked me for some additional tests too this morning. I add him in copy to this mail. Patrick Le 24/10/2023 à 12:57, Eugen Block a écrit : Hi, just to confirm, could you check that the disk which is *not* discovered by 16.2.11 has a "removable" flag? cat /sys/block/sdX/removable I could reproduce it as well on a test machine with a USB thumb drive (live distro) which is excluded in 16.2.11 but is shown in 16.2.10. Although I'm not a developer I tried to understand what changes were made in https://github.com/ceph/ceph/pull/46375/files#diff-330f9319b0fe352dff0486f66d3c4d6a6a3d48efd900b2ceb86551cfd88dc4c4R771 and there's this line: if get_file_contents(os.path.join(_sys_block_path, dev, 'removable')) == "1": continue The thumb drive is removable, of course, apparently that is filtered here. Regards, Eugen Zitat von Patrick Begou : Le 23/10/2023 à 03:04, 544463...@qq.com a écrit : I think you can try to roll back this part of the python code and wait for your good news :) Not so easy [root@e9865d9a7f41 ceph]# git revert 4fc6bc394dffaf3ad375ff29cbb0a3eb9e4dbefc Auto-merging src/ceph-volume/ceph_volume/tests/util/test_device.py CONFLICT (content): Merge conflict in src/ceph-volume/ceph_volume/tests/util/test_device.py Auto-merging src/ceph-volume/ceph_volume/util/device.py CONFLICT (content): Merge conflict in src/ceph-volume/ceph_volume/util/device.py Auto-merging src/ceph-volume/ceph_volume/util/disk.py CONFLICT (content): Merge conflict in src/ceph-volume/ceph_volume/util/disk.py error: could not revert 4fc6bc394df... ceph-volume: Optionally consume loop devices Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe s
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
Hi, Running git pull this morning I saw the patch on the main branch and try to compile it but it fails with cython for rbd.pyx. I have many similar errors: rbd.pyx:760:44: Cannot assign type 'int (*)(uint64_t, uint64_t, void *) except? -1' to 'librbd_progress_fn_t'. Exception values are incompatible. Suggest adding 'noexcept' to type 'int (uint64_t, uint64_t, void *) except? -1'. rbd.pyx:763:23: Cannot assign type 'int (*)(uint64_t, uint64_t, void *) except? -1 nogil' to 'librbd_progress_fn_t'. Exception values are incompatible. Suggest adding 'noexcept' to type 'int (uint64_t, uint64_t, void *) except? -1 nogil'. rbd.pyx:868:44: Cannot assign type 'int (*)(uint64_t, uint64_t, void *) except? -1' to 'librbd_progress_fn_t'. Exception values are incompatible. Suggest adding 'noexcept' to type 'int (uint64_t, uint64_t, void *) except? -1'. I don't know cython at all. I've juste run ./install-deps.sh ./do_cmake.sh cd build ninja # gcc --version gcc (GCC) 11.2.1 20220127 (Red Hat 11.2.1-9) Any suggestion ? Thanks Patrick Le 24/10/2023 à 22:43, Zack Cerza a écrit : That's correct - it's the removable flag that's causing the disks to be excluded. I actually just merged this PR last week: https://github.com/ceph/ceph/pull/49954 One of the changes it made was to enable removable (but not USB) devices, as there are vendors that report hot-swappable drives as removable. Patrick, it looks like this may resolve your issue as well. On Tue, Oct 24, 2023 at 5:57 AM Eugen Block wrote: Hi, May be because they are hot-swappable hard drives. yes, that's my assumption as well. Zitat von Patrick Begou : Hi Eugen, Yes Eugen, all the devices /dev/sd[abc] have the removable flag set to 1. May be because they are hot-swappable hard drives. I have contacted the commit author Zack Cerza and he asked me for some additional tests too this morning. I add him in copy to this mail. Patrick Le 24/10/2023 à 12:57, Eugen Block a écrit : Hi, just to confirm, could you check that the disk which is *not* discovered by 16.2.11 has a "removable" flag? cat /sys/block/sdX/removable I could reproduce it as well on a test machine with a USB thumb drive (live distro) which is excluded in 16.2.11 but is shown in 16.2.10. Although I'm not a developer I tried to understand what changes were made in https://github.com/ceph/ceph/pull/46375/files#diff-330f9319b0fe352dff0486f66d3c4d6a6a3d48efd900b2ceb86551cfd88dc4c4R771 and there's this line: if get_file_contents(os.path.join(_sys_block_path, dev, 'removable')) == "1": continue The thumb drive is removable, of course, apparently that is filtered here. Regards, Eugen Zitat von Patrick Begou : Le 23/10/2023 à 03:04, 544463...@qq.com a écrit : I think you can try to roll back this part of the python code and wait for your good news :) Not so easy [root@e9865d9a7f41 ceph]# git revert 4fc6bc394dffaf3ad375ff29cbb0a3eb9e4dbefc Auto-merging src/ceph-volume/ceph_volume/tests/util/test_device.py CONFLICT (content): Merge conflict in src/ceph-volume/ceph_volume/tests/util/test_device.py Auto-merging src/ceph-volume/ceph_volume/util/device.py CONFLICT (content): Merge conflict in src/ceph-volume/ceph_volume/util/device.py Auto-merging src/ceph-volume/ceph_volume/util/disk.py CONFLICT (content): Merge conflict in src/ceph-volume/ceph_volume/util/disk.py error: could not revert 4fc6bc394df... ceph-volume: Optionally consume loop devices Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
Some tests: If in Nautilus 16.2.14 in /usr/lib/python3.6/site-packages/ceph_volume/util/disk.py I disable lines 804 and 805 804 if get_file_contents(os.path.join(_sys_block_path, dev, 'removable')) == "1": 805 continue the command "ceph-volume inventory" works as in Octopus or Nautilus < 16.2.11: [ceph: root@mostha1 /]# ceph-volume inventory Device Path Size Device nodes rotates available Model name /dev/sdc 232.83 GB sdc True True SAMSUNG HE253GJ /dev/sda 232.83 GB sda True False SAMSUNG HE253GJ /dev/sdb 465.76 GB sdb True False WDC WD5003ABYX-1 but 1) "ceph orch device ls" still returns nothing. 2) I cannot zap the /dev/sdc device: [ceph: root@mostha1 /]# ceph orch device zap mostha1.legi.grenoble-inp.fr /dev/sdc --force Error EINVAL: Device path '/dev/sdc' not found on host 'mostha1.legi.grenoble-inp.fr' 3) I cannot manualy add the sdc device as an osd: [ceph: root@mostha1 /]# ceph orch daemon add osd mostha1.legi.grenoble-inp.fr:/dev/sdc Created no osd(s) on host mostha1.legi.grenoble-inp.fr; already created? Even is the device is present and unused: [ceph: root@mostha1 /]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 1 232.9G 0 disk |-sda1 8:1 1 3.9G 0 part /rootfs/boot |-sda2 8:2 1 3.9G 0 part [SWAP] `-sda3 8:3 1 225G 0 part |-al8vg-rootvol 253:0 0 48.8G 0 lvm /rootfs |-al8vg-homevol 253:2 0 9.8G 0 lvm /rootfs/home |-al8vg-tmpvol 253:3 0 9.8G 0 lvm /rootfs/tmp `-al8vg-varvol 253:4 0 79.8G 0 lvm /rootfs/var sdb 8:16 1 465.8G 0 disk `-ceph--08827fdc--136e--4070--97e9--e5e8b3970766-osd--block--7dec1808--d6f4--4f90--ac74--75a4346e1df5 253:1 0 465.8G 0 lvm sdc 8:32 1 232.9G 0 disk Patrick Le 24/10/2023 à 13:38, Patrick Begou a écrit : Hi Eugen, Yes Eugen, all the devices /dev/sd[abc] have the removable flag set to 1. May be because they are hot-swappable hard drives. I have contacted the commit author Zack Cerza and he asked me for some additional tests too this morning. I add him in copy to this mail. Patrick Le 24/10/2023 à 12:57, Eugen Block a écrit : Hi, just to confirm, could you check that the disk which is *not* discovered by 16.2.11 has a "removable" flag? cat /sys/block/sdX/removable I could reproduce it as well on a test machine with a USB thumb drive (live distro) which is excluded in 16.2.11 but is shown in 16.2.10. Although I'm not a developer I tried to understand what changes were made in https://github.com/ceph/ceph/pull/46375/files#diff-330f9319b0fe352dff0486f66d3c4d6a6a3d48efd900b2ceb86551cfd88dc4c4R771 and there's this line: if get_file_contents(os.path.join(_sys_block_path, dev, 'removable')) == "1": continue The thumb drive is removable, of course, apparently that is filtered here. Regards, Eugen Zitat von Patrick Begou : Le 23/10/2023 à 03:04, 544463...@qq.com a écrit : I think you can try to roll back this part of the python code and wait for your good news :) Not so easy [root@e9865d9a7f41 ceph]# git revert 4fc6bc394dffaf3ad375ff29cbb0a3eb9e4dbefc Auto-merging src/ceph-volume/ceph_volume/tests/util/test_device.py CONFLICT (content): Merge conflict in src/ceph-volume/ceph_volume/tests/util/test_device.py Auto-merging src/ceph-volume/ceph_volume/util/device.py CONFLICT (content): Merge conflict in src/ceph-volume/ceph_volume/util/device.py Auto-merging src/ceph-volume/ceph_volume/util/disk.py CONFLICT (content): Merge conflict in src/ceph-volume/ceph_volume/util/disk.py error: could not revert 4fc6bc394df... ceph-volume: Optionally consume loop devices Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
Hi Eugen, Yes Eugen, all the devices /dev/sd[abc] have the removable flag set to 1. May be because they are hot-swappable hard drives. I have contacted the commit author Zack Cerza and he asked me for some additional tests too this morning. I add him in copy to this mail. Patrick Le 24/10/2023 à 12:57, Eugen Block a écrit : Hi, just to confirm, could you check that the disk which is *not* discovered by 16.2.11 has a "removable" flag? cat /sys/block/sdX/removable I could reproduce it as well on a test machine with a USB thumb drive (live distro) which is excluded in 16.2.11 but is shown in 16.2.10. Although I'm not a developer I tried to understand what changes were made in https://github.com/ceph/ceph/pull/46375/files#diff-330f9319b0fe352dff0486f66d3c4d6a6a3d48efd900b2ceb86551cfd88dc4c4R771 and there's this line: if get_file_contents(os.path.join(_sys_block_path, dev, 'removable')) == "1": continue The thumb drive is removable, of course, apparently that is filtered here. Regards, Eugen Zitat von Patrick Begou : Le 23/10/2023 à 03:04, 544463...@qq.com a écrit : I think you can try to roll back this part of the python code and wait for your good news :) Not so easy [root@e9865d9a7f41 ceph]# git revert 4fc6bc394dffaf3ad375ff29cbb0a3eb9e4dbefc Auto-merging src/ceph-volume/ceph_volume/tests/util/test_device.py CONFLICT (content): Merge conflict in src/ceph-volume/ceph_volume/tests/util/test_device.py Auto-merging src/ceph-volume/ceph_volume/util/device.py CONFLICT (content): Merge conflict in src/ceph-volume/ceph_volume/util/device.py Auto-merging src/ceph-volume/ceph_volume/util/disk.py CONFLICT (content): Merge conflict in src/ceph-volume/ceph_volume/util/disk.py error: could not revert 4fc6bc394df... ceph-volume: Optionally consume loop devices Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
Le 23/10/2023 à 03:04, 544463...@qq.com a écrit : I think you can try to roll back this part of the python code and wait for your good news :) Not so easy [root@e9865d9a7f41 ceph]# git revert 4fc6bc394dffaf3ad375ff29cbb0a3eb9e4dbefc Auto-merging src/ceph-volume/ceph_volume/tests/util/test_device.py CONFLICT (content): Merge conflict in src/ceph-volume/ceph_volume/tests/util/test_device.py Auto-merging src/ceph-volume/ceph_volume/util/device.py CONFLICT (content): Merge conflict in src/ceph-volume/ceph_volume/util/device.py Auto-merging src/ceph-volume/ceph_volume/util/disk.py CONFLICT (content): Merge conflict in src/ceph-volume/ceph_volume/util/disk.py error: could not revert 4fc6bc394df... ceph-volume: Optionally consume loop devices Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
Hi all, ending with git bisect just now shows: 4fc6bc394dffaf3ad375ff29cbb0a3eb9e4dbefc is the first bad commit commit 4fc6bc394dffaf3ad375ff29cbb0a3eb9e4dbefc Author: Zack Cerza Date: Tue May 17 11:29:02 2022 -0600 ceph-volume: Optionally consume loop devices A similar proposal was rejected in #24765; I understand the logic behind the rejection, but this will allow us to run Ceph clusters on machines that lack disk resources for testing purposes. We just need to make it impossible to accidentally enable, and make it clear it is unsupported. Signed-off-by: Zack Cerza (cherry picked from commit c7f017b21ade3762ba5b7b9688bed72c6b60dc0e) .../ceph_volume/tests/util/test_device.py | 17 +++ src/ceph-volume/ceph_volume/util/device.py | 14 +++-- src/ceph-volume/ceph_volume/util/disk.py | 59 ++ 3 files changed, 78 insertions(+), 12 deletions(-) I will try to investigate next week but if some Ceph expert developpers can have a look at this commit ;-) Have a nice week-end Patrick Le 18/10/2023 à 13:48, Patrick Begou a écrit : Hi all, I'm trying to catch the faulty commit. I'm able to build Ceph from the git repo in a fresh podman container but at this time, the lsblk command returns nothing in my container. In ceph containers lsblk works So something is wrong with launching my podman container (or different from launching ceph containers) and I cannot find what. Any help about this step ? Thanks Patrick Le 13/10/2023 à 09:18, Eugen Block a écrit : Trying to resend with the attachment. I can't really find anything suspicious, ceph-volume (16.2.11) does recognize /dev/sdc though: [2023-10-12 08:58:14,135][ceph_volume.process][INFO ] stdout NAME="sdc" KNAME="sdc" PKNAME="" MAJ:MIN="8:32" FSTYPE="" MOUNTPOINT="" LABEL="" UUID="" RO="0" RM="1" MODEL="SAMSUNG HE253GJ " SIZE="232.9G" STATE="running" OWNER="root" GROUP="disk" MODE="brw-rw" ALIGNMENT="0" PHY-SEC="512" LOG-SEC="512" ROTA="1" SCHED="mq-deadline" TYPE="disk" DISC-ALN="0" DISC-GRAN="0B" DISC-MAX="0B" DISC-ZERO="0" PKNAME="" PARTLABEL="" [2023-10-12 08:58:14,139][ceph_volume.util.system][INFO ] Executable pvs found on the host, will use /sbin/pvs [2023-10-12 08:58:14,140][ceph_volume.process][INFO ] Running command: nsenter --mount=/rootfs/proc/1/ns/mnt --ipc=/rootfs/proc/1/ns/ipc --net=/rootfs/proc/1/ns/net --uts=/rootfs/proc/1/ns/uts /sbin/pvs --noheadings --readonly --units=b --nosuffix --separator=";" -o pv_name,vg_name,pv_count,lv_count,vg_attr,vg_extent_count,vg_free_count,vg_extent_size But apparently it just stops after that. I already tried to find a debug log-level for ceph-volume but it's not applicable to all subcommands. The cephadm.log also just stops without even finishing the "copying blob", which makes me wonder if it actually pulls the entire image? I assume you have enough free disk space (otherwise I would expect a message "failed to pull target image"), do you see any other warnings in syslog or something? Or are the logs incomplete? Maybe someone else finds any clues in the logs... Regards, Eugen Zitat von Patrick Begou : Hi Eugen, You will find in attachment cephadm.log and cepĥ-volume.log. Each contains the outputs for the 2 versions. v16.2.10-20220920 is really more verbose or v16.2.11-20230125 does not execute all the detection process Patrick Le 12/10/2023 à 09:34, Eugen Block a écrit : Good catch, and I found the thread I had in my mind, it was this exact one. :-D Anyway, can you share the ceph-volume.log from the working and the not working attempt? I tried to look for something significant in the pacific release notes for 16.2.11, and there were some changes to ceph-volume, but I'm not sure what it could be. Zitat von Patrick Begou : I've ran additional tests with Pacific releases and with "ceph-volume inventory" things went wrong with the first v16.11 release (v16.2.11-20230125) === Ceph v16.2.10-20220920 === Device Path Size rotates available Model name /dev/sdc 232.83 GB True True SAMSUNG HE253GJ /dev/sda 232.83 GB True False SAMSUNG HE253GJ /dev/sdb 465.76 GB True False WDC WD5003ABYX-1 === Ceph v16.2.11-20230125 === Device Path Size Device nodes rotates available Model name May be this could help to see what has changed ? Patrick Le 11/10/2023 à 17:38, Eugen Block a écrit : That's really strange. Just out of curiosity, have you tried Quincy (
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
Hi all, I'm trying to catch the faulty commit. I'm able to build Ceph from the git repo in a fresh podman container but at this time, the lsblk command returns nothing in my container. In ceph containers lsblk works So something is wrong with launching my podman container (or different from launching ceph containers) and I cannot find what. Any help about this step ? Thanks Patrick Le 13/10/2023 à 09:18, Eugen Block a écrit : Trying to resend with the attachment. I can't really find anything suspicious, ceph-volume (16.2.11) does recognize /dev/sdc though: [2023-10-12 08:58:14,135][ceph_volume.process][INFO ] stdout NAME="sdc" KNAME="sdc" PKNAME="" MAJ:MIN="8:32" FSTYPE="" MOUNTPOINT="" LABEL="" UUID="" RO="0" RM="1" MODEL="SAMSUNG HE253GJ " SIZE="232.9G" STATE="running" OWNER="root" GROUP="disk" MODE="brw-rw" ALIGNMENT="0" PHY-SEC="512" LOG-SEC="512" ROTA="1" SCHED="mq-deadline" TYPE="disk" DISC-ALN="0" DISC-GRAN="0B" DISC-MAX="0B" DISC-ZERO="0" PKNAME="" PARTLABEL="" [2023-10-12 08:58:14,139][ceph_volume.util.system][INFO ] Executable pvs found on the host, will use /sbin/pvs [2023-10-12 08:58:14,140][ceph_volume.process][INFO ] Running command: nsenter --mount=/rootfs/proc/1/ns/mnt --ipc=/rootfs/proc/1/ns/ipc --net=/rootfs/proc/1/ns/net --uts=/rootfs/proc/1/ns/uts /sbin/pvs --noheadings --readonly --units=b --nosuffix --separator=";" -o pv_name,vg_name,pv_count,lv_count,vg_attr,vg_extent_count,vg_free_count,vg_extent_size But apparently it just stops after that. I already tried to find a debug log-level for ceph-volume but it's not applicable to all subcommands. The cephadm.log also just stops without even finishing the "copying blob", which makes me wonder if it actually pulls the entire image? I assume you have enough free disk space (otherwise I would expect a message "failed to pull target image"), do you see any other warnings in syslog or something? Or are the logs incomplete? Maybe someone else finds any clues in the logs... Regards, Eugen Zitat von Patrick Begou : Hi Eugen, You will find in attachment cephadm.log and cepĥ-volume.log. Each contains the outputs for the 2 versions. v16.2.10-20220920 is really more verbose or v16.2.11-20230125 does not execute all the detection process Patrick Le 12/10/2023 à 09:34, Eugen Block a écrit : Good catch, and I found the thread I had in my mind, it was this exact one. :-D Anyway, can you share the ceph-volume.log from the working and the not working attempt? I tried to look for something significant in the pacific release notes for 16.2.11, and there were some changes to ceph-volume, but I'm not sure what it could be. Zitat von Patrick Begou : I've ran additional tests with Pacific releases and with "ceph-volume inventory" things went wrong with the first v16.11 release (v16.2.11-20230125) === Ceph v16.2.10-20220920 === Device Path Size rotates available Model name /dev/sdc 232.83 GB True True SAMSUNG HE253GJ /dev/sda 232.83 GB True False SAMSUNG HE253GJ /dev/sdb 465.76 GB True False WDC WD5003ABYX-1 === Ceph v16.2.11-20230125 === Device Path Size Device nodes rotates available Model name May be this could help to see what has changed ? Patrick Le 11/10/2023 à 17:38, Eugen Block a écrit : That's really strange. Just out of curiosity, have you tried Quincy (and/or Reef) as well? I don't recall what inventory does in the background exactly, I believe Adam King mentioned that in some thread, maybe that can help here. I'll search for that thread tomorrow. Zitat von Patrick Begou : Hi Eugen, [root@mostha1 ~]# rpm -q cephadm cephadm-16.2.14-0.el8.noarch Log associated to the 2023-10-11 16:16:02,167 7f820515fb80 DEBUG cephadm ['gather-facts'] 2023-10-11 16:16:02,208 7f820515fb80 DEBUG /bin/podman: 4.4.1 2023-10-11 16:16:02,313 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,317 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,322 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,326 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,329 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,333 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:0
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
Hi Johan, So it is not O.S. related as you are running Debian and I am running Alma Linux. But I'm surprised why so few people meet this bug. Patrick Le 13/10/2023 à 17:38, Johan a écrit : At home Im running a small cluster, Ceph v17.2.6, Debian 11 Bullseye. I have recently added a new server to the cluster but face the same problem as Patrick, I can't add any HDD. Ceph doesn't recognise them. I have run the same tests as Patrick, using Ceph v14-v18, and as Patrick showed the problem appears in Ceph v16.2.11-20230125 === Ceph v16.2.10-20220920 === $ sudo cephadm --image quay.io/ceph/ceph:v16.2.10-20220920 ceph-volume inventory Inferring fsid 5592891c-30e4-11ed-b720-f02f741f58ac Device Path Size rotates available Model name /dev/nvme0n1 931.51 GB False False KINGSTON SNV2S1000G /dev/nvme1n1 931.51 GB False False KINGSTON SNV2S1000G /dev/sda 3.64 TB True False WDC WD4003FFBX-6 /dev/sdb 5.46 TB True False WDC WD6003FFBX-6 /dev/sdc 7.28 TB True False ST8000NE001-2M71 /dev/sdd 7.28 TB True False WDC WD8003FFBX-6 === Ceph v16.2.11-20230125 === $ sudo cephadm --image quay.io/ceph/ceph:v16.2.11-20230125 ceph-volume inventory Inferring fsid 5592891c-30e4-11ed-b720-f02f741f58ac Device Path Size Device nodes rotates available Model name /dev/md0 9.30 GB nvme1n1p2,nvme0n1p2 False False /dev/md1 59.57 GB nvme0n1p3,nvme1n1p3 False False /dev/md2 279.27 GB nvme1n1p4,nvme0n1p4 False False /dev/nvme0n1 931.51 GB nvme0n1 False False KINGSTON SNV2S1000G /dev/nvme1n1 931.51 GB nvme1n1 False False KINGSTON SNV2S1000G /Johan ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
The server has enough available storage: [root@mostha1 log]# df -h Sys. de fichiers Taille Utilisé Dispo Uti% Monté sur devtmpfs 24G 0 24G 0% /dev tmpfs 24G 84K 24G 1% /dev/shm tmpfs 24G 195M 24G 1% /run tmpfs 24G 0 24G 0% /sys/fs/cgroup /dev/mapper/al8vg-rootvol 49G 6,5G 43G 14% / /dev/sda1 3,8G 412M 3,2G 12% /boot /dev/mapper/al8vg-varvol 20G 9,7G 11G 49% /var /dev/mapper/al8vg-tmpvol 9,8G 103M 9,7G 2% /tmp /dev/mapper/al8vg-homevol 9,8G 103M 9,7G 2% /home tmpfs 4,7G 0 4,7G 0% /run/user/0 overlay 20G 9,7G 11G 49% /var/lib/containers/storage/overlay/b8769720357497ebdbf68768753da154b3d63cfbef254036441af60a91649127/merged overlay 20G 9,7G 11G 49% /var/lib/containers/storage/overlay/2eed15daec130da50530621740025655ecd961e1b1855f35922f03561960d999/merged overlay 20G 9,7G 11G 49% /var/lib/containers/storage/overlay/4d0b4f0b4063cce3f983beda80bac78dd3b5f30379d2eb96daefef8ddfaf/merged overlay 20G 9,7G 11G 49% /var/lib/containers/storage/overlay/129c5d3e070f80f17a79c1f172b60c2fc0f30a84b51b07ea207dc5868cd1d7f0/merged overlay 20G 9,7G 11G 49% /var/lib/containers/storage/overlay/c41d6bdaf941d16fd80326ef5dae6a02524d3f41bcb64cb29bda2bd5816fee9a/merged overlay 20G 9,7G 11G 49% /var/lib/containers/storage/overlay/1b6c1c893e7ed2c128378bdf2af408f3a834f3453a0505ac042099d6f484dc9b/merged overlay 20G 9,7G 11G 49% /var/lib/containers/storage/overlay/962e5c1380a60e9a54ac29eccb71667f13a5f9047b2ee98e6303a5fea613162f/merged overlay 20G 9,7G 11G 49% /var/lib/containers/storage/overlay/3578d0f5a70afce839017dec888908dead82fb50f90834e5b040e9fd2ada9fba/merged overlay 20G 9,7G 11G 49% /var/lib/containers/storage/overlay/7d9c35751388325c3da54f03981770aa49599a657c2dfe3ba9527884864f177d/merged When I was testing different versions, I removed tested images each time with "podman rmi" for i in v16.2.10-20220920 v16.2.11-20230125 v16.2.11-20230209 v16.2.11-20230316; do echo "=== Ceph $i ===" cephadm --image quay.io/ceph/ceph:$i ceph-volume inventory id=$(podman images |grep " $i "|cut -c 46-59) podman rmi $id done |tee trace.ceph16.2.txt I do not now how to investigate, may be with a "git bisect" between the 2 releases to catch the faulty commit in a podman container context. I'm not so familiar with containers and ceph. Patrick Le 13/10/2023 à 09:18, Eugen Block a écrit : Trying to resend with the attachment. I can't really find anything suspicious, ceph-volume (16.2.11) does recognize /dev/sdc though: [2023-10-12 08:58:14,135][ceph_volume.process][INFO ] stdout NAME="sdc" KNAME="sdc" PKNAME="" MAJ:MIN="8:32" FSTYPE="" MOUNTPOINT="" LABEL="" UUID="" RO="0" RM="1" MODEL="SAMSUNG HE253GJ " SIZE="232.9G" STATE="running" OWNER="root" GROUP="disk" MODE="brw-rw" ALIGNMENT="0" PHY-SEC="512" LOG-SEC="512" ROTA="1" SCHED="mq-deadline" TYPE="disk" DISC-ALN="0" DISC-GRAN="0B" DISC-MAX="0B" DISC-ZERO="0" PKNAME="" PARTLABEL="" [2023-10-12 08:58:14,139][ceph_volume.util.system][INFO ] Executable pvs found on the host, will use /sbin/pvs [2023-10-12 08:58:14,140][ceph_volume.process][INFO ] Running command: nsenter --mount=/rootfs/proc/1/ns/mnt --ipc=/rootfs/proc/1/ns/ipc --net=/rootfs/proc/1/ns/net --uts=/rootfs/proc/1/ns/uts /sbin/pvs --noheadings --readonly --units=b --nosuffix --separator=";" -o pv_name,vg_name,pv_count,lv_count,vg_attr,vg_extent_count,vg_free_count,vg_extent_size But apparently it just stops after that. I already tried to find a debug log-level for ceph-volume but it's not applicable to all subcommands. The cephadm.log also just stops without even finishing the "copying blob", which makes me wonder if it actually pulls the entire image? I assume you have enough free disk space (otherwise I would expect a message "failed to pull target image"), do you see any other warnings in syslog or something? Or are the logs incomplete? Maybe someone else finds any clues in the logs... Regards, Eugen Zitat von Patrick Begou : Hi Eugen, You will find in attachment cephadm.log and cepĥ-volume.log. Each contains the outputs for the 2 ve
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
Hi Eugen, You will find in attachment cephadm.log and cepĥ-volume.log. Each contains the outputs for the 2 versions. v16.2.10-20220920 is really more verbose or v16.2.11-20230125 does not execute all the detection process Patrick Le 12/10/2023 à 09:34, Eugen Block a écrit : Good catch, and I found the thread I had in my mind, it was this exact one. :-D Anyway, can you share the ceph-volume.log from the working and the not working attempt? I tried to look for something significant in the pacific release notes for 16.2.11, and there were some changes to ceph-volume, but I'm not sure what it could be. Zitat von Patrick Begou : I've ran additional tests with Pacific releases and with "ceph-volume inventory" things went wrong with the first v16.11 release (v16.2.11-20230125) === Ceph v16.2.10-20220920 === Device Path Size rotates available Model name /dev/sdc 232.83 GB True True SAMSUNG HE253GJ /dev/sda 232.83 GB True False SAMSUNG HE253GJ /dev/sdb 465.76 GB True False WDC WD5003ABYX-1 === Ceph v16.2.11-20230125 === Device Path Size Device nodes rotates available Model name May be this could help to see what has changed ? Patrick Le 11/10/2023 à 17:38, Eugen Block a écrit : That's really strange. Just out of curiosity, have you tried Quincy (and/or Reef) as well? I don't recall what inventory does in the background exactly, I believe Adam King mentioned that in some thread, maybe that can help here. I'll search for that thread tomorrow. Zitat von Patrick Begou : Hi Eugen, [root@mostha1 ~]# rpm -q cephadm cephadm-16.2.14-0.el8.noarch Log associated to the 2023-10-11 16:16:02,167 7f820515fb80 DEBUG cephadm ['gather-facts'] 2023-10-11 16:16:02,208 7f820515fb80 DEBUG /bin/podman: 4.4.1 2023-10-11 16:16:02,313 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,317 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,322 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,326 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,329 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,333 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:04,474 7ff2a5c08b80 DEBUG cephadm ['ceph-volume', 'inventory'] 2023-10-11 16:16:04,516 7ff2a5c08b80 DEBUG /usr/bin/podman: 4.4.1 2023-10-11 16:16:04,520 7ff2a5c08b80 DEBUG Using default config: /etc/ceph/ceph.conf 2023-10-11 16:16:04,573 7ff2a5c08b80 DEBUG /usr/bin/podman: 0d28d71358d7,445.8MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: 2084faaf4d54,13.27MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: 61073c53805d,512.7MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: 6b9f0b72d668,361.1MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: 7493a28808ad,163.7MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: a89672a3accf,59.22MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: b45271cc9726,54.24MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: e00ec13ab138,707.3MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: fcb1e1a6b08d,35.55MB / 50.32GB 2023-10-11 16:16:04,630 7ff2a5c08b80 DEBUG /usr/bin/podman: 0d28d71358d7,1.28% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: 2084faaf4d54,0.00% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: 61073c53805d,1.19% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: 6b9f0b72d668,1.03% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: 7493a28808ad,0.78% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: a89672a3accf,0.11% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: b45271cc9726,1.35% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: e00ec13ab138,0.43% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: fcb1e1a6b08d,0.02% 2023-10-11 16:16:04,634 7ff2a5c08b80 INFO Inferring fsid 250f9864-0142-11ee-8e5f-00266cf8869c 2023-10-11 16:16:04,691 7ff2a5c08b80 DEBUG /usr/bin/podman: quay.io/ceph/ceph@sha256:f30bf50755d7087f47c6223e6a921caf5b12e86401b3d49220230c84a8302a1e 2023-10-11 16:16:04,692 7ff2a5c08b80 DEBUG /usr/bin/podman: quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca 2023-10-11 16:16:04,692 7ff2a5c08b80 DEBUG /usr/bin/podman: docker.io/ceph/c
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
I've ran additional tests with Pacific releases and with "ceph-volume inventory" things went wrong with the first v16.11 release (v16.2.11-20230125) === Ceph v16.2.10-20220920 === Device Path Size rotates available Model name /dev/sdc 232.83 GB True True SAMSUNG HE253GJ /dev/sda 232.83 GB True False SAMSUNG HE253GJ /dev/sdb 465.76 GB True False WDC WD5003ABYX-1 === Ceph v16.2.11-20230125 === Device Path Size Device nodes rotates available Model name May be this could help to see what has changed ? Patrick Le 11/10/2023 à 17:38, Eugen Block a écrit : That's really strange. Just out of curiosity, have you tried Quincy (and/or Reef) as well? I don't recall what inventory does in the background exactly, I believe Adam King mentioned that in some thread, maybe that can help here. I'll search for that thread tomorrow. Zitat von Patrick Begou : Hi Eugen, [root@mostha1 ~]# rpm -q cephadm cephadm-16.2.14-0.el8.noarch Log associated to the 2023-10-11 16:16:02,167 7f820515fb80 DEBUG cephadm ['gather-facts'] 2023-10-11 16:16:02,208 7f820515fb80 DEBUG /bin/podman: 4.4.1 2023-10-11 16:16:02,313 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,317 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,322 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,326 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,329 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,333 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:04,474 7ff2a5c08b80 DEBUG cephadm ['ceph-volume', 'inventory'] 2023-10-11 16:16:04,516 7ff2a5c08b80 DEBUG /usr/bin/podman: 4.4.1 2023-10-11 16:16:04,520 7ff2a5c08b80 DEBUG Using default config: /etc/ceph/ceph.conf 2023-10-11 16:16:04,573 7ff2a5c08b80 DEBUG /usr/bin/podman: 0d28d71358d7,445.8MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: 2084faaf4d54,13.27MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: 61073c53805d,512.7MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: 6b9f0b72d668,361.1MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: 7493a28808ad,163.7MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: a89672a3accf,59.22MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: b45271cc9726,54.24MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: e00ec13ab138,707.3MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: fcb1e1a6b08d,35.55MB / 50.32GB 2023-10-11 16:16:04,630 7ff2a5c08b80 DEBUG /usr/bin/podman: 0d28d71358d7,1.28% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: 2084faaf4d54,0.00% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: 61073c53805d,1.19% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: 6b9f0b72d668,1.03% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: 7493a28808ad,0.78% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: a89672a3accf,0.11% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: b45271cc9726,1.35% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: e00ec13ab138,0.43% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: fcb1e1a6b08d,0.02% 2023-10-11 16:16:04,634 7ff2a5c08b80 INFO Inferring fsid 250f9864-0142-11ee-8e5f-00266cf8869c 2023-10-11 16:16:04,691 7ff2a5c08b80 DEBUG /usr/bin/podman: quay.io/ceph/ceph@sha256:f30bf50755d7087f47c6223e6a921caf5b12e86401b3d49220230c84a8302a1e 2023-10-11 16:16:04,692 7ff2a5c08b80 DEBUG /usr/bin/podman: quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca 2023-10-11 16:16:04,692 7ff2a5c08b80 DEBUG /usr/bin/podman: docker.io/ceph/ceph@sha256:056637972a107df4096f10951e4216b21fcd8ae0b9fb4552e628d35df3f61139 2023-10-11 16:16:04,694 7ff2a5c08b80 INFO Using recent ceph image quay.io/ceph/ceph@sha256:f30bf50755d7087f47c6223e6a921caf5b12e86401b3d49220230c84a8302a1e 2023-10-11 16:16:05,094 7ff2a5c08b80 DEBUG stat: 167 167 2023-10-11 16:16:05,903 7ff2a5c08b80 DEBUG Acquiring lock 140679815723776 on /run/cephadm/250f9864-0142-11ee-8e5f-00266cf8869c.lock 2023-10-11 16:16:05,903 7ff2a5c08b80 DEBUG Lock 140679815723776 acquired on /run/cephadm/250f9864-0142-11ee-8e5f-00266cf8869c.lock 2023-10-11 16:16:05,929 7ff2a5c08b80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:05,933 7ff2a5c08b80 DEBU
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
This afternoon I have a look at the python file but do not manage how it works with containers as I am only a Fortran HPC programmer... but I found that "cephadm gather-facts" shows all the HDD in Pacific. Some quick tests show: == Nautilus == [root@mostha1 ~]# cephadm --image quay.io/ceph/ceph:v14 ceph-volume inventory Inferring fsid 250f9864-0142-11ee-8e5f-00266cf8869c Device Path Size rotates available Model name /dev/sdc 232.83 GB True True SAMSUNG HE253GJ /dev/sda 232.83 GB True False SAMSUNG HE253GJ /dev/sdb 465.76 GB True False WDC WD5003ABYX-1 == Octopus == [root@mostha1 ~]# cephadm --image quay.io/ceph/ceph:v15 ceph-volume inventory Inferring fsid 250f9864-0142-11ee-8e5f-00266cf8869c Device Path Size rotates available Model name /dev/sdc 232.83 GB True True SAMSUNG HE253GJ /dev/sda 232.83 GB True False SAMSUNG HE253GJNautilus /dev/sdb 465.76 GB True False WDC WD5003ABYX-1 == Pacific == [root@mostha1 ~]# cephadm --image quay.io/ceph/ceph:v16 ceph-volume inventory Inferring fsid 250f9864-0142-11ee-8e5f-00266cf8869c Device Path Size Device nodes rotates available Model name == Quincy == [root@mostha1 ~]# cephadm --image quay.io/ceph/ceph:v17 ceph-volume inventory Inferring fsid 250f9864-0142-11ee-8e5f-00266cf8869c Device Path Size Device nodes rotates available Model name == Reef == [root@mostha1 ~]# cephadm --image quay.io/ceph/ceph:v18 ceph-volume inventory Inferring fsid 250f9864-0142-11ee-8e5f-00266cf8869c Device Path Size Device nodes rotates available Model name Could it be related to deprecated hardware support in Ceph with SATA drives ? Patrick Le 11/10/2023 à 17:38, Eugen Block a écrit : That's really strange. Just out of curiosity, have you tried Quincy (and/or Reef) as well? I don't recall what inventory does in the background exactly, I believe Adam King mentioned that in some thread, maybe that can help here. I'll search for that thread tomorrow. Zitat von Patrick Begou : Hi Eugen, [root@mostha1 ~]# rpm -q cephadm cephadm-16.2.14-0.el8.noarch Log associated to the 2023-10-11 16:16:02,167 7f820515fb80 DEBUG cephadm ['gather-facts'] 2023-10-11 16:16:02,208 7f820515fb80 DEBUG /bin/podman: 4.4.1 2023-10-11 16:16:02,313 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,317 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,322 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,326 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,329 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:02,333 7f820515fb80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:16:04,474 7ff2a5c08b80 DEBUG cephadm ['ceph-volume', 'inventory'] 2023-10-11 16:16:04,516 7ff2a5c08b80 DEBUG /usr/bin/podman: 4.4.1 2023-10-11 16:16:04,520 7ff2a5c08b80 DEBUG Using default config: /etc/ceph/ceph.conf 2023-10-11 16:16:04,573 7ff2a5c08b80 DEBUG /usr/bin/podman: 0d28d71358d7,445.8MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: 2084faaf4d54,13.27MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: 61073c53805d,512.7MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: 6b9f0b72d668,361.1MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: 7493a28808ad,163.7MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: a89672a3accf,59.22MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: b45271cc9726,54.24MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: e00ec13ab138,707.3MB / 50.32GB 2023-10-11 16:16:04,574 7ff2a5c08b80 DEBUG /usr/bin/podman: fcb1e1a6b08d,35.55MB / 50.32GB 2023-10-11 16:16:04,630 7ff2a5c08b80 DEBUG /usr/bin/podman: 0d28d71358d7,1.28% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: 2084faaf4d54,0.00% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: 61073c53805d,1.19% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: 6b9f0b72d668,1.03% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: 7493a28808ad,0.78% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: a89672a3accf,0.11% 2023-10-11 16:16:04,631 7ff2a5c08b80 DEBUG /usr/bin/podman: b45271cc9726,1.35% 2023-10-11 16:16:04,631 7f
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
467cf31b80 DEBUG Using default config: /etc/ceph/ceph.conf 2023-10-11 16:21:36,067 7f467cf31b80 DEBUG /usr/bin/podman: 0d28d71358d7,452.1MB / 50.32GB 2023-10-11 16:21:36,067 7f467cf31b80 DEBUG /usr/bin/podman: 2084faaf4d54,13.27MB / 50.32GB 2023-10-11 16:21:36,067 7f467cf31b80 DEBUG /usr/bin/podman: 61073c53805d,513.6MB / 50.32GB 2023-10-11 16:21:36,067 7f467cf31b80 DEBUG /usr/bin/podman: 6b9f0b72d668,322.4MB / 50.32GB 2023-10-11 16:21:36,067 7f467cf31b80 DEBUG /usr/bin/podman: 7493a28808ad,164MB / 50.32GB 2023-10-11 16:21:36,067 7f467cf31b80 DEBUG /usr/bin/podman: a89672a3accf,58.5MB / 50.32GB 2023-10-11 16:21:36,067 7f467cf31b80 DEBUG /usr/bin/podman: b45271cc9726,54.69MB / 50.32GB 2023-10-11 16:21:36,067 7f467cf31b80 DEBUG /usr/bin/podman: e00ec13ab138,707.1MB / 50.32GB 2023-10-11 16:21:36,068 7f467cf31b80 DEBUG /usr/bin/podman: fcb1e1a6b08d,36.28MB / 50.32GB 2023-10-11 16:21:36,125 7f467cf31b80 DEBUG /usr/bin/podman: 0d28d71358d7,1.27% 2023-10-11 16:21:36,125 7f467cf31b80 DEBUG /usr/bin/podman: 2084faaf4d54,0.00% 2023-10-11 16:21:36,125 7f467cf31b80 DEBUG /usr/bin/podman: 61073c53805d,1.16% 2023-10-11 16:21:36,125 7f467cf31b80 DEBUG /usr/bin/podman: 6b9f0b72d668,1.02% 2023-10-11 16:21:36,125 7f467cf31b80 DEBUG /usr/bin/podman: 7493a28808ad,0.78% 2023-10-11 16:21:36,125 7f467cf31b80 DEBUG /usr/bin/podman: a89672a3accf,0.11% 2023-10-11 16:21:36,125 7f467cf31b80 DEBUG /usr/bin/podman: b45271cc9726,1.35% 2023-10-11 16:21:36,125 7f467cf31b80 DEBUG /usr/bin/podman: e00ec13ab138,0.41% 2023-10-11 16:21:36,125 7f467cf31b80 DEBUG /usr/bin/podman: fcb1e1a6b08d,0.02% 2023-10-11 16:21:36,128 7f467cf31b80 INFO Inferring fsid 250f9864-0142-11ee-8e5f-00266cf8869c 2023-10-11 16:21:36,186 7f467cf31b80 DEBUG /usr/bin/podman: quay.io/ceph/ceph@sha256:f30bf50755d7087f47c6223e6a921caf5b12e86401b3d49220230c84a8302a1e 2023-10-11 16:21:36,187 7f467cf31b80 DEBUG /usr/bin/podman: quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca 2023-10-11 16:21:36,187 7f467cf31b80 DEBUG /usr/bin/podman: docker.io/ceph/ceph@sha256:056637972a107df4096f10951e4216b21fcd8ae0b9fb4552e628d35df3f61139 2023-10-11 16:21:36,189 7f467cf31b80 INFO Using recent ceph image quay.io/ceph/ceph@sha256:f30bf50755d7087f47c6223e6a921caf5b12e86401b3d49220230c84a8302a1e 2023-10-11 16:21:36,549 7f467cf31b80 DEBUG stat: 167 167 2023-10-11 16:21:36,942 7f467cf31b80 DEBUG Acquiring lock 139940396923424 on /run/cephadm/250f9864-0142-11ee-8e5f-00266cf8869c.lock 2023-10-11 16:21:36,942 7f467cf31b80 DEBUG Lock 139940396923424 acquired on /run/cephadm/250f9864-0142-11ee-8e5f-00266cf8869c.lock 2023-10-11 16:21:36,969 7f467cf31b80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:21:36,972 7f467cf31b80 DEBUG sestatus: SELinux status: disabled 2023-10-11 16:21:37,749 7f467cf31b80 DEBUG /usr/bin/podman: 2023-10-11 16:21:37,750 7f467cf31b80 DEBUG /usr/bin/podman: Device Path Size Device nodes rotates available Model name Patrick Le 11/10/2023 à 15:59, Eugen Block a écrit : Can you check which cephadm version is installed on the host? And then please add (only the relevant) output from the cephadm.log when you run the inventory (without the --image ). Sometimes the version mismatch on the host and the one the orchestrator uses can cause some disruptions. You could try the same with the latest cephadm you have in /var/lib/ceph/${fsid}/ (ls -lrt /var/lib/ceph/${fsid}/cephadm.*). I mentioned that in this thread [1]. So you could try the following: $ chmod +x /var/lib/ceph/{fsid}/cephadm.{latest} $ python3 /var/lib/ceph/{fsid}/cephadm.{latest} ceph-volume inventory Does the output differ? Paste the relevant cephadm.log from that attempt as well. [1] https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/LASBJCSPFGDYAWPVE2YLV2ZLF3HC5SLS/ Zitat von Patrick Begou : Hi Eugen, first many thanks for the time spent on this problem. "ceph osd purge 2 --force --yes-i-really-mean-it" works and clean all the bas status. *[root@mostha1 ~]# cephadm shell *Inferring fsid 250f9864-0142-11ee-8e5f-00266cf8869c Using recent ceph image quay.io/ceph/ceph@sha256:f30bf50755d7087f47c6223e6a921caf5b12e86401b3d49220230c84a8302a1e * * *[ceph: root@mostha1 /]# ceph osd purge 2 --force --yes-i-really-mean-it * purged osd.2 * * *[ceph: root@mostha1 /]# ceph osd tree* ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 1.72823 root default -5 0.45477 host dean 0 hdd 0.22739 osd.0 up 1.0 1.0 4 hdd 0.22739 osd.4 up 1.0 1.0 -9 0.22739 host ekman 6 hdd 0.22739 osd.6 up 1.0 1.0 -7 0.45479 host mostha1 5 hdd 0.45479 osd.5 up 1.0 1.0 -3 0.59128 host mostha2 1 hdd 0.22739 osd.1 up 1.0 1.0 3 hdd 0.
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
Hi Eugen, first many thanks for the time spent on this problem. "ceph osd purge 2 --force --yes-i-really-mean-it" works and clean all the bas status. *[root@mostha1 ~]# cephadm shell *Inferring fsid 250f9864-0142-11ee-8e5f-00266cf8869c Using recent ceph image quay.io/ceph/ceph@sha256:f30bf50755d7087f47c6223e6a921caf5b12e86401b3d49220230c84a8302a1e * * *[ceph: root@mostha1 /]# ceph osd purge 2 --force --yes-i-really-mean-it * purged osd.2 * * *[ceph: root@mostha1 /]# ceph osd tree* ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 1.72823 root default -5 0.45477 host dean 0 hdd 0.22739 osd.0 up 1.0 1.0 4 hdd 0.22739 osd.4 up 1.0 1.0 -9 0.22739 host ekman 6 hdd 0.22739 osd.6 up 1.0 1.0 -7 0.45479 host mostha1 5 hdd 0.45479 osd.5 up 1.0 1.0 -3 0.59128 host mostha2 1 hdd 0.22739 osd.1 up 1.0 1.0 3 hdd 0.36389 osd.3 up 1.0 1.0 * * *[ceph: root@mostha1 /]# lsblk* NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 1 232.9G 0 disk |-sda1 8:1 1 3.9G 0 part /rootfs/boot |-sda2 8:2 1 3.9G 0 part [SWAP] `-sda3 8:3 1 225G 0 part |-al8vg-rootvol 253:0 0 48.8G 0 lvm /rootfs |-al8vg-homevol 253:2 0 9.8G 0 lvm /rootfs/home |-al8vg-tmpvol 253:3 0 9.8G 0 lvm /rootfs/tmp `-al8vg-varvol 253:4 0 19.8G 0 lvm /rootfs/var sdb 8:16 1 465.8G 0 disk `-ceph--08827fdc--136e--4070--97e9--e5e8b3970766-osd--block--7dec1808--d6f4--4f90--ac74--75a4346e1df5 253:1 0 465.8G 0 lvm sdc 8:32 1 232.9G 0 disk "cephadm ceph-volume inventory" returns nothing: *[root@mostha1 ~]# cephadm ceph-volume inventory ** *Inferring fsid 250f9864-0142-11ee-8e5f-00266cf8869c Using recent ceph image quay.io/ceph/ceph@sha256:f30bf50755d7087f47c6223e6a921caf5b12e86401b3d49220230c84a8302a1e Device Path Size Device nodes rotates available Model name [root@mostha1 ~]# But running the same command within cephadm 15.2.17 works: *[root@mostha1 ~]# cephadm --image 93146564743f ceph-volume inventory* Inferring fsid 250f9864-0142-11ee-8e5f-00266cf8869c Device Path Size rotates available Model name /dev/sdc 232.83 GB True True SAMSUNG HE253GJ /dev/sda 232.83 GB True False SAMSUNG HE253GJ /dev/sdb 465.76 GB True False WDC WD5003ABYX-1 [root@mostha1 ~]# *[root@mostha1 ~]# podman images -a** *REPOSITORY TAG IMAGE ID CREATED SIZE quay.io/ceph/ceph v16.2.14 f13d80acdbb5 2 weeks ago 1.21 GB quay.io/ceph/ceph v15.2.17 93146564743f 14 months ago 1.24 GB Patrick Le 11/10/2023 à 15:14, Eugen Block a écrit : Your response is a bit confusing since it seems to be mixed up with the previous answer. So you still need to remove the OSD properly, so purge it from the crush tree: ceph osd purge 2 --force --yes-i-really-mean-it (only in a test cluster!) If everything is clean (OSD has been removed, disk has been zapped, lsblk shows no LVs for that disk) you can check the inventory: cephadm ceph-volume inventory Please also add the output of 'ceph orch ls osd --export'. Zitat von Patrick Begou : Hi Eugen, - the OS is Alma Linux 8 with latests updates. - this morning I've worked with ceph-volume but it ends with a strange final state. I was connected on host mostha1 where /dev/sdc was not reconized. These are the steps followed based on the ceph-volume documentation I've read: [root@mostha1 ~]# cephadm shell [ceph: root@mostha1 /]# ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring [ceph: root@mostha1 /]# ceph-volume lvm prepare --bluestore --data /dev/sdc Now lsblk command shows sdc as an osd: sdb 8:16 1 465.8G 0 disk `-ceph--08827fdc--136e--4070--97e9--e5e8b3970766-osd--block--7dec1808--d6f4--4f90--ac74--75a4346e1df5 253:1 0 465.8G 0 lvm sdc 8:32 1 232.9G 0 disk `-ceph--b27d7a07--278d--4ee2--b84e--53256ef8de4c-osd--block--45c8e92c--caf9--4fe7--9a42--7b45a0794632 253:5 0 232.8G 0 lvm Then I've tried to activate this osd but it fails as in podman I have not access to systemctl: [ceph: root@mostha1 /]# ceph-volume lvm activate 2 45c8e92c-caf9-4fe7-9a42-7b45a0794632 . Running command: /usr/bin/systemctl start ceph-osd@2 stderr: Failed to connect to bus: No such file or directory --> RuntimeError: command returned non-zero exit status: 1 [ceph: root@mostha1 /]# ceph osd tree And now I have now I have a strange status for this osd.2: [ceph: root@mostha1 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 1.72823 root default -5 0.45477 host dean 0 hdd 0.22739
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
Hi Eugen, sorry for posting twice, my zimbra server returns an error at the first attempt. My initial problem is that ceph cannot detect these HDD since Pacific. So I have deployed Octopus, where "ceph orch apply osd --all-available-devices" works fine and then upgraded to Pacific. But during the upgrate, 2 OSD went to "out" and "down" and I'm looking for a solution to manually re-integrate these 2 HDD in the cluster as Pacific is not able to do this automatically with "ceph orch..." like Octopus. But it is a test cluster to understand and get basic knowledge of Ceph (and I'm allowed to break everything). Patrick Le 11/10/2023 à 14:35, Eugen Block a écrit : Don't use ceph-volume manually to deploy OSDs if your cluster is managed by cephadm. I just wanted to point out that you hadn't wiped the disk properly to be able to re-use it. Let the orchestrator handle the OSD creation and activation. I recommend to remove the OSD again, wipe it properly (cephadm ceph-volume lvm zap --destroy /dev/sdc) and then let the orchestrator add it as an OSD. Depending on your drivegroup configuration it will happen automatically (if "all-available-devices" is enabled or your osd specs are already applied). If it doesn't happen automatically, deploy it with 'ceph orch daemon add osd **:**' [1]. [1] https://docs.ceph.com/en/quincy/cephadm/services/osd/#deploy-osds Zitat von Patrick Begou : Hi Eugen, - the OS is Alma Linux 8 with latests updates. - this morning I've worked with ceph-volume but it ends with a strange final state. I was connected on host mostha1 where /dev/sdc was not reconized. These are the steps followed based on the ceph-volume documentation I've read: *[root@mostha1 ~]# cephadm shell** **[ceph: root@mostha1 /]# ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring** **[ceph: root@mostha1 /]# ceph-volume lvm prepare --bluestore --data /dev/sdc** * *[ceph: root@mostha1 /]# ceph-volume lvm list == osd.2 === * [block] /dev/ceph-b27d7a07-278d-4ee2-b84e-53256ef8de4c/osd-block-45c8e92c-caf9-4fe7-9a42-7b45a0794632 block device /dev/ceph-b27d7a07-278d-4ee2-b84e-53256ef8de4c/osd-block-45c8e92c-caf9-4fe7-9a42-7b45a0794632 block uuid Pq0XeH-LJct-t4yH-f56F-d5jk-JzGQ-zITfhE cephx lockbox secret cluster fsid 250f9864-0142-11ee-8e5f-00266cf8869c cluster name ceph crush device class encrypted 0 *osd fsid 45c8e92c-caf9-4fe7-9a42-7b45a0794632* osd id 2 osdspec affinity type block vdo 0 * devices /dev/sdc * Now lsblk command shows sdc as an osd: sdb 8:16 1 465.8G 0 disk `-ceph--08827fdc--136e--4070--97e9--e5e8b3970766-osd--block--7dec1808--d6f4--4f90--ac74--75a4346e1df5 253:1 0 465.8G 0 lvm *sdc 8:32 1 232.9G 0 disk ** **`-ceph--b27d7a07--278d--4ee2--b84e--53256ef8de4c-osd--block--45c8e92c--caf9--4fe7--9a42--7b45a0794632 253:5 0 232.8G 0 lvm ** * But this osd.2 is "down" and "out" with a strange status (no related cluster host, no weight) and I cannot activate it as within the podman container systemctl is not working. [ceph: root@mostha1 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 1.72823 root default -5 0.45477 host dean 0 hdd 0.22739 osd.0 up 1.0 1.0 4 hdd 0.22739 osd.4 up 1.0 1.0 -9 0.22739 host ekman 6 hdd 0.22739 osd.6 up 1.0 1.0 -7 0.45479 host mostha1 5 hdd 0.45479 osd.5 up 1.0 1.0 -3 0.59128 host mostha2 1 hdd 0.22739 osd.1 up 1.0 1.0 3 hdd 0.36389 osd.3 up 1.0 1.0 *2 0 osd.2 down 0 1.0* My attempt to activate the osd: [ceph: root@mostha1 /]# ceph-volume lvm activate 2 45c8e92c-caf9-4fe7-9a42-7b45a0794632 Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-2 Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2 Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-b27d7a07-278d-4ee2-b84e-53256ef8de4c/osd-block-45c8e92c-caf9-4fe7-9a42-7b45a0794632 --path /var/lib/ceph/osd/ceph-2 --no-mon-config Running command: /usr/bin/ln -snf /dev/ceph-b27d7a07-278d-4ee2-b84e-53256ef8de4c/osd-block-45c8e92c-caf9-4fe7-9a42-7b45a0794632 /var/lib/ceph/osd/ceph-2/block Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-2/block Running command: /usr/bin/chown -R ceph:ceph /dev/dm-1 Running command: /usr/bin/chown -R ceph:ceph /var/li
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
0.59128 host mostha2 1 hdd 0.22739 osd.1 up 1.0 1.0 3 hdd 0.36389 osd.3 up 1.0 1.0 2 0 osd.2 down 0 1.0 * * *[ceph: root@mostha1 /]# lsblk** *NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 1 232.9G 0 disk |-sda1 8:1 1 3.9G 0 part /rootfs/boot |-sda2 8:2 1 3.9G 0 part [SWAP] `-sda3 8:3 1 225G 0 part |-al8vg-rootvol 253:0 0 48.8G 0 lvm /rootfs |-al8vg-homevol 253:3 0 9.8G 0 lvm /rootfs/home |-al8vg-tmpvol 253:4 0 9.8G 0 lvm /rootfs/tmp `-al8vg-varvol 253:5 0 19.8G 0 lvm /rootfs/var sdb 8:16 1 465.8G 0 disk `-ceph--08827fdc--136e--4070--97e9--e5e8b3970766-osd--block--7dec1808--d6f4--4f90--ac74--75a4346e1df5 253:2 0 465.8G 0 lvm *sdc * Patrick Le 11/10/2023 à 11:00, Eugen Block a écrit : Hi, just wondering if 'ceph-volume lvm zap --destroy /dev/sdc' would help here. From your previous output you didn't specify the --destroy flag. Which cephadm version is installed on the host? Did you also upgrade the OS when moving to Pacific? (Sorry if I missed that. Zitat von Patrick Begou : Le 02/10/2023 à 18:22, Patrick Bégou a écrit : Hi all, still stuck with this problem. I've deployed octopus and all my HDD have been setup as osd. Fine. I've upgraded to pacific and 2 osd have failed. They have been automatically removed and upgrade finishes. Cluster Health is finaly OK, no data loss. But now I cannot re-add these osd with pacific (I had previous troubles on these old HDDs, lost one osd in octopus and was able to reset and re-add it). I've tried manually to add the first osd on the node where it is located, following https://docs.ceph.com/en/pacific/rados/operations/bluestore-migration/ (not sure it's the best idea...) but it fails too. This node was the one used for deploying the cluster. [ceph: root@mostha1 /]# *ceph-volume lvm zap /dev/sdc* --> Zapping: /dev/sdc --> --destroy was not specified, but zapping a whole device will remove the partition table Running command: /usr/bin/dd if=/dev/zero of=/dev/sdc bs=1M count=10 conv=fsync stderr: 10+0 records in 10+0 records out 10485760 bytes (10 MB, 10 MiB) copied, 0.663425 s, 15.8 MB/s --> Zapping successful for: [ceph: root@mostha1 /]# *ceph-volume lvm create --bluestore --data /dev/sdc* Running command: /usr/bin/ceph-authtool --gen-print-key Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 9f1eb8ee-41e6-4350-ad73-1be21234ec7c stderr: 2023-10-02T16:09:29.855+ 7fb4eb8c0700 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory stderr: 2023-10-02T16:09:29.855+ 7fb4eb8c0700 -1 AuthRegistry(0x7fb4e405c4d8) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx stderr: 2023-10-02T16:09:29.856+ 7fb4eb8c0700 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory stderr: 2023-10-02T16:09:29.856+ 7fb4eb8c0700 -1 AuthRegistry(0x7fb4e40601d0) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx stderr: 2023-10-02T16:09:29.857+ 7fb4eb8c0700 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory stderr: 2023-10-02T16:09:29.857+ 7fb4eb8c0700 -1 AuthRegistry(0x7fb4eb8bee90) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx stderr: 2023-10-02T16:09:29.858+ 7fb4e965c700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] stderr: 2023-10-02T16:09:29.858+ 7fb4e9e5d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] stderr: 2023-10-02T16:09:29.858+ 7fb4e8e5b700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] stderr: 2023-10-02T16:09:29.858+ 7fb4eb8c0700 -1 monclient: authenticate NOTE: no keyring found; disabled cephx authentication stderr: [errno 13] RADOS permission denied (error connecting to the cluster) --> RuntimeError: Unable to create a new OSD id Any idea of what is wrong ? Thanks Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io I'm still trying to understand what can be wrong or how to debug this situation where Ceph cannot see the devices. The device :dev/sdc exists: [root@mostha1 ~]# cephadm shell lsmcli ldl Inferring fsid 250f9864-0142-11ee-8e5f-00266cf8869c Using recent ceph image quay.io/ceph/ceph@sha256:f30bf50755d7087f47c6223e6a921caf5b12e86401b3d49220230c84a8302a1e Path | SCSI VPD 0x83 | Link Type | Serial Number | Health Status -
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
Hi Eugen, - the OS is Alma Linux 8 with latests updates. - this morning I've worked with ceph-volume but it ends with a strange final state. I was connected on host mostha1 where /dev/sdc was not reconized. These are the steps followed based on the ceph-volume documentation I've read: *[root@mostha1 ~]# cephadm shell** **[ceph: root@mostha1 /]# ceph auth get client.bootstrap-osd > /var/lib/ceph/bootstrap-osd/ceph.keyring** **[ceph: root@mostha1 /]# ceph-volume lvm prepare --bluestore --data /dev/sdc** * *[ceph: root@mostha1 /]# ceph-volume lvm list == osd.2 === * [block] /dev/ceph-b27d7a07-278d-4ee2-b84e-53256ef8de4c/osd-block-45c8e92c-caf9-4fe7-9a42-7b45a0794632 block device /dev/ceph-b27d7a07-278d-4ee2-b84e-53256ef8de4c/osd-block-45c8e92c-caf9-4fe7-9a42-7b45a0794632 block uuid Pq0XeH-LJct-t4yH-f56F-d5jk-JzGQ-zITfhE cephx lockbox secret cluster fsid 250f9864-0142-11ee-8e5f-00266cf8869c cluster name ceph crush device class encrypted 0 *osd fsid 45c8e92c-caf9-4fe7-9a42-7b45a0794632* osd id 2 osdspec affinity type block vdo 0 * devices /dev/sdc * Now lsblk command shows sdc as an osd: sdb 8:16 1 465.8G 0 disk `-ceph--08827fdc--136e--4070--97e9--e5e8b3970766-osd--block--7dec1808--d6f4--4f90--ac74--75a4346e1df5 253:1 0 465.8G 0 lvm *sdc 8:32 1 232.9G 0 disk ** **`-ceph--b27d7a07--278d--4ee2--b84e--53256ef8de4c-osd--block--45c8e92c--caf9--4fe7--9a42--7b45a0794632 253:5 0 232.8G 0 lvm ** * But this osd.2 is "down" and "out" with a strange status (no related cluster host, no weight) and I cannot activate it as within the podman container systemctl is not working. [ceph: root@mostha1 /]# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 1.72823 root default -5 0.45477 host dean 0 hdd 0.22739 osd.0 up 1.0 1.0 4 hdd 0.22739 osd.4 up 1.0 1.0 -9 0.22739 host ekman 6 hdd 0.22739 osd.6 up 1.0 1.0 -7 0.45479 host mostha1 5 hdd 0.45479 osd.5 up 1.0 1.0 -3 0.59128 host mostha2 1 hdd 0.22739 osd.1 up 1.0 1.0 3 hdd 0.36389 osd.3 up 1.0 1.0 *2 0 osd.2 down 0 1.0* My attempt to activate the osd: [ceph: root@mostha1 /]# ceph-volume lvm activate 2 45c8e92c-caf9-4fe7-9a42-7b45a0794632 Running command: /usr/bin/mount -t tmpfs tmpfs /var/lib/ceph/osd/ceph-2 Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2 Running command: /usr/bin/ceph-bluestore-tool --cluster=ceph prime-osd-dir --dev /dev/ceph-b27d7a07-278d-4ee2-b84e-53256ef8de4c/osd-block-45c8e92c-caf9-4fe7-9a42-7b45a0794632 --path /var/lib/ceph/osd/ceph-2 --no-mon-config Running command: /usr/bin/ln -snf /dev/ceph-b27d7a07-278d-4ee2-b84e-53256ef8de4c/osd-block-45c8e92c-caf9-4fe7-9a42-7b45a0794632 /var/lib/ceph/osd/ceph-2/block Running command: /usr/bin/chown -h ceph:ceph /var/lib/ceph/osd/ceph-2/block Running command: /usr/bin/chown -R ceph:ceph /dev/dm-1 Running command: /usr/bin/chown -R ceph:ceph /var/lib/ceph/osd/ceph-2 Running command: /usr/bin/systemctl enable ceph-volume@lvm-2-45c8e92c-caf9-4fe7-9a42-7b45a0794632 stderr: Created symlink /etc/systemd/system/multi-user.target.wants/ceph-volume@lvm-2-45c8e92c-caf9-4fe7-9a42-7b45a0794632.service -> /usr/lib/systemd/system/ceph-volume@.service. Running command: /usr/bin/systemctl enable --runtime ceph-osd@2 stderr: Created symlink /run/systemd/system/ceph-osd.target.wants/ceph-osd@2.service -> /usr/lib/systemd/system/ceph-osd@.service. Running command: /usr/bin/systemctl start ceph-osd@2 stderr: Failed to connect to bus: No such file or directory --> RuntimeError: command returned non-zero exit status: 1 Patrick Le 11/10/2023 à 11:00, Eugen Block a écrit : Hi, just wondering if 'ceph-volume lvm zap --destroy /dev/sdc' would help here. From your previous output you didn't specify the --destroy flag. Which cephadm version is installed on the host? Did you also upgrade the OS when moving to Pacific? (Sorry if I missed that. Zitat von Patrick Begou : Le 02/10/2023 à 18:22, Patrick Bégou a écrit : Hi all, still stuck with this problem. I've deployed octopus and all my HDD have been setup as osd. Fine. I've upgraded to pacific and 2 osd have failed. They have been automatically removed and upgrade finishes. Cluster Health is finaly OK, no data loss. But now I cannot re-add these osd with pacific (I had previous troubles on these old HDDs, lost
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
Le 02/10/2023 à 18:22, Patrick Bégou a écrit : Hi all, still stuck with this problem. I've deployed octopus and all my HDD have been setup as osd. Fine. I've upgraded to pacific and 2 osd have failed. They have been automatically removed and upgrade finishes. Cluster Health is finaly OK, no data loss. But now I cannot re-add these osd with pacific (I had previous troubles on these old HDDs, lost one osd in octopus and was able to reset and re-add it). I've tried manually to add the first osd on the node where it is located, following https://docs.ceph.com/en/pacific/rados/operations/bluestore-migration/ (not sure it's the best idea...) but it fails too. This node was the one used for deploying the cluster. [ceph: root@mostha1 /]# *ceph-volume lvm zap /dev/sdc* --> Zapping: /dev/sdc --> --destroy was not specified, but zapping a whole device will remove the partition table Running command: /usr/bin/dd if=/dev/zero of=/dev/sdc bs=1M count=10 conv=fsync stderr: 10+0 records in 10+0 records out 10485760 bytes (10 MB, 10 MiB) copied, 0.663425 s, 15.8 MB/s --> Zapping successful for: [ceph: root@mostha1 /]# *ceph-volume lvm create --bluestore --data /dev/sdc* Running command: /usr/bin/ceph-authtool --gen-print-key Running command: /usr/bin/ceph --cluster ceph --name client.bootstrap-osd --keyring /var/lib/ceph/bootstrap-osd/ceph.keyring -i - osd new 9f1eb8ee-41e6-4350-ad73-1be21234ec7c stderr: 2023-10-02T16:09:29.855+ 7fb4eb8c0700 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory stderr: 2023-10-02T16:09:29.855+ 7fb4eb8c0700 -1 AuthRegistry(0x7fb4e405c4d8) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx stderr: 2023-10-02T16:09:29.856+ 7fb4eb8c0700 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory stderr: 2023-10-02T16:09:29.856+ 7fb4eb8c0700 -1 AuthRegistry(0x7fb4e40601d0) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx stderr: 2023-10-02T16:09:29.857+ 7fb4eb8c0700 -1 auth: unable to find a keyring on /var/lib/ceph/bootstrap-osd/ceph.keyring: (2) No such file or directory stderr: 2023-10-02T16:09:29.857+ 7fb4eb8c0700 -1 AuthRegistry(0x7fb4eb8bee90) no keyring found at /var/lib/ceph/bootstrap-osd/ceph.keyring, disabling cephx stderr: 2023-10-02T16:09:29.858+ 7fb4e965c700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] stderr: 2023-10-02T16:09:29.858+ 7fb4e9e5d700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] stderr: 2023-10-02T16:09:29.858+ 7fb4e8e5b700 -1 monclient(hunting): handle_auth_bad_method server allowed_methods [2] but i only support [1] stderr: 2023-10-02T16:09:29.858+ 7fb4eb8c0700 -1 monclient: authenticate NOTE: no keyring found; disabled cephx authentication stderr: [errno 13] RADOS permission denied (error connecting to the cluster) --> RuntimeError: Unable to create a new OSD id Any idea of what is wrong ? Thanks Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io I'm still trying to understand what can be wrong or how to debug this situation where Ceph cannot see the devices. The device :dev/sdc exists: [root@mostha1 ~]# cephadm shell lsmcli ldl Inferring fsid 250f9864-0142-11ee-8e5f-00266cf8869c Using recent ceph image quay.io/ceph/ceph@sha256:f30bf50755d7087f47c6223e6a921caf5b12e86401b3d49220230c84a8302a1e Path | SCSI VPD 0x83 | Link Type | Serial Number | Health Status - /dev/sda | 50024e92039e4f1c | PATA/SATA | S2B5J90ZA10142 | Good /dev/sdb | 50014ee0ad5953c9 | PATA/SATA | WD-WMAYP0982329 | Good /dev/sdc | 50024e920387fa2c | PATA/SATA | S2B5J90ZA02494 | Good But I cannot do anything with it: [root@mostha1 ~]# cephadm shell ceph orch device zap mostha1.legi.grenoble-inp.fr /dev/sdc --force Inferring fsid 250f9864-0142-11ee-8e5f-00266cf8869c Using recent ceph image quay.io/ceph/ceph@sha256:f30bf50755d7087f47c6223e6a921caf5b12e86401b3d49220230c84a8302a1e Error EINVAL: Device path '/dev/sdc' not found on host 'mostha1.legi.grenoble-inp.fr' Since I moved from octopus to Pacific. Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: After power outage, osd do not restart
Hi Eneko, I do not work on the ceph cluster since my last email (making some user support) and now the osd.2 is back in the cluster: -7 0.68217 host mostha1 2 hdd 0.22739 osd.2 up 1.0 1.0 5 hdd 0.45479 osd.5 up 1.0 1.0 May be the reboot suggested by Igor ? I will try to solve my last problem now. While upgrading from 15.2.13 to 15.2.17 I hit a memory problem on one node (these are old computers used to learn Ceph). Upgrading one of the osd fails and it locks the upgrade as Ceph did not accept to stop and upgrade next osd in the cluster. But Ceph start rebalancing the data and magicaly finishes the upgrade. But a last osd is still down and out and it is a daemon problem as smartctl returns a good health for the HDD. I've changed the faulty memory dims and the node is back in the cluster. So this is my new challenge Using old material (2011) for learning seams fine to investigate Ceph reliability as many problems can raise up but at no risks! Patrick Le 21/09/2023 à 16:31, Eneko Lacunza a écrit : Hi Patrick, It seems your disk or controller are damaged. Are other disks connected to the same controller working ok? If so, I'd say disk is dead. Cheers El 21/9/23 a las 16:17, Patrick Begou escribió: Hi Igor, a "systemctl reset-failed" doesn't restart the osd. I reboot the node and now it show some error on the HDD: [ 107.716769] ata3.00: exception Emask 0x0 SAct 0x80 SErr 0x0 action 0x0 [ 107.716782] ata3.00: irq_stat 0x4008 [ 107.716787] ata3.00: failed command: READ FPDMA QUEUED [ 107.716791] ata3.00: cmd 60/00:b8:00:a8:08/08:00:0e:00:00/40 tag 23 ncq dma 1048576 in res 41/40:00:c2:ad:08/00:00:0e:00:00/40 Emask 0x409 (media error) [ 107.716802] ata3.00: status: { DRDY ERR } [ 107.716806] ata3.00: error: { UNC } [ 107.728547] ata3.00: configured for UDMA/133 [ 107.728575] sd 2:0:0:0: [sda] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1s [ 107.728581] sd 2:0:0:0: [sda] tag#23 Sense Key : Medium Error [current] [ 107.728585] sd 2:0:0:0: [sda] tag#23 Add. Sense: Unrecovered read error - auto reallocate failed [ 107.728590] sd 2:0:0:0: [sda] tag#23 CDB: Read(10) 28 00 0e 08 a8 00 00 08 00 00 [ 107.728592] I/O error, dev sda, sector 235449794 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 2 [ 107.728623] ata3: EH complete [ 109.203256] ata3.00: exception Emask 0x0 SAct 0x2000 SErr 0x0 action 0x0 [ 109.203268] ata3.00: irq_stat 0x4008 [ 109.203274] ata3.00: failed command: READ FPDMA QUEUED [ 109.203277] ata3.00: cmd 60/08:e8:48:ad:08/00:00:0e:00:00/40 tag 29 ncq dma 4096 in res 41/40:00:48:ad:08/00:00:0e:00:00/40 Emask 0x409 (media error) [ 109.203289] ata3.00: status: { DRDY ERR } [ 109.203292] ata3.00: error: { UNC } I think the storage is corrupted and I have te reset it all. Patrick Le 21/09/2023 à 13:32, Igor Fedotov a écrit : May be execute systemctl reset-failed <...> or even restart the node? On 21/09/2023 14:26, Patrick Begou wrote: Hi Igor, the ceph-osd.2.log remains empty on the node where this osd is located. This is what I get when manualy restarting the osd. [root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# systemctl restart ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service Job for ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service failed because a timeout was exceeded. See "systemctl status ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service" and "journalctl -xe" for details. [root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# journalctl -xe sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 5728 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 5882 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 5884 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 6031 (bash) in control
[ceph-users] Re: After power outage, osd do not restart
Hi Igor, a "systemctl reset-failed" doesn't restart the osd. I reboot the node and now it show some error on the HDD: [ 107.716769] ata3.00: exception Emask 0x0 SAct 0x80 SErr 0x0 action 0x0 [ 107.716782] ata3.00: irq_stat 0x4008 [ 107.716787] ata3.00: failed command: READ FPDMA QUEUED [ 107.716791] ata3.00: cmd 60/00:b8:00:a8:08/08:00:0e:00:00/40 tag 23 ncq dma 1048576 in res 41/40:00:c2:ad:08/00:00:0e:00:00/40 Emask 0x409 (media error) [ 107.716802] ata3.00: status: { DRDY ERR } [ 107.716806] ata3.00: error: { UNC } [ 107.728547] ata3.00: configured for UDMA/133 [ 107.728575] sd 2:0:0:0: [sda] tag#23 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK cmd_age=1s [ 107.728581] sd 2:0:0:0: [sda] tag#23 Sense Key : Medium Error [current] [ 107.728585] sd 2:0:0:0: [sda] tag#23 Add. Sense: Unrecovered read error - auto reallocate failed [ 107.728590] sd 2:0:0:0: [sda] tag#23 CDB: Read(10) 28 00 0e 08 a8 00 00 08 00 00 [ 107.728592] I/O error, dev sda, sector 235449794 op 0x0:(READ) flags 0x80700 phys_seg 6 prio class 2 [ 107.728623] ata3: EH complete [ 109.203256] ata3.00: exception Emask 0x0 SAct 0x2000 SErr 0x0 action 0x0 [ 109.203268] ata3.00: irq_stat 0x4008 [ 109.203274] ata3.00: failed command: READ FPDMA QUEUED [ 109.203277] ata3.00: cmd 60/08:e8:48:ad:08/00:00:0e:00:00/40 tag 29 ncq dma 4096 in res 41/40:00:48:ad:08/00:00:0e:00:00/40 Emask 0x409 (media error) [ 109.203289] ata3.00: status: { DRDY ERR } [ 109.203292] ata3.00: error: { UNC } I think the storage is corrupted and I have te reset it all. Patrick Le 21/09/2023 à 13:32, Igor Fedotov a écrit : May be execute systemctl reset-failed <...> or even restart the node? On 21/09/2023 14:26, Patrick Begou wrote: Hi Igor, the ceph-osd.2.log remains empty on the node where this osd is located. This is what I get when manualy restarting the osd. [root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# systemctl restart ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service Job for ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service failed because a timeout was exceeded. See "systemctl status ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service" and "journalctl -xe" for details. [root@mostha1 250f9864-0142-11ee-8e5f-00266cf8869c]# journalctl -xe sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 5728 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 5882 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 5884 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 6031 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 6033 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 6185 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 6187 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mo
[ceph-users] Re: After power outage, osd do not restart
legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 15171 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 15646 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 15648 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 15792 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 15794 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 25561 (bash) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 25563 (podman) in control group while starting unit. Ignoring. sept. 21 13:22:39 mostha1.legi.grenoble-inp.fr systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. Patrick Le 21/09/2023 à 12:44, Igor Fedotov a écrit : Hi Patrick, please share osd restart log to investigate that. Thanks, Igor On 21/09/2023 13:41, Patrick Begou wrote: Hi, After a power outage on my test ceph cluster, 2 osd fail to restart. The log file show: 8e5f-00266cf8869c@osd.2.service: Failed with result 'timeout'. Sep 21 11:55:02 mostha1 systemd[1]: Failed to start Ceph osd.2 for 250f9864-0142-11ee-8e5f-00266cf8869c. Sep 21 11:55:12 mostha1 systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Service RestartSec=10s expired, scheduling restart. Sep 21 11:55:12 mostha1 systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Scheduled restart job, restart counter is at 2. Sep 21 11:55:12 mostha1 systemd[1]: Stopped Ceph osd.2 for 250f9864-0142-11ee-8e5f-00266cf8869c. Sep 21 11:55:12 mostha1 systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 1858 (bash) in control group while starting unit. Ignoring. Sep 21 11:55:12 mostha1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. Sep 21 11:55:12 mostha1 systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 2815 (podman) in control group while starting unit. Ignoring. This is not critical as it is a test cluster and it is actually rebalancing on other osd but I would like to know how to return to HEALTH_OK status. Smartctl show the HDD are OK. So is there a way to recover the osd from this state ? Version is 15.2.17 (juste moved from 15.2.13 to 15.2.17 yesterday, will try to move to latest versions as soon as this problem is solved) Thanks Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] After power outage, osd do not restart
Hi, After a power outage on my test ceph cluster, 2 osd fail to restart. The log file show: 8e5f-00266cf8869c@osd.2.service: Failed with result 'timeout'. Sep 21 11:55:02 mostha1 systemd[1]: Failed to start Ceph osd.2 for 250f9864-0142-11ee-8e5f-00266cf8869c. Sep 21 11:55:12 mostha1 systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Service RestartSec=10s expired, scheduling restart. Sep 21 11:55:12 mostha1 systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Scheduled restart job, restart counter is at 2. Sep 21 11:55:12 mostha1 systemd[1]: Stopped Ceph osd.2 for 250f9864-0142-11ee-8e5f-00266cf8869c. Sep 21 11:55:12 mostha1 systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 1858 (bash) in control group while starting unit. Ignoring. Sep 21 11:55:12 mostha1 systemd[1]: This usually indicates unclean termination of a previous run, or service implementation deficiencies. Sep 21 11:55:12 mostha1 systemd[1]: ceph-250f9864-0142-11ee-8e5f-00266cf8869c@osd.2.service: Found left-over process 2815 (podman) in control group while starting unit. Ignoring. This is not critical as it is a test cluster and it is actually rebalancing on other osd but I would like to know how to return to HEALTH_OK status. Smartctl show the HDD are OK. So is there a way to recover the osd from this state ? Version is 15.2.17 (juste moved from 15.2.13 to 15.2.17 yesterday, will try to move to latest versions as soon as this problem is solved) Thanks Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: No snap_schedule module in Octopus
Hi Patrick, I agree that learning Ceph today with Octopus is not a good idea, but, as a newbie with this tool, I was not able to solve the HDD detection problem and my post about it on this forum do not provide any help (https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/OPMWHJ4ZFCOOPUY6ST4WAJ4G4ASJFALM/). I've also looked for a list of new unsupported hardware between Octopus and Pacific without success. I've also received a private mail from a Sweden user reading the forum last week and having the same HDD detection problem with 17.2.6. He was asking if I have solved it. He tells me he will try to debug. In my mind, an old version of Ceph on an old material had more chances to be stable and bug free too. Yes I have created file systems (datacfs) and I can create a snapshot by hand using cephadm. I've just tested: # ceph fs set datacfs allow_new_snaps true # ceph-fuse /mnt # mkdir /mnt/.snap/$(TZ=CET date +%Y-%m-%d:%H-%M-%S) and I have a snapshot. I can remove it too. May be today my goal should be: 1- try to undo "ceph mgr module enable snap_schedule --force" (always a bad idea in my mind to use options like "--force") 2- launch update to nautilus now that all HDDs are configured. In my Ceph learning process there is also the step to test update procedures. 3- try again to use snap_schedule Thanks for the time spent on my problem Patrick Le 19/09/2023 à 19:46, Patrick Donnelly a écrit : I'm not sure off-hand. The module did have several changes as recently as pacific so it's possible something is broken. Perhaps you don't have a file system created yet? I would still expect to see the commands however... I suggest you figure out why Ceph Pacific+ can't detect your hard disk drives (???). That seems more productive than debugging a long EOLifed release. On Tue, Sep 19, 2023 at 8:49 AM Patrick Begou wrote: Hi Patrick, sorry for the bad copy/paste. As it was not working I have also tried with the module name [ceph: root@mostha1 /]# ceph fs snap-schedule no valid command found; 10 closest matches: fs status [] fs volume ls fs volume create [] fs volume rm [] fs subvolumegroup ls fs subvolumegroup create [] [] [] [] fs subvolumegroup rm [--force] fs subvolume ls [] fs subvolume create [] [] [] [] [] [] [--namespace-isolated] fs subvolume rm [] [--force] [--retain-snapshots] Error EINVAL: invalid command I'm reading the same documentation, but for Octopus: https://docs.ceph.com/en/octopus/cephfs/snap-schedule/# I think that if "ceph mgr module enable snap_schedule" was not working without the "--force" option, it was because something was wrong in my Ceph install. Patrick Le 19/09/2023 à 14:29, Patrick Donnelly a écrit : https://docs.ceph.com/en/quincy/cephfs/snap-schedule/#usage ceph fs snap-schedule (note the hyphen!) On Tue, Sep 19, 2023 at 8:23 AM Patrick Begou wrote: Hi, still some problems with snap_schedule as as the ceph fs snap-schedule namespace is not available on my nodes. [ceph: root@mostha1 /]# ceph mgr module ls | jq -r '.enabled_modules []' cephadm dashboard iostat prometheus restful snap_schedule [ceph: root@mostha1 /]# ceph fs snap_schedule no valid command found; 10 closest matches: fs status [] fs volume ls fs volume create [] fs volume rm [] fs subvolumegroup ls fs subvolumegroup create [] [] [] [] fs subvolumegroup rm [--force] fs subvolume ls [] fs subvolume create [] [] [] [] [] [] [--namespace-isolated] fs subvolume rm [] [--force] [--retain-snapshots] Error EINVAL: invalid command I think I need your help to go further Patrick Le 19/09/2023 à 10:23, Patrick Begou a écrit : Hi, bad question, sorry. I've just run ceph mgr module enable snap_schedule --force to solve this problem. I was just afraid to use "--force" but as I can break this test configuration.... Patrick Le 19/09/2023 à 09:47, Patrick Begou a écrit : Hi, I'm working on a small POC for a ceph setup on 4 old C6100 power-edge. I had to install Octopus since latest versions were unable to detect the HDD (too old hardware ??). No matter, this is only for training and understanding Ceph environment. My installation is based on https://download.ceph.com/rpm-15.2.12/el8/noarch/cephadm-15.2.12-0.el8.noarch.rpm bootstrapped. I'm reaching the point to automate the snapshots (I can create snapshot by hand without any problem). The documentation https://download.ceph.com/rpm-15.2.12/el8/noarch/cephadm-15.2.12-0.el8.noarch.rpm says to use the snap_schedule module but this module does not exist. # ceph mgr module ls | jq -r '.enabled_modules []' cephadm dashboard iostat prometheus restful Have I missed something ? Is there some additional install steps to do for this module ? Thanks for your help. Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io _
[ceph-users] Re: No snap_schedule module in Octopus
Hi Patrick, sorry for the bad copy/paste. As it was not working I have also tried with the module name [ceph: root@mostha1 /]# ceph fs snap-schedule no valid command found; 10 closest matches: fs status [] fs volume ls fs volume create [] fs volume rm [] fs subvolumegroup ls fs subvolumegroup create [] [] [] [] fs subvolumegroup rm [--force] fs subvolume ls [] fs subvolume create [] [] [] [] [] [] [--namespace-isolated] fs subvolume rm [] [--force] [--retain-snapshots] Error EINVAL: invalid command I'm reading the same documentation, but for Octopus: https://docs.ceph.com/en/octopus/cephfs/snap-schedule/# I think that if "ceph mgr module enable snap_schedule" was not working without the "--force" option, it was because something was wrong in my Ceph install. Patrick Le 19/09/2023 à 14:29, Patrick Donnelly a écrit : https://docs.ceph.com/en/quincy/cephfs/snap-schedule/#usage ceph fs snap-schedule (note the hyphen!) On Tue, Sep 19, 2023 at 8:23 AM Patrick Begou wrote: Hi, still some problems with snap_schedule as as the ceph fs snap-schedule namespace is not available on my nodes. [ceph: root@mostha1 /]# ceph mgr module ls | jq -r '.enabled_modules []' cephadm dashboard iostat prometheus restful snap_schedule [ceph: root@mostha1 /]# ceph fs snap_schedule no valid command found; 10 closest matches: fs status [] fs volume ls fs volume create [] fs volume rm [] fs subvolumegroup ls fs subvolumegroup create [] [] [] [] fs subvolumegroup rm [--force] fs subvolume ls [] fs subvolume create [] [] [] [] [] [] [--namespace-isolated] fs subvolume rm [] [--force] [--retain-snapshots] Error EINVAL: invalid command I think I need your help to go further Patrick Le 19/09/2023 à 10:23, Patrick Begou a écrit : Hi, bad question, sorry. I've just run ceph mgr module enable snap_schedule --force to solve this problem. I was just afraid to use "--force" but as I can break this test configuration Patrick Le 19/09/2023 à 09:47, Patrick Begou a écrit : Hi, I'm working on a small POC for a ceph setup on 4 old C6100 power-edge. I had to install Octopus since latest versions were unable to detect the HDD (too old hardware ??). No matter, this is only for training and understanding Ceph environment. My installation is based on https://download.ceph.com/rpm-15.2.12/el8/noarch/cephadm-15.2.12-0.el8.noarch.rpm bootstrapped. I'm reaching the point to automate the snapshots (I can create snapshot by hand without any problem). The documentation https://download.ceph.com/rpm-15.2.12/el8/noarch/cephadm-15.2.12-0.el8.noarch.rpm says to use the snap_schedule module but this module does not exist. # ceph mgr module ls | jq -r '.enabled_modules []' cephadm dashboard iostat prometheus restful Have I missed something ? Is there some additional install steps to do for this module ? Thanks for your help. Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: No snap_schedule module in Octopus
Hi, still some problems with snap_schedule as as the ceph fs snap-schedule namespace is not available on my nodes. [ceph: root@mostha1 /]# ceph mgr module ls | jq -r '.enabled_modules []' cephadm dashboard iostat prometheus restful snap_schedule [ceph: root@mostha1 /]# ceph fs snap_schedule no valid command found; 10 closest matches: fs status [] fs volume ls fs volume create [] fs volume rm [] fs subvolumegroup ls fs subvolumegroup create [] [] [] [] fs subvolumegroup rm [--force] fs subvolume ls [] fs subvolume create [] [] [] [] [] [] [--namespace-isolated] fs subvolume rm [] [--force] [--retain-snapshots] Error EINVAL: invalid command I think I need your help to go further Patrick Le 19/09/2023 à 10:23, Patrick Begou a écrit : Hi, bad question, sorry. I've just run ceph mgr module enable snap_schedule --force to solve this problem. I was just afraid to use "--force" but as I can break this test configuration Patrick Le 19/09/2023 à 09:47, Patrick Begou a écrit : Hi, I'm working on a small POC for a ceph setup on 4 old C6100 power-edge. I had to install Octopus since latest versions were unable to detect the HDD (too old hardware ??). No matter, this is only for training and understanding Ceph environment. My installation is based on https://download.ceph.com/rpm-15.2.12/el8/noarch/cephadm-15.2.12-0.el8.noarch.rpm bootstrapped. I'm reaching the point to automate the snapshots (I can create snapshot by hand without any problem). The documentation https://download.ceph.com/rpm-15.2.12/el8/noarch/cephadm-15.2.12-0.el8.noarch.rpm says to use the snap_schedule module but this module does not exist. # ceph mgr module ls | jq -r '.enabled_modules []' cephadm dashboard iostat prometheus restful Have I missed something ? Is there some additional install steps to do for this module ? Thanks for your help. Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: No snap_schedule module in Octopus
Hi, bad question, sorry. I've just run ceph mgr module enable snap_schedule --force to solve this problem. I was just afraid to use "--force" but as I can break this test configuration Patrick Le 19/09/2023 à 09:47, Patrick Begou a écrit : Hi, I'm working on a small POC for a ceph setup on 4 old C6100 power-edge. I had to install Octopus since latest versions were unable to detect the HDD (too old hardware ??). No matter, this is only for training and understanding Ceph environment. My installation is based on https://download.ceph.com/rpm-15.2.12/el8/noarch/cephadm-15.2.12-0.el8.noarch.rpm bootstrapped. I'm reaching the point to automate the snapshots (I can create snapshot by hand without any problem). The documentation https://download.ceph.com/rpm-15.2.12/el8/noarch/cephadm-15.2.12-0.el8.noarch.rpm says to use the snap_schedule module but this module does not exist. # ceph mgr module ls | jq -r '.enabled_modules []' cephadm dashboard iostat prometheus restful Have I missed something ? Is there some additional install steps to do for this module ? Thanks for your help. Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] No snap_schedule module in Octopus
Hi, I'm working on a small POC for a ceph setup on 4 old C6100 power-edge. I had to install Octopus since latest versions were unable to detect the HDD (too old hardware ??). No matter, this is only for training and understanding Ceph environment. My installation is based on https://download.ceph.com/rpm-15.2.12/el8/noarch/cephadm-15.2.12-0.el8.noarch.rpm bootstrapped. I'm reaching the point to automate the snapshots (I can create snapshot by hand without any problem). The documentation https://download.ceph.com/rpm-15.2.12/el8/noarch/cephadm-15.2.12-0.el8.noarch.rpm says to use the snap_schedule module but this module does not exist. # ceph mgr module ls | jq -r '.enabled_modules []' cephadm dashboard iostat prometheus restful Have I missed something ? Is there some additional install steps to do for this module ? Thanks for your help. Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process
I'm a new ceph user and I have some trouble with boostraping with cephadm: using Pacific or Quincy no hard drive are detected by Ceph. Using Octopus all the hard drives are detected. As I do not know how to really clean, even a successful install but not functional, each test requires me a full reinstall of the node (it is a test node, no problem except needed time). A detailed (and working) cleaning (or uninstalling) methode (or command) of a ceph deployment for a Ceph newbie will be very helpfull. About how to do this, I'm using proxmox for vitualization and removing a VM via the web interface requires typing again the ID of the VM. May be Ceph could require the user providing the cluster ID when running the command ? In the command arguments if building a new cluster create always a different id or when command is running as a double check. Best regards, Patrick Le 30/05/2023 à 11:23, Frank Schilder a écrit : What I'm having in mind is if the command is already in history. A wrong history reference can execute a command with "--yes-i-really-mean-it" even though you really don't mean it. Been there. For an OSD this is maybe tolerable, but for an entire cluster ... not really. Some things need to be hard to limit the blast radius of a typo (or attacker). For example, when issuing such a command the first time, the cluster could print a nonce that needs to be included in such a command to make it happen and which is only valid once for this exact command, so one actually needs to type something new every time to destroy stuff. An exception could be if a "safe-to-destroy" query for any daemon (pool etc.) returns true. I would still not allow an entire cluster to be wiped with a single command. In a single step, only allow to destroy what could be recovered in some way (there has to be some form of undo). And there should be notifications to all admins about what is going on to be able to catch malicious execution of destructive commands. Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Nico Schottelius Sent: Tuesday, May 30, 2023 10:51 AM To: Frank Schilder Cc: Nico Schottelius; Redouane Kachach; ceph-users@ceph.io Subject: Re: [ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process Hey Frank, in regards to destroying a cluster, I'd suggest to reuse the old --yes-i-really-mean-it parameter, as it is already in use by ceph osd destroy [0]. Then it doesn't matter whether it's prod or not, if you really mean it ... ;-) Best regards, Nico [0] https://docs.ceph.com/en/latest/rados/operations/add-or-rm-osds/ Frank Schilder writes: Hi, I would like to second Nico's comment. What happened to the idea that a deployment tool should be idempotent? The most natural option would be: 1) start install -> something fails 2) fix problem 3) repeat exact same deploy command -> deployment picks up at current state (including cleaning up failed state markers) and tries to continue until next issue (go to 2) I'm not sure (meaning: its a terrible idea) if its a good idea to provide a single command to wipe a cluster. Just for the fat finger syndrome. This seems safe only if it would be possible to mark a cluster as production somehow (must be sticky, that is, cannot be unset), which prevents a cluster destroy command (or any too dangerous command) from executing. I understand the test case in the tracker, but having such test-case utils that can run on a production cluster and destroy everything seems a bit dangerous. I think destroying a cluster should be a manual and tedious process and figuring out how to do it should be part of the learning experience. So my answer to "how do I start over" would be "go figure it out, its an important lesson". Best regards, = Frank Schilder AIT Risø Campus Bygning 109, rum S14 From: Nico Schottelius Sent: Friday, May 26, 2023 10:40 PM To: Redouane Kachach Cc: ceph-users@ceph.io Subject: [ceph-users] Re: Seeking feedback on Improving cephadm bootstrap process Hello Redouane, much appreciated kick-off for improving cephadm. I was wondering why cephadm does not use a similar approach to rook in the sense of "repeat until it is fixed?" For the background, rook uses a controller that checks the state of the cluster, the state of monitors, whether there are disks to be added, etc. It periodically restarts the checks and when needed shifts monitors, creates OSDs, etc. My question is, why not have a daemon or checker subcommand of cephadm that a) checks what the current cluster status is (i.e. cephadm verify-cluster) and b) fixes the situation (i.e. cephadm verify-and-fix-cluster)? I think that option would be much more beneficial than the other two suggested ones. Best regards, Nico -- Sustainable and modern Infrastructures by ungleich.ch
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
Hi Michel, I do not notice anything strange in the logs files (looking for errors or warnings). The hardware is a DELL C6100 sled (from 2011) running Alma Linux8 up-to-date. It uses 3 sata disks. Is there a way to force osd installation by hand with providing the device /dev/sdc for example ? A "do what I say" approach... Is it a good try to deploy Octopus on the nodes, configure the osd (even if podman 4.2.0 is not validated for Octopus) and then upgrade to Pacific? Could this be a workaround for this sort of regression from Octopus to Pacific ? May be updating the BIOS from 1.7.1 to 1.8.1 ? All this is a little bit confusing for me as I'm trying to discover Ceph Thanks Patrick Le 26/05/2023 à 17:19, Michel Jouvin a écrit : Hi Patrick, It is weird, we have a couple of clusters with cephadm and running pacify or quincy and ceph orch device works well. Have you looked at the cephadm logs (ceph log last cephadm)? Except if you are using a very specific hardware, I suspect Ceph is suffering of a problem outside it... Cheers, Michel Sent from my mobile Le 26 mai 2023 17:02:50 Patrick Begou a écrit : Hi, I'm back working on this problem. First of all, I saw that I had a hardware memory error so I had to solve this first. It's done. I've tested some different Ceph deployments, each time starting with a full OS re-install (it requires some time for each test). Using Octopus, the devices are found: dnf -y install \ https://download.ceph.com/rpm-15.2.12/el8/noarch/cephadm-15.2.12-0.el8.noarch.rpm monip=$(getent ahostsv4 mostha1 |head -n 1| awk '{ print $1 }')) cephadm bootstrap --mon-ip $monip --initial-dashboard-password x \ --allow-fqdn-hostname [ceph: root@mostha1 /]# *ceph orch device ls* Hostname Path Type Serial Size Health Ident Fault Available mostha1.legi.grenoble-inp.fr /dev/sda hdd S2B5J90ZA02494 250G Unknown N/A N/A Yes mostha1.legi.grenoble-inp.fr /dev/sdc hdd WD-WMAYP0982329 500G Unknown N/A N/A Yes But with Pacific or Quincy the command returns nothing. With Pacific: dnf -y install \ https://download.ceph.com/rpm-16.2.13/el8/noarch/cephadm-16.2.13-0.el8.noarch.rpm monip=$(getent ahostsv4 mostha1 |head -n 1| awk '{ print $1 }') cephadm bootstrap --mon-ip $monip --initial-dashboard-password x \ --allow-fqdn-hostname "ceph orch device ls" doesn't return anything but "cephadm shell lsmcli ldl" list all the devices. [ceph: root@mostha1 /]# *ceph orch device ls --wide* [ceph: root@mostha1 /]# *lsblk* NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 1 232.9G 0 disk |-sda1 8:1 1 3.9G 0 part /rootfs/boot |-sda2 8:2 1 78.1G 0 part | `-osvg-rootvol 253:0 0 48.8G 0 lvm /rootfs |-sda3 8:3 1 3.9G 0 part [SWAP] `-sda4 8:4 1 146.9G 0 part |-secretvg-homevol 253:1 0 9.8G 0 lvm /rootfs/home |-secretvg-tmpvol 253:2 0 9.8G 0 lvm /rootfs/tmp `-secretvg-varvol 253:3 0 9.8G 0 lvm /rootfs/var sdb 8:16 1 232.9G 0 disk sdc 8:32 1 465.8G 0 disk [ceph: root@mostha1 /]# exit [root@mostha1 ~]# *cephadm ceph-volume inventory* Inferring fsid 2e3e85a8-fbcf-11ed-84e5-00266cf8869c Using ceph image with id '0dc91bca92c2' and tag 'v17' created on 2023-05-25 16:26:31 + UTC quay.io/ceph/ceph@sha256:b8df01a568f4dec7bac6d5040f9391dcca14e00ec7f4de8a3dcf3f2a6502d3a9 Device Path Size Device nodes rotates available Model name [root@mostha1 ~]# *cephadm shell lsmcli ldl* Inferring fsid 4d54823c-fb05-11ed-aecf-00266cf8869c Inferring config /var/lib/ceph/4d54823c-fb05-11ed-aecf-00266cf8869c/mon.mostha1/config Using ceph image with id 'c9a1062f7289' and tag 'v17' created on 2023-04-25 16:04:33 + UTC quay.io/ceph/ceph@sha256:af79fedafc42237b7612fe2d18a9c64ca62a0b38ab362e614ad671efa4a0547e Path | SCSI VPD 0x83 | Link Type | Serial Number | Health Status - */dev/sda | 50024e92039e4f1c | PATA/SATA | S2B5J90ZA10142 | Good** **/dev/sdc | 50014ee0ad5953c9 | PATA/SATA | WD-WMAYP0982329 | Good** **/dev/sdb | 50024e920387fa2c | PATA/SATA | S2B5J90ZA02494 | Good** * Could it be a bug in ceph-volume ? Adam suggest looking to the underlying commands (lsblk, blkid, udevadm, lvs, or pvs) but I'm not very comfortable with blkid and udevadm. Is there a "debug flag" to set ceph more verbose ? Thanks Patrick Le 15/05/2023 à 21:20, Adam King a écrit : As you've already seem to have figured out, "ceph orch device ls" is populated with the results from "ceph-volume inventory".
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
lsblk, blkid, udevadm, lvs, or pvs. Also, if you want to see if it's an issue with a certain version of ceph-volume, you can use different versions by passing the image flag to cephadm. E.g. cephadm --image quay.io/ceph/ceph:v17.2.6 <http://quay.io/ceph/ceph:v17.2.6> ceph-volume -- inventory would use the 17.2.6 version of ceph-volume for the inventory. It works by running ceph-volume through the container, so you don't have to have to worry about installing different packages to try them and it should pull the container image on its own if it isn't on the machine already (but note that means the command will take longer as it pulls the image the first time). On Sat, May 13, 2023 at 4:34 AM Patrick Begou wrote: Hi Joshua, I've tried these commands but it looks like CEPH is unable to see and configure these HDDs. [root@mostha1 ~]# cephadm ceph-volume inventory Inferring fsid 4b7a6504-f0be-11ed-be1a-00266cf8869c Using recent ceph image quay.io/ceph/ceph@sha256:e6919776f0ff8331a8e9c4b18d36c5e9eed31e1a80da62ae8454e42d10e95544 <http://quay.io/ceph/ceph@sha256:e6919776f0ff8331a8e9c4b18d36c5e9eed31e1a80da62ae8454e42d10e95544> Device Path Size Device nodes rotates available Model name [root@mostha1 ~]# cephadm shell [ceph: root@mostha1 /]# ceph orch apply osd --all-available-devices Scheduled osd.all-available-devices update... [ceph: root@mostha1 /]# ceph orch device ls[ceph: root@mostha1 /]# ceph-volume lvm zap /dev/sdb --> Zapping: /dev/sdb --> --destroy was not specified, but zapping a whole device will remove the partition table Running command: /usr/bin/dd if=/dev/zero of=/dev/sdb bs=1M count=10 conv=fsync stderr: 10+0 records in 10+0 records out 10485760 bytes (10 MB, 10 MiB) copied, 0.10039 s, 104 MB/s --> Zapping successful for: I can check that /dev/sdb1 has been erased, so previous command is successful [ceph: root@mostha1 ceph]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 1 232.9G 0 disk |-sda1 8:1 1 3.9G 0 part /rootfs/boot |-sda2 8:2 1 78.1G 0 part | `-osvg-rootvol 253:0 0 48.8G 0 lvm /rootfs |-sda3 8:3 1 3.9G 0 part [SWAP] `-sda4 8:4 1 146.9G 0 part |-secretvg-homevol 253:1 0 9.8G 0 lvm /rootfs/home |-secretvg-tmpvol 253:2 0 9.8G 0 lvm /rootfs/tmp `-secretvg-varvol 253:3 0 9.8G 0 lvm /rootfs/var sdb 8:16 1 465.8G 0 disk sdc 8:32 1 232.9G 0 disk But still no visible HDD: [ceph: root@mostha1 ceph]# ceph orch apply osd --all-available-devices Scheduled osd.all-available-devices update... [ceph: root@mostha1 ceph]# ceph orch device ls [ceph: root@mostha1 ceph]# May be I have done something bad at install time as in the container I've unintentionally run: dnf -y install https://download.ceph.com/rpm-16.2.13/el8/noarch/cephadm-16.2.13-0.el8.noarch.rpm (an awful copy/paste launching the command). Can this break The container ? I do not know what should be available as ceph packages in the container to remove properly this install (no dnf.log file in the container) Patrick Le 12/05/2023 à 21:38, Beaman, Joshua a écrit : > The most significant point I see there, is you have no OSD service > spec to tell orchestrator how to deploy OSDs. The easiest fix for > that would be “cephorchapplyosd--all-available-devices” > > This will create a simple spec that should work for a test > environment. Most likely it will collocate the block, block.db, and > WAL all on the same device. Not ideal for prod environments, but fine > for practice and testing. > > The other command I should have had you try is “cephadm ceph-volume > inventory”. That should show you the devices available for OSD > deployment, and hopefully matches up to what your “lsblk” shows. If > you need to zap HDDs and orchestrator is still not seeing them, you > can try “cephadm ceph-volume lvm zap /dev/sdb” > > Thank you, > > Josh Beaman > > *From: *Patrick Begou > *Date: *Friday, May 12, 2023 at 2:22 PM > *To: *Beaman, Joshua , ceph-users > > *Subject: *Re: [EXTERNAL] [ceph-users] [Pacific] ceph orch device ls > do not returns any HDD > > Hi Joshua and thanks for this quick reply. > > At this step I have only one node. I was checking what ceph was > returning with different commands on this host before adding new > hosts. Just to compar
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
Hi Joshua, I've tried these commands but it looks like CEPH is unable to see and configure these HDDs. [root@mostha1 ~]# cephadm ceph-volume inventory Inferring fsid 4b7a6504-f0be-11ed-be1a-00266cf8869c Using recent ceph image quay.io/ceph/ceph@sha256:e6919776f0ff8331a8e9c4b18d36c5e9eed31e1a80da62ae8454e42d10e95544 Device Path Size Device nodes rotates available Model name [root@mostha1 ~]# cephadm shell [ceph: root@mostha1 /]# ceph orch apply osd --all-available-devices Scheduled osd.all-available-devices update... [ceph: root@mostha1 /]# ceph orch device ls[ceph: root@mostha1 /]# ceph-volume lvm zap /dev/sdb --> Zapping: /dev/sdb --> --destroy was not specified, but zapping a whole device will remove the partition table Running command: /usr/bin/dd if=/dev/zero of=/dev/sdb bs=1M count=10 conv=fsync stderr: 10+0 records in 10+0 records out 10485760 bytes (10 MB, 10 MiB) copied, 0.10039 s, 104 MB/s --> Zapping successful for: I can check that /dev/sdb1 has been erased, so previous command is successful [ceph: root@mostha1 ceph]# lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sda 8:0 1 232.9G 0 disk |-sda1 8:1 1 3.9G 0 part /rootfs/boot |-sda2 8:2 1 78.1G 0 part | `-osvg-rootvol 253:0 0 48.8G 0 lvm /rootfs |-sda3 8:3 1 3.9G 0 part [SWAP] `-sda4 8:4 1 146.9G 0 part |-secretvg-homevol 253:1 0 9.8G 0 lvm /rootfs/home |-secretvg-tmpvol 253:2 0 9.8G 0 lvm /rootfs/tmp `-secretvg-varvol 253:3 0 9.8G 0 lvm /rootfs/var sdb 8:16 1 465.8G 0 disk sdc 8:32 1 232.9G 0 disk But still no visible HDD: [ceph: root@mostha1 ceph]# ceph orch apply osd --all-available-devices Scheduled osd.all-available-devices update... [ceph: root@mostha1 ceph]# ceph orch device ls [ceph: root@mostha1 ceph]# May be I have done something bad at install time as in the container I've unintentionally run: dnf -y install https://download.ceph.com/rpm-16.2.13/el8/noarch/cephadm-16.2.13-0.el8.noarch.rpm (an awful copy/paste launching the command). Can this break The container ? I do not know what should be available as ceph packages in the container to remove properly this install (no dnf.log file in the container) Patrick Le 12/05/2023 à 21:38, Beaman, Joshua a écrit : The most significant point I see there, is you have no OSD service spec to tell orchestrator how to deploy OSDs. The easiest fix for that would be “cephorchapplyosd--all-available-devices” This will create a simple spec that should work for a test environment. Most likely it will collocate the block, block.db, and WAL all on the same device. Not ideal for prod environments, but fine for practice and testing. The other command I should have had you try is “cephadm ceph-volume inventory”. That should show you the devices available for OSD deployment, and hopefully matches up to what your “lsblk” shows. If you need to zap HDDs and orchestrator is still not seeing them, you can try “cephadm ceph-volume lvm zap /dev/sdb” Thank you, Josh Beaman *From: *Patrick Begou *Date: *Friday, May 12, 2023 at 2:22 PM *To: *Beaman, Joshua , ceph-users *Subject: *Re: [EXTERNAL] [ceph-users] [Pacific] ceph orch device ls do not returns any HDD Hi Joshua and thanks for this quick reply. At this step I have only one node. I was checking what ceph was returning with different commands on this host before adding new hosts. Just to compare with my first Octopus install. As this hardware is for testing only, it remains easy for me to break everything and reinstall again. [root@mostha1 ~]# cephadm check-host podman (/usr/bin/podman) version 4.2.0 is present systemctl is present lvcreate is present Unit chronyd.service is enabled and running Host looks OK [ceph: root@mostha1 /]# ceph -s cluster: id: 4b7a6504-f0be-11ed-be1a-00266cf8869c health: HEALTH_WARN OSD count 0 < osd_pool_default_size 3 services: mon: 1 daemons, quorum mostha1.legi.grenoble-inp.fr (age 5h) mgr: mostha1.legi.grenoble-inp.fr.hogwuz(active, since 5h) osd: 0 osds: 0 up, 0 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: [ceph: root@mostha1 /]# ceph orch ls NAME PORTS RUNNING REFRESHED AGE PLACEMENT alertmanager ?:9093,9094 1/1 6m ago 6h count:1 crash 1/1 6m ago 6h * grafana ?:3000 1/1 6m ago 6h count:1 mgr 1/2 6m ago 6h count:2 mon 1/5 6m ago 6h count:5 node-exporter ?:9100 1/1 6m ago 6h * prometheu
[ceph-users] Re: [EXTERNAL] [Pacific] ceph orch device ls do not returns any HDD
Hi Joshua and thanks for this quick reply. At this step I have only one node. I was checking what ceph was returning with different commands on this host before adding new hosts. Just to compare with my first Octopus install. As this hardware is for testing only, it remains easy for me to break everything and reinstall again. [root@mostha1 ~]# cephadm check-host podman (/usr/bin/podman) version 4.2.0 is present systemctl is present lvcreate is present Unit chronyd.service is enabled and running Host looks OK [ceph: root@mostha1 /]# ceph -s cluster: id: 4b7a6504-f0be-11ed-be1a-00266cf8869c health: HEALTH_WARN OSD count 0 < osd_pool_default_size 3 services: mon: 1 daemons, quorum mostha1.legi.grenoble-inp.fr (age 5h) mgr: mostha1.legi.grenoble-inp.fr.hogwuz(active, since 5h) osd: 0 osds: 0 up, 0 in data: pools: 0 pools, 0 pgs objects: 0 objects, 0 B usage: 0 B used, 0 B / 0 B avail pgs: [ceph: root@mostha1 /]# ceph orch ls NAME PORTS RUNNING REFRESHED AGE PLACEMENT alertmanager ?:9093,9094 1/1 6m ago 6h count:1 crash 1/1 6m ago 6h * grafana ?:3000 1/1 6m ago 6h count:1 mgr 1/2 6m ago 6h count:2 mon 1/5 6m ago 6h count:5 node-exporter ?:9100 1/1 6m ago 6h * prometheus ?:9095 1/1 6m ago 6h count:1 [ceph: root@mostha1 /]# ceph orch ls osd -export No services reported [ceph: root@mostha1 /]# ceph orch host ls HOST ADDR LABELS STATUS mostha1.legi.grenoble-inp.fr 194.254.66.34 _admin 1 hosts in cluster [ceph: root@mostha1 /]# ceph log last cephadm ... 2023-05-12T15:19:58.754655+ mgr.mostha1.legi.grenoble-inp.fr.hogwuz (mgr.44098) 1876 : cephadm [INF] Zap device mostha1.legi.grenoble-inp.fr:/dev/sdb 2023-05-12T15:19:58.756639+ mgr.mostha1.legi.grenoble-inp.fr.hogwuz (mgr.44098) 1877 : cephadm [ERR] Device path '/dev/sdb' not found on host 'mostha1.legi.grenoble-inp.fr' Traceback (most recent call last): File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 125, in wrapper return OrchResult(f(*args, **kwargs)) File "/usr/share/ceph/mgr/cephadm/module.py", line 2275, in zap_device f"Device path '{path}' not found on host '{host}'") orchestrator._interface.OrchestratorError: Device path '/dev/sdb' not found on host 'mostha1.legi.grenoble-inp.fr' [ceph: root@mostha1 /]# ls -l /dev/sdb brw-rw 1 root disk 8, 16 May 12 15:16 /dev/sdb [ceph: root@mostha1 /]# lsblk /dev/sdb NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT sdb 8:16 1 465.8G 0 disk `-sdb1 8:17 1 465.8G 0 part I have crated a full partition on /dev/sdb (for testing) and /dev/sdc has no partition table (removed). But all seams fine with these commands. Patrick Le 12/05/2023 à 20:19, Beaman, Joshua a écrit : I don’t quite understand why that zap would not work. But, here’s where I’d start. 1. cephadm check-host 1. Run this on each of your hosts to make sure cephadm, podman and all other prerequisites are installed and recognized 2. ceph orch ls 1. This should show at least a mon, mgr, and osd spec deployed 3. ceph orch ls osd –export 1. This will show the OSD placement service specifications that orchestrator uses to identify devices to deploy as OSDs 4. ceph orch host ls 1. This will list the hosts that have been added to orchestrator’s inventory, and what labels are applied which correlate to the service placement labels 5. ceph log last cephadm 1. This will show you what orchestrator has been trying to do, and how it may be failing Also, it’s never un-helpful to have a look at “ceph -s” and “ceph health detail”, particularly for any people trying to help you without access to your systems. Best of luck, Josh Beaman *From: *Patrick Begou *Date: *Friday, May 12, 2023 at 10:45 AM *To: *ceph-users *Subject: *[EXTERNAL] [ceph-users] [Pacific] ceph orch device ls do not returns any HDD Hi everyone I'm new to CEPH, just a french 4 days training session with Octopus on VMs that convince me to build my first cluster. At this time I have 4 old identical nodes for testing with 3 HDDs each, 2 network interfaces and running Alma Linux8 (el8). I try to replay the training session but it fails, breaking the web interface because of some problems with podman 4.2 not compatible with Octopus. So I try to deploy Pacific with cephadm tool on my first node (mostha1) (to enable testing also an upgrade later). dnf -y install https://urldefense.com/v3/__https://download.ceph.com/rpm-16.2.13/el8/noarch/cephadm-16.2.1
[ceph-users] [Pacific] ceph orch device ls do not returns any HDD
Hi everyone I'm new to CEPH, just a french 4 days training session with Octopus on VMs that convince me to build my first cluster. At this time I have 4 old identical nodes for testing with 3 HDDs each, 2 network interfaces and running Alma Linux8 (el8). I try to replay the training session but it fails, breaking the web interface because of some problems with podman 4.2 not compatible with Octopus. So I try to deploy Pacific with cephadm tool on my first node (mostha1) (to enable testing also an upgrade later). dnf -y install https://download.ceph.com/rpm-16.2.13/el8/noarch/cephadm-16.2.13-0.el8.noarch.rpm monip=$(getent ahostsv4 mostha1 |head -n 1| awk '{ print $1 }') cephadm bootstrap --mon-ip $monip --initial-dashboard-password x \ --initial-dashboard-user admceph \ --allow-fqdn-hostname --cluster-network 10.1.0.0/16 This was sucessfull. But running "*c**eph orch device ls*" do not show any HDD even if I have /dev/sda (used by the OS), /dev/sdb and /dev/sdc The web interface shows a row capacity which is an aggregate of the sizes of the 3 HDDs for the node. I've also tried to reset /dev/sdb but cephadm do not see it: [ceph: root@mostha1 /]# ceph orch device zap mostha1.legi.grenoble-inp.fr /dev/sdb --force Error EINVAL: Device path '/dev/sdb' not found on host 'mostha1.legi.grenoble-inp.fr' On my first attempt with octopus, I was able to list the available HDD with this command line. Before moving to Pacific, the OS on this node has been reinstalled from scratch. Any advices for a CEPH beginner ? Thanks Patrick ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io