[ceph-users] Re: cephadm to setup wal/db on nvme
> Clean up completed and total clean up time :8.23223 > > > > > > > > ### SSD pool benchmark > > > > (venv-openstack) root@os-ctrl1:~# rados -p test-ssd -t 64 -b 4096 bench > 10 > > write > > hints = 1 > > Maintaining 64 concurrent writes of 4096 bytes to objects of size 4096 > for > > up to 10 seconds or 0 objects > > Object prefix: benchmark_data_os-ctrl1_1933383 > > sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg > > lat(s) > >0 0 0 0 0 0 - > > 0 > >1 63 43839 43776 170.972 171 0.000991462 > > 0.00145833 > >2 64 92198 92134 179.921 188.898 0.00211419 > > 0.001387 > >3 64141917141853 184.675 194.215 0.00106326 > > 0.00135174 > >4 63193151193088 188.534 200.137 0.00179379 > > 0.00132423 > >5 63243104243041 189.847 195.129 0.000831263 > > 0.00131512 > >6 63291045290982 189.413187.27 0.00120208 > > 0.00131807 > >7 64341295341231 190.391 196.285 0.00102127 > > 0.00131137 > >8 63393336393273 191.999 203.289 0.000958149 > > 0.00130041 > >9 63442459442396 191.983 191.887 0.00123453 > > 0.00130053 > > Total time run: 10.0008 > > Total writes made: 488729 > > Write size: 4096 > > Object size:4096 > > Bandwidth (MB/sec): 190.894 > > Stddev Bandwidth: 9.35224 > > Max bandwidth (MB/sec): 203.289 > > Min bandwidth (MB/sec): 171 > > Average IOPS: 48868 > > Stddev IOPS:2394.17 > > Max IOPS: 52042 > > Min IOPS: 43776 > > Average Latency(s): 0.00130796 > > Stddev Latency(s): 0.000604629 > > Max latency(s): 0.0268462 > > Min latency(s): 0.000628738 > > Cleaning up (deleting benchmark objects) > > Removed 488729 objects > > Clean up completed and total clean up time :8.84114 > > > > > > > > > > > > > > > > > > On Wed, Aug 23, 2023 at 1:25 PM Adam King wrote: > > > >> this should be possible by specifying a "data_devices" and "db_devices" > >> fields in the OSD spec file each with different filters. There's some > >> examples in the docs > >> https://docs.ceph.com/en/latest/cephadm/services/osd/#the-simple-case > that > >> show roughly how that's done, and some other sections ( > >> https://docs.ceph.com/en/latest/cephadm/services/osd/#filters) that go > >> more in depth on the different filtering options available so you can > try > >> and find one that works for your disks. You can check the output of > "ceph > >> orch device ls --format json | jq" to see things like what cephadm > >> considers the model, size etc. for the devices to be for use in the > >> filtering. > >> > >> On Wed, Aug 23, 2023 at 1:13 PM Satish Patel > wrote: > >> > >>> Folks, > >>> > >>> I have 3 nodes with each having 1x NvME (1TB) and 3x 2.9TB SSD. Trying > to > >>> build ceph storage using cephadm on Ubuntu 22.04 distro. > >>> > >>> If I want to use NvME for Journaling (WAL/DB) for my SSD based OSDs > then > >>> how does cephadm handle it? > >>> > >>> Trying to find a document where I can tell cephadm to deploy wal/db on > >>> nvme > >>> so it can speed up write optimization. Do I need to create or cephadm > will > >>> create each partition for the number of OSD? > >>> > >>> Help me to understand how it works and is it worth doing? > >>> ___ > >>> ceph-users mailing list -- ceph-users@ceph.io > >>> To unsubscribe send an email to ceph-users-le...@ceph.io > >>> > >>> > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephadm to setup wal/db on nvme
Thank you for reply, I have created two class SSD and NvME and assigned them to crush maps. $ ceph osd crush rule ls replicated_rule ssd_pool nvme_pool Running benchmarks on nvme is the worst performing. SSD showing much better results compared to NvME. NvME model is Samsung_SSD_980_PRO_1TB NvME pool benchmark with 3x replication # rados -p test-nvme -t 64 -b 4096 bench 10 write hints = 1 Maintaining 64 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects Object prefix: benchmark_data_os-ctrl1_1931595 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 64 5541 5477 21.3917 21.3945 0.0134898 0.0116529 2 64 11209 11145 21.7641 22.1406 0.00939951 0.0114506 3 64 17036 16972 22.0956 22.7617 0.00938263 0.0112938 4 64 23187 23123 22.5776 24.0273 0.00863939 0.0110473 5 64 29753 29689 23.1911 25.6484 0.00925603 0.0107662 6 64 36222 36158 23.5369 25.2695 0.0100759 0.010606 7 63 42997 42934 23.9551 26.4688 0.00902186 0.0104246 8 64 49859 49795 24.3102 26.8008 0.00884379 0.0102765 9 64 56429 56365 24.4601 25.6641 0.00989885 0.0102124 10 31 62727 62696 24.4869 24.7305 0.0115833 0.0102027 Total time run: 10.0064 Total writes made: 62727 Write size: 4096 Object size:4096 Bandwidth (MB/sec): 24.4871 Stddev Bandwidth: 1.85423 Max bandwidth (MB/sec): 26.8008 < Only 26MB/s for nvme disk Min bandwidth (MB/sec): 21.3945 Average IOPS: 6268 Stddev IOPS:474.683 Max IOPS: 6861 Min IOPS: 5477 Average Latency(s): 0.0102022 Stddev Latency(s): 0.00170505 Max latency(s): 0.0365743 Min latency(s): 0.00641319 Cleaning up (deleting benchmark objects) Removed 62727 objects Clean up completed and total clean up time :8.23223 ### SSD pool benchmark (venv-openstack) root@os-ctrl1:~# rados -p test-ssd -t 64 -b 4096 bench 10 write hints = 1 Maintaining 64 concurrent writes of 4096 bytes to objects of size 4096 for up to 10 seconds or 0 objects Object prefix: benchmark_data_os-ctrl1_1933383 sec Cur ops started finished avg MB/s cur MB/s last lat(s) avg lat(s) 0 0 0 0 0 0 - 0 1 63 43839 43776 170.972 171 0.000991462 0.00145833 2 64 92198 92134 179.921 188.898 0.00211419 0.001387 3 64141917141853 184.675 194.215 0.00106326 0.00135174 4 63193151193088 188.534 200.137 0.00179379 0.00132423 5 63243104243041 189.847 195.129 0.000831263 0.00131512 6 63291045290982 189.413187.27 0.00120208 0.00131807 7 64341295341231 190.391 196.285 0.00102127 0.00131137 8 63393336393273 191.999 203.289 0.000958149 0.00130041 9 63442459442396 191.983 191.887 0.00123453 0.00130053 Total time run: 10.0008 Total writes made: 488729 Write size: 4096 Object size:4096 Bandwidth (MB/sec): 190.894 Stddev Bandwidth: 9.35224 Max bandwidth (MB/sec): 203.289 Min bandwidth (MB/sec): 171 Average IOPS: 48868 Stddev IOPS:2394.17 Max IOPS: 52042 Min IOPS: 43776 Average Latency(s): 0.00130796 Stddev Latency(s): 0.000604629 Max latency(s): 0.0268462 Min latency(s): 0.000628738 Cleaning up (deleting benchmark objects) Removed 488729 objects Clean up completed and total clean up time :8.84114 On Wed, Aug 23, 2023 at 1:25 PM Adam King wrote: > this should be possible by specifying a "data_devices" and "db_devices" > fields in the OSD spec file each with different filters. There's some > examples in the docs > https://docs.ceph.com/en/latest/cephadm/services/osd/#the-simple-case that > show roughly how that's done, and some other sections ( > https://docs.ceph.com/en/latest/cephadm/services/osd/#filters) that go > more in depth on the different filtering options available so you can try > and find one that works for your disks. You can check the output of "ceph > orch device ls --format json | jq" to see things like what cephadm > considers the model, size etc. for the devices to be for use in the > filtering. > > On Wed, Aug 23, 2023 at 1:13 PM Satish Patel wrote: > >> Folks, >> >> I have 3 nodes with each having 1x NvME (1TB) and 3x 2.9TB SSD. Trying to >> build ceph storage using cephadm on Ubuntu 22.04 distro. >> >> If I want to use NvME for Journaling (WAL/DB) for my SSD based
[ceph-users] cephadm to setup wal/db on nvme
Folks, I have 3 nodes with each having 1x NvME (1TB) and 3x 2.9TB SSD. Trying to build ceph storage using cephadm on Ubuntu 22.04 distro. If I want to use NvME for Journaling (WAL/DB) for my SSD based OSDs then how does cephadm handle it? Trying to find a document where I can tell cephadm to deploy wal/db on nvme so it can speed up write optimization. Do I need to create or cephadm will create each partition for the number of OSD? Help me to understand how it works and is it worth doing? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] cephadm docker registry
Folks, I am trying to install ceph on 10 node clusters and planning to use cephadm. My question is if next year i will add new nodes to this cluster then what docker image version cephadm will use to add new nodes? Are there any local registry can i create one to copy images locally? How does cephadm control images? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] [Quincy] Module 'devicehealth' has failed: disk I/O error
Folks, Any idea what is going on, I am running 3 node quincy version of openstack and today suddenly i noticed the following error. I found reference link but not sure if that is my issue or not https://tracker.ceph.com/issues/51974 root@ceph1:~# ceph -s cluster: id: cd748128-a3ea-11ed-9e46-c309158fad32 health: HEALTH_ERR 1 mgr modules have recently crashed services: mon: 3 daemons, quorum ceph1,ceph2,ceph3 (age 2d) mgr: ceph1.ckfkeb(active, since 6h), standbys: ceph2.aaptny osd: 9 osds: 9 up (since 2d), 9 in (since 2d) data: pools: 4 pools, 128 pgs objects: 1.18k objects, 4.7 GiB usage: 17 GiB used, 16 TiB / 16 TiB avail pgs: 128 active+clean root@ceph1:~# ceph health HEALTH_ERR Module 'devicehealth' has failed: disk I/O error; 1 mgr modules have recently crashed root@ceph1:~# ceph crash ls IDENTITY NEW 2023-02-07T00:07:12.739187Z_fcb9cbc9-bb55-4e7c-bf00-945b96469035 mgr.ceph1.ckfkeb * root@ceph1:~# ceph crash info 2023-02-07T00:07:12.739187Z_fcb9cbc9-bb55-4e7c-bf00-945b96469035 { "backtrace": [ " File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 373, in serve\nself.scrape_all()", " File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 425, in scrape_all\nself.put_device_metrics(device, data)", " File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 500, in put_device_metrics\nself._create_device(devid)", " File \"/usr/share/ceph/mgr/devicehealth/module.py\", line 487, in _create_device\ncursor = self.db.execute(SQL, (devid,))", "sqlite3.OperationalError: disk I/O error" ], "ceph_version": "17.2.5", "crash_id": "2023-02-07T00:07:12.739187Z_fcb9cbc9-bb55-4e7c-bf00-945b96469035", "entity_name": "mgr.ceph1.ckfkeb", "mgr_module": "devicehealth", "mgr_module_caller": "PyModuleRunner::serve", "mgr_python_exception": "OperationalError", "os_id": "centos", "os_name": "CentOS Stream", "os_version": "8", "os_version_id": "8", "process_name": "ceph-mgr", "stack_sig": "7e506cc2729d5a18403f0373447bb825b42aafa2405fb0e5cfffc2896b093ed8", "timestamp": "2023-02-07T00:07:12.739187Z", "utsname_hostname": "ceph1", "utsname_machine": "x86_64", "utsname_release": "5.15.0-58-generic", "utsname_sysname": "Linux", "utsname_version": "#64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023" ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [cephadm] Found duplicate OSDs
Hi Eugen, My error cleared up itself, Look like it took some time but now I am not seeing any errors and the output is very clean. Thank you so much. On Fri, Oct 21, 2022 at 1:46 PM Eugen Block wrote: > Do you still see it with ‚cephadm ls‘ on that node? If yes you could > try ‚cephadm rm-daemon —name osd.3‘. Or you try it with the > orchestrator: ceph orch daemon rm… > I don’t have the exact command at the moment, you should check the docs. > > Zitat von Satish Patel : > > > Hi Eugen, > > > > I have delected osd.3 directory from datastorn4 node as you mentioned but > > still i am seeing that duplicate osd in ps output. > > > > root@datastorn1:~# ceph orch ps | grep osd.3 > > osd.3 datastorn4stopped 5m > > ago 3w-42.6G > > osd.3 datastorn5running (3w) 5m > > ago 3w2587M42.6G 17.2.3 0912465dcea5 d139f8a1234b > > > > How do I clean up permanently? > > > > > > On Fri, Oct 21, 2022 at 6:24 AM Eugen Block wrote: > > > >> Hi, > >> > >> it looks like the OSDs haven't been cleaned up after removing them. Do > >> you see the osd directory in /var/lib/ceph//osd.3 on datastorn4? > >> Just remove the osd.3 directory, then cephadm won't try to activate it. > >> > >> > >> Zitat von Satish Patel : > >> > >> > Folks, > >> > > >> > I have deployed 15 OSDs node clusters using cephadm and encount > duplicate > >> > OSD on one of the nodes and am not sure how to clean that up. > >> > > >> > root@datastorn1:~# ceph health > >> > HEALTH_WARN 1 failed cephadm daemon(s); 1 pool(s) have no replicas > >> > configured > >> > > >> > osd.3 is duplicated on two nodes, i would like to remove it from > >> > datastorn4 but I'm not sure how to remove it. In the ceph osd tree I > am > >> not > >> > seeing any duplicate. > >> > > >> > root@datastorn1:~# ceph orch ps | grep osd.3 > >> > osd.3 datastorn4stopped > 7m > >> > ago 3w-42.6G > >> > osd.3 datastorn5running (3w) > 7m > >> > ago 3w2584M42.6G 17.2.3 0912465dcea5 d139f8a1234b > >> > > >> > > >> > Getting following error in logs > >> > > >> > 2022-10-21T09:10:45.226872+ mgr.datastorn1.nciiiu (mgr.14188) > >> 1098186 : > >> > cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on > >> datastorn4, > >> > osd.3 in status running on datastorn5 > >> > 2022-10-21T09:11:46.254979+ mgr.datastorn1.nciiiu (mgr.14188) > >> 1098221 : > >> > cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on > >> datastorn4, > >> > osd.3 in status running on datastorn5 > >> > 2022-10-21T09:12:53.009252+ mgr.datastorn1.nciiiu (mgr.14188) > >> 1098256 : > >> > cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on > >> datastorn4, > >> > osd.3 in status running on datastorn5 > >> > 2022-10-21T09:13:59.283251+ mgr.datastorn1.nciiiu (mgr.14188) > >> 1098293 : > >> > cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on > >> datastorn4, > >> > osd.3 in status running on datastorn5 > >> > ___ > >> > ceph-users mailing list -- ceph-users@ceph.io > >> > To unsubscribe send an email to ceph-users-le...@ceph.io > >> > >> > >> > >> ___ > >> ceph-users mailing list -- ceph-users@ceph.io > >> To unsubscribe send an email to ceph-users-le...@ceph.io > >> > > > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [cephadm] Found duplicate OSDs
Hi Eugen, I have delected osd.3 directory from datastorn4 node as you mentioned but still i am seeing that duplicate osd in ps output. root@datastorn1:~# ceph orch ps | grep osd.3 osd.3 datastorn4stopped 5m ago 3w-42.6G osd.3 datastorn5running (3w) 5m ago 3w2587M42.6G 17.2.3 0912465dcea5 d139f8a1234b How do I clean up permanently? On Fri, Oct 21, 2022 at 6:24 AM Eugen Block wrote: > Hi, > > it looks like the OSDs haven't been cleaned up after removing them. Do > you see the osd directory in /var/lib/ceph//osd.3 on datastorn4? > Just remove the osd.3 directory, then cephadm won't try to activate it. > > > Zitat von Satish Patel : > > > Folks, > > > > I have deployed 15 OSDs node clusters using cephadm and encount duplicate > > OSD on one of the nodes and am not sure how to clean that up. > > > > root@datastorn1:~# ceph health > > HEALTH_WARN 1 failed cephadm daemon(s); 1 pool(s) have no replicas > > configured > > > > osd.3 is duplicated on two nodes, i would like to remove it from > > datastorn4 but I'm not sure how to remove it. In the ceph osd tree I am > not > > seeing any duplicate. > > > > root@datastorn1:~# ceph orch ps | grep osd.3 > > osd.3 datastorn4stopped 7m > > ago 3w-42.6G > > osd.3 datastorn5running (3w) 7m > > ago 3w2584M42.6G 17.2.3 0912465dcea5 d139f8a1234b > > > > > > Getting following error in logs > > > > 2022-10-21T09:10:45.226872+ mgr.datastorn1.nciiiu (mgr.14188) > 1098186 : > > cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on > datastorn4, > > osd.3 in status running on datastorn5 > > 2022-10-21T09:11:46.254979+ mgr.datastorn1.nciiiu (mgr.14188) > 1098221 : > > cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on > datastorn4, > > osd.3 in status running on datastorn5 > > 2022-10-21T09:12:53.009252+ mgr.datastorn1.nciiiu (mgr.14188) > 1098256 : > > cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on > datastorn4, > > osd.3 in status running on datastorn5 > > 2022-10-21T09:13:59.283251+ mgr.datastorn1.nciiiu (mgr.14188) > 1098293 : > > cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on > datastorn4, > > osd.3 in status running on datastorn5 > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] [cephadm] Found duplicate OSDs
Folks, I have deployed 15 OSDs node clusters using cephadm and encount duplicate OSD on one of the nodes and am not sure how to clean that up. root@datastorn1:~# ceph health HEALTH_WARN 1 failed cephadm daemon(s); 1 pool(s) have no replicas configured osd.3 is duplicated on two nodes, i would like to remove it from datastorn4 but I'm not sure how to remove it. In the ceph osd tree I am not seeing any duplicate. root@datastorn1:~# ceph orch ps | grep osd.3 osd.3 datastorn4stopped 7m ago 3w-42.6G osd.3 datastorn5running (3w) 7m ago 3w2584M42.6G 17.2.3 0912465dcea5 d139f8a1234b Getting following error in logs 2022-10-21T09:10:45.226872+ mgr.datastorn1.nciiiu (mgr.14188) 1098186 : cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on datastorn4, osd.3 in status running on datastorn5 2022-10-21T09:11:46.254979+ mgr.datastorn1.nciiiu (mgr.14188) 1098221 : cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on datastorn4, osd.3 in status running on datastorn5 2022-10-21T09:12:53.009252+ mgr.datastorn1.nciiiu (mgr.14188) 1098256 : cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on datastorn4, osd.3 in status running on datastorn5 2022-10-21T09:13:59.283251+ mgr.datastorn1.nciiiu (mgr.14188) 1098293 : cephadm [INF] Found duplicate OSDs: osd.3 in status stopped on datastorn4, osd.3 in status running on datastorn5 ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: strange osd error during add disk
Hi Dominique, How do I check using cephadm shell ? I am new to cephadm :) https://paste.opendev.org/show/b4egkEdAkCWSkT3VRyO9/ On Fri, Sep 30, 2022 at 6:20 AM Dominique Ramaekers < dominique.ramaek...@cometal.be> wrote: > > Ceph.conf isn't available on that node/container. > > Wat happens if you try to start a cephadm shell on that node? > > > > -Oorspronkelijk bericht- > > Van: Satish Patel > > Verzonden: donderdag 29 september 2022 21:45 > > Aan: ceph-users > > Onderwerp: [ceph-users] Re: strange osd error during add disk > > > > Bump! Any suggestions? > > > > On Wed, Sep 28, 2022 at 4:26 PM Satish Patel > wrote: > > > > > Folks, > > > > > > I have 15 nodes for ceph and each node has a 160TB disk attached. I am > > > using cephadm quincy release and all 14 nodes have been added except > > > one node which is giving a very strange error during adding it. I have > > > put all logs here > > https://paste.opendev.org/show/bbSKwlSLyANMbrlhwzXL/ > > > > > > In short, the following error logs I am getting. I have tried zap to > > > disk and re-add but getting the following error every single time. > > > > > > [2022-09-28 20:13:28,644][ceph_volume.main][INFO ] Running command: > > > ceph-volume lvm list --format json > > > [2022-09-28 20:13:28,644][ceph_volume.main][ERROR ] ignoring inability > > > to load ceph.conf Traceback (most recent call last): > > > File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line > 145, > > in main > > > conf.ceph = configuration.load(conf.path) > > > File "/usr/lib/python3.6/site-packages/ceph_volume/configuration.py", > > line 51, in load > > > raise exceptions.ConfigurationError(abspath=abspath) > > > > > > > > ___ > > ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an > email > > to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: strange osd error during add disk
Hi Alvaro, I have seen errors on every node and even functional and working nodes so assuming it's not important "ceph_volume.exceptions.ConfigurationError: Unable to load expected Ceph config at: /etc/ceph/ceph.conf" Maybe cephadm run inside docker and that is why its just giving this warning. On Thu, Sep 29, 2022 at 4:29 PM Alvaro Soto wrote: > Where is your ceph.conf file? > > ceph_volume.exceptions.ConfigurationError: Unable to load expected Ceph > config at: /etc/ceph/ceph.conf > > > > --- > Alvaro Soto. > > Note: My work hours may not be your work hours. Please do not feel the > need to respond during a time that is not convenient for you. > -- > Great people talk about ideas, > ordinary people talk about things, > small people talk... about other people. > > On Thu, Sep 29, 2022, 2:45 PM Satish Patel wrote: > >> Bump! Any suggestions? >> >> On Wed, Sep 28, 2022 at 4:26 PM Satish Patel >> wrote: >> >> > Folks, >> > >> > I have 15 nodes for ceph and each node has a 160TB disk attached. I am >> > using cephadm quincy release and all 14 nodes have been added except one >> > node which is giving a very strange error during adding it. I have put >> all >> > logs here https://paste.opendev.org/show/bbSKwlSLyANMbrlhwzXL/ >> > >> > In short, the following error logs I am getting. I have tried zap to >> disk >> > and re-add but getting the following error every single time. >> > >> > [2022-09-28 20:13:28,644][ceph_volume.main][INFO ] Running command: >> ceph-volume lvm list --format json >> > [2022-09-28 20:13:28,644][ceph_volume.main][ERROR ] ignoring inability >> to load ceph.conf >> > Traceback (most recent call last): >> > File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line >> 145, in main >> > conf.ceph = configuration.load(conf.path) >> > File "/usr/lib/python3.6/site-packages/ceph_volume/configuration.py", >> line 51, in load >> > raise exceptions.ConfigurationError(abspath=abspath) >> > >> > >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: strange osd error during add disk
Bump! Any suggestions? On Wed, Sep 28, 2022 at 4:26 PM Satish Patel wrote: > Folks, > > I have 15 nodes for ceph and each node has a 160TB disk attached. I am > using cephadm quincy release and all 14 nodes have been added except one > node which is giving a very strange error during adding it. I have put all > logs here https://paste.opendev.org/show/bbSKwlSLyANMbrlhwzXL/ > > In short, the following error logs I am getting. I have tried zap to disk > and re-add but getting the following error every single time. > > [2022-09-28 20:13:28,644][ceph_volume.main][INFO ] Running command: > ceph-volume lvm list --format json > [2022-09-28 20:13:28,644][ceph_volume.main][ERROR ] ignoring inability to > load ceph.conf > Traceback (most recent call last): > File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 145, in > main > conf.ceph = configuration.load(conf.path) > File "/usr/lib/python3.6/site-packages/ceph_volume/configuration.py", line > 51, in load > raise exceptions.ConfigurationError(abspath=abspath) > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] strange osd error during add disk
Folks, I have 15 nodes for ceph and each node has a 160TB disk attached. I am using cephadm quincy release and all 14 nodes have been added except one node which is giving a very strange error during adding it. I have put all logs here https://paste.opendev.org/show/bbSKwlSLyANMbrlhwzXL/ In short, the following error logs I am getting. I have tried zap to disk and re-add but getting the following error every single time. [2022-09-28 20:13:28,644][ceph_volume.main][INFO ] Running command: ceph-volume lvm list --format json [2022-09-28 20:13:28,644][ceph_volume.main][ERROR ] ignoring inability to load ceph.conf Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ceph_volume/main.py", line 145, in main conf.ceph = configuration.load(conf.path) File "/usr/lib/python3.6/site-packages/ceph_volume/configuration.py", line 51, in load raise exceptions.ConfigurationError(abspath=abspath) ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [cephadm] not detecting new disk
I did use sgdisk zap to disk and wipe out everything. But still not detecting. Is there any other good way to wipeout ? Sent from my iPhone > On Sep 3, 2022, at 2:53 AM, Eugen Block wrote: > > It is detecting the disk, but it contains a partition table so it can’t use > it. Wipe the disk properly first. > > Zitat von Satish Patel : > >> Folks, >> >> I have created a new lab using cephadm and installed a new 1TB spinning >> disk which is trying to add in a cluster but somehow ceph is not detecting >> it. >> >> $ parted /dev/sda print >> Model: ATA WDC WD10EZEX-00B (scsi) >> Disk /dev/sda: 1000GB >> Sector size (logical/physical): 512B/4096B >> Partition Table: gpt >> Disk Flags: >> >> Number Start End Size File system Name Flags >> >> Trying following but no luck >> >> $ cephadm shell -- ceph orch daemon add osd os-ctrl-1:/dev/sda >> Inferring fsid 351f8a26-2b31-11ed-b555-494149d85a01 >> Using recent ceph image >> quay.io/ceph/ceph@sha256:c5fd9d806c54e5cc9db8efd50363e1edf7af62f101b264dccacb9d6091dcf7aa >> Error EINVAL: Traceback (most recent call last): >> File "/usr/share/ceph/mgr/mgr_module.py", line 1446, in _handle_command >>return self.handle_command(inbuf, cmd) >> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 171, in >> handle_command >>return dispatch[cmd['prefix']].call(self, cmd, inbuf) >> File "/usr/share/ceph/mgr/mgr_module.py", line 414, in call >>return self.func(mgr, **kwargs) >> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 107, in >> >>wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs) >> # noqa: E731 >> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 96, in wrapper >>return func(*args, **kwargs) >> File "/usr/share/ceph/mgr/orchestrator/module.py", line 803, in >> _daemon_add_osd >>raise_if_exception(completion) >> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 228, in >> raise_if_exception >>raise e >> RuntimeError: cephadm exited with an error code: 1, stderr:Inferring config >> /var/lib/ceph/351f8a26-2b31-11ed-b555-494149d85a01/mon.os-ctrl-1/config >> Non-zero exit code 2 from /usr/bin/docker run --rm --ipc=host >> --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume >> --privileged --group-add=disk --init -e CONTAINER_IMAGE= >> quay.io/ceph/ceph@sha256:c5fd9d806c54e5cc9db8efd50363e1edf7af62f101b264dccacb9d6091dcf7aa >> -e NODE_NAME=os-ctrl-1 -e CEPH_USE_RANDOM_NONCE=1 -e >> CEPH_VOLUME_OSDSPEC_AFFINITY=None -e CEPH_VOLUME_SKIP_RESTORECON=yes -e >> CEPH_VOLUME_DEBUG=1 -v >> /var/run/ceph/351f8a26-2b31-11ed-b555-494149d85a01:/var/run/ceph:z -v >> /var/log/ceph/351f8a26-2b31-11ed-b555-494149d85a01:/var/log/ceph:z -v >> /var/lib/ceph/351f8a26-2b31-11ed-b555-494149d85a01/crash:/var/lib/ceph/crash:z >> -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v >> /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v >> /tmp/ceph-tmpznn3t_7i:/etc/ceph/ceph.conf:z -v >> /tmp/ceph-tmpun8t5_ej:/var/lib/ceph/bootstrap-osd/ceph.keyring:z >> quay.io/ceph/ceph@sha256:c5fd9d806c54e5cc9db8efd50363e1edf7af62f101b264dccacb9d6091dcf7aa >> lvm batch --no-auto /dev/sda --yes --no-systemd >> /usr/bin/docker: stderr usage: ceph-volume lvm batch [-h] [--db-devices >> [DB_DEVICES [DB_DEVICES ...]]] >> /usr/bin/docker: stderr [--wal-devices >> [WAL_DEVICES [WAL_DEVICES ...]]] >> /usr/bin/docker: stderr [--journal-devices >> [JOURNAL_DEVICES [JOURNAL_DEVICES ...]]] >> /usr/bin/docker: stderr [--auto] [--no-auto] >> [--bluestore] [--filestore] >> /usr/bin/docker: stderr [--report] [--yes] >> /usr/bin/docker: stderr [--format >> {json,json-pretty,pretty}] [--dmcrypt] >> /usr/bin/docker: stderr [--crush-device-class >> CRUSH_DEVICE_CLASS] >> /usr/bin/docker: stderr [--no-systemd] >> /usr/bin/docker: stderr [--osds-per-device >> OSDS_PER_DEVICE] >> /usr/bin/docker: stderr [--data-slots >> DATA_SLOTS] >> /usr/bin/docker: stderr [--block-db-size >> BLOCK_DB_SIZE] >> /usr/bin/docker: stderr [--block-db-slots >> BLOCK_DB_SLOTS] >> /usr/bin/docker: stderr
[ceph-users] [cephadm] not detecting new disk
Folks, I have created a new lab using cephadm and installed a new 1TB spinning disk which is trying to add in a cluster but somehow ceph is not detecting it. $ parted /dev/sda print Model: ATA WDC WD10EZEX-00B (scsi) Disk /dev/sda: 1000GB Sector size (logical/physical): 512B/4096B Partition Table: gpt Disk Flags: Number Start End Size File system Name Flags Trying following but no luck $ cephadm shell -- ceph orch daemon add osd os-ctrl-1:/dev/sda Inferring fsid 351f8a26-2b31-11ed-b555-494149d85a01 Using recent ceph image quay.io/ceph/ceph@sha256:c5fd9d806c54e5cc9db8efd50363e1edf7af62f101b264dccacb9d6091dcf7aa Error EINVAL: Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 1446, in _handle_command return self.handle_command(inbuf, cmd) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 171, in handle_command return dispatch[cmd['prefix']].call(self, cmd, inbuf) File "/usr/share/ceph/mgr/mgr_module.py", line 414, in call return self.func(mgr, **kwargs) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 107, in wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs) # noqa: E731 File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 96, in wrapper return func(*args, **kwargs) File "/usr/share/ceph/mgr/orchestrator/module.py", line 803, in _daemon_add_osd raise_if_exception(completion) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 228, in raise_if_exception raise e RuntimeError: cephadm exited with an error code: 1, stderr:Inferring config /var/lib/ceph/351f8a26-2b31-11ed-b555-494149d85a01/mon.os-ctrl-1/config Non-zero exit code 2 from /usr/bin/docker run --rm --ipc=host --stop-signal=SIGTERM --net=host --entrypoint /usr/sbin/ceph-volume --privileged --group-add=disk --init -e CONTAINER_IMAGE= quay.io/ceph/ceph@sha256:c5fd9d806c54e5cc9db8efd50363e1edf7af62f101b264dccacb9d6091dcf7aa -e NODE_NAME=os-ctrl-1 -e CEPH_USE_RANDOM_NONCE=1 -e CEPH_VOLUME_OSDSPEC_AFFINITY=None -e CEPH_VOLUME_SKIP_RESTORECON=yes -e CEPH_VOLUME_DEBUG=1 -v /var/run/ceph/351f8a26-2b31-11ed-b555-494149d85a01:/var/run/ceph:z -v /var/log/ceph/351f8a26-2b31-11ed-b555-494149d85a01:/var/log/ceph:z -v /var/lib/ceph/351f8a26-2b31-11ed-b555-494149d85a01/crash:/var/lib/ceph/crash:z -v /dev:/dev -v /run/udev:/run/udev -v /sys:/sys -v /run/lvm:/run/lvm -v /run/lock/lvm:/run/lock/lvm -v /:/rootfs -v /tmp/ceph-tmpznn3t_7i:/etc/ceph/ceph.conf:z -v /tmp/ceph-tmpun8t5_ej:/var/lib/ceph/bootstrap-osd/ceph.keyring:z quay.io/ceph/ceph@sha256:c5fd9d806c54e5cc9db8efd50363e1edf7af62f101b264dccacb9d6091dcf7aa lvm batch --no-auto /dev/sda --yes --no-systemd /usr/bin/docker: stderr usage: ceph-volume lvm batch [-h] [--db-devices [DB_DEVICES [DB_DEVICES ...]]] /usr/bin/docker: stderr [--wal-devices [WAL_DEVICES [WAL_DEVICES ...]]] /usr/bin/docker: stderr [--journal-devices [JOURNAL_DEVICES [JOURNAL_DEVICES ...]]] /usr/bin/docker: stderr [--auto] [--no-auto] [--bluestore] [--filestore] /usr/bin/docker: stderr [--report] [--yes] /usr/bin/docker: stderr [--format {json,json-pretty,pretty}] [--dmcrypt] /usr/bin/docker: stderr [--crush-device-class CRUSH_DEVICE_CLASS] /usr/bin/docker: stderr [--no-systemd] /usr/bin/docker: stderr [--osds-per-device OSDS_PER_DEVICE] /usr/bin/docker: stderr [--data-slots DATA_SLOTS] /usr/bin/docker: stderr [--block-db-size BLOCK_DB_SIZE] /usr/bin/docker: stderr [--block-db-slots BLOCK_DB_SLOTS] /usr/bin/docker: stderr [--block-wal-size BLOCK_WAL_SIZE] /usr/bin/docker: stderr [--block-wal-slots BLOCK_WAL_SLOTS] /usr/bin/docker: stderr [--journal-size JOURNAL_SIZE] /usr/bin/docker: stderr [--journal-slots JOURNAL_SLOTS] [--prepare] /usr/bin/docker: stderr [--osd-ids [OSD_IDS [OSD_IDS ...]]] /usr/bin/docker: stderr [DEVICES [DEVICES ...]] /usr/bin/docker: stderr ceph-volume lvm batch: error: GPT headers found, they must be removed on: /dev/sda Traceback (most recent call last): File "/var/lib/ceph/351f8a26-2b31-11ed-b555-494149d85a01/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d", line 8971, in main() File "/var/lib/ceph/351f8a26-2b31-11ed-b555-494149d85a01/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d", line 8959, in main r = ctx.func(ctx) File "/var/lib/ceph/351f8a26-2b31-11ed-b555-494149d85a01/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d", line 1902, in _infer_config return func(ctx) File
[ceph-users] Re: [cephadm] mgr: no daemons active
Adam, In google someone suggested a manual upgrade using the following method and it seems to work but I am stuck in MON redeploy.. haha Go to mgr container and edit /var/lib/ceph/$fsid/mgr.$whatever/unit.run file and change ceph/ceph:v16.2.10 on both mgr and restart mgr service using systemctl restart After a few minutes I noticed the docker downloaded image and I can see both mgr running with the 16.2.10 version. Now i have tried to do an upgrade and nothing happened so I used the same manual method with MON node and did use command ceph orch daemon redeploy mon.ceph1 which destroyed mon service and now i can't do anything because i don't have mon. ceph -s and all other command hangs Try to find out how to get back mon :) On Fri, Sep 2, 2022 at 3:34 PM Satish Patel wrote: > Yes, i have stopped upgrade and those log before upgrade > > On Fri, Sep 2, 2022 at 3:27 PM Adam King wrote: > >> I don't think the number of mons should have any effect on this. Looking >> at your logs, the interesting thing is that all the messages are so close >> together. Was this before having stopped the upgrade? >> >> On Fri, Sep 2, 2022 at 2:53 PM Satish Patel wrote: >> >>> Do you think this is because I have only a single MON daemon running? I >>> have only two nodes. >>> >>> On Fri, Sep 2, 2022 at 2:39 PM Satish Patel >>> wrote: >>> >>>> Adam, >>>> >>>> I have enabled debug and my logs flood with the following. I am going >>>> to try some stuff from your provided mailing list and see.. >>>> >>>> root@ceph1:~# tail -f >>>> /var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log >>>> 2022-09-02T18:38:21.754391+ mgr.ceph2.huidoh (mgr.344392) 211198 : >>>> cephadm [DBG] 0 OSDs are scheduled for removal: [] >>>> 2022-09-02T18:38:21.754519+ mgr.ceph2.huidoh (mgr.344392) 211199 : >>>> cephadm [DBG] Saving [] to store >>>> 2022-09-02T18:38:21.757155+ mgr.ceph2.huidoh (mgr.344392) 211200 : >>>> cephadm [DBG] refreshing hosts and daemons >>>> 2022-09-02T18:38:21.758065+ mgr.ceph2.huidoh (mgr.344392) 211201 : >>>> cephadm [DBG] _check_for_strays >>>> 2022-09-02T18:38:21.758334+ mgr.ceph2.huidoh (mgr.344392) 211202 : >>>> cephadm [DBG] 0 OSDs are scheduled for removal: [] >>>> 2022-09-02T18:38:21.758455+ mgr.ceph2.huidoh (mgr.344392) 211203 : >>>> cephadm [DBG] Saving [] to store >>>> 2022-09-02T18:38:21.761001+ mgr.ceph2.huidoh (mgr.344392) 211204 : >>>> cephadm [DBG] refreshing hosts and daemons >>>> 2022-09-02T18:38:21.762092+ mgr.ceph2.huidoh (mgr.344392) 211205 : >>>> cephadm [DBG] _check_for_strays >>>> 2022-09-02T18:38:21.762357+ mgr.ceph2.huidoh (mgr.344392) 211206 : >>>> cephadm [DBG] 0 OSDs are scheduled for removal: [] >>>> 2022-09-02T18:38:21.762480+ mgr.ceph2.huidoh (mgr.344392) 211207 : >>>> cephadm [DBG] Saving [] to store >>>> >>>> On Fri, Sep 2, 2022 at 12:17 PM Adam King wrote: >>>> >>>>> hmm, okay. It seems like cephadm is stuck in general rather than an >>>>> issue specific to the upgrade. I'd first make sure the orchestrator isn't >>>>> paused (just running "ceph orch resume" should be enough, it's >>>>> idempotent). >>>>> >>>>> Beyond that, there was someone else who had an issue with things >>>>> getting stuck that was resolved in this thread >>>>> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M >>>>> <https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M> >>>>> that >>>>> might be worth a look. >>>>> >>>>> If you haven't already, it's possible stopping the upgrade is a good >>>>> idea, as maybe that's interfering with it getting to the point where it >>>>> does the redeploy. >>>>> >>>>> If none of those help, it might be worth setting the log level to >>>>> debug and seeing where things are ending up ("ceph config set mgr >>>>> mgr/cephadm/log_to_cluster_level debug; ceph orch ps --refresh" then >>>>> waiting a few minutes before running "ceph log last 100 debug cephadm" >>>>> (not >>>>> 100% on format of that command, if it fails try just "ceph log last >&g
[ceph-users] Re: [cephadm] mgr: no daemons active
Yes, i have stopped upgrade and those log before upgrade On Fri, Sep 2, 2022 at 3:27 PM Adam King wrote: > I don't think the number of mons should have any effect on this. Looking > at your logs, the interesting thing is that all the messages are so close > together. Was this before having stopped the upgrade? > > On Fri, Sep 2, 2022 at 2:53 PM Satish Patel wrote: > >> Do you think this is because I have only a single MON daemon running? I >> have only two nodes. >> >> On Fri, Sep 2, 2022 at 2:39 PM Satish Patel wrote: >> >>> Adam, >>> >>> I have enabled debug and my logs flood with the following. I am going to >>> try some stuff from your provided mailing list and see.. >>> >>> root@ceph1:~# tail -f >>> /var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log >>> 2022-09-02T18:38:21.754391+ mgr.ceph2.huidoh (mgr.344392) 211198 : >>> cephadm [DBG] 0 OSDs are scheduled for removal: [] >>> 2022-09-02T18:38:21.754519+ mgr.ceph2.huidoh (mgr.344392) 211199 : >>> cephadm [DBG] Saving [] to store >>> 2022-09-02T18:38:21.757155+ mgr.ceph2.huidoh (mgr.344392) 211200 : >>> cephadm [DBG] refreshing hosts and daemons >>> 2022-09-02T18:38:21.758065+ mgr.ceph2.huidoh (mgr.344392) 211201 : >>> cephadm [DBG] _check_for_strays >>> 2022-09-02T18:38:21.758334+ mgr.ceph2.huidoh (mgr.344392) 211202 : >>> cephadm [DBG] 0 OSDs are scheduled for removal: [] >>> 2022-09-02T18:38:21.758455+ mgr.ceph2.huidoh (mgr.344392) 211203 : >>> cephadm [DBG] Saving [] to store >>> 2022-09-02T18:38:21.761001+ mgr.ceph2.huidoh (mgr.344392) 211204 : >>> cephadm [DBG] refreshing hosts and daemons >>> 2022-09-02T18:38:21.762092+ mgr.ceph2.huidoh (mgr.344392) 211205 : >>> cephadm [DBG] _check_for_strays >>> 2022-09-02T18:38:21.762357+ mgr.ceph2.huidoh (mgr.344392) 211206 : >>> cephadm [DBG] 0 OSDs are scheduled for removal: [] >>> 2022-09-02T18:38:21.762480+ mgr.ceph2.huidoh (mgr.344392) 211207 : >>> cephadm [DBG] Saving [] to store >>> >>> On Fri, Sep 2, 2022 at 12:17 PM Adam King wrote: >>> >>>> hmm, okay. It seems like cephadm is stuck in general rather than an >>>> issue specific to the upgrade. I'd first make sure the orchestrator isn't >>>> paused (just running "ceph orch resume" should be enough, it's idempotent). >>>> >>>> Beyond that, there was someone else who had an issue with things >>>> getting stuck that was resolved in this thread >>>> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M >>>> <https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M> >>>> that >>>> might be worth a look. >>>> >>>> If you haven't already, it's possible stopping the upgrade is a good >>>> idea, as maybe that's interfering with it getting to the point where it >>>> does the redeploy. >>>> >>>> If none of those help, it might be worth setting the log level to debug >>>> and seeing where things are ending up ("ceph config set mgr >>>> mgr/cephadm/log_to_cluster_level debug; ceph orch ps --refresh" then >>>> waiting a few minutes before running "ceph log last 100 debug cephadm" (not >>>> 100% on format of that command, if it fails try just "ceph log last >>>> cephadm"). We could maybe get more info on why it's not performing the >>>> redeploy from those debug logs. Just remember to set the log level back >>>> after 'ceph config set mgr mgr/cephadm/log_to_cluster_level info' as debug >>>> logs are quite verbose. >>>> >>>> On Fri, Sep 2, 2022 at 11:39 AM Satish Patel >>>> wrote: >>>> >>>>> Hi Adam, >>>>> >>>>> As you said, i did following >>>>> >>>>> $ ceph orch daemon redeploy mgr.ceph1.smfvfd >>>>> quay.io/ceph/ceph:v16.2.10 >>>>> >>>>> Noticed following line in logs but then no activity nothing, still >>>>> standby mgr running in older version >>>>> >>>>> 2022-09-02T15:35:45.753093+ mgr.ceph2.huidoh (mgr.344392) 2226 : >>>>> cephadm [INF] Schedule redeploy daemon mgr.ceph1.smfvfd >>>>> 2022-09-02T15:36:17.279190+ mgr.ceph2.huidoh (mgr.344392) 2245 : >>>
[ceph-users] Re: [cephadm] mgr: no daemons active
Do you think this is because I have only a single MON daemon running? I have only two nodes. On Fri, Sep 2, 2022 at 2:39 PM Satish Patel wrote: > Adam, > > I have enabled debug and my logs flood with the following. I am going to > try some stuff from your provided mailing list and see.. > > root@ceph1:~# tail -f > /var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log > 2022-09-02T18:38:21.754391+ mgr.ceph2.huidoh (mgr.344392) 211198 : > cephadm [DBG] 0 OSDs are scheduled for removal: [] > 2022-09-02T18:38:21.754519+ mgr.ceph2.huidoh (mgr.344392) 211199 : > cephadm [DBG] Saving [] to store > 2022-09-02T18:38:21.757155+ mgr.ceph2.huidoh (mgr.344392) 211200 : > cephadm [DBG] refreshing hosts and daemons > 2022-09-02T18:38:21.758065+ mgr.ceph2.huidoh (mgr.344392) 211201 : > cephadm [DBG] _check_for_strays > 2022-09-02T18:38:21.758334+ mgr.ceph2.huidoh (mgr.344392) 211202 : > cephadm [DBG] 0 OSDs are scheduled for removal: [] > 2022-09-02T18:38:21.758455+ mgr.ceph2.huidoh (mgr.344392) 211203 : > cephadm [DBG] Saving [] to store > 2022-09-02T18:38:21.761001+ mgr.ceph2.huidoh (mgr.344392) 211204 : > cephadm [DBG] refreshing hosts and daemons > 2022-09-02T18:38:21.762092+ mgr.ceph2.huidoh (mgr.344392) 211205 : > cephadm [DBG] _check_for_strays > 2022-09-02T18:38:21.762357+ mgr.ceph2.huidoh (mgr.344392) 211206 : > cephadm [DBG] 0 OSDs are scheduled for removal: [] > 2022-09-02T18:38:21.762480+ mgr.ceph2.huidoh (mgr.344392) 211207 : > cephadm [DBG] Saving [] to store > > On Fri, Sep 2, 2022 at 12:17 PM Adam King wrote: > >> hmm, okay. It seems like cephadm is stuck in general rather than an issue >> specific to the upgrade. I'd first make sure the orchestrator isn't paused >> (just running "ceph orch resume" should be enough, it's idempotent). >> >> Beyond that, there was someone else who had an issue with things getting >> stuck that was resolved in this thread >> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M >> <https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M> >> that >> might be worth a look. >> >> If you haven't already, it's possible stopping the upgrade is a good >> idea, as maybe that's interfering with it getting to the point where it >> does the redeploy. >> >> If none of those help, it might be worth setting the log level to debug >> and seeing where things are ending up ("ceph config set mgr >> mgr/cephadm/log_to_cluster_level debug; ceph orch ps --refresh" then >> waiting a few minutes before running "ceph log last 100 debug cephadm" (not >> 100% on format of that command, if it fails try just "ceph log last >> cephadm"). We could maybe get more info on why it's not performing the >> redeploy from those debug logs. Just remember to set the log level back >> after 'ceph config set mgr mgr/cephadm/log_to_cluster_level info' as debug >> logs are quite verbose. >> >> On Fri, Sep 2, 2022 at 11:39 AM Satish Patel >> wrote: >> >>> Hi Adam, >>> >>> As you said, i did following >>> >>> $ ceph orch daemon redeploy mgr.ceph1.smfvfd quay.io/ceph/ceph:v16.2.10 >>> >>> Noticed following line in logs but then no activity nothing, still >>> standby mgr running in older version >>> >>> 2022-09-02T15:35:45.753093+ mgr.ceph2.huidoh (mgr.344392) 2226 : >>> cephadm [INF] Schedule redeploy daemon mgr.ceph1.smfvfd >>> 2022-09-02T15:36:17.279190+ mgr.ceph2.huidoh (mgr.344392) 2245 : >>> cephadm [INF] refreshing ceph2 facts >>> 2022-09-02T15:36:17.984478+ mgr.ceph2.huidoh (mgr.344392) 2246 : >>> cephadm [INF] refreshing ceph1 facts >>> 2022-09-02T15:37:17.663730+ mgr.ceph2.huidoh (mgr.344392) 2284 : >>> cephadm [INF] refreshing ceph2 facts >>> 2022-09-02T15:37:18.386586+ mgr.ceph2.huidoh (mgr.344392) 2285 : >>> cephadm [INF] refreshing ceph1 facts >>> >>> I am not seeing any image get downloaded also >>> >>> root@ceph1:~# docker image ls >>> REPOSITORY TAG IMAGE ID CREATED >>> SIZE >>> quay.io/ceph/ceph v15 93146564743f 3 weeks ago >>> 1.2GB >>> quay.io/ceph/ceph-grafana 8.3.5 dad864ee21e9 4 months >>> ago558MB >>> quay.io/prometheus/prometheus v2.33.4 514e6a882f6e 6 months >>> ago204MB >>> quay.io/prom
[ceph-users] Re: [cephadm] mgr: no daemons active
Adam, I have enabled debug and my logs flood with the following. I am going to try some stuff from your provided mailing list and see.. root@ceph1:~# tail -f /var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log 2022-09-02T18:38:21.754391+ mgr.ceph2.huidoh (mgr.344392) 211198 : cephadm [DBG] 0 OSDs are scheduled for removal: [] 2022-09-02T18:38:21.754519+ mgr.ceph2.huidoh (mgr.344392) 211199 : cephadm [DBG] Saving [] to store 2022-09-02T18:38:21.757155+ mgr.ceph2.huidoh (mgr.344392) 211200 : cephadm [DBG] refreshing hosts and daemons 2022-09-02T18:38:21.758065+ mgr.ceph2.huidoh (mgr.344392) 211201 : cephadm [DBG] _check_for_strays 2022-09-02T18:38:21.758334+ mgr.ceph2.huidoh (mgr.344392) 211202 : cephadm [DBG] 0 OSDs are scheduled for removal: [] 2022-09-02T18:38:21.758455+ mgr.ceph2.huidoh (mgr.344392) 211203 : cephadm [DBG] Saving [] to store 2022-09-02T18:38:21.761001+ mgr.ceph2.huidoh (mgr.344392) 211204 : cephadm [DBG] refreshing hosts and daemons 2022-09-02T18:38:21.762092+ mgr.ceph2.huidoh (mgr.344392) 211205 : cephadm [DBG] _check_for_strays 2022-09-02T18:38:21.762357+ mgr.ceph2.huidoh (mgr.344392) 211206 : cephadm [DBG] 0 OSDs are scheduled for removal: [] 2022-09-02T18:38:21.762480+ mgr.ceph2.huidoh (mgr.344392) 211207 : cephadm [DBG] Saving [] to store On Fri, Sep 2, 2022 at 12:17 PM Adam King wrote: > hmm, okay. It seems like cephadm is stuck in general rather than an issue > specific to the upgrade. I'd first make sure the orchestrator isn't paused > (just running "ceph orch resume" should be enough, it's idempotent). > > Beyond that, there was someone else who had an issue with things getting > stuck that was resolved in this thread > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M > <https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M/#NKKLV5TMHFA3ERGCMJ3M7BVLA5PGIR4M> > that > might be worth a look. > > If you haven't already, it's possible stopping the upgrade is a good idea, > as maybe that's interfering with it getting to the point where it does the > redeploy. > > If none of those help, it might be worth setting the log level to debug > and seeing where things are ending up ("ceph config set mgr > mgr/cephadm/log_to_cluster_level debug; ceph orch ps --refresh" then > waiting a few minutes before running "ceph log last 100 debug cephadm" (not > 100% on format of that command, if it fails try just "ceph log last > cephadm"). We could maybe get more info on why it's not performing the > redeploy from those debug logs. Just remember to set the log level back > after 'ceph config set mgr mgr/cephadm/log_to_cluster_level info' as debug > logs are quite verbose. > > On Fri, Sep 2, 2022 at 11:39 AM Satish Patel wrote: > >> Hi Adam, >> >> As you said, i did following >> >> $ ceph orch daemon redeploy mgr.ceph1.smfvfd quay.io/ceph/ceph:v16.2.10 >> >> Noticed following line in logs but then no activity nothing, still >> standby mgr running in older version >> >> 2022-09-02T15:35:45.753093+ mgr.ceph2.huidoh (mgr.344392) 2226 : >> cephadm [INF] Schedule redeploy daemon mgr.ceph1.smfvfd >> 2022-09-02T15:36:17.279190+ mgr.ceph2.huidoh (mgr.344392) 2245 : >> cephadm [INF] refreshing ceph2 facts >> 2022-09-02T15:36:17.984478+ mgr.ceph2.huidoh (mgr.344392) 2246 : >> cephadm [INF] refreshing ceph1 facts >> 2022-09-02T15:37:17.663730+ mgr.ceph2.huidoh (mgr.344392) 2284 : >> cephadm [INF] refreshing ceph2 facts >> 2022-09-02T15:37:18.386586+ mgr.ceph2.huidoh (mgr.344392) 2285 : >> cephadm [INF] refreshing ceph1 facts >> >> I am not seeing any image get downloaded also >> >> root@ceph1:~# docker image ls >> REPOSITORY TAG IMAGE ID CREATED >> SIZE >> quay.io/ceph/ceph v15 93146564743f 3 weeks ago >> 1.2GB >> quay.io/ceph/ceph-grafana 8.3.5 dad864ee21e9 4 months ago >>558MB >> quay.io/prometheus/prometheus v2.33.4 514e6a882f6e 6 months ago >>204MB >> quay.io/prometheus/alertmanagerv0.23.0 ba2b418f427c 12 months >> ago 57.5MB >> quay.io/ceph/ceph-grafana 6.7.4 557c83e11646 13 months >> ago 486MB >> quay.io/prometheus/prometheus v2.18.1 de242295e225 2 years ago >> 140MB >> quay.io/prometheus/alertmanagerv0.20.0 0881eb8f169f 2 years ago >> 52.1MB >> quay.io/prometheus/node-exporter v0.18.1 e5a616e4b9cf 3 years ago >> 22.9MB >> >> >> On Fri, Sep 2, 2022
[ceph-users] Re: [cephadm] mgr: no daemons active
Hi Adam, As you said, i did following $ ceph orch daemon redeploy mgr.ceph1.smfvfd quay.io/ceph/ceph:v16.2.10 Noticed following line in logs but then no activity nothing, still standby mgr running in older version 2022-09-02T15:35:45.753093+ mgr.ceph2.huidoh (mgr.344392) 2226 : cephadm [INF] Schedule redeploy daemon mgr.ceph1.smfvfd 2022-09-02T15:36:17.279190+ mgr.ceph2.huidoh (mgr.344392) 2245 : cephadm [INF] refreshing ceph2 facts 2022-09-02T15:36:17.984478+ mgr.ceph2.huidoh (mgr.344392) 2246 : cephadm [INF] refreshing ceph1 facts 2022-09-02T15:37:17.663730+ mgr.ceph2.huidoh (mgr.344392) 2284 : cephadm [INF] refreshing ceph2 facts 2022-09-02T15:37:18.386586+ mgr.ceph2.huidoh (mgr.344392) 2285 : cephadm [INF] refreshing ceph1 facts I am not seeing any image get downloaded also root@ceph1:~# docker image ls REPOSITORY TAG IMAGE ID CREATED SIZE quay.io/ceph/ceph v15 93146564743f 3 weeks ago 1.2GB quay.io/ceph/ceph-grafana 8.3.5 dad864ee21e9 4 months ago 558MB quay.io/prometheus/prometheus v2.33.4 514e6a882f6e 6 months ago 204MB quay.io/prometheus/alertmanagerv0.23.0 ba2b418f427c 12 months ago 57.5MB quay.io/ceph/ceph-grafana 6.7.4 557c83e11646 13 months ago 486MB quay.io/prometheus/prometheus v2.18.1 de242295e225 2 years ago 140MB quay.io/prometheus/alertmanagerv0.20.0 0881eb8f169f 2 years ago 52.1MB quay.io/prometheus/node-exporter v0.18.1 e5a616e4b9cf 3 years ago 22.9MB On Fri, Sep 2, 2022 at 11:06 AM Adam King wrote: > hmm, at this point, maybe we should just try manually upgrading the mgr > daemons and then move from there. First, just stop the upgrade "ceph orch > upgrade stop". If you figure out which of the two mgr daemons is the > standby (it should say which one is active in "ceph -s" output) and then do > a "ceph orch daemon redeploy quay.io/ceph/ceph:v16.2.10" > it should redeploy that specific mgr with the new version. You could then > do a "ceph mgr fail" to swap which of the mgr daemons is active, then do > another "ceph orch daemon redeploy > quay.io/ceph/ceph:v16.2.10" where the standby is now the other mgr still > on 15.2.17. Once the mgr daemons are both upgraded to the new version, run > a "ceph orch redeploy mgr" and then "ceph orch upgrade start --image > quay.io/ceph/ceph:v16.2.10" and see if it goes better. > > On Fri, Sep 2, 2022 at 10:36 AM Satish Patel wrote: > >> Hi Adam, >> >> I run the following command to upgrade but it looks like nothing is >> happening >> >> $ ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.10 >> >> Status message is empty.. >> >> root@ceph1:~# ceph orch upgrade status >> { >> "target_image": "quay.io/ceph/ceph:v16.2.10", >> "in_progress": true, >> "services_complete": [], >> "message": "" >> } >> >> Nothing in Logs >> >> root@ceph1:~# tail -f >> /var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log >> 2022-09-02T14:31:52.597661+ mgr.ceph2.huidoh (mgr.344392) 174 : >> cephadm [INF] refreshing ceph2 facts >> 2022-09-02T14:31:52.991450+ mgr.ceph2.huidoh (mgr.344392) 176 : >> cephadm [INF] refreshing ceph1 facts >> 2022-09-02T14:32:52.965092+ mgr.ceph2.huidoh (mgr.344392) 207 : >> cephadm [INF] refreshing ceph2 facts >> 2022-09-02T14:32:53.369789+ mgr.ceph2.huidoh (mgr.344392) 208 : >> cephadm [INF] refreshing ceph1 facts >> 2022-09-02T14:33:53.367986+ mgr.ceph2.huidoh (mgr.344392) 239 : >> cephadm [INF] refreshing ceph2 facts >> 2022-09-02T14:33:53.760427+ mgr.ceph2.huidoh (mgr.344392) 240 : >> cephadm [INF] refreshing ceph1 facts >> 2022-09-02T14:34:53.754277+ mgr.ceph2.huidoh (mgr.344392) 272 : >> cephadm [INF] refreshing ceph2 facts >> 2022-09-02T14:34:54.162503+ mgr.ceph2.huidoh (mgr.344392) 273 : >> cephadm [INF] refreshing ceph1 facts >> 2022-09-02T14:35:54.133467+ mgr.ceph2.huidoh (mgr.344392) 305 : >> cephadm [INF] refreshing ceph2 facts >> 2022-09-02T14:35:54.522171+ mgr.ceph2.huidoh (mgr.344392) 306 : >> cephadm [INF] refreshing ceph1 facts >> >> In progress that mesg stuck there for long time >> >> root@ceph1:~# ceph -s >> cluster: >> id: f270ad9e-1f6f-11ed-b6f8-a539d87379ea >> health: HEALTH_OK >> >> services: >> mon: 1 daemons, quorum ceph1 (age 9h) >> mgr: ceph2.huidoh(active, since 9m), standbys: ceph1.smfvfd >> osd: 4 osds: 4 up (since 9h), 4 in (since 11h) >> >> data: >>
[ceph-users] Re: [cephadm] mgr: no daemons active
Hi Adam, I run the following command to upgrade but it looks like nothing is happening $ ceph orch upgrade start --image quay.io/ceph/ceph:v16.2.10 Status message is empty.. root@ceph1:~# ceph orch upgrade status { "target_image": "quay.io/ceph/ceph:v16.2.10", "in_progress": true, "services_complete": [], "message": "" } Nothing in Logs root@ceph1:~# tail -f /var/log/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ceph.cephadm.log 2022-09-02T14:31:52.597661+ mgr.ceph2.huidoh (mgr.344392) 174 : cephadm [INF] refreshing ceph2 facts 2022-09-02T14:31:52.991450+ mgr.ceph2.huidoh (mgr.344392) 176 : cephadm [INF] refreshing ceph1 facts 2022-09-02T14:32:52.965092+ mgr.ceph2.huidoh (mgr.344392) 207 : cephadm [INF] refreshing ceph2 facts 2022-09-02T14:32:53.369789+ mgr.ceph2.huidoh (mgr.344392) 208 : cephadm [INF] refreshing ceph1 facts 2022-09-02T14:33:53.367986+ mgr.ceph2.huidoh (mgr.344392) 239 : cephadm [INF] refreshing ceph2 facts 2022-09-02T14:33:53.760427+ mgr.ceph2.huidoh (mgr.344392) 240 : cephadm [INF] refreshing ceph1 facts 2022-09-02T14:34:53.754277+ mgr.ceph2.huidoh (mgr.344392) 272 : cephadm [INF] refreshing ceph2 facts 2022-09-02T14:34:54.162503+ mgr.ceph2.huidoh (mgr.344392) 273 : cephadm [INF] refreshing ceph1 facts 2022-09-02T14:35:54.133467+ mgr.ceph2.huidoh (mgr.344392) 305 : cephadm [INF] refreshing ceph2 facts 2022-09-02T14:35:54.522171+ mgr.ceph2.huidoh (mgr.344392) 306 : cephadm [INF] refreshing ceph1 facts In progress that mesg stuck there for long time root@ceph1:~# ceph -s cluster: id: f270ad9e-1f6f-11ed-b6f8-a539d87379ea health: HEALTH_OK services: mon: 1 daemons, quorum ceph1 (age 9h) mgr: ceph2.huidoh(active, since 9m), standbys: ceph1.smfvfd osd: 4 osds: 4 up (since 9h), 4 in (since 11h) data: pools: 5 pools, 129 pgs objects: 20.06k objects, 83 GiB usage: 168 GiB used, 632 GiB / 800 GiB avail pgs: 129 active+clean io: client: 12 KiB/s wr, 0 op/s rd, 1 op/s wr progress: Upgrade to quay.io/ceph/ceph:v16.2.10 (0s) [........] On Fri, Sep 2, 2022 at 10:25 AM Satish Patel wrote: > It Looks like I did it with the following command. > > $ ceph orch daemon add mgr ceph2:10.73.0.192 > > Now i can see two with same version 15.x > > root@ceph1:~# ceph orch ps --daemon-type mgr > NAME HOST STATUS REFRESHED AGE VERSION IMAGE > NAME > IMAGE ID CONTAINER ID > mgr.ceph1.smfvfd ceph1 running (8h) 41s ago8h 15.2.17 > quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca > 93146564743f 1aab837306d2 > mgr.ceph2.huidoh ceph2 running (60s) 110s ago 60s 15.2.17 > quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca > 93146564743f 294fd6ab6c97 > > On Fri, Sep 2, 2022 at 10:19 AM Satish Patel wrote: > >> Let's come back to the original question: how to bring back the second >> mgr? >> >> root@ceph1:~# ceph orch apply mgr 2 >> Scheduled mgr update... >> >> Nothing happened with above command, logs saying nothing >> >> 2022-09-02T14:16:20.407927+ mgr.ceph1.smfvfd (mgr.334626) 16939 : >> cephadm [INF] refreshing ceph2 facts >> 2022-09-02T14:16:40.247195+ mgr.ceph1.smfvfd (mgr.334626) 16952 : >> cephadm [INF] Saving service mgr spec with placement count:2 >> 2022-09-02T14:16:53.106919+ mgr.ceph1.smfvfd (mgr.334626) 16961 : >> cephadm [INF] Saving service mgr spec with placement count:2 >> 2022-09-02T14:17:19.135203+ mgr.ceph1.smfvfd (mgr.334626) 16975 : >> cephadm [INF] refreshing ceph1 facts >> 2022-09-02T14:17:20.780496+ mgr.ceph1.smfvfd (mgr.334626) 16977 : >> cephadm [INF] refreshing ceph2 facts >> 2022-09-02T14:18:19.502034+ mgr.ceph1.smfvfd (mgr.334626) 17008 : >> cephadm [INF] refreshing ceph1 facts >> 2022-09-02T14:18:21.127973+ mgr.ceph1.smfvfd (mgr.334626) 17010 : >> cephadm [INF] refreshing ceph2 facts >> >> >> >> >> >> >> >> On Fri, Sep 2, 2022 at 10:15 AM Satish Patel >> wrote: >> >>> Hi Adam, >>> >>> Wait..wait.. now it's working suddenly without doing anything.. very odd >>> >>> root@ceph1:~# ceph orch ls >>> NAME RUNNING REFRESHED AGE PLACEMENTIMAGE NAME >>> >>> IMAGE ID >>> alertmanager 1/1 5s ago 2w count:1 >>> quay.io/prometheus/alertmanager:v0.20.0 >>>0881eb8f169f >>> crash 2/2 5s ago 2w * >>> quay.io/ceph/ceph:v15 >>>93146564743f >>&
[ceph-users] Re: [cephadm] mgr: no daemons active
It Looks like I did it with the following command. $ ceph orch daemon add mgr ceph2:10.73.0.192 Now i can see two with same version 15.x root@ceph1:~# ceph orch ps --daemon-type mgr NAME HOST STATUS REFRESHED AGE VERSION IMAGE NAME IMAGE ID CONTAINER ID mgr.ceph1.smfvfd ceph1 running (8h) 41s ago8h 15.2.17 quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca 93146564743f 1aab837306d2 mgr.ceph2.huidoh ceph2 running (60s) 110s ago 60s 15.2.17 quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca 93146564743f 294fd6ab6c97 On Fri, Sep 2, 2022 at 10:19 AM Satish Patel wrote: > Let's come back to the original question: how to bring back the second mgr? > > root@ceph1:~# ceph orch apply mgr 2 > Scheduled mgr update... > > Nothing happened with above command, logs saying nothing > > 2022-09-02T14:16:20.407927+ mgr.ceph1.smfvfd (mgr.334626) 16939 : > cephadm [INF] refreshing ceph2 facts > 2022-09-02T14:16:40.247195+ mgr.ceph1.smfvfd (mgr.334626) 16952 : > cephadm [INF] Saving service mgr spec with placement count:2 > 2022-09-02T14:16:53.106919+ mgr.ceph1.smfvfd (mgr.334626) 16961 : > cephadm [INF] Saving service mgr spec with placement count:2 > 2022-09-02T14:17:19.135203+ mgr.ceph1.smfvfd (mgr.334626) 16975 : > cephadm [INF] refreshing ceph1 facts > 2022-09-02T14:17:20.780496+ mgr.ceph1.smfvfd (mgr.334626) 16977 : > cephadm [INF] refreshing ceph2 facts > 2022-09-02T14:18:19.502034+ mgr.ceph1.smfvfd (mgr.334626) 17008 : > cephadm [INF] refreshing ceph1 facts > 2022-09-02T14:18:21.127973+ mgr.ceph1.smfvfd (mgr.334626) 17010 : > cephadm [INF] refreshing ceph2 facts > > > > > > > > On Fri, Sep 2, 2022 at 10:15 AM Satish Patel wrote: > >> Hi Adam, >> >> Wait..wait.. now it's working suddenly without doing anything.. very odd >> >> root@ceph1:~# ceph orch ls >> NAME RUNNING REFRESHED AGE PLACEMENTIMAGE NAME >> >> IMAGE ID >> alertmanager 1/1 5s ago 2w count:1 >> quay.io/prometheus/alertmanager:v0.20.0 >>0881eb8f169f >> crash 2/2 5s ago 2w * >> quay.io/ceph/ceph:v15 >>93146564743f >> grafana 1/1 5s ago 2w count:1 >> quay.io/ceph/ceph-grafana:6.7.4 >>557c83e11646 >> mgr 1/2 5s ago 8h count:2 >> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca >> 93146564743f >> mon 1/2 5s ago 8h ceph1;ceph2 >> quay.io/ceph/ceph:v15 >>93146564743f >> node-exporter 2/2 5s ago 2w * >> quay.io/prometheus/node-exporter:v0.18.1 >> e5a616e4b9cf >> osd.osd_spec_default 4/0 5s ago - >> quay.io/ceph/ceph:v15 >>93146564743f >> prometheus1/1 5s ago 2w count:1 >> quay.io/prometheus/prometheus:v2.18.1 >> >> On Fri, Sep 2, 2022 at 10:13 AM Satish Patel >> wrote: >> >>> I can see that in the output but I'm not sure how to get rid of it. >>> >>> root@ceph1:~# ceph orch ps --refresh >>> NAME >>> HOST STATUSREFRESHED AGE VERSIONIMAGE NAME >>> IMAGE ID >>>CONTAINER ID >>> alertmanager.ceph1 >>> ceph1 running (9h) 64s ago2w 0.20.0 >>> quay.io/prometheus/alertmanager:v0.20.0 >>>0881eb8f169f ba804b555378 >>> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >>> ceph2 stopped 65s ago- >>> >>> >>> crash.ceph1 >>> ceph1 running (9h) 64s ago2w 15.2.17quay.io/ceph/ceph:v15 >>> >>> 93146564743f a3a431d834fc >>> crash.ceph2 >>> ceph2 running (9h) 65s ago13d 15.2.17quay.io/ceph/ceph:v15 >>> >>> 93146564743f 3c963693ff2b >>> grafana.ceph1 >>> ceph1 running (9h) 64s ago2w 6.7.4 >>> quay.io/ceph/ceph-grafana:6.7.4 >>>557c83e11646 7583a8dc4c61 >>> mgr.ceph1.smfvfd >>> ceph1 running (8h) 64s ago8h 15.2.17 >>> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca >>> 93146564743f 1aab837306d2 >>> mon.ceph1 >>> ceph1 running
[ceph-users] Re: [cephadm] mgr: no daemons active
Let's come back to the original question: how to bring back the second mgr? root@ceph1:~# ceph orch apply mgr 2 Scheduled mgr update... Nothing happened with above command, logs saying nothing 2022-09-02T14:16:20.407927+ mgr.ceph1.smfvfd (mgr.334626) 16939 : cephadm [INF] refreshing ceph2 facts 2022-09-02T14:16:40.247195+ mgr.ceph1.smfvfd (mgr.334626) 16952 : cephadm [INF] Saving service mgr spec with placement count:2 2022-09-02T14:16:53.106919+ mgr.ceph1.smfvfd (mgr.334626) 16961 : cephadm [INF] Saving service mgr spec with placement count:2 2022-09-02T14:17:19.135203+ mgr.ceph1.smfvfd (mgr.334626) 16975 : cephadm [INF] refreshing ceph1 facts 2022-09-02T14:17:20.780496+ mgr.ceph1.smfvfd (mgr.334626) 16977 : cephadm [INF] refreshing ceph2 facts 2022-09-02T14:18:19.502034+ mgr.ceph1.smfvfd (mgr.334626) 17008 : cephadm [INF] refreshing ceph1 facts 2022-09-02T14:18:21.127973+ mgr.ceph1.smfvfd (mgr.334626) 17010 : cephadm [INF] refreshing ceph2 facts On Fri, Sep 2, 2022 at 10:15 AM Satish Patel wrote: > Hi Adam, > > Wait..wait.. now it's working suddenly without doing anything.. very odd > > root@ceph1:~# ceph orch ls > NAME RUNNING REFRESHED AGE PLACEMENTIMAGE NAME > > IMAGE ID > alertmanager 1/1 5s ago 2w count:1 > quay.io/prometheus/alertmanager:v0.20.0 > 0881eb8f169f > crash 2/2 5s ago 2w * > quay.io/ceph/ceph:v15 > 93146564743f > grafana 1/1 5s ago 2w count:1 > quay.io/ceph/ceph-grafana:6.7.4 > 557c83e11646 > mgr 1/2 5s ago 8h count:2 > quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca > 93146564743f > mon 1/2 5s ago 8h ceph1;ceph2 > quay.io/ceph/ceph:v15 > 93146564743f > node-exporter 2/2 5s ago 2w * > quay.io/prometheus/node-exporter:v0.18.1 > e5a616e4b9cf > osd.osd_spec_default 4/0 5s ago - > quay.io/ceph/ceph:v15 > 93146564743f > prometheus1/1 5s ago 2w count:1 > quay.io/prometheus/prometheus:v2.18.1 > > On Fri, Sep 2, 2022 at 10:13 AM Satish Patel wrote: > >> I can see that in the output but I'm not sure how to get rid of it. >> >> root@ceph1:~# ceph orch ps --refresh >> NAME >> HOST STATUSREFRESHED AGE VERSIONIMAGE NAME >> IMAGE ID >>CONTAINER ID >> alertmanager.ceph1 >> ceph1 running (9h) 64s ago2w 0.20.0 >> quay.io/prometheus/alertmanager:v0.20.0 >>0881eb8f169f ba804b555378 >> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >> ceph2 stopped 65s ago- >> >> >> crash.ceph1 >> ceph1 running (9h) 64s ago2w 15.2.17quay.io/ceph/ceph:v15 >> >> 93146564743f a3a431d834fc >> crash.ceph2 >> ceph2 running (9h) 65s ago13d 15.2.17quay.io/ceph/ceph:v15 >> >> 93146564743f 3c963693ff2b >> grafana.ceph1 >> ceph1 running (9h) 64s ago2w 6.7.4 >> quay.io/ceph/ceph-grafana:6.7.4 >>557c83e11646 7583a8dc4c61 >> mgr.ceph1.smfvfd >> ceph1 running (8h) 64s ago8h 15.2.17 >> quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca >> 93146564743f 1aab837306d2 >> mon.ceph1 >> ceph1 running (9h) 64s ago2w 15.2.17quay.io/ceph/ceph:v15 >> >> 93146564743f c1d155d8c7ad >> node-exporter.ceph1 >> ceph1 running (9h) 64s ago2w 0.18.1 >> quay.io/prometheus/node-exporter:v0.18.1 >> e5a616e4b9cf 2ff235fe0e42 >> node-exporter.ceph2 >> ceph2 running (9h) 65s ago13d 0.18.1 >> quay.io/prometheus/node-exporter:v0.18.1 >> e5a616e4b9cf 17678b9ba602 >> osd.0 >> ceph1 running (9h) 64s ago13d 15.2.17quay.io/ceph/ceph:v15 >> >> 93146564743f d0fd73b777a3 >> osd.1 >> ceph1 running (9h) 64s ago13d 15.2.17quay.io/ceph/ceph:v15 >> >> 93146564743f 049120e83102 >> osd.2 >> ceph2 running (9h) 65s ago13d 15.2.17quay.io/ceph/ceph:v15 >> >> 93146564743f 8700e8cefd1f >> osd.3 >> ceph2 running (9h) 65s ago13d 15.2.17quay.io/ceph/ceph:v15 >> >> 93146564743f 9c71bc87ed16 >> prometheus.ceph1 >> ceph1 running (9h) 64s ago2w 2.18.1 >>
[ceph-users] Re: [cephadm] mgr: no daemons active
Hi Adam, Wait..wait.. now it's working suddenly without doing anything.. very odd root@ceph1:~# ceph orch ls NAME RUNNING REFRESHED AGE PLACEMENTIMAGE NAME IMAGE ID alertmanager 1/1 5s ago 2w count:1 quay.io/prometheus/alertmanager:v0.20.0 0881eb8f169f crash 2/2 5s ago 2w * quay.io/ceph/ceph:v15 93146564743f grafana 1/1 5s ago 2w count:1 quay.io/ceph/ceph-grafana:6.7.4 557c83e11646 mgr 1/2 5s ago 8h count:2 quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca 93146564743f mon 1/2 5s ago 8h ceph1;ceph2 quay.io/ceph/ceph:v15 93146564743f node-exporter 2/2 5s ago 2w * quay.io/prometheus/node-exporter:v0.18.1 e5a616e4b9cf osd.osd_spec_default 4/0 5s ago - quay.io/ceph/ceph:v15 93146564743f prometheus1/1 5s ago 2w count:1 quay.io/prometheus/prometheus:v2.18.1 On Fri, Sep 2, 2022 at 10:13 AM Satish Patel wrote: > I can see that in the output but I'm not sure how to get rid of it. > > root@ceph1:~# ceph orch ps --refresh > NAME > HOST STATUSREFRESHED AGE VERSIONIMAGE NAME > IMAGE ID >CONTAINER ID > alertmanager.ceph1 > ceph1 running (9h) 64s ago2w 0.20.0 > quay.io/prometheus/alertmanager:v0.20.0 > 0881eb8f169f ba804b555378 > cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d > ceph2 stopped 65s ago- > > > crash.ceph1 > ceph1 running (9h) 64s ago2w 15.2.17quay.io/ceph/ceph:v15 > > 93146564743f a3a431d834fc > crash.ceph2 > ceph2 running (9h) 65s ago13d 15.2.17quay.io/ceph/ceph:v15 > > 93146564743f 3c963693ff2b > grafana.ceph1 > ceph1 running (9h) 64s ago2w 6.7.4 > quay.io/ceph/ceph-grafana:6.7.4 > 557c83e11646 7583a8dc4c61 > mgr.ceph1.smfvfd > ceph1 running (8h) 64s ago8h 15.2.17 > quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca > 93146564743f 1aab837306d2 > mon.ceph1 > ceph1 running (9h) 64s ago2w 15.2.17quay.io/ceph/ceph:v15 > > 93146564743f c1d155d8c7ad > node-exporter.ceph1 > ceph1 running (9h) 64s ago2w 0.18.1 > quay.io/prometheus/node-exporter:v0.18.1 > e5a616e4b9cf 2ff235fe0e42 > node-exporter.ceph2 > ceph2 running (9h) 65s ago13d 0.18.1 > quay.io/prometheus/node-exporter:v0.18.1 > e5a616e4b9cf 17678b9ba602 > osd.0 > ceph1 running (9h) 64s ago13d 15.2.17quay.io/ceph/ceph:v15 > > 93146564743f d0fd73b777a3 > osd.1 > ceph1 running (9h) 64s ago13d 15.2.17quay.io/ceph/ceph:v15 > > 93146564743f 049120e83102 > osd.2 > ceph2 running (9h) 65s ago13d 15.2.17quay.io/ceph/ceph:v15 > > 93146564743f 8700e8cefd1f > osd.3 > ceph2 running (9h) 65s ago13d 15.2.17quay.io/ceph/ceph:v15 > > 93146564743f 9c71bc87ed16 > prometheus.ceph1 > ceph1 running (9h) 64s ago2w 2.18.1 > quay.io/prometheus/prometheus:v2.18.1 > de242295e225 74a538efd61e > > On Fri, Sep 2, 2022 at 10:10 AM Adam King wrote: > >> maybe also a "ceph orch ps --refresh"? It might still have the old cached >> daemon inventory from before you remove the files. >> >> On Fri, Sep 2, 2022 at 9:57 AM Satish Patel wrote: >> >>> Hi Adam, >>> >>> I have deleted file located here - rm >>> /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >>> >>> But still getting the same error, do i need to do anything else? >>> >>> On Fri, Sep 2, 2022 at 9:51 AM Adam King wrote: >>> >>>> Okay, I'm wondering if this is an issue with version mismatch. Having >>>> previously had a 16.2.10 mgr and then now having a 15.2.17 one that doesn't >>>> expect this sort of thing to be present. Either way, I'd think just >>>> deleting this cephadm.7ce656a8721deb5054c37b0cfb9038 >>>> 1522d521dde51fb0c5a2142314d663f63d (and any others like it) file would >>>> be the way forward to get orch ls working again. >>>> >>>> On Fri, Sep 2, 2022 at 9:44 AM Satish Patel >>>> wrote: >>>> >>>>> Hi Adam, >>>>> >>>>> In cephadm ls i found the following service but
[ceph-users] Re: [cephadm] mgr: no daemons active
I can see that in the output but I'm not sure how to get rid of it. root@ceph1:~# ceph orch ps --refresh NAME HOST STATUSREFRESHED AGE VERSIONIMAGE NAME IMAGE ID CONTAINER ID alertmanager.ceph1 ceph1 running (9h) 64s ago2w 0.20.0 quay.io/prometheus/alertmanager:v0.20.0 0881eb8f169f ba804b555378 cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d ceph2 stopped 65s ago- crash.ceph1 ceph1 running (9h) 64s ago2w 15.2.17quay.io/ceph/ceph:v15 93146564743f a3a431d834fc crash.ceph2 ceph2 running (9h) 65s ago13d 15.2.17quay.io/ceph/ceph:v15 93146564743f 3c963693ff2b grafana.ceph1 ceph1 running (9h) 64s ago2w 6.7.4 quay.io/ceph/ceph-grafana:6.7.4 557c83e11646 7583a8dc4c61 mgr.ceph1.smfvfd ceph1 running (8h) 64s ago8h 15.2.17 quay.io/ceph/ceph@sha256:c08064dde4bba4e72a1f55d90ca32df9ef5aafab82efe2e0a0722444a5aaacca 93146564743f 1aab837306d2 mon.ceph1 ceph1 running (9h) 64s ago2w 15.2.17quay.io/ceph/ceph:v15 93146564743f c1d155d8c7ad node-exporter.ceph1 ceph1 running (9h) 64s ago2w 0.18.1 quay.io/prometheus/node-exporter:v0.18.1 e5a616e4b9cf 2ff235fe0e42 node-exporter.ceph2 ceph2 running (9h) 65s ago13d 0.18.1 quay.io/prometheus/node-exporter:v0.18.1 e5a616e4b9cf 17678b9ba602 osd.0 ceph1 running (9h) 64s ago13d 15.2.17quay.io/ceph/ceph:v15 93146564743f d0fd73b777a3 osd.1 ceph1 running (9h) 64s ago13d 15.2.17quay.io/ceph/ceph:v15 93146564743f 049120e83102 osd.2 ceph2 running (9h) 65s ago13d 15.2.17quay.io/ceph/ceph:v15 93146564743f 8700e8cefd1f osd.3 ceph2 running (9h) 65s ago13d 15.2.17quay.io/ceph/ceph:v15 93146564743f 9c71bc87ed16 prometheus.ceph1 ceph1 running (9h) 64s ago2w 2.18.1 quay.io/prometheus/prometheus:v2.18.1 de242295e225 74a538efd61e On Fri, Sep 2, 2022 at 10:10 AM Adam King wrote: > maybe also a "ceph orch ps --refresh"? It might still have the old cached > daemon inventory from before you remove the files. > > On Fri, Sep 2, 2022 at 9:57 AM Satish Patel wrote: > >> Hi Adam, >> >> I have deleted file located here - rm >> /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >> >> But still getting the same error, do i need to do anything else? >> >> On Fri, Sep 2, 2022 at 9:51 AM Adam King wrote: >> >>> Okay, I'm wondering if this is an issue with version mismatch. Having >>> previously had a 16.2.10 mgr and then now having a 15.2.17 one that doesn't >>> expect this sort of thing to be present. Either way, I'd think just >>> deleting this cephadm.7ce656a8721deb5054c37b0cfb9038 >>> 1522d521dde51fb0c5a2142314d663f63d (and any others like it) file would >>> be the way forward to get orch ls working again. >>> >>> On Fri, Sep 2, 2022 at 9:44 AM Satish Patel >>> wrote: >>> >>>> Hi Adam, >>>> >>>> In cephadm ls i found the following service but i believe it was there >>>> before also. >>>> >>>> { >>>> "style": "cephadm:v1", >>>> "name": >>>> "cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d", >>>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >>>> "systemd_unit": >>>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >>>> ", >>>> "enabled": false, >>>> "state": "stopped", >>>> "container_id": null, >>>> "container_image_name": null, >>>> "container_image_id": null, >>>> "version": null, >>>> "started": null, >>>> "created": null, >>>> "deployed": null, >>>> "configured&q
[ceph-users] Re: [cephadm] mgr: no daemons active
Hi Adam, I have deleted file located here - rm /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d But still getting the same error, do i need to do anything else? On Fri, Sep 2, 2022 at 9:51 AM Adam King wrote: > Okay, I'm wondering if this is an issue with version mismatch. Having > previously had a 16.2.10 mgr and then now having a 15.2.17 one that doesn't > expect this sort of thing to be present. Either way, I'd think just > deleting this cephadm.7ce656a8721deb5054c37b0cfb9038 > 1522d521dde51fb0c5a2142314d663f63d (and any others like it) file would be > the way forward to get orch ls working again. > > On Fri, Sep 2, 2022 at 9:44 AM Satish Patel wrote: > >> Hi Adam, >> >> In cephadm ls i found the following service but i believe it was there >> before also. >> >> { >> "style": "cephadm:v1", >> "name": >> "cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d", >> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >> "systemd_unit": >> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >> ", >> "enabled": false, >> "state": "stopped", >> "container_id": null, >> "container_image_name": null, >> "container_image_id": null, >> "version": null, >> "started": null, >> "created": null, >> "deployed": null, >> "configured": null >> }, >> >> Look like remove didn't work >> >> root@ceph1:~# ceph orch rm cephadm >> Failed to remove service. was not found. >> >> root@ceph1:~# ceph orch rm >> cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d >> Failed to remove service. >> >> was not found. >> >> On Fri, Sep 2, 2022 at 8:27 AM Adam King wrote: >> >>> this looks like an old traceback you would get if you ended up with a >>> service type that shouldn't be there somehow. The things I'd probably check >>> are that "cephadm ls" on either host definitely doesn't report and strange >>> things that aren't actually daemons in your cluster such as >>> "cephadm.". Another thing you could maybe try, as I believe the >>> assertion it's giving is for an unknown service type here ("AssertionError: >>> cephadm"), is just "ceph orch rm cephadm" which would maybe cause it to >>> remove whatever it thinks is this "cephadm" service that it has deployed. >>> Lastly, you could try having the mgr you manually deploy be a 16.2.10 one >>> instead of 15.2.17 (I'm assuming here, but the line numbers in that >>> traceback suggest octopus). The 16.2.10 one is just much less likely to >>> have a bug that causes something like this. >>> >>> On Fri, Sep 2, 2022 at 1:41 AM Satish Patel >>> wrote: >>> >>>> Now when I run "ceph orch ps" it works but the following command throws >>>> an >>>> error. Trying to bring up second mgr using ceph orch apply mgr command >>>> but >>>> didn't help >>>> >>>> root@ceph1:/ceph-disk# ceph version >>>> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus >>>> (stable) >>>> >>>> root@ceph1:/ceph-disk# ceph orch ls >>>> Error EINVAL: Traceback (most recent call last): >>>> File "/usr/share/ceph/mgr/mgr_module.py", line 1212, in >>>> _handle_command >>>> return self.handle_command(inbuf, cmd) >>>> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140, in >>>> handle_command >>>> return dispatch[cmd['prefix']].call(self, cmd, inbuf) >>>> File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call >>>> return self.func(mgr, **kwargs) >>>> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102, in >>>> >>>> wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, >>>> **l_kwargs) >>>> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91, in >>>> wrapper >>>> return func(*args, **kwargs) >>>> File "/usr/share
[ceph-users] Re: [cephadm] mgr: no daemons active
Hi Adam, In cephadm ls i found the following service but i believe it was there before also. { "style": "cephadm:v1", "name": "cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d", "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", "systemd_unit": "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d ", "enabled": false, "state": "stopped", "container_id": null, "container_image_name": null, "container_image_id": null, "version": null, "started": null, "created": null, "deployed": null, "configured": null }, Look like remove didn't work root@ceph1:~# ceph orch rm cephadm Failed to remove service. was not found. root@ceph1:~# ceph orch rm cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d Failed to remove service. was not found. On Fri, Sep 2, 2022 at 8:27 AM Adam King wrote: > this looks like an old traceback you would get if you ended up with a > service type that shouldn't be there somehow. The things I'd probably check > are that "cephadm ls" on either host definitely doesn't report and strange > things that aren't actually daemons in your cluster such as > "cephadm.". Another thing you could maybe try, as I believe the > assertion it's giving is for an unknown service type here ("AssertionError: > cephadm"), is just "ceph orch rm cephadm" which would maybe cause it to > remove whatever it thinks is this "cephadm" service that it has deployed. > Lastly, you could try having the mgr you manually deploy be a 16.2.10 one > instead of 15.2.17 (I'm assuming here, but the line numbers in that > traceback suggest octopus). The 16.2.10 one is just much less likely to > have a bug that causes something like this. > > On Fri, Sep 2, 2022 at 1:41 AM Satish Patel wrote: > >> Now when I run "ceph orch ps" it works but the following command throws an >> error. Trying to bring up second mgr using ceph orch apply mgr command >> but >> didn't help >> >> root@ceph1:/ceph-disk# ceph version >> ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus >> (stable) >> >> root@ceph1:/ceph-disk# ceph orch ls >> Error EINVAL: Traceback (most recent call last): >> File "/usr/share/ceph/mgr/mgr_module.py", line 1212, in _handle_command >> return self.handle_command(inbuf, cmd) >> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140, in >> handle_command >> return dispatch[cmd['prefix']].call(self, cmd, inbuf) >> File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call >> return self.func(mgr, **kwargs) >> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102, in >> >> wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, >> **l_kwargs) >> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91, in >> wrapper >> return func(*args, **kwargs) >> File "/usr/share/ceph/mgr/orchestrator/module.py", line 503, in >> _list_services >> raise_if_exception(completion) >> File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 642, in >> raise_if_exception >> raise e >> AssertionError: cephadm >> >> On Fri, Sep 2, 2022 at 1:32 AM Satish Patel wrote: >> >> > nevermind, i found doc related that and i am able to get 1 mgr up - >> > >> https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon >> > >> > >> > On Fri, Sep 2, 2022 at 1:21 AM Satish Patel >> wrote: >> > >> >> Folks, >> >> >> >> I am having little fun time with cephadm and it's very annoying to deal >> >> with it >> >> >> >> I have deployed a ceph cluster using cephadm on two nodes. Now when i >> was >> >> trying to upgrade and noticed hiccups where it just upgraded a single >> mgr >> >> with 16.2.10 but not other so i started messing around and somehow I >> >> deleted both mgr in the thought that cephadm will recreate them. >> >> >> >> Now i don't have any single mgr so my ceph orch command hangs forever >> and >> >> looks like a chicken egg issue. >> >> >> >> How do I recover from this? If I can't run the ceph orch command, I >> won't >> >> be able to redeploy my mgr daemons. >> >> >> >> I am not able to find any mgr in the following command on both nodes. >> >> >> >> $ cephadm ls | grep mgr >> >> >> > >> ___ >> ceph-users mailing list -- ceph-users@ceph.io >> To unsubscribe send an email to ceph-users-le...@ceph.io >> >> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [cephadm] mgr: no daemons active
Now when I run "ceph orch ps" it works but the following command throws an error. Trying to bring up second mgr using ceph orch apply mgr command but didn't help root@ceph1:/ceph-disk# ceph version ceph version 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) octopus (stable) root@ceph1:/ceph-disk# ceph orch ls Error EINVAL: Traceback (most recent call last): File "/usr/share/ceph/mgr/mgr_module.py", line 1212, in _handle_command return self.handle_command(inbuf, cmd) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 140, in handle_command return dispatch[cmd['prefix']].call(self, cmd, inbuf) File "/usr/share/ceph/mgr/mgr_module.py", line 320, in call return self.func(mgr, **kwargs) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 102, in wrapper_copy = lambda *l_args, **l_kwargs: wrapper(*l_args, **l_kwargs) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 91, in wrapper return func(*args, **kwargs) File "/usr/share/ceph/mgr/orchestrator/module.py", line 503, in _list_services raise_if_exception(completion) File "/usr/share/ceph/mgr/orchestrator/_interface.py", line 642, in raise_if_exception raise e AssertionError: cephadm On Fri, Sep 2, 2022 at 1:32 AM Satish Patel wrote: > nevermind, i found doc related that and i am able to get 1 mgr up - > https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon > > > On Fri, Sep 2, 2022 at 1:21 AM Satish Patel wrote: > >> Folks, >> >> I am having little fun time with cephadm and it's very annoying to deal >> with it >> >> I have deployed a ceph cluster using cephadm on two nodes. Now when i was >> trying to upgrade and noticed hiccups where it just upgraded a single mgr >> with 16.2.10 but not other so i started messing around and somehow I >> deleted both mgr in the thought that cephadm will recreate them. >> >> Now i don't have any single mgr so my ceph orch command hangs forever and >> looks like a chicken egg issue. >> >> How do I recover from this? If I can't run the ceph orch command, I won't >> be able to redeploy my mgr daemons. >> >> I am not able to find any mgr in the following command on both nodes. >> >> $ cephadm ls | grep mgr >> > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [cephadm] mgr: no daemons active
nevermind, i found doc related that and i am able to get 1 mgr up - https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon On Fri, Sep 2, 2022 at 1:21 AM Satish Patel wrote: > Folks, > > I am having little fun time with cephadm and it's very annoying to deal > with it > > I have deployed a ceph cluster using cephadm on two nodes. Now when i was > trying to upgrade and noticed hiccups where it just upgraded a single mgr > with 16.2.10 but not other so i started messing around and somehow I > deleted both mgr in the thought that cephadm will recreate them. > > Now i don't have any single mgr so my ceph orch command hangs forever and > looks like a chicken egg issue. > > How do I recover from this? If I can't run the ceph orch command, I won't > be able to redeploy my mgr daemons. > > I am not able to find any mgr in the following command on both nodes. > > $ cephadm ls | grep mgr > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] [cephadm] mgr: no daemons active
Folks, I am having little fun time with cephadm and it's very annoying to deal with it I have deployed a ceph cluster using cephadm on two nodes. Now when i was trying to upgrade and noticed hiccups where it just upgraded a single mgr with 16.2.10 but not other so i started messing around and somehow I deleted both mgr in the thought that cephadm will recreate them. Now i don't have any single mgr so my ceph orch command hangs forever and looks like a chicken egg issue. How do I recover from this? If I can't run the ceph orch command, I won't be able to redeploy my mgr daemons. I am not able to find any mgr in the following command on both nodes. $ cephadm ls | grep mgr ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: [cephadm] Found duplicate OSDs
Great, thanks! Don't ask me how many commands I have typed to fix my issue. Finally I did it. Basically i fix /etc/hosts and then i remove mgr service using following command ceph orch daemon rm mgr.ceph1.xmbvsb And cephadm auto deployed a new working mgr. I found ceph orch ps was hanging and the solution I found was to restart all ceph daemon using ( systemctl restart ceph.target ) command. root@ceph1:/ceph-disk# ceph orch ps NAME HOST PORTSSTATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID alertmanager.ceph1 ceph1 running (12m) 9m ago 2w 16.0M- 0.20.0 0881eb8f169f d064a0177439 crash.ceph1 ceph1 running (49m) 9m ago 2w 7963k- 15.2.17 93146564743f 550b088467e4 crash.ceph2 ceph2 running (35m) 9m ago 13d 7287k- 15.2.17 93146564743f c4b5b3327fa5 grafana.ceph1ceph1 running (14m) 9m ago 2w 34.9M- 6.7.4557c83e11646 46048ebff031 mgr.ceph1.hxsfrs ceph1 *:8443,9283 running (13m) 9m ago 13m 327M- 15.2.17 93146564743f 4c5169890e9d mgr.ceph2.hmbdla ceph2 running (35m) 9m ago 13d 435M- 16.2.10 0d668911f040 361d58a423cd mon.ceph1ceph1 running (49m) 9m ago 2w 85.5M2048M 15.2.17 93146564743f a5f055953256 node-exporter.ceph1 ceph1 running (14m) 9m ago 2w 32.9M- 0.18.1 e5a616e4b9cf 833cc2e6c9ed node-exporter.ceph2 ceph2 running (13m) 9m ago 13d 33.9M- 0.18.1 e5a616e4b9cf 30d15dde3860 osd.0ceph1 running (49m) 9m ago 13d 355M4096M 15.2.17 93146564743f 6e9bee5c211e osd.1ceph1 running (49m) 9m ago 13d 372M4096M 15.2.17 93146564743f 09b8616bc096 osd.2ceph2 running (35m) 9m ago 13d 287M4096M 15.2.17 93146564743f 20f75a1b5221 osd.3ceph2 running (35m) 9m ago 13d 300M4096M 15.2.17 93146564743f c57154355b03 prometheus.ceph1 ceph1 running (12m) 9m ago 2w 89.5M- 2.18.1 de242295e225 b5ff35307ac0 Now I am going to start an upgrade process next. I will keep you posted to see how it goes. On Thu, Sep 1, 2022 at 10:06 PM Adam King wrote: > I'm not sure exactly what needs to be done to fix that, but I'd imagine > just editing the /etc/hosts file on all your hosts to be correct would be > the start (the cephadm shell would have taken its /etc/hosts off of > whatever host you ran the shell from). Unfortunately I'm not much of a > networking expert and if you have some sort of DNS stuff going on for your > local network I'm not too sure what to do there, but if it's possible just > fixing the /etc/hosts entries will resolve things. Either way, once you've > got the networking fixed so ssh-ing to the hosts works as expected with the > IPs you might need to re-add one or both of the hosts to the cluster with > the correct IP as well ( "ceph orch host add "). I believe > if you just run the orch host add command again with a different IP but the > same hostname it will just change the IP cephadm has stored for the host. > If that isn't working, running "ceph orch host rm --force" > beforehand should make it work (if you just remove the host with --force it > shouldn't touch the host's daemons and should therefore be a relatively > sage operation). In the end, the IP cephadm lists for each host in "ceph > orch host ls" must be an IP that allows correctly ssh-ing to the host. > > On Thu, Sep 1, 2022 at 9:17 PM Satish Patel wrote: > >> Hi Adam, >> >> You are correct, look like it was a naming issue in my /etc/hosts file. >> Is there a way to correct it? >> >> If you see i have ceph1 two time. :( >> >> 10.73.0.191 ceph1.example.com ceph1 >> 10.73.0.192 ceph2.example.com ceph1 >> >> On Thu, Sep 1, 2022 at 8:06 PM Adam King wrote: >> >>> the naming for daemons is a bit different for each daemon type, but for >>> mgr daemons it's always "mgr..". The daemons >>> cephadm will be able to find for something like a daemon redeploy are >>> pretty much always whatever is reported in "ceph orch ps". Given that >>> "mgr.ceph1.xmbvsb" isn't listed there, it's not surprising it said it >>> couldn't find it. >>> >>> There is definitely something very odd going on here. It looks like the >>> crash daemons as well are reporting a duplicate "crash.ceph2" on both ceph1 >>> and ceph2. Going back to your original orch ps output from the first email, >>> it seems that every daemon see
[ceph-users] Re: [cephadm] Found duplicate OSDs
Hi Adam, You are correct, look like it was a naming issue in my /etc/hosts file. Is there a way to correct it? If you see i have ceph1 two time. :( 10.73.0.191 ceph1.example.com ceph1 10.73.0.192 ceph2.example.com ceph1 On Thu, Sep 1, 2022 at 8:06 PM Adam King wrote: > the naming for daemons is a bit different for each daemon type, but for > mgr daemons it's always "mgr..". The daemons > cephadm will be able to find for something like a daemon redeploy are > pretty much always whatever is reported in "ceph orch ps". Given that > "mgr.ceph1.xmbvsb" isn't listed there, it's not surprising it said it > couldn't find it. > > There is definitely something very odd going on here. It looks like the > crash daemons as well are reporting a duplicate "crash.ceph2" on both ceph1 > and ceph2. Going back to your original orch ps output from the first email, > it seems that every daemon seems to have a duplicate and none of the actual > daemons listed in the "cephadm ls" on ceph1 are actually being reported in > the orch ps output. I think something may have gone wrong with the host and > networking setup here and it seems to be reporting ceph2 daemons as the > daemons for both ceph1 and ceph2 as if trying to connect to ceph1 ends up > connecting to ceph2. The only time I've seen anything like this was when I > made a mistake and setup a virtual IP on one host that was the same as the > actual IP for another host on the cluster and cephadm basically ended up > ssh-ing to the same host via both IPs (the one that was supposed to be for > host A and host B where the virtual IP matching host B was setup on host > A). I doubt you're in that exact situation, but I think we need to look > very closely at the networking setup here. I would try opening up a cephadm > shell and ssh-ing to each of the two hosts by the IP listed in "ceph orch > host ls" and make sure you actually get to the correct host and it has the > correct hostname. Given the output, I wouldn't be surprised if trying to > connect to ceph1's IP landed you on ceph2 or vice versa. I will say I found > it a bit odd originally when I saw the two IPs were 10.73.0.192 and > 10.73.3.192. There's nothing necessarily wrong with that, but typically IPs > on the host are more likely to differ at the end than in the middle (e.g. > 192.168.122.1 and 192.168.122.2 rather than 192.168.1.122 and > 192.168.2.122) and it did make me wonder if a mistake had occurred in the > networking. Either way, there's clearly something making it think ceph2's > daemons are on both ceph1 and ceph2 and some sort of networking issue is > the only thing I'm aware of currently that causes something like that. > > On Thu, Sep 1, 2022 at 6:30 PM Satish Patel wrote: > >> Hi Adam, >> >> I have also noticed a very strange thing which is Duplicate name in the >> following output. Is this normal? I don't know how it got here. Is there >> a way I can rename them? >> >> root@ceph1:~# ceph orch ps >> NAME HOST PORTSSTATUS REFRESHED AGE >> MEM USE MEM LIM VERSIONIMAGE ID CONTAINER ID >> alertmanager.ceph1 ceph1 *:9093,9094 starting-- >> -- >> crash.ceph2 ceph1 running (13d) 10s ago 13d >> 10.0M- 15.2.1793146564743f 0a009254afb0 >> crash.ceph2 ceph2 running (13d) 10s ago 13d >> 10.0M- 15.2.1793146564743f 0a009254afb0 >> grafana.ceph1ceph1 *:3000 starting-- >> -- >> mgr.ceph2.hmbdla ceph1 running (103m)10s ago 13d >> 518M- 16.2.100d668911f040 745245c18d5e >> mgr.ceph2.hmbdla ceph2 running (103m)10s ago 13d >> 518M- 16.2.100d668911f040 745245c18d5e >> node-exporter.ceph2 ceph1 running (7h) 10s ago 13d >> 70.2M- 0.18.1 e5a616e4b9cf d0ba04bb977c >> node-exporter.ceph2 ceph2 running (7h) 10s ago 13d >> 70.2M- 0.18.1 e5a616e4b9cf d0ba04bb977c >> osd.2ceph1 running (19h) 10s ago 13d >> 901M4096M 15.2.1793146564743f e286fb1c6302 >> osd.2ceph2 running (19h) 10s ago 13d >> 901M4096M 15.2.1793146564743f e286fb1c6302 >> osd.3ceph1 running (19h) 10s ago 13d >> 1006M4096M 15.2.1793146564743f d3ae5d9f694f >> osd.3ceph2 running (19h) 10s ago 13d >> 1006M4096M 15.2.1793146564
[ceph-users] Re: [cephadm] Found duplicate OSDs
Hi Adam, I have also noticed a very strange thing which is Duplicate name in the following output. Is this normal? I don't know how it got here. Is there a way I can rename them? root@ceph1:~# ceph orch ps NAME HOST PORTSSTATUS REFRESHED AGE MEM USE MEM LIM VERSIONIMAGE ID CONTAINER ID alertmanager.ceph1 ceph1 *:9093,9094 starting-- -- crash.ceph2 ceph1 running (13d) 10s ago 13d 10.0M- 15.2.1793146564743f 0a009254afb0 crash.ceph2 ceph2 running (13d) 10s ago 13d 10.0M- 15.2.1793146564743f 0a009254afb0 grafana.ceph1ceph1 *:3000 starting-- -- mgr.ceph2.hmbdla ceph1 running (103m)10s ago 13d 518M- 16.2.100d668911f040 745245c18d5e mgr.ceph2.hmbdla ceph2 running (103m)10s ago 13d 518M- 16.2.100d668911f040 745245c18d5e node-exporter.ceph2 ceph1 running (7h) 10s ago 13d 70.2M- 0.18.1 e5a616e4b9cf d0ba04bb977c node-exporter.ceph2 ceph2 running (7h) 10s ago 13d 70.2M- 0.18.1 e5a616e4b9cf d0ba04bb977c osd.2ceph1 running (19h) 10s ago 13d 901M4096M 15.2.1793146564743f e286fb1c6302 osd.2ceph2 running (19h) 10s ago 13d 901M4096M 15.2.1793146564743f e286fb1c6302 osd.3ceph1 running (19h) 10s ago 13d 1006M4096M 15.2.1793146564743f d3ae5d9f694f osd.3ceph2 running (19h) 10s ago 13d 1006M4096M 15.2.1793146564743f d3ae5d9f694f osd.5ceph1 running (19h) 10s ago 9d 222M4096M 15.2.1793146564743f 405068fb474e osd.5ceph2 running (19h) 10s ago 9d 222M4096M 15.2.1793146564743f 405068fb474e prometheus.ceph1 ceph1 *:9095 running (15s) 10s ago 15s 30.6M- 514e6a882f6e 65a0acfed605 prometheus.ceph1 ceph2 *:9095 running (15s) 10s ago 15s 30.6M- 514e6a882f6e 65a0acfed605 I found the following example link which has all different names, how does cephadm decide naming? https://achchusnulchikam.medium.com/deploy-ceph-cluster-with-cephadm-on-centos-8-257b300e7b42 On Thu, Sep 1, 2022 at 6:20 PM Satish Patel wrote: > Hi Adam, > > Getting the following error, not sure why it's not able to find it. > > root@ceph1:~# ceph orch daemon redeploy mgr.ceph1.xmbvsb > Error EINVAL: Unable to find mgr.ceph1.xmbvsb daemon(s) > > On Thu, Sep 1, 2022 at 5:57 PM Adam King wrote: > >> what happens if you run `ceph orch daemon redeploy mgr.ceph1.xmbvsb`? >> >> On Thu, Sep 1, 2022 at 5:12 PM Satish Patel wrote: >> >>> Hi Adam, >>> >>> Here is requested output >>> >>> root@ceph1:~# ceph health detail >>> HEALTH_WARN 4 stray daemon(s) not managed by cephadm >>> [WRN] CEPHADM_STRAY_DAEMON: 4 stray daemon(s) not managed by cephadm >>> stray daemon mon.ceph1 on host ceph1 not managed by cephadm >>> stray daemon osd.0 on host ceph1 not managed by cephadm >>> stray daemon osd.1 on host ceph1 not managed by cephadm >>> stray daemon osd.4 on host ceph1 not managed by cephadm >>> >>> >>> root@ceph1:~# ceph orch host ls >>> HOST ADDR LABELS STATUS >>> ceph1 10.73.0.192 >>> ceph2 10.73.3.192 _admin >>> 2 hosts in cluster >>> >>> >>> My cephadm ls saying mgr is in error state >>> >>> { >>> "style": "cephadm:v1", >>> "name": "mgr.ceph1.xmbvsb", >>> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >>> "systemd_unit": >>> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb", >>> "enabled": true, >>> "state": "error", >>> "container_id": null, >>> "container_image_name": "quay.io/ceph/ceph:v15", >>> "container_image_id": null, >>> "version": null, >>> "started": null, >>> "created": "2022-09-01T20:59:49.314347Z", >>> "deployed": "2022-09-01T20:59:48.718347Z", >>> "configured": "2022-09-01T20:59:49.314347Z" >>> }, >>> >>> >>> Getting error >>> >&
[ceph-users] Re: [cephadm] Found duplicate OSDs
Hi Adam, Getting the following error, not sure why it's not able to find it. root@ceph1:~# ceph orch daemon redeploy mgr.ceph1.xmbvsb Error EINVAL: Unable to find mgr.ceph1.xmbvsb daemon(s) On Thu, Sep 1, 2022 at 5:57 PM Adam King wrote: > what happens if you run `ceph orch daemon redeploy mgr.ceph1.xmbvsb`? > > On Thu, Sep 1, 2022 at 5:12 PM Satish Patel wrote: > >> Hi Adam, >> >> Here is requested output >> >> root@ceph1:~# ceph health detail >> HEALTH_WARN 4 stray daemon(s) not managed by cephadm >> [WRN] CEPHADM_STRAY_DAEMON: 4 stray daemon(s) not managed by cephadm >> stray daemon mon.ceph1 on host ceph1 not managed by cephadm >> stray daemon osd.0 on host ceph1 not managed by cephadm >> stray daemon osd.1 on host ceph1 not managed by cephadm >> stray daemon osd.4 on host ceph1 not managed by cephadm >> >> >> root@ceph1:~# ceph orch host ls >> HOST ADDR LABELS STATUS >> ceph1 10.73.0.192 >> ceph2 10.73.3.192 _admin >> 2 hosts in cluster >> >> >> My cephadm ls saying mgr is in error state >> >> { >> "style": "cephadm:v1", >> "name": "mgr.ceph1.xmbvsb", >> "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", >> "systemd_unit": >> "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb", >> "enabled": true, >> "state": "error", >> "container_id": null, >> "container_image_name": "quay.io/ceph/ceph:v15", >> "container_image_id": null, >> "version": null, >> "started": null, >> "created": "2022-09-01T20:59:49.314347Z", >> "deployed": "2022-09-01T20:59:48.718347Z", >> "configured": "2022-09-01T20:59:49.314347Z" >> }, >> >> >> Getting error >> >> root@ceph1:~# cephadm unit --fsid f270ad9e-1f6f-11ed-b6f8-a539d87379ea >> --name mgr.ceph1.xmbvsb start >> stderr Job for >> ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb.service >> failed because the control process exited with error code. >> stderr See "systemctl status >> ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb.service" and >> "journalctl -xe" for details. >> Traceback (most recent call last): >> File "/usr/sbin/cephadm", line 6250, in >> r = args.func() >> File "/usr/sbin/cephadm", line 1357, in _infer_fsid >> return func() >> File "/usr/sbin/cephadm", line 3727, in command_unit >> call_throws([ >> File "/usr/sbin/cephadm", line 1119, in call_throws >> raise RuntimeError('Failed command: %s' % ' '.join(command)) >> RuntimeError: Failed command: systemctl start >> ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb >> >> >> How do I remove and re-deploy mgr? >> >> On Thu, Sep 1, 2022 at 4:54 PM Adam King wrote: >> >>> cephadm deploys the containers with --rm so they will get removed if you >>> stop them. As for getting the 2nd mgr back, if it still lists the 2nd one >>> in `ceph orch ps` you should be able to do a `ceph orch daemon redeploy >>> ` where should match the name given in >>> the orch ps output for the one that isn't actually up. If it isn't listed >>> there, given you have a count of 2, cephadm should deploy another one. I do >>> see in the orch ls output you posted that it says the mgr service has "2/2" >>> running which implies it believes a 2nd mgr is present (and you would >>> therefore be able to try the daemon redeploy if that daemon isn't actually >>> there). >>> >>> Is it still reporting the duplicate osds in orch ps? I see in the >>> cephadm ls output on ceph1 that osd.2 isn't being reported, which was >>> reported as being on ceph1 in the orch ps output in your original message >>> in this thread. I'm interested in what `ceph health detail` is reporting >>> now as well, as it says there are 4 stray daemons. Also, the `ceph orch >>> host ls` output just to get a better grasp of the topology of this cluster. >>> >>> On Thu, Sep 1, 2022 at 3:50 PM Satish Patel >>> wrote: >>> >>>> Adam, >>>> >>>> I have posted a question related to upgrading earlier and this thread >>>> is related to that,
[ceph-users] Re: [cephadm] Found duplicate OSDs
Hi Adam, Here is requested output root@ceph1:~# ceph health detail HEALTH_WARN 4 stray daemon(s) not managed by cephadm [WRN] CEPHADM_STRAY_DAEMON: 4 stray daemon(s) not managed by cephadm stray daemon mon.ceph1 on host ceph1 not managed by cephadm stray daemon osd.0 on host ceph1 not managed by cephadm stray daemon osd.1 on host ceph1 not managed by cephadm stray daemon osd.4 on host ceph1 not managed by cephadm root@ceph1:~# ceph orch host ls HOST ADDR LABELS STATUS ceph1 10.73.0.192 ceph2 10.73.3.192 _admin 2 hosts in cluster My cephadm ls saying mgr is in error state { "style": "cephadm:v1", "name": "mgr.ceph1.xmbvsb", "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", "systemd_unit": "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb", "enabled": true, "state": "error", "container_id": null, "container_image_name": "quay.io/ceph/ceph:v15", "container_image_id": null, "version": null, "started": null, "created": "2022-09-01T20:59:49.314347Z", "deployed": "2022-09-01T20:59:48.718347Z", "configured": "2022-09-01T20:59:49.314347Z" }, Getting error root@ceph1:~# cephadm unit --fsid f270ad9e-1f6f-11ed-b6f8-a539d87379ea --name mgr.ceph1.xmbvsb start stderr Job for ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb.service failed because the control process exited with error code. stderr See "systemctl status ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb.service" and "journalctl -xe" for details. Traceback (most recent call last): File "/usr/sbin/cephadm", line 6250, in r = args.func() File "/usr/sbin/cephadm", line 1357, in _infer_fsid return func() File "/usr/sbin/cephadm", line 3727, in command_unit call_throws([ File "/usr/sbin/cephadm", line 1119, in call_throws raise RuntimeError('Failed command: %s' % ' '.join(command)) RuntimeError: Failed command: systemctl start ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@mgr.ceph1.xmbvsb How do I remove and re-deploy mgr? On Thu, Sep 1, 2022 at 4:54 PM Adam King wrote: > cephadm deploys the containers with --rm so they will get removed if you > stop them. As for getting the 2nd mgr back, if it still lists the 2nd one > in `ceph orch ps` you should be able to do a `ceph orch daemon redeploy > ` where should match the name given in > the orch ps output for the one that isn't actually up. If it isn't listed > there, given you have a count of 2, cephadm should deploy another one. I do > see in the orch ls output you posted that it says the mgr service has "2/2" > running which implies it believes a 2nd mgr is present (and you would > therefore be able to try the daemon redeploy if that daemon isn't actually > there). > > Is it still reporting the duplicate osds in orch ps? I see in the cephadm > ls output on ceph1 that osd.2 isn't being reported, which was reported as > being on ceph1 in the orch ps output in your original message in this > thread. I'm interested in what `ceph health detail` is reporting now as > well, as it says there are 4 stray daemons. Also, the `ceph orch host ls` > output just to get a better grasp of the topology of this cluster. > > On Thu, Sep 1, 2022 at 3:50 PM Satish Patel wrote: > >> Adam, >> >> I have posted a question related to upgrading earlier and this thread is >> related to that, I have opened a new one because I found that error in logs >> and thought the upgrade may be stuck because of duplicate OSDs. >> >> root@ceph1:~# ls -l /var/lib/ceph/f270ad9e-1f6f-11ed-b6f8-a539d87379ea/ >> total 44 >> drwx-- 3 nobody nogroup 4096 Aug 19 05:37 alertmanager.ceph1 >> drwx-- 3167 167 4096 Aug 19 05:36 crash >> drwx-- 2167 167 4096 Aug 19 05:37 crash.ceph1 >> drwx-- 4998 996 4096 Aug 19 05:37 grafana.ceph1 >> drwx-- 2167 167 4096 Aug 19 05:36 mgr.ceph1.xmbvsb >> drwx-- 3167 167 4096 Aug 19 05:36 mon.ceph1 >> drwx-- 2 nobody nogroup 4096 Aug 19 05:37 node-exporter.ceph1 >> drwx-- 2167 167 4096 Aug 19 17:55 osd.0 >> drwx-- 2167 167 4096 Aug 19 18:03 osd.1 >> drwx-- 2167 167 4096 Aug 31 05:20 osd.4 >> drwx-- 4 nobody nogroup 4096 Aug 19 05:38 prometheus.ceph1 >> >> Here is the output of cephadm ls >> >> root@ceph1:~# cephadm ls >> [ >> { >> "style": "cephadm:v1", >> "name&qu
[ceph-users] Re: [cephadm] Found duplicate OSDs
ed": "2022-08-19T03:38:06.487603Z" }, { "style": "cephadm:v1", "name": "osd.4", "fsid": "f270ad9e-1f6f-11ed-b6f8-a539d87379ea", "systemd_unit": "ceph-f270ad9e-1f6f-11ed-b6f8-a539d87379ea@osd.4", "enabled": true, "state": "running", "container_id": "938840fe7fd0cb45cc26d077837c9847d7c7a7a68c7e1588d4bb4343c695a071", "container_image_name": "quay.io/ceph/ceph:v15", "container_image_id": "93146564743febec815d6a764dad93fc07ce971e88315403ac508cb5da6d35f4", "version": "15.2.17", "started": "2022-08-31T03:20:55.416219Z", "created": "2022-08-23T21:46:49.458533Z", "deployed": "2022-08-23T21:46:48.818533Z", "configured": "2022-08-31T02:53:41.196643Z" } ] I have noticed one more thing, I did docker stop on ceph1 node and now my mgr container disappeared, I can't see it anywhere and not sure how do i bring back mgr because upgrade won't let me do anything if i don't have two mgr instance. root@ceph1:~# ceph -s cluster: id: f270ad9e-1f6f-11ed-b6f8-a539d87379ea health: HEALTH_WARN 4 stray daemon(s) not managed by cephadm services: mon: 1 daemons, quorum ceph1 (age 17h) mgr: ceph2.hmbdla(active, since 5h) osd: 6 osds: 6 up (since 40h), 6 in (since 8d) data: pools: 6 pools, 161 pgs objects: 20.59k objects, 85 GiB usage: 174 GiB used, 826 GiB / 1000 GiB avail pgs: 161 active+clean io: client: 0 B/s rd, 12 KiB/s wr, 0 op/s rd, 2 op/s wr progress: Upgrade to quay.io/ceph/ceph:16.2.10 (0s) [] I can see mgr count:2 but not sure how do i bring it back root@ceph1:~# ceph orch ls NAME PORTSRUNNING REFRESHED AGE PLACEMENT alertmanager ?:9093,9094 1/1 20s ago13d count:1 crash 2/2 20s ago13d * grafana?:3000 1/1 20s ago13d count:1 mgr 2/2 20s ago13d count:2 mon 0/5 - 13d node-exporter ?:9100 2/2 20s ago13d * osd 6 20s ago- osd.all-available-devices 0 - 13d * osd.osd_spec_default 0 - 8d * prometheus ?:9095 1/1 20s ago13d count:1 On Thu, Sep 1, 2022 at 12:28 PM Adam King wrote: > Are there any extra directories in /var/lib/ceph or /var/lib/ceph/ > that appear to be for those OSDs on that host? When cephadm builds the info > it uses for "ceph orch ps" it's actually scraping those directories. The > output of "cephadm ls" on the host with the duplicates could also > potentially have some insights. > > On Thu, Sep 1, 2022 at 12:15 PM Satish Patel wrote: > >> Folks, >> >> I am playing with cephadm and life was good until I started upgrading from >> octopus to pacific. My upgrade process stuck after upgrading mgr and in >> logs now i can see following error >> >> root@ceph1:~# ceph log last cephadm >> 2022-09-01T14:40:45.739804+ mgr.ceph2.hmbdla (mgr.265806) 8 : >> cephadm [INF] Deploying daemon grafana.ceph1 on ceph1 >> 2022-09-01T14:40:56.115693+ mgr.ceph2.hmbdla (mgr.265806) 14 : >> cephadm [INF] Deploying daemon prometheus.ceph1 on ceph1 >> 2022-09-01T14:41:11.856725+ mgr.ceph2.hmbdla (mgr.265806) 25 : >> cephadm [INF] Reconfiguring alertmanager.ceph1 (dependencies >> changed)... >> 2022-09-01T14:41:11.861535+ mgr.ceph2.hmbdla (mgr.265806) 26 : >> cephadm [INF] Reconfiguring daemon alertmanager.ceph1 on ceph1 >> 2022-09-01T14:41:12.927852+ mgr.ceph2.hmbdla (mgr.265806) 27 : >> cephadm [INF] Reconfiguring grafana.ceph1 (dependencies changed)... >> 2022-09-01T14:41:12.940615+ mgr.ceph2.hmbdla (mgr.265806) 28 : >> cephadm [INF] Reconfiguring daemon grafana.ceph1 on ceph1 >> 2022-09-01T14:41:14.056113+ mgr.ceph2.hmbdla (mgr.265806) 33 : >> cephadm [INF] Found duplicate OSDs: osd.2 in status running on ceph1, >> osd.2 in status running on ceph2 >> 2022-09-01T14:41:14.056437+ mgr.ceph2.hmbdla (mgr.265806) 34 : >> cephadm [INF] Found duplicate OSDs: osd.5 in status running on ceph1, >> osd.5 in status running on ceph2 >> 2022-09-01T14:41:14.056630+ mgr.ceph2.hmbdla (mgr.265806) 35 : >> cephadm [INF] Found duplicate OSDs: osd.3 in status running on ceph1, >> osd.3 in status running o
[ceph-users] [cephadm] Found duplicate OSDs
Folks, I am playing with cephadm and life was good until I started upgrading from octopus to pacific. My upgrade process stuck after upgrading mgr and in logs now i can see following error root@ceph1:~# ceph log last cephadm 2022-09-01T14:40:45.739804+ mgr.ceph2.hmbdla (mgr.265806) 8 : cephadm [INF] Deploying daemon grafana.ceph1 on ceph1 2022-09-01T14:40:56.115693+ mgr.ceph2.hmbdla (mgr.265806) 14 : cephadm [INF] Deploying daemon prometheus.ceph1 on ceph1 2022-09-01T14:41:11.856725+ mgr.ceph2.hmbdla (mgr.265806) 25 : cephadm [INF] Reconfiguring alertmanager.ceph1 (dependencies changed)... 2022-09-01T14:41:11.861535+ mgr.ceph2.hmbdla (mgr.265806) 26 : cephadm [INF] Reconfiguring daemon alertmanager.ceph1 on ceph1 2022-09-01T14:41:12.927852+ mgr.ceph2.hmbdla (mgr.265806) 27 : cephadm [INF] Reconfiguring grafana.ceph1 (dependencies changed)... 2022-09-01T14:41:12.940615+ mgr.ceph2.hmbdla (mgr.265806) 28 : cephadm [INF] Reconfiguring daemon grafana.ceph1 on ceph1 2022-09-01T14:41:14.056113+ mgr.ceph2.hmbdla (mgr.265806) 33 : cephadm [INF] Found duplicate OSDs: osd.2 in status running on ceph1, osd.2 in status running on ceph2 2022-09-01T14:41:14.056437+ mgr.ceph2.hmbdla (mgr.265806) 34 : cephadm [INF] Found duplicate OSDs: osd.5 in status running on ceph1, osd.5 in status running on ceph2 2022-09-01T14:41:14.056630+ mgr.ceph2.hmbdla (mgr.265806) 35 : cephadm [INF] Found duplicate OSDs: osd.3 in status running on ceph1, osd.3 in status running on ceph2 Not sure from where duplicate names came and how that happened. In following output i can't see any duplication root@ceph1:~# ceph osd tree ID CLASS WEIGHT TYPE NAME STATUS REWEIGHT PRI-AFF -1 0.97656 root default -3 0.48828 host ceph1 4hdd 0.09769 osd.4 up 1.0 1.0 0ssd 0.19530 osd.0 up 1.0 1.0 1ssd 0.19530 osd.1 up 1.0 1.0 -5 0.48828 host ceph2 5hdd 0.09769 osd.5 up 1.0 1.0 2ssd 0.19530 osd.2 up 1.0 1.0 3ssd 0.19530 osd.3 up 1.0 1.0 But same time i can see duplicate OSD number in ceph1 and ceph2 root@ceph1:~# ceph orch ps NAME HOST PORTSSTATUS REFRESHED AGE MEM USE MEM LIM VERSION IMAGE ID CONTAINER ID alertmanager.ceph1 ceph1 *:9093,9094 running (20s) 2s ago 20s 17.1M- ba2b418f427c 856a4fe641f1 alertmanager.ceph1 ceph2 *:9093,9094 running (20s) 3s ago 20s 17.1M- ba2b418f427c 856a4fe641f1 crash.ceph2 ceph1 running (12d) 2s ago 12d 10.0M- 15.2.17 93146564743f 0a009254afb0 crash.ceph2 ceph2 running (12d) 3s ago 12d 10.0M- 15.2.17 93146564743f 0a009254afb0 grafana.ceph1ceph1 *:3000 running (18s) 2s ago 19s 47.9M- 8.3.5dad864ee21e9 7d7a70b8ab7f grafana.ceph1ceph2 *:3000 running (18s) 3s ago 19s 47.9M- 8.3.5dad864ee21e9 7d7a70b8ab7f mgr.ceph2.hmbdla ceph1 running (13h) 2s ago 12d 506M- 16.2.10 0d668911f040 6274723c35f7 mgr.ceph2.hmbdla ceph2 running (13h) 3s ago 12d 506M- 16.2.10 0d668911f040 6274723c35f7 node-exporter.ceph2 ceph1 running (91m) 2s ago 12d 60.7M- 0.18.1 e5a616e4b9cf d0ba04bb977c node-exporter.ceph2 ceph2 running (91m) 3s ago 12d 60.7M- 0.18.1 e5a616e4b9cf d0ba04bb977c osd.2ceph1 running (12h) 2s ago 12d 867M4096M 15.2.17 93146564743f e286fb1c6302 osd.2ceph2 running (12h) 3s ago 12d 867M4096M 15.2.17 93146564743f e286fb1c6302 osd.3ceph1 running (12h) 2s ago 12d 978M4096M 15.2.17 93146564743f d3ae5d9f694f osd.3ceph2 running (12h) 3s ago 12d 978M4096M 15.2.17 93146564743f d3ae5d9f694f osd.5ceph1 running (12h) 2s ago 8d 225M4096M 15.2.17 93146564743f 405068fb474e osd.5ceph2 running (12h) 3s ago 8d 225M4096M 15.2.17 93146564743f 405068fb474e prometheus.ceph1 ceph1 *:9095 running (8s) 2s ago 8s 30.4M- 514e6a882f6e 9031dbe30cae prometheus.ceph1 ceph2 *:9095 running (8s) 3s ago 8s 30.4M- 514e6a882f6e 9031dbe30cae Is this a bug or did I do something wrong? any workaround to get out from this condition? ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] cephadm upgrade from octopus to pasific stuck
Hi, I have a small cluster in the lab which has only two nodes. I have a single monitor and two OSD nodes. Running upgrade but somehow it stuck after upgrading mgr ceph orch upgrade start --ceph-version 16.2.10 root@ceph1:~# ceph -s cluster: id: f270ad9e-1f6f-11ed-b6f8-a539d87379ea health: HEALTH_WARN 5 stray daemon(s) not managed by cephadm services: mon: 1 daemons, quorum ceph1 (age 22m) mgr: ceph1.xmbvsb(active, since 21m), standbys: ceph2.hmbdla osd: 6 osds: 6 up (since 23h), 6 in (since 8d) data: pools: 6 pools, 161 pgs objects: 20.53k objects, 85 GiB usage: 173 GiB used, 826 GiB / 1000 GiB avail pgs: 161 active+clean io: client: 0 B/s rd, 2.7 KiB/s wr, 0 op/s rd, 0 op/s wr progress: Upgrade to quay.io/ceph/ceph:v16.2.10 (0s) [] root@ceph1:~# ceph health detail HEALTH_WARN 5 stray daemon(s) not managed by cephadm [WRN] CEPHADM_STRAY_DAEMON: 5 stray daemon(s) not managed by cephadm stray daemon mgr.ceph1.xmbvsb on host ceph1 not managed by cephadm stray daemon mon.ceph1 on host ceph1 not managed by cephadm stray daemon osd.0 on host ceph1 not managed by cephadm stray daemon osd.1 on host ceph1 not managed by cephadm stray daemon osd.4 on host ceph1 not managed by cephadm root@ceph1:~# ceph log last cephadm 2022-09-01T02:46:12.020993+ mgr.ceph1.xmbvsb (mgr.254112) 437 : cephadm [INF] refreshing ceph2 facts 2022-09-01T02:47:12.016303+ mgr.ceph1.xmbvsb (mgr.254112) 469 : cephadm [INF] refreshing ceph1 facts 2022-09-01T02:47:12.431002+ mgr.ceph1.xmbvsb (mgr.254112) 470 : cephadm [INF] refreshing ceph2 facts 2022-09-01T02:48:12.424640+ mgr.ceph1.xmbvsb (mgr.254112) 501 : cephadm [INF] refreshing ceph1 facts 2022-09-01T02:48:12.839790+ mgr.ceph1.xmbvsb (mgr.254112) 502 : cephadm [INF] refreshing ceph2 facts 2022-09-01T02:49:12.836875+ mgr.ceph1.xmbvsb (mgr.254112) 534 : cephadm [INF] refreshing ceph1 facts 2022-09-01T02:49:13.210871+ mgr.ceph1.xmbvsb (mgr.254112) 535 : cephadm [INF] refreshing ceph2 facts 2022-09-01T02:50:13.207635+ mgr.ceph1.xmbvsb (mgr.254112) 566 : cephadm [INF] refreshing ceph1 facts 2022-09-01T02:50:13.615722+ mgr.ceph1.xmbvsb (mgr.254112) 568 : cephadm [INF] refreshing ceph2 facts root@ceph1:~# ceph orch ps NAME HOST STATUS REFRESHED AGE VERSIONIMAGE NAME IMAGE ID CONTAINER ID cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d ceph1 stopped3m ago - cephadm.7ce656a8721deb5054c37b0cfb90381522d521dde51fb0c5a2142314d663f63d ceph2 stopped3m ago - crash.ceph2 ceph1 running (12d) 3m ago 12d 15.2.17quay.io/ceph/ceph:v15 93146564743f 0a009254afb0 crash.ceph2 ceph2 running (12d) 3m ago 12d 15.2.17quay.io/ceph/ceph:v15 93146564743f 0a009254afb0 mgr.ceph2.hmbdla ceph1 running (43m) 3m ago 12d 16.2.10quay.io/ceph/ceph:v16.2.10 0d668911f040 6274723c35f7 mgr.ceph2.hmbdla ceph2 running (43m) 3m ago 12d 16.2.10quay.io/ceph/ceph:v16.2.10 0d668911f040 6274723c35f7 node-exporter.ceph2 ceph1 running (23m) 3m ago 12d 0.18.1 quay.io/prometheus/node-exporter:v0.18.1 e5a616e4b9cf 7a6217cb1a9e node-exporter.ceph2 ceph2 running (23m) 3m ago 12d 0.18.1 quay.io/prometheus/node-exporter:v0.18.1 e5a616e4b9cf 7a6217cb1a9e osd.2 ceph1 running (23h) 3m ago 12d 15.2.17quay.io/ceph/ceph:v15 93146564743f e286fb1c6302 osd.2 ceph2 running (23h) 3m ago 12d 15.2.17quay.io/ceph/ceph:v15 93146564743f e286fb1c6302 osd.3 ceph1 running (23h) 3m ago 12d 15.2.17quay.io/ceph/ceph:v15 93146564743f d3ae5d9f694f osd.3 ceph2 running (23h) 3m ago 12d 15.2.17quay.io/ceph/ceph:v15 93146564743f d3ae5d9f694f osd.5 ceph1 running (23h) 3m ago 8d 15.2.17quay.io/ceph/ceph:v15 93146564743f 405068fb474e osd.5 ceph2 running (23h) 3m ago 8d 15.2.17quay.io/ceph/ceph:v15 93146564743f 405068fb474e What could be wrong here and how to debug issue, cephadm is new to me so not sure where to look for logs ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Benefits of dockerized ceph?
Hi, I believe only advantage of running dockerize which isolate binaries from OS and as you said upgrade is easier, In my case i am running OSD/MON role on same servers so it provide greater isolation when i want to upgrade component. cephadm uses containers to deploy ceph clusters in production. On Wed, Aug 24, 2022 at 4:07 PM Boris wrote: > Hi, > I was just asked if we can switch to dockerized ceph, because it is easier > to update. > > Last time I tried wo use ceph orch i failed really hard to get the rgw > daemon running as I would like to (IP/port/zonegroup and so on). > Also I never really felt comfortable running production workload in > docker. > > Now I wanted to ask the ML: are there good reasons to run ceph in docker, > oder than „update is easier and is decoupled from OS packages“? > > Cheers > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Suggestion to build ceph storage
Thanks Christophe, On Mon, Jun 20, 2022 at 11:45 AM Christophe BAILLON wrote: > Hi > > We have 20 ceph node, each with 12 x 18Tb, 2 x nvme 1Tb > > I try this method to create osd > > ceph orch apply -i osd_spec.yaml > > with this conf > > osd_spec.yaml > service_type: osd > service_id: osd_spec_default > placement: > host_pattern: '*' > data_devices: > rotational: 1 > db_devices: > paths: > - /dev/nvme0n1 > - /dev/nvme1n1 > > this created 6 osd with wal/db on /dev/nvme0n1 and 6 on /dev/nvme1n1 per > node > > Does cephadm automatically create partitions for Wal/DB or is it something I have to define in config? ( sorry i am new to cephadm because we are using ceph-ansible and i heard cephadm will replace ceph-ansible soon, is that correct?) > but when I do a lvs, I see only 6 x 80Go partitions on each nvme... > > I think this is dynamic sizing, but I'm not sure, I don't know how to > check it... > > Our cluster will only host couple of files, a small one and a big one ~2GB > for cephfs only use, and with only 8 users accessing datas > How many MDS nodes do you have for your cluster size? Do you have dedicated or shared MDS with OSDs? > I don't know if this is optimum, we are in testing process... > > - Mail original - > > De: "Stefan Kooman" > > À: "Jake Grimmett" , "Christian Wuerdig" < > christian.wuer...@gmail.com>, "Satish Patel" > > > > Cc: "ceph-users" > > Envoyé: Lundi 20 Juin 2022 16:59:58 > > Objet: [ceph-users] Re: Suggestion to build ceph storage > > > On 6/20/22 16:47, Jake Grimmett wrote: > >> Hi Stefan > >> > >> We use cephfs for our 7200CPU/224GPU HPC cluster, for our use-case > >> (large-ish image files) it works well. > >> > >> We have 36 ceph nodes, each with 12 x 12TB HDD, 2 x 1.92TB NVMe, plus a > >> 240GB System disk. Four dedicated nodes have NVMe for metadata pool, and > >> provide mon,mgr and MDS service. > >> > >> I'm not sure you need 4% of OSD for wal/db, search this mailing list > >> archive for a definitive answer, but my personal notes are as follows: > >> > >> "If you expect lots of small files: go for a DB that's > ~300 GB > >> For mostly large files you are probably fine with a 60 GB DB. > >> 266 GB is the same as 60 GB, due to the way the cache multiplies at each > >> level, spills over during compaction." > > > > There is (experimental ...) support for dynamic sizing in Pacific [1]. > > Not sure if it's stable yet in Quincy. > > > > Gr. Stefan > > > > [1]: > > > https://docs.ceph.com/en/quincy/rados/configuration/bluestore-config-ref/#sizing > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > -- > Christophe BAILLON > Mobile :: +336 16 400 522 > Work :: https://eyona.com > Twitter :: https://twitter.com/ctof > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Suggestion to build ceph storage
Thank Jake On Mon, Jun 20, 2022 at 10:47 AM Jake Grimmett wrote: > Hi Stefan > > We use cephfs for our 7200CPU/224GPU HPC cluster, for our use-case > (large-ish image files) it works well. > > We have 36 ceph nodes, each with 12 x 12TB HDD, 2 x 1.92TB NVMe, plus a > 240GB System disk. Four dedicated nodes have NVMe for metadata pool, and > provide mon,mgr and MDS service. > This is great info, Assuming we don't need redundancy for NvMe because if it fails then it will impact only on 6 OSDs and that is acceptable. At present, because of limited HW supply i am planning to host MDS on the same OSDs nodes (not dedicated HW for MDS) agreed that not a best practice but again currently i am dealing with all unknown and i don't want to throw money on something which we don't know. I may have more data as we start using and then I can adjust requirements accordingly. > > I'm not sure you need 4% of OSD for wal/db, search this mailing list > archive for a definitive answer, but my personal notes are as follows: > > "If you expect lots of small files: go for a DB that's > ~300 GB > For mostly large files you are probably fine with a 60 GB DB. > 266 GB is the same as 60 GB, due to the way the cache multiplies at each > level, spills over during compaction." > We don't know what kind of workload we are going to run because currently all they ask for is large storage with many many drives. In future if they ask for more IOPs then we may replace some box with NvME or SSD and adjust requirements. > > We use a single enterprise quality 1.9TB NVMe for each 6 OSDs to good > effect, you probably need 1DWPD to be safe. I suspect you might be able > to increase the ratio of HDD per NVMe with PCIe gen4 NVMe drives. > > Can you share what company NvME drives are you using? > best regards, > > Jake > > On 20/06/2022 08:22, Stefan Kooman wrote: > > On 6/19/22 23:23, Christian Wuerdig wrote: > >> On Sun, 19 Jun 2022 at 02:29, Satish Patel > wrote: > >> > >>> Greeting folks, > >>> > >>> We are planning to build Ceph storage for mostly cephFS for HPC > workload > >>> and in future we are planning to expand to S3 style but that is yet > >>> to be > >>> decided. Because we need mass storage, we bought the following HW. > >>> > >>> 15 Total servers and each server has a 12x18TB HDD (spinning disk) . We > >>> understand SSD/NvME would be best fit but it's way out of budget. > >>> > >>> I hope you have extra HW on hand for Monitor and MDS servers > > > > ^^ this. It also depends on the uptime guarantees you have to provide > > (if any). Are the HPC users going to write large files? Or loads of > > small files? The more metadata operations the busier the MDSes will be, > > but if it's mainly large files the load on them will be much lower. > >> > >>> Ceph recommends using a faster disk for wal/db if the data disk is > >>> slow and > >>> in my case I do have a slower disk for data. > >>> > >>> Question: > >>> 1. Let's say if i want to put a NvME disk for wal/db then what size i > >>> should buy. > >>> > >> > >> The official recommendation is to budget 4% of OSD size for WAL/DB - > >> so in > >> your case that would be 720GB per OSD. Especially if you want to go to > S3 > >> later you should stick closer to that limit since RGW is a heavy meta > >> data > >> user. > > > > CephFS can be metadata heavy also, depending on work load. You can > > co-locate the S3 service on this cluster later on, but from an > > operational perspective this might not be preferred: you can tune the > > hardware / configuration for each use case. Easier to troubleshoot, > > independent upgrade cycle, etc. > > > > Gr. Stefan > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > For help, read https://www.mrc-lmb.cam.ac.uk/scicomp/ > then contact unixad...@mrc-lmb.cam.ac.uk > -- > Dr Jake Grimmett > Head Of Scientific Computing > MRC Laboratory of Molecular Biology > Francis Crick Avenue, > Cambridge CB2 0QH, UK. > Phone 01223 267019 > Mobile 0776 9886539 > > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Suggestion to build ceph storage
Greeting folks, We are planning to build Ceph storage for mostly cephFS for HPC workload and in future we are planning to expand to S3 style but that is yet to be decided. Because we need mass storage, we bought the following HW. 15 Total servers and each server has a 12x18TB HDD (spinning disk) . We understand SSD/NvME would be best fit but it's way out of budget. Ceph recommends using a faster disk for wal/db if the data disk is slow and in my case I do have a slower disk for data. Question: 1. Let's say if i want to put a NvME disk for wal/db then what size i should buy. 2. Do I need to partition wal/db for each OSD or just a single partition can share for all OSDs? 3. Can I put the OS on the same disk where the wal/db is going to sit ? (This way i don't need to spend extra money for extra disk) Any suggestions you have for this kind of storage would be much appreciated. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io