[ceph-users] Ceph orch commands non-responsive after mgr/mon reboots 16.2.9
Howdy, I seem to be facing a problem on my 16.2.9 ceph cluster. After a staggered reboot of my 3 infra nodes all of ceph orch commands are hanging much like in this previous reported issue [1] I have paused orch and rebuilt a manager by hand as outlined here [2], and the issue continues to persist. I am unable to scale up or down of services, restart daemons, etc. ceph orch ls –verbose [{'flags': 8, 'help': 'List services known to orchestrator', 'module': 'mgr', 'perm': 'r', 'sig': [argdesc(, req=True, name=prefix, n=1, numseen=0, prefix=orch), argdesc(, req=True, name=prefix, n=1, numseen=0, prefix=ls), argdesc(, req=False, name=service_type, n=1, numseen=0), argdesc(, req=False, name=service_name, n=1, numseen=0), argdesc(, req=False, name=export, n=1, numseen=0), argdesc(, req=False, name=format, n=1, numseen=0, strings=plain|json|json-pretty|yaml|xml-pretty|xml), argdesc(, req=False, name=refresh, n=1, numseen=0)]}] Submitting command: {'prefix': 'orch ls', 'target': ('mon-mgr', '')} submit {"prefix": "orch ls", "target": ["mon-mgr", ""]} to mon-mgr Debug output on the manager: debug 2022-07-22T23:27:12.509+ 7fc180230700 0 log_channel(audit) log [DBG] : from='client.1084220 -' entity='client.admin' cmd=[{"prefix": "orch ls", "target": ["mon-mgr", ""]}]: dispatch I have collected a startup of the manager and uploaded it for review [3] Many Thanks, Tim [1] https://www.spinics.net/lists/ceph-users/msg68398.html [2] https://docs.ceph.com/en/quincy/cephadm/troubleshooting/#manually-deploying-a-mgr-daemon [3] https://pastebin.com/Dvb8sEbz ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: octopus v15.2.17 QE Validation status
On Thu, Jul 21, 2022 at 8:47 AM Ilya Dryomov wrote: > > On Thu, Jul 21, 2022 at 4:24 PM Yuri Weinstein wrote: > > > > Details of this release are summarized here: > > > > https://tracker.ceph.com/issues/56484 > > Release Notes - https://github.com/ceph/ceph/pull/47198 > > > > Seeking approvals for: > > > > rados - Neha, Travis, Ernesto, Adam rados approved! known issue https://tracker.ceph.com/issues/55854 Thanks, Neha > > > rgw - Casey > > fs, kcephfs, multimds - Venky, Patrick > > rbd - Ilya, Deepika > > krbd Ilya, Deepika > > rbd and krbd approved. > > Thanks, > > Ilya > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: dashboard on Ubuntu 22.04: python3-cheroot incompatibility
On Fri, Jul 22, 2022 at 04:54:23PM +0100, James Page wrote: > > If I remove the version check (see below), dashboard appears to be working. > > > https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1967139 > > I just uploaded a fix for cheroot to resolve this issue - the stable > release update team should pick that up next week. thank you! Matthias ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: dashboard on Ubuntu 22.04: python3-cheroot incompatibility
Hi Matthias On Fri, Jul 22, 2022 at 4:50 PM Matthias Ferdinand wrote: > Hi, > > trying to activate ceph dashboard on a 17.2.0 cluster (Ubuntu 22.04 > using standard ubuntu repos), the dashboard module crashes because it > cannot understand the python3-cheroot version number '8.5.2+ds1': > > root@mceph00:~# ceph crash info > 2022-07-22T14:44:03.226395Z_a6b006a7-10c3-443d-9ead-161e06a27bf3 > { > "backtrace": [ > " File \"/usr/share/ceph/mgr/dashboard/__init__.py\", line > 52, in \nfrom .module import Module, StandbyModule # noqa: > F401", > " File \"/usr/share/ceph/mgr/dashboard/module.py\", line 49, > in \npatch_cherrypy(cherrypy.__version__)", > " File > \"/usr/share/ceph/mgr/dashboard/cherrypy_backports.py\", line 197, in > patch_cherrypy\naccept_socket_error_0(v)", > " File > \"/usr/share/ceph/mgr/dashboard/cherrypy_backports.py\", line 124, in > accept_socket_error_0\nif v < StrictVersion(\"9.0.0\") or > cheroot_version < StrictVersion(\"6.5.5\"):", > " File \"/lib/python3.10/distutils/version.py\", line 64, in > __gt__\nc = self._cmp(other)", > " File \"/lib/python3.10/distutils/version.py\", line 168, in > _cmp\nother = StrictVersion(other)", > " File \"/lib/python3.10/distutils/version.py\", line 40, in > __init__\nself.parse(vstring)", > " File \"/lib/python3.10/distutils/version.py\", line 137, in > parse\nraise ValueError(\"invalid version number '%s'\" % vstring)", > => "ValueError: invalid version number '8.5.2+ds1'" > ], > "ceph_version": "17.2.0", > "crash_id": > "2022-07-22T14:44:03.226395Z_a6b006a7-10c3-443d-9ead-161e06a27bf3", > "entity_name": "mgr.mceph05", > "mgr_module": "dashboard", > "mgr_module_caller": "PyModule::load_subclass_of", > "mgr_python_exception": "ValueError", > "os_id": "22.04", > "os_name": "Ubuntu 22.04 LTS", > "os_version": "22.04 LTS (Jammy Jellyfish)", > "os_version_id": "22.04", > "process_name": "ceph-mgr", > "stack_sig": > "3f893983e716f2a7e368895904cf3485ac7064d3294a45ea14066a1576c818e3", > "timestamp": "2022-07-22T14:44:03.226395Z", > "utsname_hostname": "mceph05", > "utsname_machine": "x86_64", > "utsname_release": "5.15.0-41-generic", > "utsname_sysname": "Linux", > "utsname_version": "#44-Ubuntu SMP Wed Jun 22 14:20:53 UTC 2022" > } > > If I remove the version check (see below), dashboard appears to be working. https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1967139 I just uploaded a fix for cheroot to resolve this issue - the stable release update team should pick that up next week. Cheers James ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] dashboard on Ubuntu 22.04: python3-cheroot incompatibility
Hi, trying to activate ceph dashboard on a 17.2.0 cluster (Ubuntu 22.04 using standard ubuntu repos), the dashboard module crashes because it cannot understand the python3-cheroot version number '8.5.2+ds1': root@mceph00:~# ceph crash info 2022-07-22T14:44:03.226395Z_a6b006a7-10c3-443d-9ead-161e06a27bf3 { "backtrace": [ " File \"/usr/share/ceph/mgr/dashboard/__init__.py\", line 52, in \nfrom .module import Module, StandbyModule # noqa: F401", " File \"/usr/share/ceph/mgr/dashboard/module.py\", line 49, in \npatch_cherrypy(cherrypy.__version__)", " File \"/usr/share/ceph/mgr/dashboard/cherrypy_backports.py\", line 197, in patch_cherrypy\naccept_socket_error_0(v)", " File \"/usr/share/ceph/mgr/dashboard/cherrypy_backports.py\", line 124, in accept_socket_error_0\nif v < StrictVersion(\"9.0.0\") or cheroot_version < StrictVersion(\"6.5.5\"):", " File \"/lib/python3.10/distutils/version.py\", line 64, in __gt__\nc = self._cmp(other)", " File \"/lib/python3.10/distutils/version.py\", line 168, in _cmp\nother = StrictVersion(other)", " File \"/lib/python3.10/distutils/version.py\", line 40, in __init__\nself.parse(vstring)", " File \"/lib/python3.10/distutils/version.py\", line 137, in parse\nraise ValueError(\"invalid version number '%s'\" % vstring)", => "ValueError: invalid version number '8.5.2+ds1'" ], "ceph_version": "17.2.0", "crash_id": "2022-07-22T14:44:03.226395Z_a6b006a7-10c3-443d-9ead-161e06a27bf3", "entity_name": "mgr.mceph05", "mgr_module": "dashboard", "mgr_module_caller": "PyModule::load_subclass_of", "mgr_python_exception": "ValueError", "os_id": "22.04", "os_name": "Ubuntu 22.04 LTS", "os_version": "22.04 LTS (Jammy Jellyfish)", "os_version_id": "22.04", "process_name": "ceph-mgr", "stack_sig": "3f893983e716f2a7e368895904cf3485ac7064d3294a45ea14066a1576c818e3", "timestamp": "2022-07-22T14:44:03.226395Z", "utsname_hostname": "mceph05", "utsname_machine": "x86_64", "utsname_release": "5.15.0-41-generic", "utsname_sysname": "Linux", "utsname_version": "#44-Ubuntu SMP Wed Jun 22 14:20:53 UTC 2022" } If I remove the version check (see below), dashboard appears to be working. Regards Matthias --- root@mceph00:~# diff -rbup /usr/share/ceph/mgr/dashboard/cherrypy_backports.py{.orig,} --- /usr/share/ceph/mgr/dashboard/cherrypy_backports.py.orig2022-04-19 00:08:27.0 +0200 +++ /usr/share/ceph/mgr/dashboard/cherrypy_backports.py 2022-07-22 16:46:12.850768963 +0200 @@ -121,7 +121,8 @@ def accept_socket_error_0(v): except ImportError: pass -if v < StrictVersion("9.0.0") or cheroot_version < StrictVersion("6.5.5"): +#if v < StrictVersion("9.0.0") or cheroot_version < StrictVersion("6.5.5"): +if v < StrictVersion("9.0.0"): generic_socket_error = OSError def accept_socket_error_0(func): ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Can't setup Basic Ceph Client
Hello Iban, We finally did it ! With your example, we set up a client which does what we need. We only regret that the documentation of ceph auth in not a little more explicit, that could have led us quicker to the solution. Many thanks Iban, and Kai Stian Olstad too Best regards JM Le 19/07/2022 à 14:12, Jean-Marc FONTANA a écrit : Hello Iban, Thanks for your answering ! We finally managed to connect with the admin keyring and we think that is not the best practice. We shall try your conf and get you advised of the result. Best regards JM Le 19/07/2022 à 11:08, Iban Cabrillo a écrit : Hi Jean, If you do not want to use the admin user, which is the most logical thing to do, you must create a client with rbd access to the pool on which you are going to perform the I/O actions. For example in our case it is the user cinder: client.cinder key: caps: [mgr] allow r caps: [mon] profile rbd caps: [osd] profile rbd pool=vol1, profile rbd pool=vol2 . profile rbd pool=volx And the install the client keyring on the client node: cephclient:~ # ls -la /etc/ceph/ total 28 drwxr-xr-x 2 root root 4096 Jul 18 11:37 . drwxr-xr-x 132 root root 12288 Jul 18 11:37 ... -rw-r--r-- 1 root root root 64 Oct 19 2017 ceph.client.cinder.keyring -rw-r--r-- 1 root root root 2018 Jul 18 11:37 ceph.conf In our case we have added cat /etc/profile.d/ceph-cinder.sh export CEPH_ARGS="--keyring /etc/ceph/ceph.client.cinder.keyring --id cinder" so that it picks it up automatically- cephclient:~ # rbd ls -p volumes image01_to_remove volume-01bbf2ee-198c-446d-80bf-f68292130f5c volume-036865ad-6f9b-4966-b2ea-ce10bf09b6a9 volume-04445a86-a032-4731-8bff-203dfc5d02e1 .. I hope this help you. Cheers, I ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: librbd leaks memory on crushmap updates
Am 21.07.22 um 17:50 schrieb Ilya Dryomov: > On Thu, Jul 21, 2022 at 11:42 AM Peter Lieven wrote: >> Am 19.07.22 um 17:57 schrieb Ilya Dryomov: >>> On Tue, Jul 19, 2022 at 5:10 PM Peter Lieven wrote: Am 24.06.22 um 16:13 schrieb Peter Lieven: > Am 23.06.22 um 12:59 schrieb Ilya Dryomov: >> On Thu, Jun 23, 2022 at 11:32 AM Peter Lieven wrote: >>> Am 22.06.22 um 15:46 schrieb Josh Baergen: Hey Peter, > I found relatively large allocations in the qemu smaps and checked > the contents. It contained several hundred repetitions of osd and > pool names. We use the default builds on Ubuntu 20.04. Is there a > special memory allocator in place that might not clean up properly? I'm sure you would have noticed this and mentioned it if it was so - any chance the contents of these regions look like log messages of some kind? I recently tracked down a high client memory usage that looked like a leak that turned out to be a broken config option resulting in higher in-memory log retention: https://tracker.ceph.com/issues/56093. AFAICT it affects Nautilus+. >>> Hi Josh, hi Ilya, >>> >>> >>> it seems we were in fact facing 2 leaks with 14.x. Our long running VMs >>> with librbd 14.x have several million items in the osdmap mempool. >>> >>> In our testing environment with 15.x I see no unlimited increase in the >>> osdmap mempool (compared this to a second dev host with 14.x client >>> where I see the increase wiht my tests), >>> >>> but I still see leaking memory when I generate a lot of osdmap changes, >>> but this in fact seem to be log messages - thanks Josh. >>> >>> >>> So I would appreciate if #56093 would be backported to Octopus before >>> its final release. >> I picked up Josh's PR that was sitting there unnoticed but I'm not sure >> it is the issue you are hitting. I think Josh's change just resurrects >> the behavior where clients stored only up to 500 log entries instead of >> up to 1 (the default for daemons). There is no memory leak there, >> just a difference in how much memory is legitimately consumed. The >> usage is bounded either way. >> >> However in your case, the usage is slowly but constantly growing. >> In the original post you said that it was observed both on 14.2.22 and >> 15.2.16. Are you saying that you are no longer seeing it in 15.x? > After I understood whats the background of Josh issue I can confirm that > I still see increasing memory which is not caused > > by osdmap items and also not by log entries. There must be something else > going on. I still see increased memory (heap) usage. Might it be that it is just heap fragmentation? >>> Hi Peter, >>> >>> It could be but you never quantified the issue. What is the actual >>> heap usage you are seeing, how fast is it growing? Is it specific to >>> some particular VMs or does it affect the entire fleet? >> >> Hi Ilya, >> >> >> I see the issue across the fleet. The memory increases about 200KB/day per >> attached drive. >> >> Same hypervisor with attached iSCSI storage - no issue. >> >> >> However, the memory that is increasing is not listed as heap under >> /proc/{pid}/smaps. >> >> Does librbd use its own memory allocator? > Hi Peter, > > By default, librbd uses tcmalloc. Thats a good pointer. From what I read tcmalloc is not aggressively returning memory back to the OS after free. > >> >> I am still testing with 15.x as I mainly have long running VMs in our >> production environment. >> >> With 14.x we had an additional issue with the ospmaps not beeing freed. That >> is gone with 15.x >> >> >> I will try with a patched qemu that allocated the write buffers inside qemu >> and set disable_zero_copy_write = true. >> >> to see if this makes any difference. > We are unlikely to be able to do anything about 15.x at this point so > I'd encourage you to try 17.x. That said, any new information would be > helpful. I will certainly do, but at the moment it looks like tcmalloc and heap fragmentation. I am currently testing with a modified qemu that sets rbd_disable_zero_copy_writes to false and implements the bounce buffer internally. It additionally has the benefit that we can avoid a buffer allocation for very small writes (e.g. up to 4k) and take the memory from the coroutine stack. (And it would allow for the implementation of FUA support with the existing librbd API, but thats future.) Best Peter ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: replacing OSD nodes
It seems like a low hanging fruit to fix? There must be a reason why the developers have not made a prioritized order of backfilling PGs. Or maybe the prioritization is something else than available space? The answer remains unanswered, as well as if my suggested approach/script would work or not? Summer vacation? Best, Jesper -- Jesper Lykkegaard Karlsen Scientific Computing Centre for Structural Biology Department of Molecular Biology and Genetics Aarhus University Universitetsbyen 81 8000 Aarhus C E-mail: je...@mbg.au.dk Tlf:+45 50906203 Fra: Janne Johansson Sendt: 20. juli 2022 19:39 Til: Jesper Lykkegaard Karlsen Cc: ceph-users@ceph.io Emne: Re: [ceph-users] replacing OSD nodes Den ons 20 juli 2022 kl 11:22 skrev Jesper Lykkegaard Karlsen : > Thanks for you answer Janne. > Yes, I am also running "ceph osd reweight" on the "nearfull" osds, once they > get too close for comfort. > > But I just though a continuous prioritization of rebalancing PGs, could make > this process more smooth, with less/no need for handheld operations. You are absolutely right there, just wanted to chip in with my experiences of "it nags at me but it will still work out" so other people finding these mails later on can feel a bit relieved at knowing that a few toofull warnings aren't a major disaster and that it sometimes happens, because ceph looks for all possible moves, even those who will run late in the rebalancing. -- May the most significant bit of your life be positive. ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io