[ceph-users] Cephadm stacktrace on copying ceph.conf

2024-03-26 Thread Jesper Agerbo Krogh [JSKR]
Hi. 

We're currently getting these errors - and I seem to be missing a clear 
overview over the cause and how to debug. 

3/26/24 9:38:09 PM[ERR]executing _write_files((['dkcphhpcadmin01', 
'dkcphhpcmgt028', 'dkcphhpcmgt029', 'dkcphhpcmgt031', 'dkcphhpcosd033', 
'dkcphhpcosd034', 'dkcphhpcosd035', 'dkcphhpcosd036', 'dkcphhpcosd037', 
'dkcphhpcosd038', 'dkcphhpcosd039', 'dkcphhpcosd040', 'dkcphhpcosd041', 
'dkcphhpcosd042', 'dkcphhpcosd043', 'dkcphhpcosd044'],)) failed. Traceback 
(most recent call last): File "/usr/share/ceph/mgr/cephadm/ssh.py", line 240, 
in _write_remote_file await asyncssh.scp(f.name, (conn, tmp_path)) File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp await 
source.run(srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 
458, in run self.handle_error(exc) File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise 
exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in 
run await self._send_files(path, b'') File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in _send_files 
self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", 
line 307, in handle_error raise exc from None File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in _send_files await 
self._send_file(srcpath, dstpath, attrs) File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in _send_file await 
self._make_cd_request(b'C', attrs, size, srcpath) File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in _make_cd_request 
self._fs.basename(path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", 
line 224, in make_request raise exc asyncssh.sftp.SFTPFailure: scp: 
/tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: 
Permission denied During handling of the above exception, another exception 
occurred: Traceback (most recent call last): File 
"/usr/share/ceph/mgr/cephadm/utils.py", line 79, in do_work return f(*arg) File 
"/usr/share/ceph/mgr/cephadm/serve.py", line 1088, in _write_files 
self._write_client_files(client_files, host) File 
"/usr/share/ceph/mgr/cephadm/serve.py", line 1107, in _write_client_files 
self.mgr.ssh.write_remote_file(host, path, content, mode, uid, gid) File 
"/usr/share/ceph/mgr/cephadm/ssh.py", line 261, in write_remote_file host, 
path, content, mode, uid, gid, addr)) File 
"/usr/share/ceph/mgr/cephadm/module.py", line 615, in wait_async return 
self.event_loop.get_result(coro) File "/usr/share/ceph/mgr/cephadm/ssh.py", 
line 56, in get_result return asyncio.run_coroutine_threadsafe(coro, 
self._loop).result() File "/lib64/python3.6/concurrent/futures/_base.py", line 
432, in result return self.__get_result() File 
"/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result raise 
self._exception File "/usr/share/ceph/mgr/cephadm/ssh.py", line 249, in 
_write_remote_file raise OrchestratorError(msg) 
orchestrator._interface.OrchestratorError: Unable to write 
dkcphhpcmgt028:/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf:
 scp: 
/tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: 
Permission denied
3/26/24 9:38:09 PM[ERR]Unable to write 
dkcphhpcmgt028:/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf:
 scp: 
/tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: 
Permission denied Traceback (most recent call last): File 
"/usr/share/ceph/mgr/cephadm/ssh.py", line 240, in _write_remote_file await 
asyncssh.scp(f.name, (conn, tmp_path)) File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 922, in scp await 
source.run(srcpath) File "/lib/python3.6/site-packages/asyncssh/scp.py", line 
458, in run self.handle_error(exc) File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 307, in handle_error raise 
exc from None File "/lib/python3.6/site-packages/asyncssh/scp.py", line 456, in 
run await self._send_files(path, b'') File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 438, in _send_files 
self.handle_error(exc) File "/lib/python3.6/site-packages/asyncssh/scp.py", 
line 307, in handle_error raise exc from None File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 434, in _send_files await 
self._send_file(srcpath, dstpath, attrs) File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 365, in _send_file await 
self._make_cd_request(b'C', attrs, size, srcpath) File 
"/lib/python3.6/site-packages/asyncssh/scp.py", line 343, in _make_cd_request 
self._fs.basename(path)) File "/lib/python3.6/site-packages/asyncssh/scp.py", 
line 224, in make_request raise exc asyncssh.sftp.SFTPFailure: scp: 
/tmp/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf.new: 
Permission denied
3/26/24 9:38:09 PM[INF]Updating 
dkcphhpcmgt028:/var/lib/ceph/5c384430-da91-11ed-af9c-c780a5227aff/config/ceph.conf

It seem to be related to the permissions that the manager writes the files with 
and the process copying 

[ceph-users] Adding new OSD's - slow_ops and other issues.

2024-03-25 Thread jskr
Hi. 

We have a cluster working very nicely since it was put up more than a year ago. 
Now we needed to add more NVMe drives to expand. 

After setting all the "no" flags.. we added them using

$ ceph orch osd add  

The twist is that we have managed to get the default weights set to 1 for all 
disks not 7.68 (as the default for the ceph orch command. 

Thus we did a subsequent reweight to change weight -- and then removed the "no" 
flags. 

As a consequence we had a bunch of OSD's delivering slow_ops and -- after 
manually restarting osd's to get rid of them - the system returned to normal. 

... second try... 

Same drill - but somehow the ceph orch command failed to bring the new OSD 
online before we ran the reweight command ... and it works flawlessly 

... third try ... 

Same drill - but now ceph orch brought the new OSD into the system - and we saw 
excactly the same problem again. Being a bit wiser - we forcefully restarted 
the new OSD.. and everything whet back into normal mode again. 

Thus it seems like the "reweight" command on online OSD's have a bad effect on 
our setup - causing major service disruption. 

1) Is it possible to "bulk" change default weights on all OSD's without a huge 
data movement going on? 
2) or Is it possible to instrurct "ceph orch osd add" to set default weight 
before it putting the new OSD into the system? 

I would not expect above to be expected behaviour - if someone has ideas about 
what goes on more than above please share? 

Setup: 
# ceph version
ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable)

43 7.68 TB NVMe's over 12 OSD hosts - all connected using 2x 100GbitE 


Thanks Jesper
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io