[ceph-users] Re: MDS stuck ops

2022-11-29 Thread Venky Shankar
Hi Frank, CC Patrick. On Tue, Nov 29, 2022 at 8:58 PM Frank Schilder wrote: > > Hi Venky, > > thanks for taking the time. I'm afraid I still don't get the difference. > Maybe the ceph dev terminology means something else than what I use. Let's > look at this statement, I think it summarises

[ceph-users] Re: Implications of pglog_hardlimit

2022-11-29 Thread Joshua Timmer
Great, thanks, that seems to be what I needed. The osds are running again and the cluster is beginning its long road to recovery. It looks like I'm left with a few unfound objects and 3 osds that won't start due to crashes while reading the osdmap, but I'll see if I can work through that. On

[ceph-users] Upgrade OSDs without ok-to-stop

2022-11-29 Thread Hollow D.M.
Hi All, I have a ceph cluster for test, and it has some pool without replica and EC, so while i run "ceph orch upgrade start --ceph_version 17.2.5" , log got "Upgrade: unsafe to stop osd(s) at this time (74 PGs are or would become offline)" ,and waiting it. Is any way to skip this and upgrade

[ceph-users] Re: Implications of pglog_hardlimit

2022-11-29 Thread Josh Baergen
It's also possible you're running into large pglog entries - any chance you're running RGW and there's an s3:CopyObject workload hitting an object that was uploaded with MPU? https://tracker.ceph.com/issues/56707 If that's the case, you can inject a much smaller value for osd_min_pg_log_entries

[ceph-users] Re: Implications of pglog_hardlimit

2022-11-29 Thread Frank Schilder
Hi, it sounds like you might be affected by the pg_log dup bug: # Check if any OSDs are affected by the pg dup problem sudo -i ceph tell "osd.*" perf dump | grep -e pglog -e "osd\\." If any osd_pglog_items>>1M check https://www.clyso.com/blog/osds-with-unlimited-ram-growth/ Best regards,

[ceph-users] Re: Implications of pglog_hardlimit

2022-11-29 Thread Gregory Farnum
On Tue, Nov 29, 2022 at 1:18 PM Joshua Timmer wrote: > I've got a cluster in a precarious state because several nodes have run > out of memory due to extremely large pg logs on the osds. I came across > the pglog_hardlimit flag which sounds like the solution to the issue, > but I'm concerned

[ceph-users] Implications of pglog_hardlimit

2022-11-29 Thread Joshua Timmer
I've got a cluster in a precarious state because several nodes have run out of memory due to extremely large pg logs on the osds. I came across the pglog_hardlimit flag which sounds like the solution to the issue, but I'm concerned that enabling it will immediately truncate the pg logs and

[ceph-users] OSD container won't boot up

2022-11-29 Thread J-P Methot
Hi, I've been testing the cephadm upgrade process in my staging environment and I'm running into an issue where the docker container just doesn't boot up anymore. This is an octopus to Nautilus  16.2.10 upgrade and I expect to upgrade to quincy afterwards. This is also running on Ubuntu

[ceph-users] Re: PGs stuck down

2022-11-29 Thread Wolfpaw - Dale Corse
Thanks! Appreciate everyone who responded :) After reading up on stretch mode, it appears some of the exact things it was created to prevent happened, so this would be the solution! Cheers, D. -Original Message- From: Frank Schilder [mailto:fr...@dtu.dk] Sent: Tuesday, November 29,

[ceph-users] Re: MDS internal op exportdir despite ephemeral pinning

2022-11-29 Thread Frank Schilder
Hi Patrick. > "Both random and distributed ephemeral pin policies are off by default > in Octopus. The features may be enabled via the > mds_export_ephemeral_random and mds_export_ephemeral_distributed > configuration options." Thanks for that hint! This is a baddie. I never read that far,

[ceph-users] Re: MDS internal op exportdir despite ephemeral pinning

2022-11-29 Thread Patrick Donnelly
Hi Frank, Sorry for the delay and thanks for sharing the data privately. On Wed, Nov 23, 2022 at 4:00 AM Frank Schilder wrote: > > Hi Patrick and everybody, > > I wrote a small script that pins the immediate children of 3 sub-dirs on our > file system in a round-robin way to our 8 active

[ceph-users] Re: Issues upgrading cephadm cluster from Octopus.

2022-11-29 Thread Seth T Graham
Thanks for the suggestions. It took me a little bit to get to try it out, but I was able to get the cluster upgraded from Octopus to the latest Pacific. Setting the migration_current value didn't seem to un-wedge anything, but manually setting the registry_credentials key did. It appears my

[ceph-users] Re: MDS stuck ops

2022-11-29 Thread Frank Schilder
Hi Venky, thanks for taking the time. I'm afraid I still don't get the difference. Maybe the ceph dev terminology means something else than what I use. Let's look at this statement, I think it summarises my misery quite well: > It's an implementation difference. In octopus, each child dir

[ceph-users] Re: MDS stuck ops

2022-11-29 Thread Venky Shankar
Hi Frank, On Tue, Nov 29, 2022 at 5:38 PM Frank Schilder wrote: > > Hi Venky, > > maybe you can help me clarifying the situation a bit. I don't understand the > difference between the two pinning implementations you describe in your reply > and I also don't see any difference in meaning in the

[ceph-users] Re: Ceph networking

2022-11-29 Thread Jan Marek
Hello, thank you very much for advices. Now I have two public networks. I've tried to set cluster to use both public addresses, but I've not successful. # ceph config global public_network 192.168.1.0/24,192.168.2.0/24 # ceph config mon public_network 192.168.1.0/24,192.168.2.0/24 -

[ceph-users] Re: MDS stuck ops

2022-11-29 Thread Frank Schilder
Hi Venky, maybe you can help me clarifying the situation a bit. I don't understand the difference between the two pinning implementations you describe in your reply and I also don't see any difference in meaning in the documentation between octopus and quicy, the difference is just in wording.

[ceph-users] Re: MDS stuck ops

2022-11-29 Thread Venky Shankar
On Tue, Nov 29, 2022 at 1:42 PM Frank Schilder wrote: > > Hi Venky. > > > You most likely ran into performance issues with distributed ephemeral > > pins with octopus. It'd be nice to try out one of the latest releases > > for this. > > I run into the problem that distributed ephemeral pinning

[ceph-users] Re: PGs stuck down

2022-11-29 Thread Frank Schilder
Hi Dale, > we thought we had set it up to prevent.. and with size = 4 and min_size set = > 1 I'm afraid this is exactly what you didn't. Firstly, min_size=1 is always a bad idea. Secondly, if you have 2 data centres, the only way to get this to work is to use stretch mode. Even if you had

[ceph-users] Re: MDS stuck ops

2022-11-29 Thread Frank Schilder
Hi Venky. > You most likely ran into performance issues with distributed ephemeral > pins with octopus. It'd be nice to try out one of the latest releases > for this. I run into the problem that distributed ephemeral pinning seems not actually implemented in octopus. This mode didn't pin