[ceph-users] Re: OSD reboot loop after running out of memory

2021-01-02 Thread Stefan Wild
ff Thanks, Stefan On 1/1/21, 11:23 PM, "Anthony D'Atri" wrote: I have to ask if this might be the balancer or the PG autoscaler at work > On Jan 1, 2021, at 7:15 PM, Stefan Wild wrote: > > Our setup is not using SSDs as the Bluestore DB devices. We only have 2 SSDs

[ceph-users] Re: OSD reboot loop after running out of memory

2021-01-01 Thread Stefan Wild
e > and stability wise. > Or you may be able to workaround the excessive swap usage (when > bluefs_buffered_io is set to true) by lowering vm.swappiness or > disabling the swap. > > Regards, > > Frédéric. > > Le 14/12/2020 à 2

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Stefan Wild
Hi Frédéric, Thanks for the additional input. We are currently only running RGW on the cluster, so no snapshot removal, but there have been plenty of remappings with the OSDs failing (all of them at first during and after the OOM incident, then one-by-one). I haven't had a chance to look into

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Stefan Wild
will look up the referenced thread(s) and try the offline DB compaction. It would be amazing if that does the trick. Will keep you posted, here. Thanks, Stefan From: Igor Fedotov Sent: Monday, December 14, 2020 6:39:28 AM To: Stefan Wild ; ceph-users@ceph.io

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-14 Thread Stefan Wild
history from 2 weeks ago, but ballooning and running out of memory is not the issue anymore. Thanks, Stefan From: Kalle Happonen Sent: Monday, December 14, 2020 5:00:17 AM To: huxia...@horebdata.cn Cc: Stefan Wild ; ceph-users Subject: Re: [ceph-users] Re: OSD

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-13 Thread Stefan Wild
Igor On 12/13/2020 5:44 AM, Stefan Wild wrote: > Just had another look at the logs and this is what I did notice after the affected OSD starts up. > > Loads of entries of this sort: > > Dec 12 21:38:40 ceph-tpa-server1 bash[780507]: debug 2020-12-13T02:38:40

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-12 Thread Stefan Wild
Got a trace of the osd process, shortly after ceph status -w announced boot for the osd: strace: Process 784735 attached futex(0x5587c3e22fc8, FUTEX_WAIT_PRIVATE, 0, NULL) = ? +++ exited with 1 +++ It was stuck at that one call for several minutes before exiting. From: Stefan Wild Date

[ceph-users] Re: OSD reboot loop after running out of memory

2020-12-12 Thread Stefan Wild
er1 systemd[1]: Started Ceph osd.1 for 08fa929a-8e23-11ea-a1a2-ac1f6bf83142. Hope that helps… Thanks, Stefan From: Stefan Wild Date: Saturday, December 12, 2020 at 9:35 PM To: "ceph-users@ceph.io" Subject: OSD reboot loop after running out of memory Hi, We recently upgraded

[ceph-users] OSD reboot loop after running out of memory

2020-12-12 Thread Stefan Wild
Hi, We recently upgraded a cluster from 15.2.1 to 15.2.5. About two days later, one of the server ran out of memory for unknown reasons (normally the machine uses about 60 out of 128 GB). Since then, some OSDs on that machine get caught in an endless restart loop. Logs will just mention system

[ceph-users] Re: RGW listing slower on nominally faster setup

2020-06-12 Thread Stefan Wild
On 6/12/20, 5:40 AM, "James, GleSYS" wrote: > When I set the debug_rgw logs to "20/1", the issue disappears immediately, > and the throughput for the index pool goes back down to normal levels. I can – somewhat happily – confirm that setting debug_rgw to "20/1" makes the issue disappear

[ceph-users] RGW listing slower on nominally faster setup

2020-06-10 Thread Stefan Wild
Hi everyone, We are currently transitioning from a temporary machine to our production hardware. Since we're starting with under 200 TB raw storage, we are currently on only 1–2 physical machines per cluster, eventually in 3 zones. The temporary machine is undersized for even that with an