[ceph-users] Re: Backfill Performance for

2023-08-08 Thread Josh Baergen
Hi Jonathan,

> - All PGs seem to be backfilling at the same time which seems to be in
> violation of osd_max_backfills. I understand that there should be 6 readers
> and 6 writers at a time, but I'm seeing a given OSD participate in more
> than 6 PG backfills. Is an OSD only considered as backfilling if it is not
> present in both the UP and ACTING groups (e.g. it will have it's data
> altered)?

Say you have a PG that looks like this:
1.7ffe   active+remapped+backfill_wait[983,1112,486] 983
[983,1423,1329] 983

If this is a replicated cluster, 983 (the primary OSD) will be the
data read source, and 1423/1329 will of course be targets. If this is
EC, then 1112 will be the read source for the 1423 backfill, and 486
will be the read source for the 1329 backfill. (Unless the PG is
degraded, in which case backfill reads may become normal PG reads.)

Backfill locks are taken on the primary OSD (983 in the example above)
and then all the backfill targets (1329, 1423). Locks are _not_ taken
on read sources for EC backfills, so it's possible to have any number
of backfills reading from a single OSD during EC backfill with no
direct control over this.

> - Some PGs are recovering at a much slower rate than others (some as little
> as kilobytes per second) despite the disks being all of a similar speed. Is
> there some way to dig into why that may be?

Where I would start with this is looking at whether the read sources
or write targets are overloaded at the disk level.

> - In general, the recovery is happening very slowly (between 1 and 5
> objects per second per PG). Is it possible the settings above are too
> aggressive and causing performance degradation due to disk thrashing?

Maybe - which settings are appropriate depend on your configuration
(replicated vs. EC); if you have a replicated pool, then those
settings are probably way too aggressive, and max backfills should be
reduced. If it's EC, the max backfills might be OK. In either case,
the sleep should be increased, but it's unlikely that the sleep
setting is affecting per-PG backfill speed that much (though it could
make it uneven).

> - Currently, all misplaced PGs are backfilling, if I were to change some of
> the settings above (specifically `osd_max_backfills`) would that
> essentially pause backfilling PGs or will those backfills have to end and
> then start over when it is done waiting?

It effectively pauses backfill.

> - Given that all PGs are backfilling simultaneously there is no way to
> prioritize one PG over another (we have some disks with very high usage
> that we're trying to reduce). Would reducing those max backfills allow for
> proper prioritization of PGs with force-backfill?

There's no great way to affect backfill prioritization. The backfill
lock acquisition I noted above is blocking without backoff, so
high-priority backfills could be waiting in line for a while until
they get a chance to run.

> - We have had some OSDs restart during the process and their misplaced
> object count is now zero but they are incrementing their recovering objects
> bytes. Is that expected and is there a way to estimate when that will
> complete?

Not sure - this gets messy.

FWIW, this situation is one of the reasons why we built
https://github.com/digitalocean/pgremapper (inspired by a procedure
and some tooling that CERN built for the same reason). You might be
interested in 
https://github.com/digitalocean/pgremapper#example---cancel-all-backfill-in-the-system-as-a-part-of-an-augment,
or using cancel-backfill plus an undo-upmaps loop.

Josh
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Puzzle re 'ceph: mds0 session blocklisted"

2023-08-08 Thread Harry G Coin

Can anyone help me understand seemingly contradictory cephfs error messages?

I have a RHEL ceph client that mounts a cephfs file system via autofs.  
Very typical.  After boot, when a user first uses the mount, for example 
'ls /mountpoint' , all appears normal to the user.  But on the system 
console I get, every time after first boot:


...

[  412.762310] Key type dns_resolver registered
[  412.912107] Key type ceph registered
[  412.925268] libceph: loaded (mon/osd proto 15/24)
[  413.110488] ceph: loaded (mds proto 32)
[  413.124870] libceph: mon3 (2)[fc00:1002:c7::44]:3300 session established
[  413.128298] libceph: client56471655 fsid 
4067126d-01cb-40af-824a-881c130140f8

[  413.355716] ceph: mds0 session blocklisted

...

The autofs line is

/mountpoint 
-fstype=ceph,fs=cephlibraryfs,fsid=really-big-number,name=cephuser,secretfile=/etc/ceph/secret1.key,ms_mode=crc,relatime,recover_session=clean,mount_timeout=15,fscontext="system_u:object_r:cephfs_t:s0" 
[fc00:1002:c7::41]:3300,[fc00:1002:c7::42]:3300,[fc00:1002:c7::43]:3300,[fc00:1002:c7::44]:3300:/


'blocklisting' is, well, 'bad'... but there's no obvious user effect.  
Is there an 'unobvious' problem?   What am I missing?  Ceph pacific 
everywhere latest.


Thanks

Harry Coin



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: MDS stuck in rejoin

2023-08-08 Thread Frank Schilder
Dear Xiubo,

the nearfull pool is an RBD pool and has nothing to do with the file system. 
All pools for the file system have plenty of capacity.

I think we have an idea what kind of workload caused the issue. We had a user 
run a computation that reads the same file over and over again. He started 100 
such jobs in parallel and our storage servers were at 400% load. I saw 167K 
read IOP/s on an HDD pool that has an aggregated raw IOP/s budget of ca. 11K. 
Clearly, most of this was served from RAM.

It is possible that this extreme load situation triggered a race that remained 
undetected/unreported. There is literally no related message in any logs near 
the time the warning started popping up. It shows up out of nowhere.

We asked the user to change his workflow to use local RAM disk for the input 
files. I don't think we can reproduce the problem anytime soon.

About the bug fixes, I'm eagerly waiting for this and another one. Any idea 
when they might show up in distro kernels?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14


From: Xiubo Li 
Sent: Tuesday, August 8, 2023 2:57 AM
To: Frank Schilder; ceph-users@ceph.io
Subject: Re: [ceph-users] Re: MDS stuck in rejoin


On 8/7/23 21:54, Frank Schilder wrote:
> Dear Xiubo,
>
> I managed to collect some information. It looks like there is nothing in the 
> dmesg log around the time the client failed to advance its TID. I collected 
> short snippets around the critical time below. I have full logs in case you 
> are interested. Its large files, I will need to do an upload for that.
>
> I also have a dump of "mds session ls" output for clients that showed the 
> same issue later. Unfortunately, no consistent log information for a single 
> incident.
>
> Here the summary, please let me know if uploading the full package makes 
> sense:
>
> - Status:
>
> On July 29, 2023
>
> ceph status/df/pool stats/health detail at 01:05:14:
>cluster:
>  health: HEALTH_WARN
>  1 pools nearfull
>
> ceph status/df/pool stats/health detail at 01:05:28:
>cluster:
>  health: HEALTH_WARN
>  1 clients failing to advance oldest client/flush tid
>  1 pools nearfull

Okay, then this could be the root cause.

If the pool nearful it could block flushing the journal logs to the pool
and then the MDS couldn't safe reply to the requests and then block them
like this.

Could you fix the pool nearful issue first and then check could you see
it again ?


> [...]
>
> On July 31, 2023
>
> ceph status/df/pool stats/health detail at 10:36:16:
>cluster:
>  health: HEALTH_WARN
>  1 clients failing to advance oldest client/flush tid
>  1 pools nearfull
>
>cluster:
>  health: HEALTH_WARN
>  1 pools nearfull
>
> - client evict command (date, time, command):
>
> 2023-07-31 10:36  ceph tell mds.ceph-11 client evict id=145678457
>
> We have a 1h time difference between the date stamp of the command and the 
> dmesg date stamps. However, there seems to be a weird 10min delay from 
> issuing the evict command until it shows up in dmesg on the client.
>
> - dmesg:
>
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 12:59:14 2023] beegfs: enabling unsafe global rkey
> [Fri Jul 28 16:07:47 2023] slurm.epilog.cl (24175): drop_caches: 3
> [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed 
> (con state OPEN)
> [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed 
> (con state OPEN)
> [Sat Jul 29 18:21:30 2023] libceph: mds2 192.168.32.75:6801 socket closed 
> (con state OPEN)
> [Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:21:42 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:21:43 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:26:39 2023] ceph: mds2 reconnect start
> [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:26:40 2023] ceph: mds2 reconnect success
> [Sat Jul 29 18:26:49 2023] ceph: update_snap_trace error -22

This is a known bug and we have fixed it in both kclient and ceph side:

https://tracker.ceph.com/issues/61200

https://tracker.ceph.com/issues/61217

Thanks

- Xiubo


> [Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
> [Sat Jul 29 18:26:49 2023] ceph: mds2 recovery completed
> 

[ceph-users] Ceph bucket notification events stop working

2023-08-08 Thread daniel . yordanov1
Hello, 

We started to use the Ceph bucket notification events with subscription to an 
HTTP endpoint. 
We encountered an issue when the receiver endpoint was changed. Which means the 
events from Ceph weren't consumed. We deleted the bucket notifications and the 
topic, and created a new topic with the new endpoint and new bucket 
notifications. 
(We are using the REST api to create bucket notifications and topics. We also 
used the CLI commands, but there we found out that deleting a topic doesn't 
delete the notifications that are subscribed to it.  Ceph version is Pacific.)
>From that moment we didn't receive any more notification events to our new 
>endpoint. 
We tried many times to create new topics and new bucket notifications, but we 
don't receive anymore events to our endpoint. 
We suspect that the notification queues don't get fully cleaned and they stay 
in some broken state. 
We have been able to reproduce this locally and the only solution was to wipe 
all the containers and recreate them. The problem is that this issue is on a 
staging environment where we cannot destroy everything. 
We are looking for a solution or a command to clean the notification queues, to 
be able to start anew. 
We also are looking  for a way to know programatically if the notifications 
broke and have a way to automatically recover as such a flaw is critical for 
our application. 

Thanks for your time!
Daniel Yordanov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io