[ceph-users] Re: MDS_DAMAGE in 17.2.7 / Cannot delete affected files
Hi Patrick, On 30.11.23 03:58, Patrick Donnelly wrote: I've not yet fully reviewed the logs but it seems there is a bug in the detection logic which causes a spurious abort. This does not appear to be actually new damage. We are accessing the metadata (read-only) daily. The issue only popped up after updating to 17.2.7. Of course, this does not mean that there was no damage there before, only that it was not detected. Are you using postgres? Not on top of CephFS, no. We do use postgres on some RBD volumes. If you can share details about your snapshot workflow and general workloads that would be helpful (privately if desired). Our CephFS root looks like this: /archive /homes /no-snapshot /other-snapshot /scratch We are running snapshots on /homes and /other-snapshot with the same schedule. We mount the filesystem with a Kernel client on one of the Ceph Hosts (not running the MDS) and mkdir / rmdir as needed. - daily between 06:00 and 19:45 UTC (inclusive): Create a snapshot every 15 minutes, delete it unless it is hourly (xx:00) one hour later - daily on the full hour: Create a snapshot, delete the 24 hours old snapshot unless it is midnight - daily at midnight delete the snapshot from 14 days ago unless it is Sunday - every Sunday at midnight delete the snapshot from 8 weeks ago Workload is two main Samba servers (one only sharing a subdirectory which is generally not accessed on the other). Client access to those servers is limited to 1GBit/s each. Until Tuesday, we also had a mailserver with Dovecot running on top of CephFS. This was migrated on Tuesday to an RBD volume as we had some issues with hanging access to some files / directories (interestingly only in the main tree, in snapshots access was without issue). Additionally, we have a Nextcloud instance with ~200 active users storing data in CephFS as well as some other Kernel clients with little / sporadic traffic, some running Samba, some NFS, some interactive SSH / x2go servers with direct user access, some specialised web applications (notably OMERO). We run daily incremental backups of most of the CephFS content with Bareos running on a dedicated server which has the whole CephFS tree mounted read-only. For most data a full backup is performed every two months, for some data only every six months. The affected area is contained in this "every six months" full backup portion of the file system tree. Two weeks ago we deleted a folder structure with 6 TB, average file size in the range of 1GB. The structure was unter /other-snapshot as well. This led to severe load on the MDS, especially starting midnight. In conjunction with Ubuntu kernel mount, we also had issues with non-released capabilities preventing read-access to the /other-snapshot part. To combat these lingering problems, we deleted all snapshots in /other-snapshot which led to a half a dozen PGs stuck in snaptrim state (and a few hundred in snaptrim_wait). Updating from 17.2.6 to 17.2.7 solved that issue quickly, the affected PGs became unstuck and the whole cluster was in active+clean a few hours later. For now, I'll hold off on running first-damage.py to try to remove the affected files / inodes. Ultimately however, this seems to be the most sensible solution to me, at least with regards to cluster downtime. Please give me another day to review then feel free to use first-damage.py to cleanup. If you see new damage please upload the logs. We are in no hurry and will probably run first-damage.py sometime next week. I will report new damage if it comes in. Cheers Sebastian -- Dr. Sebastian Knust | Bielefeld University IT Administrator | Faculty of Physics Office: D2-110 | Universitätsstr. 25 Phone: +49 521 106 5234 | 33615 Bielefeld ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: MDS_DAMAGE in 17.2.7 / Cannot delete affected files
Hello Patrick, On 27.11.23 19:05, Patrick Donnelly wrote: I would **really** love to see the debug logs from the MDS. Please upload them using ceph-post-file [1]. If you can reliably reproduce, turn on more debugging: ceph config set mds debug_mds 20 ceph config set mds debug_ms 1 [1] https://docs.ceph.com/en/reef/man/8/ceph-post-file/ Uploaded debug log and core dump, see ceph-post-file: 02f78445-7136-44c9-a362-410de37a0b7d Unfortunately, we cannot easily shut down normal access to the cluster for these tests, therefore there is quite some clutter in the logs. The logs show three crashes, the last one with enabled core dumping (ulimits set to unlimited) A note on reproducibility: To recreate the crash, reading the contents of the file prior to removal seems necessary. Simply calling stat on the file and then performing the removal also yields an Input/output error but does not crash the MDS. Interestingly, the MDS_DAMAGE flag is reset on restart of the MDS and only comes back once the files in question are accessed (stat call is sufficient). For now, I'll hold off on running first-damage.py to try to remove the affected files / inodes. Ultimately however, this seems to be the most sensible solution to me, at least with regards to cluster downtime. Cheers Sebastian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] MDS_DAMAGE in 17.2.7 / Cannot delete affected files
Hi, After updating from 17.2.6 to 17.2.7 with cephadm, our cluster went into MDS_DAMAGE state. We had some prior issues with faulty kernel clients not releasing capabilities, therefore the update might just be a coincidence. `ceph tell mds.cephfs:0 damage ls` lists 56 affected files all with these general details: { "damage_type": "dentry", "id": 123456, "ino": 1234567890, "frag": "*", "dname": "some-filename.ext", "snap_id": "head", "path": "/full/path/to/file" } The behaviour upon trying to access file information in the (Kernel mounted) filesystem is a bit inconsistent. Generally, the first `stat` call seems to result in "Input/output error", the next call provides all `stat` data as expected from an undamaged file. The file can be read with `cat` with full and correct content (verified with backup) once the stat call succeeds. Scrubbing the affected subdirectories with `ceph tell mds.cephfs:0 scrub start /path/to/dir/ recursive,repair,force` does not fix the issue. Trying to delete the file results in an "Input/output error". If the stat calls beforehand succeeded, this also crashes the active MDS with these messages in the system journal: Nov 24 14:21:15 iceph-18.servernet ceph-mds[1946861]: mds.0.cache.den(0x10012271195 DisplaySettings.json) newly corrupt dentry to be committed: [dentry #0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json [1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x10012271197 state=1073741824 | inodepin=1 0x56413e1e2780] Nov 24 14:21:15 iceph-18.servernet ceph-mds[1946861]: log_channel(cluster) log [ERR] : MDS abort because newly corrupt dentry to be committed: [dentry #0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json [1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x10012271197 state=1073741824 | inodepin=1 0x56413e1e2780] Nov 24 14:21:15 iceph-18.servernet ceph-eafd0514-3644-11eb-bc6a-3cecef2330fa-mds-cephfs-iceph-18-ujfqnd[1946838]: 2023-11-24T13:21:15.654+ 7f3fdcde0700 -1 mds.0.cache.den(0x10012271195 DisplaySettings.json) newly corrupt dentry to be committed: [dentry #0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json [1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x1001> Nov 24 14:21:15 iceph-18.servernet ceph-eafd0514-3644-11eb-bc6a-3cecef2330fa-mds-cephfs-iceph-18-ujfqnd[1946838]: 2023-11-24T13:21:15.654+ 7f3fdcde0700 -1 log_channel(cluster) log [ERR] : MDS abort because newly corrupt dentry to be committed: [dentry #0x1/homes/huser/d3data/transfer/hortkrass/FLIMSIM/2023-04-12-irf-characterization/2-qwp-no-extra-filter-pc-off-tirf-94-tirf-cursor/DisplaySettings.json [1000275c4a0,head] auth (dversion lock) pv=0 v=225 ino=0x10012> Nov 24 14:21:15 iceph-18.servernet ceph-eafd0514-3644-11eb-bc6a-3cecef2330fa-mds-cephfs-iceph-18-ujfqnd[1946838]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDSRank.cc: In function 'void MDSRank::abort(std::string_view)' thread 7f3fdcde0700 time 2023-11-24T13:21:15.655088+ Nov 24 14:21:15 iceph-18.servernet ceph-mds[1946861]: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDSRank.cc: In function 'void MDSRank::abort(std::string_view)' thread 7f3fdcde0700 time 2023-11-24T13:21:15.655088+ /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.7/rpm/el8/BUILD/ceph-17.2.7/src/mds/MDSRank.cc: 937: ceph_abort_msg("abort() called") ceph version 17.2.7 (b12291d110049b2f35e32e0de30d70e9a4c060d2) quincy (stable) 1: (ceph::__ceph_abort(char const*, int, char const*, std::__cxx11::basic_string, std::allocator > const&)+0xd7) [0x7f3fe5a1cb03] 2: (MDSRank::abort(std::basic_string_view >)+0x7d) [0x5640f2e6fa2d] 3: (CDentry::check_corruption(bool)+0x740) [0x5640f30e4820] 4: (EMetaBlob::add_primary_dentry(EMetaBlob::dirlump&, CDentry*, CInode*, unsigned char)+0x47) [0x5640f2f41877] 5: (EOpen::add_clean_in
[ceph-users] Re: Centos 7 Kernel clients on ceph Quincy -- experiences??
Hi Christoph, I am able to reproducibly kernel panic CentOS 7 clients with native kernel (3.10.0-1160.76.1.el7) when accessing CephFS snapshots via SMB with vfs_shadow_copy2. This occurs on a Pacific cluster. IIRC accessing the snapshots on the server also lead to a kernel panic, but I'm not sure. Running a mainline kernel from elrepo prevents this issue. I imagine that you might possibly run into these issues with a Quincy cluster as well, if you are using CephFS snapshots at all. Cheers Sebastian On 20.09.22 13:34, Ackermann, Christoph wrote: Hello all, i would like to upgrade our well running Rocky 8.6 based bare metal cluster from Octopus to Quincy next few days. But there are some Centos7 Kernel based clients mapping RBDs or mounting CephFS in our environment. Is there someone here who can confirm Centos 7 clients (3.10.0-1160.76.1.el7.x86_64) working with Quincy? Best regards, Christoph ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: CephFS snapshots with samba shadowcopy
Hi, I am providing CephFS snapshots via Samba with the shadow_copy2 VFS object. I am running CentOS 7 with smbd 4.10.16 for which ceph_snapshots is not available AFAIK. Snapshots are created by a cronjob above the root of my shares with export TZ=GMT mkdir /cephfs/path/.snap/`date +@GMT-%Y.%m.%d-%H.%M.%S` i.e. the exported shares are subfolders of the folder in which I create snapshots. Samba configuration is: [global] ... shadow:snapdir = .snap shadow:snapdirseverywhere = yes shadow:format = _@GMT-%Y.%m.%d-%H.%M.%S_some-inode-number ... [sharename] ... path = /cephfs/path_to_main_root/share vfs object = shadow_copy2 ... [other_share_with_different_root] ... path = /cephfs/path_to_different_root/other_share vfs object = shadow_copy2 shadow:format = _@GMT-%Y.%m.%d-%H.%M.%S_other-inode-number The inode numbers in the configuration are of course the inode numbers of the directory containing the snapshots. Cheers Sebastian On 13.07.22 02:08, Bailey Allison wrote: Hi All, Curious if anyone is making use of samba shadowcopy with CephFS snapshots using the vfs object ceph_snapshots? I've had wildly different results on an Ubuntu 20.04 LTS samba server where the snaps just do not appear at all within shadowcopy, and a Rocky Linux samba server where the snaps do appear within shadowcopy but when opening them they contain absolutely no files at all. Both the Ubuntu and Rocky samba server are sharing out kernel cephfs mount via samba, ceph version is 17.2.1 and samba version is 4.13.7 for Ubuntu 20.04 and 4.15.5 for Rocky Linux. I have also tried using a samba fuse mount with vfs_ceph with the same results. More so just curious to see if anyone on the list has had success with making use of the ceph_snapshots vfs object and if they can share how it has worked for them. Included below is the share config for both Ubuntu and Rocky if anyone is curious: Ubuntu 20.04 LTS [public] force group = nogroup force user = nobody guest ok = Yes path = /mnt/cephfs/public read only = No vfs objects = ceph_snapshots Rocky Linux [public] force group = nogroup force user = nobody guest ok = Yes path = /mnt/cephfs/public read only = No vfs objects = ceph_snapshots Regards, Bailey ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io -- Dr. Sebastian Knust | Bielefeld University IT Administrator | Faculty of Physics Office: D2-110 | Universitätsstr. 25 Phone: +49 521 106 5234 | 33615 Bielefeld ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs quota used
Hi Jasper, On 16.12.21 12:45, Jesper Lykkegaard Karlsen wrote: Now, I want to access the usage information of folders with quotas from root level of the cephfs. I have failed to find this information through getfattr commands, only quota limits are shown here, and du-command on individual folders is a suboptimal solution. `getfattr -n ceph.quota.max_bytes /path` gives the specified quota for a given path. `getfattr -n ceph.dir.rbytes /path` gives the size of the path, as you would usually get with du for conventional file systems. As an example, I am using this script for weekly utilisation reports: for i in /ceph-path-to-home-dirs/*; do if [ -d "$i" ]; then SIZE=$(getfattr -n ceph.dir.rbytes --only-values "$i") QUOTA=$(getfattr -n ceph.quota.max_bytes --only-values "$i" 2>/dev/null || echo 0) PERC=$(echo $SIZE*100/$QUOTA | bc 2> /dev/null) if [ -z "$PERC" ]; then PERC="--"; fi printf "%-30s %8s %8s %8s%%\n" "$i" `numfmt --to=iec $SIZE` `numfmt --to=iec $QUOTA` $PERC fi done Note that you can also mount CephFS with the "rbytes" mount option. IIRC the fuse clients defaults to it, for the kernel client you have to specify it in the mount command or fstab entry. The rbytes option returns the recursive path size (so the ceph.dir.rbytes fattr) in stat calls to directories, so you will see it with ls immediately. I really like it! Just beware that some software might have issues with this behaviour - alpine is the only example (bug report and patch proposal have been submitted) that I know of. Cheers Sebastian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: cephfs kernel client + snapshots slowness
2973] ceph: queue_realm_cap_snaps d313f1e4 1001af8c8e7 inodes [3199649.882974] ceph: queue_cap_snap 171f11a7 nothing dirty|writing [3199649.882975] ceph: queue_cap_snap 938b9cd2 nothing dirty|writing [3199649.882976] ceph: queue_cap_snap 615cf4dd nothing dirty|writing [3199649.882977] ceph: queue_cap_snap 0027e295 nothing dirty|writing [3199649.882979] ceph: queue_cap_snap ba18b2f8 nothing dirty|writing [3199649.882980] ceph: queue_cap_snap 7c9c80de nothing dirty|writing [3199649.882981] ceph: queue_cap_snap 629b4b0e nothing dirty|writing [3199649.882982] ceph: queue_cap_snap ab330b37 nothing dirty|writing [3199649.882983] ceph: queue_cap_snap c7dbc320 nothing dirty|writing [3199649.882985] ceph: queue_cap_snap 70a0598f nothing dirty|writing [3199649.882986] ceph: queue_cap_snap 915b9e2e nothing dirty|writing ... (and a lot lot more of these) ... At this point the client has about a million caps (running up against the default cap limit) - so potentially this loop is over all the caps (?), which could mean tens/hundreds of milliseconds? Indeed, reducing mds_max_caps_per_client by an order of magnitude does improve the lstat times by about an order of magnitude (which is still pretty slow - but supports this hypothesis). The ceph cluster is Nautilus 14.2.20. There are a total of 7 snapshots in cephfs, all taken at the root of the cephfs tree (a rolling set of 7 previous daily snapshots). I've tested this with a few kernels: two LTS ones, and one more recent stable one: 5.4.114, 5.10.73 and 5.14.16 with the same result. Any ideas/suggestions? Andras ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io -- Dr. Sebastian Knust | Bielefeld University IT Administrator | Faculty of Physics Office: D2-110 | Universitätsstr. 25 Phone: +49 521 106 5234 | 33615 Bielefeld ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OSD repeatedly marked down
Hi Jan, On 01.12.21 17:31, Jan Kasprzak wrote: In "ceph -s", they "2 osds down" message disappears, and the number of degraded objects steadily decreases. However, after some time the number of degraded objects starts going up and down again, and osds appear to be down (and then up again). After 5 minutes the OSDs are kicked out from the cluster, and the ceph-osd daemons stop Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 7f8c38e02700 -1 received signal: Interrupt from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0 Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 7f8c38e02700 -1 osd.32 1119559 *** Got signal Interrupt *** Dec 01 17:18:07 my.osd.host ceph-osd[3818]: 2021-12-01T17:18:07.626+0100 7f8c38e02700 -1 osd.32 1119559 *** Immediate shutdown (osd_fast_shutdown=true) *** Do you have enough memory on your host? You might want to look for oom messages in dmesg / journal and monitor your memory usage throughout the recovery. If the osd processes are indeed killed by OOM killer, you have a few options. Adding more memory would probably be best to future-proof the system. Maybe you could also work with some Ceph config setting, e.g. lowering osd_max_backfills (although I'm definitely not an expert on which parameters would give you the best result). Adding swap will most likely only produce other issues, but might be a method of last resort. Cheers Sebastian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Kworker 100% with ceph-msgr (after upgrade to 14.2.6?)
Hi, I too am still suffering the same issue (snapshots lead to 100% ceph-msgr usage on client during metadata-intensive operations like backup and rsync) and had previously reported it to this list. This issue is also tracked at https://tracker.ceph.com/issues/44100 My current observations: - approx. 20 total snapshots in the filesystem are sufficient to reliably cause the issue - in my observation there is no linear relationship between slowdown and number of snapshots. Once you reach a critical snapshot number (which might actually be 1, I have not tested this extensively) and perform the necessary operations to induce the error (for me, Bareos backups are a reliable reproducer), metadata operations on that client grind to a near-halt - memory on the MDS is not a limiting / causing factor: I now have a dedicated MDS server with 160 GB memory and adjusted mds_cache_memory_limit accordingly and saw the issue occurring at 30GB MDS memory usage - fuse mounts don't show the issue but are much slower on metadata operations overall and therefore not a solution for daily backups, as they slow down the backup too much I'm running Ceph Octopus 15.2.13 on CentOS8. Client is CentOS8 with elrepo kernel 5.12. My workaround is to not use cephfs snapshots at all, although I really would like to use them. Cheers Sebastian On 07.09.21 14:12, Frank Schilder wrote: Hi Marc, did you ever get a proper solution for this problem? We are having exactly the same issue, having snapshots on a file system leads to incredible performance degradation. I'm reporting some observations here (latest reply): https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/HKEBXXRMX5WA5Y6JFM34WFPMWTCMPFCG/#6S5GTKGGBI2Y3QE4E5XJJY2KSSLLX64H The problem is almost certainly that the ceph kernel client executes ceph_update_snap_trace over and over again over the exact same data. I see that the execution time of ceph fs IO increases roughly with the number of snapshots present, N snapshots means ~N times slower. I'm testing this on kernel version 5.9.9-1.el7.elrepo.x86_64. It is even worse on older kernels. Best regards, = Frank Schilder ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: CephFS Octopus mv: Invalid cross-device link [Errno 18] / slow move
Hi Luís, Am 18.08.2021 um 19:02 schrieb Luis Henriques: > Sebastian Knust writes: > >> Hi, >> >> I am running a Ceph Oc,topus (15.2.13) cluster mainly for CephFS. Moving >> (with >> mv) a large directory (mail server backup, so a few million small files) >> within >> the cluster takes multiple days, even though both source and destination >> share >> the same (default) file layout and - at least on the client I am performing >> the >> move on - are located within the same mount point. >> >> I also see that the move is done by recursive copying and later deletion, as >> I >> would only expect between different file systems / mount points. > > A reason for that to happen could be the usage of quotas in the > filesystem. If you have quotas set in any of the source or destination > hierarchies the rename(2) syscall will fail with -EXDEV (the "Invalid > cross-device link" error). And I guess that 'mv' will then revert to > the less efficient recursive copy. > > A possible solution would be to temporarily remove the quotas > (i.e. setting them to '0'), and setting them back after the rename. > > Cheers, That's it! Setting quota temporarily to 0 allows for immediate move by rename. Thanks a lot. Cheers Sebastian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] CephFS Octopus mv: Invalid cross-device link [Errno 18] / slow move
Hi, I am running a Ceph Octopus (15.2.13) cluster mainly for CephFS. Moving (with mv) a large directory (mail server backup, so a few million small files) within the cluster takes multiple days, even though both source and destination share the same (default) file layout and - at least on the client I am performing the move on - are located within the same mount point. I also see that the move is done by recursive copying and later deletion, as I would only expect between different file systems / mount points. Checking with cephfs-shell (16.2.5), the move fails with the "Invalid cross-device link [Errno 18]" error. However, stat shows the same device ID for source and destination: CephFS:~/>>> mv /source/foo /dest/foo cephfs.OSError: error in rename /source/foo to /dest/foo: Invalid cross-device link [Errno 18] CephFS:~/>>> stat /source/foo Device: 18446744073709551614Inode: 1099620656366 CephFS:~/>>> stat /dest/ Device: 18446744073709551614Inode: 1099570814227 Full output at https://pastebin.com/9V6FZ6hP Any ideas why this happens? The /source was originally created by ceph fs subvolume create ..., however I was not using the volume/subvolume features and reorganised the data - the directory inode is still the same. Cheers Sebastian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Docker container snapshots accumulate until disk full failure?
Dear Harry, `docker image prune -a` removes all dangling images as well as all images not referenced by any running container. I successfully used it in my setups to remove old versions. In RHEL/CentOS, podman is used and thus you should use `podman image prune -a` instead. HTH, Cheers Sebastian On 11.08.21 15:35, Harry G. Coin wrote: Does ceph remove container subvolumes holding previous revisions of daemon images after upgrades? I have a couple servers using btrfs to hold the containers. The number of docker related sub-volumes just keeps growing, way beyond the number of daemons running. If I ignore this, I'll get disk-full related system failures. Is there a command to 'erase all non-live docker image subvolumes'? Or a way to at least get a list of what I need to delete manually ( !! ) Thanks Harry Coin ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Wrong hostnames in "ceph mgr services" (Octopus)
Hi, After upgrading from 15.2.8. to 15.2.13 with cephadm on CentOS 8 (containerised installation done by cephadm), Grafana no longer shows new data. Additionally, when accessing the Dashboard-URL on a host currently not hosting the dashboard, I am redirected to a wrong hostname (as shown in ceph mgr services). I assume that this is caused by the same reason which leads to this output of `ceph mgr services`: { "dashboard": "https://ceph--mgr.iceph-11.tsmsqs:8443/", "prometheus": "http://ceph--mgr.iceph-11.tsmsqs:9283/" } The correct hostname is iceph-11 (without the tsmsqs part), FQDN is iceph-11.servernet. The hosts use DNS, the names (iceph-11 and iceph-11.servernet) are resolvable both from the hosts as well as from within the Podman containers. I have determined that podman by default sets the container name as a hostname alias (visible with `hostname -a` within the container), which somehow leads to Ceph mgr picking it up as the primary name? My workaround is to modify /var/lib/ceph//mgr../unit.run, adding --no-hosts as an additional argument to the "podman run" command. I could probably use a system-wide containers.conf as well. With this workaround and after restarting the Ceph mgr container (via systemctl) and then restarting Prometheus and Grafana (with ceph orch redeploy), I once again get data in Grafana and the correct redirect for the dashboard. `ceph mgr services` also shows expected and correct values. I am wondering if this kind of issue is known or whether there is something wrong with my setup. I expected Ceph mgr to use the primary hostname and not some seemingly random hostname alias. Maybe this issue can also be discussed in a troubleshooting section of the monitoring stack documentation. Cheers Sebastian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: OT: How to Build a poor man's storage with ceph
Hi Michael, On 08.06.21 11:38, Ml Ml wrote: Now i was asked if i could also build a cheap 200-500TB Cluster Storage, which should also scale. Just for Data Storage such as NextCloud/OwnCloud. With similar requirements (server primarily for Samba and NextCloud, some RBD use, very limited budget) I am using HDD for data and SSD for system and CephFS metadata. Note that I am running NextCloud on CephFS storage. If you want to go with RGW/S3 as a storage backend instead, the following might not apply to your use case. My nodes (bought end of 2020) are: - 2U chassis with 12 3.5" SATA slots - Intel Xeon Silver 4208 - 128 GB RAM - 2 x 480 GB Samsung PM883 SSD -> 50 GB in MD-RAID1 for system -> 430 GB OSD (one per SSD) - initially 6 x 14 TB Enterprise HDD - 4 x 10 GBase-T (active/passive bonded, dedicated backend network) Each node with this configuration cost about 4k EUR net at the end of 2020. Due to increasing prices for storage, it will be a bit more expensive now. I am running five nodes now and have added a few more disks (ranging 8-14 TB), nearly filling up the nodes. My experience so far: - I had to throttle scrubbing (see below for details) - For purely NextCloud and Samba performance is sufficient for a few hundred concurrent users with a handful of power users - Migration of the mail server to this cluster was a disaster due to limited IOPS, had to add some more SSDs and place the mail server in an SSD-only pool. - MDS needs a lot of memory for larger CephFS installs, I will move it to a dedicated server probably next year. 128 GB per node works but I would not recommend any less. - Rebalancing takes an eternity (2-3 weeks), so make sure that your PG nums are okay from the start - I have all but given up on snapshots with CephFS due to severe performance degradation with kernel client during backup My scrubbing config looks like this: osd_backfill_scan_max 16 osd_backfill_scan_min 4 osd_deep_scrub_interval 2592000.00 osd_deep_scrub_randomize_ratio 0.03 osd_recovery_max_active_hdd 1 osd_recovery_max_active_ssd 5 osd_recovery_sleep_hdd 0.05 osd_scrub_begin_hour18 osd_scrub_end_hour 7 osd_scrub_chunk_max 1 osd_scrub_chunk_min 1 osd_scrub_max_interval 2419200.00 osd_scrub_min_interval 172800.00 osd_scrub_sleep 0.10 My data is in a replicated pool with n=3 without compression. You might also consider EC and then want to aim for more nodes. Cheers Sebastian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cephfs metadta pool suddenly full (100%) !
Hi Hervé, On 01.06.21 14:00, Hervé Ballans wrote: I'm aware with your points, and maybe I was not really clear in my previous email (written in a hurry!) The problematic pool is the metadata one. All its OSDs (x3) are full. The associated data pool is OK and no OSD is full on the data pool. Are you saying that you only have 3 OSD for your metadata pool, which are the full ones? Alright, then you can - at least for this specific issue - disregard my previous comment. The problem is that metadata pool suddenly increases a lot and continiously from 3% to 100% in 5 hours (from 5 am to 10 am, then crash) 724 GiB stored in the metadata pool with only 11 TiB cephfs data size does seem huge at first glance. For reference, I have about 160 TiB cephfs data with only 31 GiB stored in the metadata pool. I don't have an explanation for this behaviour, as I am relatively new to Ceph. Maybe the list can chime in? And we don't understand the reason, since there was no specific activities on the data pool ? This cluster runs perfectly with the current configuration since many years. Probably unrelated to your issues: I noticed that your STORED and USED column in `ceph df` output are identical. Is that because of Nautilus (I myself am running Octopus, where USED is the expected multiple of STORED depending on replication factor / EC configuration in the pool) or are you running a specific configuration that might cause that? Cheers Sebastian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: Cephfs metadta pool suddenly full (100%) !
Hi Hervé, On 01.06.21 13:15, Hervé Ballans wrote: # ceph status cluster: id: 838506b7-e0c6-4022-9e17-2d1cf9458be6 health: HEALTH_ERR 1 filesystem is degraded 3 full osd(s) 1 pool(s) full 1 daemons have recently crashed You have full OSDs and therefore a full pool. The "fullness" of a pool is limited by the fullest OSD, i.e. a single full OSDs can block your pool. Take a look at `ceph osd df` and you will notice a very non-uniform osd usage (both with numbers of PG / size as well as usage %). osd: 126 osds: 126 up (since 5m), 126 in (since 5M) pgs: 1662 active+clean The PG/osd ratio seems to be very low for me. The general recommendation is 100 PG / osd post-replication (and power of 2 for each pool). In my cluster I actually run with ~200 PG / osd for my SSD which contain the cephfs metadata. Thanks a lot if you have some ways for trying to solve this... You have to get your OSDs to rebalance, which probably includes increasing the number of PGs in some pools. Details depend on which Ceph version you are running and your CRUSH rules (maybe your cephfs metadata pool is residing only on NVMe?). Take a look at the balancer module [1] and the autoscaler [2] (`ceph osd pool autoscale-status` is most interesting). Theoretically, you could (temporarilly!) increase the full_ratio. However, this is a very dangerous operation which you should not do unless you know *exactly* what you are doing. Cheers & Best of luck Sebastian [1] https://docs.ceph.com/en/latest/rados/operations/balancer/ [2] https://docs.ceph.com/en/latest/rados/operations/placement-groups/ Replace latest in the URIs with your Ceph version string (i.e. octopus, nautilus) for version specific documentation ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: XFS on RBD on EC painfully slow
Hi Reed, To add to this command by Weiwen: On 28.05.21 13:03, 胡 玮文 wrote: Have you tried just start multiple rsync process simultaneously to transfer different directories? Distributed system like ceph often benefits from more parallelism. When I migrated from XFS on iSCSI (legacy system, no Ceph) to CephFS a few months ago, I used msrsync [1] and was quite happy with the speed. For your use case, I would start with -p 12 but might experiment with up to -p 24 (as you only have 6C/12T in your CPU). With many small files, you also might want to increase -s from the default 1000. Note that msrsync does not work with the --delete rsync flag. As I was syncing a live system, I ended up with this workflow: - Initial sync with msrsync (something like ./msrsync -p 12 --progress --stats --rsync "-aS --numeric-ids" ...) - Second sync with msrsync (to sync changes during the first sync) - Take old storage off-line for users / read-only - Final rsync with --delete (i.e. rsync -aS --numeric-ids --delete ...) - Mount cephfs at location of old storage, adjust /etc/exports with fsid entries where necessary, turn system back on-line / read-write Cheers Sebastian [1] https://github.com/jbd/msrsync ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] CephFS: side effects of not using ceph-mgr volumes / subvolumes
Hi, Assuming a cluster (currently octopus, might upgrade to pacific once released) serving only CephFS and that only to a handful of kernel and fuse-clients (no OpenStack, CSI or similar): Are there any side effects of not using the ceph-mgr volumes module abstractions [1], namely subvolumes and subvolume groups, that I have to consider? I would still only mount subtrees of the whole (single) CephFS file system and have some clients which mount multiple disjunct subtrees. Quotas would only be set on the subtree level which I am mounting, likewise file layouts. Snapshots (via mkdir in .snap) would be used on the mounting level or one level above. Background: I don't require the abstraction features per se. Some restrictions (e.g. subvolume group snapshots not being supported) seem to me to be caused only by the abstraction layer and not the underlying CephFS. For my specific use case I require snapshots on the subvolume group layer. It therefore seems better to just forego the abstraction as a whole and work on bare CephFS. Cheers Sebastian [1] https://docs.ceph.com/en/octopus/cephfs/fs-volumes/ ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] CephFS Octopus snapshots / kworker at 100% / kernel vs. fuse client
Hi, I am running a Ceph Octopus (15.2.8) cluster primarily for CephFS. Metadata is stored on SSD, data is stored in three different pools on HDD. Currently, I use 22 subvolumes. I am rotating snapshots on 16 subvolumes, all in the same pool, which is the primary data pool for CephFS. Currently I have 41 snapshots per subvolume. The goal is 50 snapshots (see bottom of mail for details). Snapshots are only placed in the root subvolume directory, i.e. /volumes/_nogroup/subvolname/hex-id/.snap I place the snapshots on one of the nodes. Complete CephFS is mounted, mkdir and rmdir is performed for each relevant subvolume, then CephFS is unmounted again. All PGs are active+clean most of the time, only a few in snaptrim for 1-2 minutes after snapshot deletion. I therefore assume that snaptrim is not a limiting factor. Obviously, the total number of snapshots is more than the 400 and 100 I see mentioned in some documentation. I am unsure if that is an issue here, as the snapshots are all in disjunct subvolumes. When mounting the subvolumes with kernel client (ranging from CentOS 7 supplied 3.10 up to 5.4.93), after some time and for some subvolumes the kworker process begins to hug 100% cpu usage and stat operations become very slow (even slower than with fuse client). I can mostly replicate this by starting specific rsync operations (with many small files, e.g. CTAN, CentOS, Debian mirrors) and by running a bareos backup. The kworker process seems to be stuck even after terminating the causing operating, i.e. rsync or bareos-fd. Interestingly, I can even trigger these issues on a host that has only a single CephFS subvolume without any snapshots mounted, as long as that subvolume is in the same pool as other subvolumes with snapshots. I don't see any abnormal behaviour on the cluster nodes or on other clients during these kworker hanging phases. With fuse client, in normal operation stat calls are about 10-20x slower than with the kernel client. However, I don't encounter the extreme slowdown behaviour. I am therefore currently mounting some known-problematic subvolumes with fuse and non-problematic subvolumes with the kernel client. My questions are: - Is this known or expected behaviour? - I could move the subvolumes with snapshots into a subvolumegroup and snapshot the whole group instead of each subvolume. Will this be likely to solve the issues? - What is the current recommendation regarding CephFS and max number of snapshots? Cluster setup: 5 nodes with a total of 56 OSDs Each node has a Xeon Silver 4208 and 128 GB RAM Each node has two 480GB Samsung PM883 SSD used for CephFS metadata pool HDDs are ranging from 8TB to 14TB, majority is 14TB 10 GbE internal network and 10 GbE client network, no Jumbo frames $ ceph df --- RAW STORAGE --- CLASS SIZE AVAILUSED RAW USED %RAW USED hdd520 TiB 141 TiB 378 TiB 379 TiB 72.88 ssd3.9 TiB 3.8 TiB 1.7 GiB97 GiB 2.46 TOTAL 524 TiB 145 TiB 378 TiB 379 TiB 72.36 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL device_health_metrics 1 1 66 MiB 57 198 MiB 0 23 TiB cephfs.cephfs.meta 2 1024 26 GiB2.29M 77 GiB 2.061.2 TiB cephfs.cephfs.data 3 1024 70 TiB 54.95M 213 TiB 75.19 23 TiB lofar 4 512 77 TiB 21.41M 154 TiB 68.68 35 TiB proxmox 664 526 GiB 158.60k 1.6 TiB 2.16 23 TiB archive 732 7.3 TiB5.42M 10 TiB 12.57 56 TiB Snapshots are only on cephfs.cephfs.data pool. Intended snapshot rotation: 4 quarter-hourly snapshots 24 hourly snapshots 14 daily snapshots 8 weekly snapshots Cheers Sebastian ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io