At 01:27 this morning I received the first email about MDS cache is too large (mailing happens every 15 minutes if something happens). Looking into it, it was again a standby-replay host which stops working.
At 01:00 a few rsync processes start in parallel on a client machine. This copies data from a NFS share to Cephfs share to sync the latest changes. (we want to switch to Cephfs in the near future). This crashing of the standby-replay mds happend a couple times now, so I think it would be good to get some help. Where should I look next? Some cephfs information ---------------------------------- # ceph fs status atlassian-opl - 8 clients ============= RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active atlassian-opl.mds5.zsxfep Reqs: 0 /s 7830 7803 635 3706 0-s standby-replay atlassian-opl.mds6.svvuii Evts: 0 /s 3139 1924 461 0 POOL TYPE USED AVAIL cephfs.atlassian-opl.meta metadata 2186M 1161G cephfs.atlassian-opl.data data 23.0G 1161G atlassian-prod - 12 clients ============== RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS 0 active atlassian-prod.mds1.msydxf Reqs: 0 /s 2703k 2703k 905k 1585 1 active atlassian-prod.mds2.oappgu Reqs: 0 /s 961k 961k 317k 622 2 active atlassian-prod.mds3.yvkjsi Reqs: 0 /s 2083k 2083k 670k 443 0-s standby-replay atlassian-prod.mds4.qlvypn Evts: 0 /s 352k 352k 102k 0 1-s standby-replay atlassian-prod.mds5.egsdfl Evts: 0 /s 873k 873k 277k 0 2-s standby-replay atlassian-prod.mds6.ghonso Evts: 0 /s 2317k 2316k 679k 0 POOL TYPE USED AVAIL cephfs.atlassian-prod.meta metadata 58.8G 1161G cephfs.atlassian-prod.data data 5492G 1161G MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy (stable) When looking at the log on the MDS server, I've got the following: 2023-07-21T01:21:01.942+0000 7f668a5e0700 -1 received signal: Hangup from Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0 2023-07-21T01:23:13.856+0000 7f6688ddd700 1 mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5671 from mon.1 2023-07-21T01:23:18.369+0000 7f6688ddd700 1 mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5672 from mon.1 2023-07-21T01:23:31.719+0000 7f6688ddd700 1 mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5673 from mon.1 2023-07-21T01:23:35.769+0000 7f6688ddd700 1 mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5674 from mon.1 2023-07-21T01:28:23.764+0000 7f6688ddd700 1 mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5675 from mon.1 2023-07-21T01:29:13.657+0000 7f6688ddd700 1 mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5676 from mon.1 2023-07-21T01:33:43.886+0000 7f6688ddd700 1 mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5677 from mon.1 (and another 20 lines about updating MDS map) Alert mailings: Mail at 01:27 ---------------------------------- HEALTH_WARN --- New --- [WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large (13GB/9GB); 0 inodes in use by clients, 0 stray files === Full health status === [WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large (13GB/9GB); 0 inodes in use by clients, 0 stray files Mail at 03:27 ---------------------------------- HEALTH_OK --- Cleared --- [WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large (14GB/9GB); 0 inodes in use by clients, 0 stray files === Full health status === Mail at 04:12 ---------------------------------- HEALTH_WARN --- New --- [WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large (15GB/9GB); 0 inodes in use by clients, 0 stray files === Full health status === [WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large (15GB/9GB); 0 inodes in use by clients, 0 stray files Best regards, Sake _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io