At 01:27 this morning I received the first email about MDS cache is too large 
(mailing happens every 15 minutes if something happens). Looking into it, it 
was again a standby-replay host which stops working.

At 01:00 a few rsync processes start in parallel on a client machine. This 
copies data from a NFS share to Cephfs share to sync the latest changes. (we 
want to switch to Cephfs in the near future).

This crashing of the standby-replay mds happend a couple times now, so I think 
it would be good to get some help. Where should I look next?

Some cephfs information
----------------------------------
# ceph fs status
atlassian-opl - 8 clients
=============
RANK      STATE                     MDS                    ACTIVITY     DNS    
INOS   DIRS   CAPS
 0        active      atlassian-opl.mds5.zsxfep  Reqs:    0 /s  7830   7803    
635   3706
0-s   standby-replay  atlassian-opl.mds6.svvuii  Evts:    0 /s  3139   1924    
461      0
           POOL              TYPE     USED  AVAIL
cephfs.atlassian-opl.meta  metadata  2186M  1161G
cephfs.atlassian-opl.data    data    23.0G  1161G
atlassian-prod - 12 clients
==============
RANK      STATE                      MDS                    ACTIVITY     DNS    
INOS   DIRS   CAPS
 0        active      atlassian-prod.mds1.msydxf  Reqs:    0 /s  2703k  2703k   
905k  1585
 1        active      atlassian-prod.mds2.oappgu  Reqs:    0 /s   961k   961k   
317k   622
 2        active      atlassian-prod.mds3.yvkjsi  Reqs:    0 /s  2083k  2083k   
670k   443
0-s   standby-replay  atlassian-prod.mds4.qlvypn  Evts:    0 /s   352k   352k   
102k     0
1-s   standby-replay  atlassian-prod.mds5.egsdfl  Evts:    0 /s   873k   873k   
277k     0
2-s   standby-replay  atlassian-prod.mds6.ghonso  Evts:    0 /s  2317k  2316k   
679k     0
           POOL               TYPE     USED  AVAIL
cephfs.atlassian-prod.meta  metadata  58.8G  1161G
cephfs.atlassian-prod.data    data    5492G  1161G
MDS version: ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) 
quincy (stable)


When looking at the log on the MDS server, I've got the following:
2023-07-21T01:21:01.942+0000 7f668a5e0700 -1 received  signal: Hangup from 
Kernel ( Could be generated by pthread_kill(), raise(), abort(), alarm() ) UID: 0
2023-07-21T01:23:13.856+0000 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5671 from 
mon.1
2023-07-21T01:23:18.369+0000 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5672 from 
mon.1
2023-07-21T01:23:31.719+0000 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5673 from 
mon.1
2023-07-21T01:23:35.769+0000 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5674 from 
mon.1
2023-07-21T01:28:23.764+0000 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5675 from 
mon.1
2023-07-21T01:29:13.657+0000 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5676 from 
mon.1
2023-07-21T01:33:43.886+0000 7f6688ddd700  1 
mds.atlassian-prod.pwsoel13143.qlvypn Updating MDS map to version 5677 from 
mon.1
(and another 20 lines about updating MDS map)

Alert mailings:
Mail at 01:27
----------------------------------
HEALTH_WARN

--- New ---
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
        mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large 
(13GB/9GB); 0 inodes in use by clients, 0 stray files


=== Full health status ===
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
        mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large 
(13GB/9GB); 0 inodes in use by clients, 0 stray files


Mail at 03:27
----------------------------------
HEALTH_OK

--- Cleared ---
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
        mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large 
(14GB/9GB); 0 inodes in use by clients, 0 stray files


=== Full health status ===


Mail at 04:12
----------------------------------
HEALTH_WARN

--- New ---
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
        mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large 
(15GB/9GB); 0 inodes in use by clients, 0 stray files


=== Full health status ===
[WARN] MDS_CACHE_OVERSIZED: 1 MDSs report oversized cache
        mds.atlassian-prod.mds4.qlvypn(mds.0): MDS cache is too large 
(15GB/9GB); 0 inodes in use by clients, 0 stray files


Best regards,
Sake
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to