[ceph-users] MDs stuck in rejoin with '[ERR] : loaded dup inode'

2024-05-30 Thread Noe P.


Hi,

I'm still unable to get our filesystem back.
I now have this:

fs_cluster - 0 clients
==
RANK  STATE MDS ACTIVITY   DNSINOS   DIRS   CAPS
 0rejoin  cephmd4b90.0k  89.4k  14.7k 0
 1rejoin  cephmd6b 105k   105k  21.3k 0
 2failed
  POOL TYPE USED  AVAIL
fs_cluster_meta  metadata   288G  55.2T
fs_cluster_datadata 421T  55.2T


Still cannot get rid of the 3rd failed rank. But the other two currently
stay in state rejoin forever. After all clients were stopped, the log
complains about a 'dup inode':

  2024-05-30T07:59:46.252+0200 7f2fe9146700 -1 log_channel(cluster) log [ERR] :
  loaded dup inode 0x1001710ea1d [12bc6a,head] v1432525092 at
  /homes/YYY/ZZZ/.bash_history-21032.tmp, but inode 0x1001710ea1d.head
  v1432525109 already exists at /homes/YYY/ZZZ/.bash_history

Questions:
 - Is there a way to scan/repair the metadata without any MD in 'active' state ?

 - Is there a way to remove (or otherwise fix) the inode in question given the
   above inode number ?

 - Is the state 'rejoin' due to the inode error or because of that 3rd rank ?


Regard,
  N.
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS stuck in rejoin

2023-07-20 Thread Frank Schilder
Hi all,

we had a client with the warning "[WRN] MDS_CLIENT_OLDEST_TID: 1 clients 
failing to advance oldest client/flush tid". I looked at the client and there 
was nothing going on, so I rebooted it. After the client was back, the message 
was still there. To clean this up I failed the MDS. Unfortunately, the MDS that 
took over is remained stuck in rejoin without doing anything. All that happened 
in the log was:

[root@ceph-10 ceph]# tail -f ceph-mds.ceph-10.log
2023-07-20T15:54:29.147+0200 7fedb9c9f700  1 mds.2.896604 rejoin_start
2023-07-20T15:54:29.161+0200 7fedb9c9f700  1 mds.2.896604 rejoin_joint_start
2023-07-20T15:55:28.005+0200 7fedb9c9f700  1 mds.ceph-10 Updating MDS map to 
version 896614 from mon.4
2023-07-20T15:56:00.278+0200 7fedb9c9f700  1 mds.ceph-10 Updating MDS map to 
version 896615 from mon.4
[...]
2023-07-20T16:02:54.935+0200 7fedb9c9f700  1 mds.ceph-10 Updating MDS map to 
version 896653 from mon.4
2023-07-20T16:03:07.276+0200 7fedb9c9f700  1 mds.ceph-10 Updating MDS map to 
version 896654 from mon.4

After some time I decided to give another fail a try and, this time, the 
replacement daemon went to active state really fast.

If I have a message like the above, what is the clean way of getting the client 
clean again (version: 15.2.17 (8a82819d84cf884bd39c17e3236e0632ac146dc4) 
octopus (stable))?

Thanks and best regards,
=
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] MDS stuck in rejoin

2022-05-31 Thread Dave Schulz

Hi Everyone,

I have a down system that has the MDS stuck in the rejoin state. When I 
run ceph-mds with -d and --debug_mds 10 I get this repeating:
2022-05-31 00:33:03.554 7fac80ee3700 10 mds.trex-ceph4  my compat 
compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate 
object,5=mds uses versioned encoding,6=dir
frag is stored in omap,7=mds uses inline data,8=no anchor table,9=file 
layout v2,10=snaprealm v2}
2022-05-31 00:33:03.554 7fac80ee3700 10 mds.trex-ceph4  mdsmap compat 
compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate 
object,5=mds uses versioned encoding,6=dir

frag is stored in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
2022-05-31 00:33:03.554 7fac80ee3700 10 mds.trex-ceph4 my gid is 161986332
2022-05-31 00:33:03.554 7fac80ee3700 10 mds.trex-ceph4 map says I am 
mds.0.2365745 state up:rejoin
2022-05-31 00:33:03.554 7fac80ee3700 10 mds.trex-ceph4 msgr says i am 
[v2:172.23.0.44:6800/4094836140,v1:172.23.0.44:6801/4094836140]
2022-05-31 00:33:03.554 7fac80ee3700 10 mds.trex-ceph4 handle_mds_map: 
handling map as rank 0
2022-05-31 00:33:03.557 7fac83972700  5 mds.beacon.trex-ceph4 received 
beacon reply up:rejoin seq 31 rtt 0.21701
2022-05-31 00:33:04.185 7fac7c6da700 10 mds.0.cache cache not ready for 
trimming
2022-05-31 00:33:05.182 7fac7c6da700 10 mds.0.cache cache not ready for 
trimming

2022-05-31 00:33:05.182 7fac7c6da700 10 mds.0.cache releasing free memory
2022-05-31 00:33:06.182 7fac7c6da700 10 mds.0.cache cache not ready for 
trimming
2022-05-31 00:33:07.183 7fac7c6da700 10 mds.0.cache cache not ready for 
trimming
2022-05-31 00:33:07.341 7fac7dedd700  5 mds.beacon.trex-ceph4 Sending 
beacon up:rejoin seq 32
2022-05-31 00:33:07.341 7fac83972700  5 mds.beacon.trex-ceph4 received 
beacon reply up:rejoin seq 32 rtt 0
2022-05-31 00:33:08.183 7fac7c6da700 10 mds.0.cache cache not ready for 
trimming
2022-05-31 00:33:09.184 7fac7c6da700 10 mds.0.cache cache not ready for 
trimming
2022-05-31 00:33:10.184 7fac7c6da700 10 mds.0.cache cache not ready for 
trimming
2022-05-31 00:33:11.185 7fac7c6da700 10 mds.0.cache cache not ready for 
trimming
2022-05-31 00:33:11.341 7fac7dedd700  5 mds.beacon.trex-ceph4 Sending 
beacon up:rejoin seq 33
2022-05-31 00:33:11.397 7fac80ee3700  1 mds.trex-ceph4 Updating MDS map 
to version 2365758 from mon.0


and it just stays in that state seemingly forever.  Also it seems to be 
doing nothing cpu wise.  I don't even know where to look at this point.


I see this in the mon log:

2022-05-31 00:36:27.359 7f39d0c6c700  1 mon.trex-ceph1@0(leader).osd 
e51026 _set_new_cache_sizes cache_size:1020054731 inc_alloc: 301989888 
full_alloc: 322961408 kv_alloc: 390070272


I'm falling asleep at the keyboard trying to get this to work. Any thoughts?

Thanks

-Dave

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io