OK, now we are talking. It is very well possible that trimming will not start until this operation is completed.
If there are enough shards/copies to recover the lost objects, you should try a pg repair first. If you did loose too many replicas, there are ways to flush this PG out of the system. You will loose data this way. I don't know how to repair or flush only broken objects out of a PG, but would hope that this is possible. Before you do anything destructive, open a new thread in this list specifically for how to repair/remove this PG with the least possible damage. Best regards, ================= Frank Schilder AIT Risø Campus Bygning 109, rum S14 ________________________________________ From: Francois Legrand <f...@lpnhe.in2p3.fr> Sent: 08 June 2020 16:00:28 To: Frank Schilder; ceph-users Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory exhausted There is no recovery going on, but indeed we have a pg damaged (with some lost objects due to a major crash few weeks ago)... and there are some shards of this pg on osd 27 ! That's also why we are migrating all the data out of this FS ! It's certainly related and I guess that it's trying to remove some datas that are already lost and it get stuck ! I don't know if there is a way to tell ceph to forget about these ops ! I guess no. I thus think that there is not that much to do apart from reading as much data as we can to save as much as possible. F. Le 08/06/2020 à 15:48, Frank Schilder a écrit : > That's strange. Maybe there is another problem. Do you have any other health > warnings that might be related? Is there some recovery/rebalancing going on? > > Best regards, > ================= > Frank Schilder > AIT Risø Campus > Bygning 109, rum S14 > > ________________________________________ > From: Francois Legrand <f...@lpnhe.in2p3.fr> > Sent: 08 June 2020 15:27:59 > To: Frank Schilder; ceph-users > Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory > exhausted > > Thanks again for the hint ! > Indeed, I did a > ceph daemon mds.lpnceph-mds02.in2p3.fr objecter_requests > and it seems that osd 27 is more or less stuck with op of age 34987.5 > (while others osd have ages < 1). > I tryed a ceph osd down 27 which resulted in reseting the age but I can > notice that age for osd.27 ops is rising again. > I think I will restart it (btw our osd servers and mds are different > machines). > F. > > Le 08/06/2020 à 15:01, Frank Schilder a écrit : >> Hi Francois, >> >> this sounds great. At least its operational. I guess it is still using a lot >> of swap while trying to replay operations. >> >> I would disconnect cleanly all clients if you didn't do so already, even any >> read-only clients. Any extra load will just slow down recovery. My best >> guess is, that the MDS is replaying some operations, which is very slow due >> to swap. While doing so, the segments to trim will probably keep increasing >> for a while until it can start trimming. >> >> The slow meta-data IO is an operation hanging in some OSD. You should check >> which OSD it is (ceph health detail) and check if you can see the operation >> in the OSDs OPS queue. I would expect this OSD to have a really long OPS >> queue. I have seen meta-data operations hang for a long time. In case this >> OSD runs on the same server as your MDS, you will probably have to sit it >> out. >> >> If the meta-data operation is the only operation in the queue, the OSD might >> need a restart. But be careful, if in doubt ask the list first. >> >> Best regards, >> ================= >> Frank Schilder >> AIT Risø Campus >> Bygning 109, rum S14 >> >> ________________________________________ >> From: Francois Legrand <f...@lpnhe.in2p3.fr> >> Sent: 08 June 2020 14:45:13 >> To: Frank Schilder; ceph-users >> Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory >> exhausted >> >> Hi Franck, >> Finally I dit : >> ceph config set global mds_beacon_grace 600000 >> and create /etc/sysctl.d/sysctl-ceph.conf with >> vm.min_free_kbytes=4194303 >> and then >> sysctl --system >> >> After that, the mds went to rejoin for a very long time (almost 24 >> hours) with errors like : >> 2020-06-07 04:10:36.802 7ff866e2e700 1 heartbeat_map is_healthy >> 'MDSRank' had timed out after 15 >> 2020-06-07 04:10:36.802 7ff866e2e700 0 >> mds.beacon.lpnceph-mds02.in2p3.fr Skipping beacon heartbeat to monitors >> (last acked 14653.8s ago); MDS internal heartbeat is not healthy! >> 2020-06-07 04:10:37.021 7ff868e32700 -1 monclient: _check_auth_rotating >> possible clock skew, rotating keys expired way too early (before >> 2020-06-07 03:10:37.022271) >> and also >> 2020-06-07 04:10:44.942 7ff86d63b700 0 auth: could not find secret_id=10363 >> 2020-06-07 04:10:44.942 7ff86d63b700 0 cephx: verify_authorizer could >> not get service secret for service mds secret_id=10363 >> >> but at the end the mds went active ! :-) >> I let it at rest from sunday afternoon until this morning. >> Indeed I was able to connect clients (in read-only for now) and read the >> datas. >> I checked the clients connected with ceph tell >> mds.lpnceph-mds02.in2p3.fr client ls >> and disconnected the few clients still there (with umount) and checked >> that they were not connected anymore with the same command. >> But I still have the following warnings >> MDS_SLOW_METADATA_IO 1 MDSs report slow metadata IOs >> mdslpnceph-mds02.in2p3.fr(mds.0): 1 slow metadata IOs are blocked > >> 30 secs, oldest blocked for 75372 secs >> MDS_TRIM 1 MDSs behind on trimming >> mdslpnceph-mds02.in2p3.fr(mds.0): Behind on trimming (122836/128) >> max_segments: 128, num_segments: 122836 >> >> and the number of segments is still rising (slowly). >> F. >> >> >> Le 08/06/2020 à 12:00, Frank Schilder a écrit : >>> Hi Francois, >>> >>> did you manage to get any further with this? >>> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Frank Schilder <fr...@dtu.dk> >>> Sent: 06 June 2020 15:21:59 >>> To: ceph-users; f...@lpnhe.in2p3.fr >>> Subject: [ceph-users] Re: mds behind on trimming - replay until memory >>> exhausted >>> >>> I think you have a problem similar to one I have. The priority of beacons >>> seems very low. As soon as something gets busy, beacons are ignored or not >>> sent. This was part of your log messages from the MDS. It stopped reporting >>> to the MONs due to laggy connection. This laggyness is a result of swapping: >>> >>>> 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep >>>> work because connection to Monitors appears laggy >>> Hence, during the (entire) time you are trying to get the MDS back using >>> swap, it will almost certainly stop sending beacons. Therefore, you need to >>> disable the time-out temporarily, otherwise the MON will always kill it for >>> no real reason. The time-out should be long enough to cover the entire >>> recovery period. >>> >>> Best regards, >>> ================= >>> Frank Schilder >>> AIT Risø Campus >>> Bygning 109, rum S14 >>> >>> ________________________________________ >>> From: Francois Legrand <f...@lpnhe.in2p3.fr> >>> Sent: 06 June 2020 11:11 >>> To: Frank Schilder; ceph-users >>> Subject: Re: [ceph-users] Re: mds behind on trimming - replay until memory >>> exhausted >>> >>> Thanks for the tip, >>> I will try that. For now vm.min_free_kbytes = 90112 >>> Indeed, yesterday after your last mail I set mds_beacon_grace to 240.0 >>> but this didn't change anything... >>> -27> 2020-06-06 06:15:07.373 7f83e3626700 1 >>> mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to >>> be laggy; 332.044s since last acked beacon >>> Which is the same time since last acked beacon I had before changing the >>> parameter. >>> As mds beacon interval is 4 s setting mds_beacon_grace to 240 should >>> lead to 960 s (16mn). Thus I think that the bottleneck is elsewhere. >>> F. >>> >>> >>> Le 06/06/2020 à 09:47, Frank Schilder a écrit : >>>> Hi Francois, >>>> >>>> there is actually one more parameter you might consider changing in case >>>> the MDS gets kicked out again. For a system under such high memory >>>> pressure, the value for the kernel parameter vm.min_free_kbytes might need >>>> adjusting. You can check the current value with >>>> >>>> sysctl vm.min_free_kbytes >>>> >>>> In your case with heavy swap usage, this value should probably be >>>> somewhere between 2-4GB. >>>> >>>> Careful, do not change this value while memory is in high demand. If not >>>> enough memory is available, setting this will immediately OOM kill your >>>> machine. Make sure that plenty of pages are unused. Drop page cache if >>>> necessary or reboot the machine before setting this value. >>>> >>>> Best regards, >>>> ================= >>>> Frank Schilder >>>> AIT Risø Campus >>>> Bygning 109, rum S14 >>>> >>>> ________________________________________ >>>> From: Frank Schilder <fr...@dtu.dk> >>>> Sent: 06 June 2020 00:36:13 >>>> To: ceph-users; f...@lpnhe.in2p3.fr >>>> Subject: [ceph-users] Re: mds behind on trimming - replay until memory >>>> exhausted >>>> >>>> Hi Francois, >>>> >>>> yes, the beacon grace needs to be higher due to the latency of swap. Not >>>> sure if 60s will do. For this particular recovery operation, you might >>>> want to go much higher (1h) and watch the cluster health closely. >>>> >>>> Good luck and best regards, >>>> ================= >>>> Frank Schilder >>>> AIT Risø Campus >>>> Bygning 109, rum S14 >>>> >>>> ________________________________________ >>>> From: Francois Legrand <f...@lpnhe.in2p3.fr> >>>> Sent: 05 June 2020 23:51:04 >>>> To: Frank Schilder; ceph-users >>>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory >>>> exhausted >>>> >>>> Hi, >>>> Unfortunately adding swap did not solve the problem ! >>>> I added 400 GB of swap. It used about 18GB of swap after consuming all >>>> the ram and stopped with the following logs : >>>> >>>> 2020-06-05 21:33:31.967 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >>>> Updating MDS map to version 324691 from mon.1 >>>> 2020-06-05 21:33:40.355 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >>>> Updating MDS map to version 324692 from mon.1 >>>> 2020-06-05 21:33:59.787 7f251b7e5700 1 heartbeat_map is_healthy >>>> 'MDSRank' had timed out after 15 >>>> 2020-06-05 21:33:59.787 7f251b7e5700 0 >>>> mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors >>>> (last acked 3.99979s ago); MDS internal heartbeat is not healthy! >>>> 2020-06-05 21:34:00.287 7f251b7e5700 1 heartbeat_map is_healthy >>>> 'MDSRank' had timed out after 15 >>>> 2020-06-05 21:34:00.287 7f251b7e5700 0 >>>> mds.beacon.lpnceph-mds04.in2p3.fr Skipping beacon heartbeat to monitors >>>> (last acked 4.49976s ago); MDS internal heartbeat is not healthy! >>>> .... >>>> 2020-06-05 21:39:05.991 7f251bfe6700 1 heartbeat_map reset_timeout >>>> 'MDSRank' had timed out after 15 >>>> 2020-06-05 21:39:06.015 7f251bfe6700 1 >>>> mds.beacon.lpnceph-mds04.in2p3.fr MDS connection to Monitors appears to >>>> be laggy; 310.228s since last acked beacon >>>> 2020-06-05 21:39:06.015 7f251bfe6700 1 mds.0.322900 skipping upkeep >>>> work because connection to Monitors appears laggy >>>> 2020-06-05 21:39:19.838 7f251bfe6700 1 mds.0.322900 skipping upkeep >>>> work because connection to Monitors appears laggy >>>> 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr >>>> Updating MDS map to version 324694 from mon.1 >>>> 2020-06-05 21:39:19.869 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr Map >>>> removed me (mds.-1 gid:210070681) from cluster due to lost contact; >>>> respawning >>>> 2020-06-05 21:39:19.870 7f251e7eb700 1 mds.lpnceph-mds04.in2p3.fr respawn! >>>> --- begin dump of recent events --- >>>> -9999> 2020-06-05 19:28:07.982 7f25217f1700 5 >>>> mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq >>>> 2131 rtt 0.930951 >>>> -9998> 2020-06-05 19:28:11.053 7f251b7e5700 5 >>>> mds.beacon.lpnceph-mds04.in2p3.fr Sending beacon up:replay seq 2132 >>>> -9997> 2020-06-05 19:28:11.053 7f251b7e5700 10 monclient: >>>> _send_mon_message to mon.lpnceph-mon02 at v2:134.158.152.210:3300/0 >>>> -9996> 2020-06-05 19:28:12.176 7f25217f1700 5 >>>> mds.beacon.lpnceph-mds04.in2p3.fr received beacon reply up:replay seq >>>> 2132 rtt 1.12294 >>>> -9995> 2020-06-05 19:28:12.176 7f251e7eb700 1 >>>> mds.lpnceph-mds04.in2p3.fr Updating MDS map to version 323302 from mon.1 >>>> -9994> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: tick >>>> -9993> 2020-06-05 19:28:14.290 7f251d7e9700 10 monclient: >>>> _check_auth_rotating have uptodate secrets (they expire after 2020-06-05 >>>> 19:27:44.290995) >>>> ... >>>> 2020-06-05 21:39:31.092 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr >>>> Updating MDS map to version 324749 from mon.1 >>>> 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr >>>> Updating MDS map to version 324750 from mon.1 >>>> 2020-06-05 21:39:35.257 7f3c4d5e3700 1 mds.lpnceph-mds04.in2p3.fr Map >>>> has assigned me to become a standby >>>> >>>> However, the mons doesn't seems particularly loaded ! >>>> So I am trying to set mds_beacon_grace to 60.0 to see if it helps (I did >>>> it both for mds and mons daemons because it's seems to be present in >>>> both conf). >>>> I will tells you if it works. >>>> >>>> Any other clue ? >>>> F. >>>> >>>> Le 05/06/2020 à 14:44, Frank Schilder a écrit : >>>>> Hi Francois, >>>>> >>>>> thanks for the link. The option "mds dump cache after rejoin" is for >>>>> debugging purposes only. It will write the cache after rejoin to a file, >>>>> but not drop the cache. This will not help you. I think this was >>>>> implemented recently to make it possible to send a cache dump file to >>>>> developers after an MDS crash before the restarting MDS changes the cache. >>>>> >>>>> In your case, I would set osd_op_queue_cut_off during the next regular >>>>> cluster service or upgrade. >>>>> >>>>> My best bet right now is to try to add swap. Maybe someone else reading >>>>> this has a better idea or you find a hint in one of the other threads. >>>>> >>>>> Good luck! >>>>> ================= >>>>> Frank Schilder >>>>> AIT Risø Campus >>>>> Bygning 109, rum S14 >>>>> >>>>> ________________________________________ >>>>> From: Francois Legrand<f...@lpnhe.in2p3.fr> >>>>> Sent: 05 June 2020 14:34:06 >>>>> To: Frank Schilder; ceph-users >>>>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory >>>>> exhausted >>>>> >>>>> Le 05/06/2020 à 14:18, Frank Schilder a écrit : >>>>>> Hi Francois, >>>>>> >>>>>>> I was also wondering if setting mds dump cache after rejoin could help ? >>>>>> Haven't heard of that option. Is there some documentation? >>>>> I found it on : >>>>> https://docs.ceph.com/docs/nautilus/cephfs/mds-config-ref/ >>>>> mds dump cache after rejoin >>>>> Description >>>>> Ceph will dump MDS cache contents to a file after rejoining the cache >>>>> (during recovery). >>>>> Type >>>>> Boolean >>>>> Default >>>>> false >>>>> >>>>> but I don't think it can help in my case, because rejoin occurs after >>>>> replay and in my case replay never ends ! >>>>> >>>>>>> I have : >>>>>>> osd_op_queue=wpq >>>>>>> osd_op_queue_cut_off=low >>>>>>> I can try to set osd_op_queue_cut_off to high, but it will be useful >>>>>>> only if the mds get active, true ? >>>>>> I think so. If you have no clients connected, there should not be queue >>>>>> priority issues. Maybe it is best to wait until your cluster is healthy >>>>>> again as you will need to restart all daemons. Make sure you set this in >>>>>> [global]. When I applied that change and after re-starting all OSDs my >>>>>> MDSes had reconnect issues until I set it on them too. I think all >>>>>> daemons use that option (the prefix osd_ is misleading). >>>>> For sure I would prefer not to restart all daemons because the second >>>>> filesystem is up and running (with production clients). >>>>> >>>>>>> For now, the mds_cache_memory_limit is set to 8 589 934 592 (so 8GB >>>>>>> which seems reasonable for a mds server with 32/48GB). >>>>>> This sounds bad. 8GB should not cause any issues. Maybe you are hitting >>>>>> a bug, I believe there is a regression in Nautilus. There were recent >>>>>> threads on absurdly high memory use by MDSes. Maybe its worth searching >>>>>> for these in the list. >>>>> I will have a look. >>>>> >>>>>>> I already force the clients to unmount (and even rebooted the ones from >>>>>>> which the rsync and the rmdir .snaps were launched). >>>>>> I don't know when the MDS acknowledges this. If is was a clean unmount >>>>>> (i.e. without -f or forced by reboot) the MDS should have dropped the >>>>>> clients already. If it was an unclean unmount it might not be that easy >>>>>> to get the stale client session out. However, I don't know about that. >>>>> Moreover when I did that, the mds was already not active but in replay, >>>>> so for sure the unmount was not acknowledged by any mds ! >>>>> >>>>>>> I think that providing more swap maybe the solution ! I will try that if >>>>>>> I cannot find another way to fix it. >>>>>> If the memory overrun is somewhat limited, this should allow the MDS to >>>>>> trim the logs. Will take a while, but it will do eventually. >>>>>> >>>>>> Best regards, >>>>>> ================= >>>>>> Frank Schilder >>>>>> AIT Risø Campus >>>>>> Bygning 109, rum S14 >>>>>> >>>>>> ________________________________________ >>>>>> From: Francois Legrand<f...@lpnhe.in2p3.fr> >>>>>> Sent: 05 June 2020 13:46:03 >>>>>> To: Frank Schilder; ceph-users >>>>>> Subject: Re: [ceph-users] mds behind on trimming - replay until memory >>>>>> exhausted >>>>>> >>>>>> I was also wondering if setting mds dump cache after rejoin could help ? >>>>>> >>>>>> >>>>>> Le 05/06/2020 à 12:49, Frank Schilder a écrit : >>>>>>> Out of interest, I did the same on a mimic cluster a few months ago, >>>>>>> running up to 5 parallel rsync sessions without any problems. I moved >>>>>>> about 120TB. Each rsync was running on a separate client with its own >>>>>>> cache. I made sure that the sync dirs were all disjoint (no overlap of >>>>>>> files/directories). >>>>>>> >>>>>>> How many rsync processes are you running in parallel? >>>>>>> Do you have these settings enabled: >>>>>>> >>>>>>> osd_op_queue=wpq >>>>>>> osd_op_queue_cut_off=high >>>>>>> >>>>>>> WPQ should be default, but osd_op_queue_cut_off=high might not be. >>>>>>> Setting the latter removed any behind trimming problems we have seen >>>>>>> before. >>>>>>> >>>>>>> You are in a somewhat peculiar situation. I think you need to trim >>>>>>> client caches, which requires an active MDS. If your MDS becomes active >>>>>>> for at least some time, you could try the following (I'm not an expert >>>>>>> here, so take with a grain of scepticism): >>>>>>> >>>>>>> - reduce the MDS cache memory limit to force recall of caps much >>>>>>> earlier than now >>>>>>> - reduce client cach size >>>>>>> - set "osd_op_queue_cut_off=high" if not already done so, I think this >>>>>>> requires restart of OSDs, so be careful >>>>>>> >>>>>>> At this point, you could watch your restart cycle to see if things >>>>>>> improve already. Maybe nothing more is required. >>>>>>> >>>>>>> If you have good SSDs, you could try to provide temporarily some swap >>>>>>> space. It saved me once. This will be very slow, but at least it might >>>>>>> allow you to move forward. >>>>>>> >>>>>>> Harder measures: >>>>>>> >>>>>>> - stop all I/O from the FS clients, throw users out if necessary >>>>>>> - ideally, try to cleanly (!) shut down clients or force trimming the >>>>>>> cache by either >>>>>>> * umount or >>>>>>> * sync; echo 3 > /proc/sys/vm/drop_caches >>>>>>> Either of these might hang for a long time. Do not interrupt >>>>>>> and do not do this on more than one client at a time. >>>>>>> >>>>>>> At some point, your active MDS should be able to hold a full session. >>>>>>> You should then tune the cache and other parameters such that the MDSes >>>>>>> can handle your rsync sessions. >>>>>>> >>>>>>> My experience is that MDSes overrun their cache limits quite a lot. >>>>>>> Since I reduced mds_cache_memory_limit to not more than half of what is >>>>>>> physically available, I have not had any problems again. >>>>>>> >>>>>>> Hope that helps. >>>>>>> >>>>>>> Best regards, >>>>>>> ================= >>>>>>> Frank Schilder >>>>>>> AIT Risø Campus >>>>>>> Bygning 109, rum S14 >>>>>>> >>>>>>> ________________________________________ >>>>>>> From: Francois Legrand<f...@lpnhe.in2p3.fr> >>>>>>> Sent: 05 June 2020 11:42:42 >>>>>>> To: ceph-users >>>>>>> Subject: [ceph-users] mds behind on trimming - replay until memory >>>>>>> exhausted >>>>>>> >>>>>>> Hi all, >>>>>>> We have a ceph nautilus cluster (14.2.8) with two cephfs filesystem and >>>>>>> 3 mds (1 active for each fs + one failover). >>>>>>> We are transfering all the datas (~600M files) from one FS (which was in >>>>>>> EC 3+2) to the other FS (in R3). >>>>>>> On the old FS we first removed the snapshots (to avoid strays problems >>>>>>> when removing files) and the ran some rsync deleting the files after the >>>>>>> transfer. >>>>>>> The operation should last a few weeks more to complete. >>>>>>> But few days ago, we started to have some warning mds behind on trimming >>>>>>> from the mds managing the old FS. >>>>>>> Yesterday, I restarted the active mds service to force the takeover by >>>>>>> the standby mds (basically because the standby is more powerfull and >>>>>>> have more memory, i.e 48GB over 32). >>>>>>> The standby mds took the rank 0 and started to replay... the mds behind >>>>>>> on trimming came back and the number of segments rised as well as the >>>>>>> memory usage of the server. Finally, it exhausted the memory of the mds >>>>>>> and the service stopped and the previous mds took rank 0 and started to >>>>>>> replay... until memory exhaustion and a new switch of mds etc... >>>>>>> It thus seems that we are in a never ending loop ! And of course, as the >>>>>>> mds is always in replay, the data are not accessible and the transfers >>>>>>> are blocked. >>>>>>> I stopped all the rsync and unmount the clients. >>>>>>> >>>>>>> My questions are : >>>>>>> - Does the mds trim during the replay so we could hope that after a >>>>>>> while it will purge everything and the mds will be able to become active >>>>>>> at the end ? >>>>>>> - Is there a way to accelerate the operation or to fix this situation ? >>>>>>> >>>>>>> Thanks for you help. >>>>>>> F. >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list --ceph-users@ceph.io >>>>>>> To unsubscribe send an email toceph-users-le...@ceph.io >>>> _______________________________________________ >>>> ceph-users mailing list -- ceph-users@ceph.io >>>> To unsubscribe send an email to ceph-users-le...@ceph.io >>> _______________________________________________ >>> ceph-users mailing list -- ceph-users@ceph.io >>> To unsubscribe send an email to ceph-users-le...@ceph.io _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io