Short story:
------------

We have a new HPC installation with file systems provided by cephfs (home, 
apps, ...). We have one cephfs and all client file systems are sub-directory 
mounts. On this ceph file system, we have a bit more than 500 nodes with 
currently 2 ceph fs mounts each, resulting in a bit over 1000 client 
connections. Today, we run a benchmark for the HPC nodes, which somehow crashed 
our MDS setup as a by-product. We are currently on "ceph version 13.2.2 
(02899bfda814146b021136e9d8e80eba494e1126) mimic (stable)".

We have the fs meta-data on SSD and the data on an EC data-pool on HDD. The 
ceph servers are running on kernel v5.0 while the clients are on v3.10. It 
looks like all important ceph fs kernel client fixes are back-ported (not all 
though). Everything on mimic. If necessary, we will provide more details, which 
we omitted here in favour of focus on the quite long event log below.

What we observed after starting the benchmark was between 10000-20000 read OPs, 
probably caused by the application (running on all 500+ nodes) repeatedly 
accessing the same or few same files. For a while, this more or less worked 
until operations started piling up, which never got flushed. After quite a bit 
of work, it looks like we got our cluster back as well as discovering a 
(serious?) problem with an MDS daemon pair set-up as active/standby-replay.

The main questions in this e-mail are basically (please see long story below 
for the clue we found):

1) Is there a bug with having MDS daemons acting as standby-replay?
2) Is there a manual way to cancel an "internal op fragmentdir"?

Our MDS configuration section looked like this before the incident:

[mds.ceph-08]
mds standby replay = true
mds standby for rank = 0
mds standby for fscid = 1

[mds.ceph-12]
mds standby replay = true
mds standby for rank = 0
mds standby for fscid = 1

We changed it to this one until further information on the issue:

[mds.ceph-08]
mds standby replay = false
#mds standby for rank = 0
#mds standby for fscid = 1

[mds.ceph-12]
mds standby replay = false
#mds standby for rank = 0
#mds standby for fscid = 1


Long Story:
-----------

Disclaimer: I was a bit under stress getting our cluster back. Unfortunately, I 
did not take copies of all potentially relevant information and have to recall 
some from memory. We are still in a phase where we can execute tests to collect 
more data - if necessary. We will certainly stress test the active/standby 
configuration with the same benchmark.

It looks like there might be a bug where the standby-replay daemon is regarded 
as active and a fragmentdir operation becomes scheduled, but never executed. 
After this internal request is in the pipeline, operations keep piling up 
without ever being executed. The IO to files accessed by the benchmark stopped.

Here is a protocol of what we observed and did:

- We started the benchmark and observed large IOPs (10-20K ops read) with low 
MB/s (300-500MB/s read). Very little write going on.
- At some point we started seeing the health warning

            1 MDSs report slow requests

and a large number of OPS blocked (NNNN-some 4 digit number):

MDS_SLOW_REQUEST 1 MDSs report slow requests
    mdsceph-12(mds.0): NNNN slow requests are blocked > 30 secs

with the number NNNN increasing fast.

- Some time later, we see trimming messages in addition:

MDS_TRIM 2 MDSs behind on trimming
    mdsceph-12(mds.0): Behind on trimming (292/128) max_segments: 128, 
num_segments: 292
    mdsceph-08(mds.0): Behind on trimming (292/128) max_segments: 128, 
num_segments: 292

with num_segments increasing slowly.

- That's when we started taking action since we noted that the old ops were 
never executed and ops would probably just keep piling up.

- We cancelled the benchmark job that was causing the IO load and waited to see 
if ceph would start recovering. I should note that this problem was already 
triggered by earlier runs of the same benchmark and the health issues did 
disappear after some time back then. However, in those cases the user 
terminated the job as soon as he noticed strange behaviour - much much earlier 
than in the supervised test described here, where we wanted to see how ceph 
acts under a sustained load of this type (users do have job accidents:).

- We observed that this time, ceph was not getting better. The number of stuck 
operations was remaining at this level:

MDS_SLOW_REQUEST 1 MDSs report slow requests
    mdsceph-12(mds.0): 12051 slow requests are blocked > 30 secs
MDS_TRIM 2 MDSs behind on trimming
    mdsceph-12(mds.0): Behind on trimming (292/128) max_segments: 128, 
num_segments: 292
    mdsceph-08(mds.0): Behind on trimming (292/128) max_segments: 128, 
num_segments: 292

- So, we had to do something. After trying a number of less invasive things, 
like remounting the cepf fs on a client node, to no avail, we decided to take 
more drastic measures.

- We evicted all clients and observed that more and more OPS got marked as 
"cancelled" and "cleaned up". During the eviction process, these warnings came 
up temporarily:

    mdsceph-12(mds.0): Client sn371.hpc.ait.dtu.dk:con-fs-hpc failing to 
respond to capability release client_id: 226263
    mdsceph-12(mds.0): Client sn322.hpc.ait.dtu.dk:con-fs-hpc failing to 
respond to capability release client_id: 242672
    mdsceph-12(mds.0): Client sn339.hpc.ait.dtu.dk:con-fs-hpc failing to 
respond to capability release client_id: 242708
    mdsceph-12(mds.0): Client sn335.hpc.ait.dtu.dk:con-fs-hpc failing to 
respond to capability release client_id: 272728

I'm aware that there is/was a kernel (client) bug that might still be present 
in 3.10. However, I have not seen this warning in any other context so far. I 
did test pushing clients to release caps by changing the MDS cache memory limit 
at run time and clients responded with dropping caps as needed. I have never 
seen the MDS exceeding the cache memory limit.

- After kicking out all clients, we ended up with this result:

HEALTH_WARN 1 MDSs report slow requests; 2 MDSs behind on trimming
MDS_SLOW_REQUEST 1 MDSs report slow requests
    mdsceph-12(mds.0): 12051 slow requests are blocked > 30 secs
MDS_TRIM 2 MDSs behind on trimming
    mdsceph-12(mds.0): Behind on trimming (294/128) max_segments: 128, 
num_segments: 294
    mdsceph-08(mds.0): Behind on trimming (294/128) max_segments: 128, 
num_segments: 294

[root@ceph-mds:ceph-12 /]# ceph daemon mds.ceph-12 ops | grep description | wc 
-l
12051

[root@ceph-mds:ceph-12 /]# ceph daemon mds.ceph-12 ops | grep killing | wc -l
12050

So, there was 1 stuck operation that was not client related. I started 
filtering the descriptions with "grep -v" and got a hit:

[root@ceph-mds:ceph-12 /]# ceph daemon mds.ceph-12 ops | grep "internal op 
fragmentdir" -A 100
            "description": "internal op fragmentdir:mds.0:169119",
            "initiated_at": "2019-05-13 11:17:52.002507",
            "age": 12796.457152,
            "duration": 12796.666273,
            "type_data": {
                "flag_point": "dispatched",
                "reqid": "mds.0:169119",
                "op_type": "internal_op",
                "internal_op": 5376,
                "op_name": "fragmentdir",
                "events": [
                    {
                        "time": "2019-05-13 11:17:52.002507",
                        "event": "initiated"
                    },
                    {
                        "time": "2019-05-13 11:17:52.002507",
                        "event": "header_read"
                    },
                    {
                        "time": "0.000000",
                        "event": "throttled"
                    },
                    {
                        "time": "0.000000",
                        "event": "all_read"
                    },
                    {
                        "time": "0.000000",
                        "event": "dispatched"
                    }
                ]
            }
        },

The information I found searching the web indicates that this is something that 
happens in an active/active set-up, where one active MDS requests another 
active MDS to take over part of the meta data operations. In our case, there 
was no MDS to accept that request. Our hypothesis is, that the standby-replay 
daemon erroneously was seen as active.

I should add here that (all?) the blocked operations had as their last entry 
something like "waiting for directory fragmentation" or "pending ...". I forgot 
to copy the message and have to recall that from memory. In any case, I 
suspected something like the above operation being stuck in the queue holding 
everything up already a bit earlier.

- Since all other OPS were cancelled, we assumed it would be safe at this point 
to restart MDS daemons without loosing relevant data. The idea to get rid of 
the pending dirfrag was to restart the standby-replay daemon as standby (only) 
and see what happens. If things looked good, we would fail over the active 
daemon and restart this one also with standby (only) configuration as indicated 
above with the modified MDS section in ceph.conf.

Indeed, this procedure flushed the dirfrag out of the queue and all other 
operations with it. The cluster came back healthy and I do not think we lost 
any relevant IO.

- I should mention that ceph fs actually kept working fine for anyone except 
the user running the benchmark. Only IO to particular files/a particular 
directory stopped, so this problem seems to remain isolated. Also, the load on 
the servers was not high during the test. The fs remained responsive to other 
users. Also, the MDS daemons never crashed. There was no fail-over except the 
ones we triggered manually.

As mentioned above, we can do some more testing within reason. We already have 
pilot users on the system and we need to keep it sort of working.


Best regards and thanks in advance for any help on getting a stable 
active/standby-replay config working.

And for reading all that.

=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to