On Wed, May 15, 2019 at 9:34 PM Frank Schilder <fr...@dtu.dk> wrote:
>
> Dear Stefan,
>
> thanks for the fast reply. We encountered the problem again, this time in a 
> much simpler situation; please see below. However, let me start with your 
> questions first:
>
> What bug? -- In a single-active MDS set-up, should there ever occur an 
> operation with "op_name": "fragmentdir"?
>
> Trimming settings: In the version I'm running, mds_log_max_expiring does not 
> exist and mds_log_max_segments is 128 by default. I guess this is fine.
>
> Upgrading: The problem described here is the only issue we observe. Unless 
> the problem is fixed upstream, upgrading won't help us and would be a bit of 
> a waste of time. If someone can confirm that this problem is fixed in a newer 
> version, we will do it. Otherwise, we might prefer to wait until it is.
>
> News on the problem. We encountered it again when one of our users executed a 
> command in parallel with pdsh on all our ~500 client nodes. This command 
> accesses the same file from all these nodes pretty much simultaneously. We 
> did this quite often in the past, but this time, the command got stuck and we 
> started observing the MDS health problem again. Symptoms:
>
> - The pdsh process enters an un-interruptible state.
> - It is no longer possible to access the directory where the simultaneously 
> accessed file resides (from any client).
> - 'ceph status' reports 'MDS slow requests'
> - The 'ceph daemon mds.nnn ops' list contains operations that are waiting for 
> directory fragmentation (see full log below).
> - The ops list contains an operation "internal op fragmentdir:mds.0:35" that 
> is dispatched, but apparently never completes.
> - Any attempt to access the locked directory adds operations to the ops list 
> that will then also hang indefinitely.
> - I/O to other directories continues to work fine.
>
> We waited some time to confirm that ceph does not heal itself. It is a 
> dead-lock situation that seems to be triggered by a large number of clients 
> simultaneously accessing the same file/directory. This problem seems not to 
> occur with 100 or fewer clients. The probability of occurrence seems load 
> dependent.
>
> Temporary fix: Failing the active MDS flushed the stuck operations. The 
> cluster became healthy and all clients rejoined.
>
> This time I captured the MDS ops list (log output does not really contain 
> more info than this list). It contains 12 ops and I will include it here in 
> full length (hope this is acceptable):
>

Your issues were caused by stuck internal op fragmentdir.  Can you
dump mds cache and send the output to us?

> {
>     "ops": [
>         {
>             "description": "client_request(client.386087:12791 lookup 
> #0x10000000027/file.pdf 2019-05-15 11:30:47.173526 caller_uid=0, 
> caller_gid=0{})",
>             "initiated_at": "2019-05-15 11:30:47.174134",
>             "age": 492.800243,
>             "duration": 492.800277,
>             "type_data": {
>                 "flag_point": "failed to authpin, dir is being fragmented",
>                 "reqid": "client.386087:12791",
>                 "op_type": "client_request",
>                 "client_info": {
>                     "client": "client.386087",
>                     "tid": 12791
>                 },
>                 "events": [
>                     {
>                         "time": "2019-05-15 11:30:47.174134",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2019-05-15 11:30:47.174134",
>                         "event": "header_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:30:47.174136",
>                         "event": "throttled"
>                     },
>                     {
>                         "time": "2019-05-15 11:30:47.174144",
>                         "event": "all_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:30:47.174245",
>                         "event": "dispatched"
>                     },
>                     {
>                         "time": "2019-05-15 11:30:47.174271",
>                         "event": "failed to authpin, dir is being fragmented"
>                     }
>                 ]
>             }
>         },
>         {
>             "description": "client_request(client.62472:6092355 create 
> #0x10000000038/lastnotification.uXMjaLSt 2019-05-15 11:15:02.883027 
> caller_uid=105731, caller_gid=105731{})",
>             "initiated_at": "2019-05-15 11:15:02.884547",
>             "age": 1437.089830,
>             "duration": 1437.089937,
>             "type_data": {
>                 "flag_point": "failed to authpin, dir is being fragmented",
>                 "reqid": "client.62472:6092355",
>                 "op_type": "client_request",
>                 "client_info": {
>                     "client": "client.62472",
>                     "tid": 6092355
>                 },
>                 "events": [
>                     {
>                         "time": "2019-05-15 11:15:02.884547",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2019-05-15 11:15:02.884547",
>                         "event": "header_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:15:02.884552",
>                         "event": "throttled"
>                     },
>                     {
>                         "time": "2019-05-15 11:15:02.884571",
>                         "event": "all_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:15:02.884721",
>                         "event": "dispatched"
>                     },
>                     {
>                         "time": "2019-05-15 11:15:02.884857",
>                         "event": "failed to authpin, dir is being fragmented"
>                     }
>                 ]
>             }
>         },
>         {
>             "description": "client_request(client.386087:12793 lookup 
> #0x10000000027/file.pdf 2019-05-15 11:33:52.499279 caller_uid=0, 
> caller_gid=0{})",
>             "initiated_at": "2019-05-15 11:33:52.500354",
>             "age": 307.474024,
>             "duration": 307.474189,
>             "type_data": {
>                 "flag_point": "failed to authpin, dir is being fragmented",
>                 "reqid": "client.386087:12793",
>                 "op_type": "client_request",
>                 "client_info": {
>                     "client": "client.386087",
>                     "tid": 12793
>                 },
>                 "events": [
>                     {
>                         "time": "2019-05-15 11:33:52.500354",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2019-05-15 11:33:52.500354",
>                         "event": "header_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:33:52.500356",
>                         "event": "throttled"
>                     },
>                     {
>                         "time": "2019-05-15 11:33:52.500393",
>                         "event": "all_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:33:52.500458",
>                         "event": "dispatched"
>                     },
>                     {
>                         "time": "2019-05-15 11:33:52.500482",
>                         "event": "failed to authpin, dir is being fragmented"
>                     }
>                 ]
>             }
>         },
>         {
>             "description": "client_request(client.377552:1989 lookup 
> #0x10000000027/file.pdf 2019-05-15 11:40:44.566328 caller_uid=0, 
> caller_gid=0{})",
>             "initiated_at": "2019-05-15 11:36:13.509050",
>             "age": 166.465328,
>             "duration": 166.465529,
>             "type_data": {
>                 "flag_point": "failed to authpin, dir is being fragmented",
>                 "reqid": "client.377552:1989",
>                 "op_type": "client_request",
>                 "client_info": {
>                     "client": "client.377552",
>                     "tid": 1989
>                 },
>                 "events": [
>                     {
>                         "time": "2019-05-15 11:36:13.509050",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2019-05-15 11:36:13.509050",
>                         "event": "header_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:36:13.509052",
>                         "event": "throttled"
>                     },
>                     {
>                         "time": "2019-05-15 11:36:13.509063",
>                         "event": "all_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:36:13.509163",
>                         "event": "dispatched"
>                     },
>                     {
>                         "time": "2019-05-15 11:36:13.509219",
>                         "event": "failed to authpin, dir is being fragmented"
>                     }
>                 ]
>             }
>         },
>         {
>             "description": "client_request(client.377552:1990 lookup 
> #0x10000000027/file.pdf 2019-05-15 11:40:46.130328 caller_uid=0, 
> caller_gid=0{})",
>             "initiated_at": "2019-05-15 11:36:15.072708",
>             "age": 164.901670,
>             "duration": 164.901906,
>             "type_data": {
>                 "flag_point": "failed to authpin, dir is being fragmented",
>                 "reqid": "client.377552:1990",
>                 "op_type": "client_request",
>                 "client_info": {
>                     "client": "client.377552",
>                     "tid": 1990
>                 },
>                 "events": [
>                     {
>                         "time": "2019-05-15 11:36:15.072708",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2019-05-15 11:36:15.072708",
>                         "event": "header_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:36:15.072710",
>                         "event": "throttled"
>                     },
>                     {
>                         "time": "2019-05-15 11:36:15.072719",
>                         "event": "all_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:36:15.072847",
>                         "event": "dispatched"
>                     },
>                     {
>                         "time": "2019-05-15 11:36:15.072881",
>                         "event": "failed to authpin, dir is being fragmented"
>                     }
>                 ]
>             }
>         },
>         {
>             "description": "client_request(client.62472:6092426 readdir 
> #0x1000000002f 2019-05-15 11:21:53.080524 caller_uid=0, caller_gid=0{})",
>             "initiated_at": "2019-05-15 11:21:53.081656",
>             "age": 1026.892722,
>             "duration": 1026.892997,
>             "type_data": {
>                 "flag_point": "failed to authpin, dir is being fragmented",
>                 "reqid": "client.62472:6092426",
>                 "op_type": "client_request",
>                 "client_info": {
>                     "client": "client.62472",
>                     "tid": 6092426
>                 },
>                 "events": [
>                     {
>                         "time": "2019-05-15 11:21:53.081656",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2019-05-15 11:21:53.081656",
>                         "event": "header_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:21:53.081658",
>                         "event": "throttled"
>                     },
>                     {
>                         "time": "2019-05-15 11:21:53.081669",
>                         "event": "all_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:21:53.081772",
>                         "event": "dispatched"
>                     },
>                     {
>                         "time": "2019-05-15 11:21:53.081826",
>                         "event": "failed to authpin, dir is being fragmented"
>                     }
>                 ]
>             }
>         },
>         {
>             "description": "internal op fragmentdir:mds.0:35",
>             "initiated_at": "2019-05-15 11:06:33.594105",
>             "age": 1946.380273,
>             "duration": 1946.380603,
>             "type_data": {
>                 "flag_point": "dispatched",
>                 "reqid": "mds.0:35",
>                 "op_type": "internal_op",
>                 "internal_op": 5376,
>                 "op_name": "fragmentdir",
>                 "events": [
>                     {
>                         "time": "2019-05-15 11:06:33.594105",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2019-05-15 11:06:33.594105",
>                         "event": "header_read"
>                     },
>                     {
>                         "time": "0.000000",
>                         "event": "throttled"
>                     },
>                     {
>                         "time": "0.000000",
>                         "event": "all_read"
>                     },
>                     {
>                         "time": "0.000000",
>                         "event": "dispatched"
>                     }
>                 ]
>             }
>         },
>         {
>             "description": "client_request(client.386087:12786 lookup 
> #0x10000000027/file.pdf 2019-05-15 11:22:06.783606 caller_uid=0, 
> caller_gid=0{})",
>             "initiated_at": "2019-05-15 11:22:06.784393",
>             "age": 1013.189985,
>             "duration": 1013.190344,
>             "type_data": {
>                 "flag_point": "failed to authpin, dir is being fragmented",
>                 "reqid": "client.386087:12786",
>                 "op_type": "client_request",
>                 "client_info": {
>                     "client": "client.386087",
>                     "tid": 12786
>                 },
>                 "events": [
>                     {
>                         "time": "2019-05-15 11:22:06.784393",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2019-05-15 11:22:06.784393",
>                         "event": "header_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:22:06.784394",
>                         "event": "throttled"
>                     },
>                     {
>                         "time": "2019-05-15 11:22:06.784405",
>                         "event": "all_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:22:06.784506",
>                         "event": "dispatched"
>                     },
>                     {
>                         "time": "2019-05-15 11:22:06.784562",
>                         "event": "failed to authpin, dir is being fragmented"
>                     }
>                 ]
>             }
>         },
>         {
>             "description": "client_request(client.386087:12795 lookup 
> #0x10000000027/file.pdf 2019-05-15 11:37:31.764353 caller_uid=0, 
> caller_gid=0{})",
>             "initiated_at": "2019-05-15 11:37:31.765631",
>             "age": 88.208747,
>             "duration": 88.209160,
>             "type_data": {
>                 "flag_point": "failed to authpin, dir is being fragmented",
>                 "reqid": "client.386087:12795",
>                 "op_type": "client_request",
>                 "client_info": {
>                     "client": "client.386087",
>                     "tid": 12795
>                 },
>                 "events": [
>                     {
>                         "time": "2019-05-15 11:37:31.765631",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2019-05-15 11:37:31.765631",
>                         "event": "header_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:37:31.765633",
>                         "event": "throttled"
>                     },
>                     {
>                         "time": "2019-05-15 11:37:31.765640",
>                         "event": "all_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:37:31.765731",
>                         "event": "dispatched"
>                     },
>                     {
>                         "time": "2019-05-15 11:37:31.765759",
>                         "event": "failed to authpin, dir is being fragmented"
>                     }
>                 ]
>             }
>         },
>         {
>             "description": "client_request(client.377552:5446 readdir 
> #0x1000000003a 2019-05-15 11:43:07.569329 caller_uid=0, caller_gid=0{})",
>             "initiated_at": "2019-05-15 11:38:36.511381",
>             "age": 23.462997,
>             "duration": 23.463467,
>             "type_data": {
>                 "flag_point": "failed to authpin, dir is being fragmented",
>                 "reqid": "client.377552:5446",
>                 "op_type": "client_request",
>                 "client_info": {
>                     "client": "client.377552",
>                     "tid": 5446
>                 },
>                 "events": [
>                     {
>                         "time": "2019-05-15 11:38:36.511381",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2019-05-15 11:38:36.511381",
>                         "event": "header_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:38:36.511383",
>                         "event": "throttled"
>                     },
>                     {
>                         "time": "2019-05-15 11:38:36.511392",
>                         "event": "all_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:38:36.511561",
>                         "event": "dispatched"
>                     },
>                     {
>                         "time": "2019-05-15 11:38:36.511604",
>                         "event": "failed to authpin, dir is being fragmented"
>                     }
>                 ]
>             }
>         },
>         {
>             "description": "client_request(client.62472:6092368 getattr 
> pAsLsXsFs #0x10000000038 2019-05-15 11:17:21.633854 caller_uid=105731, 
> caller_gid=105731{})",
>             "initiated_at": "2019-05-15 11:17:21.635927",
>             "age": 1298.338451,
>             "duration": 1298.338955,
>             "type_data": {
>                 "flag_point": "failed to authpin, dir is being fragmented",
>                 "reqid": "client.62472:6092368",
>                 "op_type": "client_request",
>                 "client_info": {
>                     "client": "client.62472",
>                     "tid": 6092368
>                 },
>                 "events": [
>                     {
>                         "time": "2019-05-15 11:17:21.635927",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2019-05-15 11:17:21.635927",
>                         "event": "header_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:17:21.635931",
>                         "event": "throttled"
>                     },
>                     {
>                         "time": "2019-05-15 11:17:21.635944",
>                         "event": "all_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:17:21.636081",
>                         "event": "dispatched"
>                     },
>                     {
>                         "time": "2019-05-15 11:17:21.636118",
>                         "event": "failed to authpin, dir is being fragmented"
>                     }
>                 ]
>             }
>         },
>         {
>             "description": "client_request(client.62472:6092400 getattr 
> pAsLsXsFs #0x10000000038 2019-05-15 11:21:25.909555 caller_uid=105731, 
> caller_gid=105731{})",
>             "initiated_at": "2019-05-15 11:21:25.910514",
>             "age": 1054.063864,
>             "duration": 1054.064406,
>             "type_data": {
>                 "flag_point": "failed to authpin, dir is being fragmented",
>                 "reqid": "client.62472:6092400",
>                 "op_type": "client_request",
>                 "client_info": {
>                     "client": "client.62472",
>                     "tid": 6092400
>                 },
>                 "events": [
>                     {
>                         "time": "2019-05-15 11:21:25.910514",
>                         "event": "initiated"
>                     },
>                     {
>                         "time": "2019-05-15 11:21:25.910514",
>                         "event": "header_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:21:25.910527",
>                         "event": "throttled"
>                     },
>                     {
>                         "time": "2019-05-15 11:21:25.910537",
>                         "event": "all_read"
>                     },
>                     {
>                         "time": "2019-05-15 11:21:25.910597",
>                         "event": "dispatched"
>                     },
>                     {
>                         "time": "2019-05-15 11:21:25.910635",
>                         "event": "failed to authpin, dir is being fragmented"
>                     }
>                 ]
>             }
>         }
>     ],
>     "num_ops": 12
> }
>
> =================
> Frank Schilder
> AIT Ris? Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Stefan Kooman <ste...@bit.nl>
> Sent: 14 May 2019 09:54:05
> To: Frank Schilder
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] mimic: MDS standby-replay causing blocked ops (MDS 
> bug?)
>
> Quoting Frank Schilder (fr...@dtu.dk):
>
> If at all possible I would:
>
> Upgrade to 13.2.5 (there have been quite a few MDS fixes since 13.2.2).
> Use more recent kernels on the clients.
>
> Below settings for [mds] might help with trimming (you might already
> have changed mds_log_max_segments to 128 according to logs):
>
> [mds]
> mds_log_max_expiring = 80      # default 20
> # trim max $value segments in parallel
> # Defaults are too conservative.
> mds_log_max_segments = 120      # default 30
>
>
> > 1) Is there a bug with having MDS daemons acting as standby-replay?
> I can't tell what bug you are referring to based on info below. It does
> seem to work as designed.
>
> Gr. Stefan
>
> --
> | BIT BV  http://www.bit.nl/        Kamer van Koophandel 09090351
> | GPG: 0xD14839C6                   +31 318 648 688 / i...@bit.nl
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to