Re: [ceph-users] CephFS fuse client users stuck

2017-04-06 Thread Andras Pataki

Hi John,

Have you managed to reproduce the test case on your side?  Any hints on 
how to proceed, or if anything I could help with?  I've been trying to 
understand the protocol between the MDS and the fuse client, but if you 
can point me to any docs on the rationale of what the implementation is 
trying to achieve, that'd be super helpful.


Andras


On 03/31/2017 02:07 PM, Andras Pataki wrote:
Several clients on one node also works well for me (I guess the fuse 
client arbitrates then and the MDS perhaps doesn't need to do so 
much).  So the clients need to be on different nodes for this test to 
fail.


Andras


On 03/31/2017 01:25 PM, John Spray wrote:

On Fri, Mar 31, 2017 at 1:27 PM, Andras Pataki
 wrote:

Hi John,

It took a while but I believe now I have a reproducible test case 
for the
capabilities being lost issue in CephFS I wrote about a couple of 
weeks ago.

The quick summary of problem is that often processes hang using CephFS
either for a while or sometimes indefinitely.  The fuse client 
believes it

is waiting for some caps which it does not get from the MDS.

A quick summary of the setup: the whole system is on Jewel 10.2.6.  
We are

using the fuse client on all nodes for the file system.

The full test program is uploaded together with the verbose logs of all
clients and the MDS (at log level 20) to
http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/ 

It essentially runs in a loop opening a file for read/write, reading 
from it
and closing it.  The read/write open is key, if open the file 
read-only, the

problem doesn't happen:

 while (time(NULL) - t0 < INTERVAL) {

 int fd = open(FILENAME, O_RDWR);
 if (fd < 0) {
 printf("Could not open file '%s' for read/write", 
FILENAME);

 exit(1);
 }

 read(fd, buffer, 100);

 close(fd);
 count++;
 }

With INTERVAL = 10 seconds, on one machine with a single process I get
something like 30k opens per 10 seconds - an excellent result.  When 
I run 3
processes on 3 different nodes, I get a few (perhaps 4 or 5) opens 
per 10
seconds per process.  The particular case I collected logs for looks 
like

this:

apataki@worker1108:~/prg/ceph/test$ date; ssh worker1109
~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1110
~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1108
~/prg/ceph/test/timed_openrw_read < /dev/null &
Fri Mar 31 07:33:04 EDT 2017
[1] 53833
[2] 53834
[3] 53835
apataki@worker1108:~/prg/ceph/test$ Number of repeats: 4
Number of repeats: 3
Number of repeats: 5

[1]   Donessh worker1109
~/prg/ceph/test/timed_openrw_read < /dev/null
[2]-  Donessh worker1110
~/prg/ceph/test/timed_openrw_read < /dev/null
[3]+  Donessh worker1108
~/prg/ceph/test/timed_openrw_read < /dev/null
apataki@worker1108:~/prg/ceph/test$

Looking at the client, it looks like there are 5 second pauses 
waiting for
capabilities.  It seems that it doesn't get a response, and perhaps 
at some
frequency it polls back to find out what happened to its 
capabilities.  Here

is an example of such 5 second gaps from the client logs:

2017-03-31 07:33:05.010111 2aaab1a57700 10 client.3849178 waiting 
for caps

need Fr want Fc
2017-03-31 07:33:05.464058 2fd47700 20 client.3849178 trim_cache 
size 4

max 16384
2017-03-31 07:33:06.464180 2fd47700 20 client.3849178 trim_cache 
size 4

max 16384
2017-03-31 07:33:07.464307 2fd47700 20 client.3849178 trim_cache 
size 4

max 16384
2017-03-31 07:33:08.464421 2fd47700 20 client.3849178 trim_cache 
size 4

max 16384
2017-03-31 07:33:09.464554 2fd47700 20 client.3849178 trim_cache 
size 4

max 16384
2017-03-31 07:33:10.464680 2fd47700 10 client.3849178 check_caps on
100024a7b1d.head(faked_ino=0 ref=3 ll_ref=79894 
cap_refs={1024=0,2048=0}

open={3=1} mode=100664 size=23/4194304 mtime=2017-03-30 11:56:38.588478
caps=pAsLsXsFr(0=pAsLsXsFr) objectset[100024a7b1d ts 0/0 objects 0
dirty_or_tx 0] parents=0x5f5d5340 0x5f73ea00) wanted 
pAsxXsxFsxcrwb

used - issued pAsLsXsFr revoking Fsc is_delayed=1

The inode in question is 100024a7b1d which corresponds to the test file
/mnt/ceph/users/apataki/test/TEST-READ.txt
When I grep for this inode in the MDS logs, it also has a gap in the 
7:33:05
to 7:33:10 time.  I've learned some about the caps by trying to 
reproduce
the problem, but I'm afraid I don't understand enough of the MDS 
logic to

see what the problem is there.

The full logs are too large for the mailing list, so I've put them 
here:
http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/ 



Some help/advice with this would very much be appreciated. Thanks in
advance!

Thanks for the reproducer -- I've tried this against master ceph with
several clients on one node and I'm seeing hundreds of repeats per 10
second interval, although my clients are all on 

Re: [ceph-users] CephFS fuse client users stuck

2017-03-31 Thread Andras Pataki
Several clients on one node also works well for me (I guess the fuse 
client arbitrates then and the MDS perhaps doesn't need to do so much).  
So the clients need to be on different nodes for this test to fail.


Andras


On 03/31/2017 01:25 PM, John Spray wrote:

On Fri, Mar 31, 2017 at 1:27 PM, Andras Pataki
 wrote:

Hi John,

It took a while but I believe now I have a reproducible test case for the
capabilities being lost issue in CephFS I wrote about a couple of weeks ago.
The quick summary of problem is that often processes hang using CephFS
either for a while or sometimes indefinitely.  The fuse client believes it
is waiting for some caps which it does not get from the MDS.

A quick summary of the setup: the whole system is on Jewel 10.2.6.  We are
using the fuse client on all nodes for the file system.

The full test program is uploaded together with the verbose logs of all
clients and the MDS (at log level 20) to
http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/
It essentially runs in a loop opening a file for read/write, reading from it
and closing it.  The read/write open is key, if open the file read-only, the
problem doesn't happen:

 while (time(NULL) - t0 < INTERVAL) {

 int fd = open(FILENAME, O_RDWR);
 if (fd < 0) {
 printf("Could not open file '%s' for read/write", FILENAME);
 exit(1);
 }

 read(fd, buffer, 100);

 close(fd);
 count++;
 }

With INTERVAL = 10 seconds, on one machine with a single process I get
something like 30k opens per 10 seconds - an excellent result.  When I run 3
processes on 3 different nodes, I get a few (perhaps 4 or 5) opens per 10
seconds per process.  The particular case I collected logs for looks like
this:

apataki@worker1108:~/prg/ceph/test$ date; ssh worker1109
~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1110
~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1108
~/prg/ceph/test/timed_openrw_read < /dev/null &
Fri Mar 31 07:33:04 EDT 2017
[1] 53833
[2] 53834
[3] 53835
apataki@worker1108:~/prg/ceph/test$ Number of repeats: 4
Number of repeats: 3
Number of repeats: 5

[1]   Donessh worker1109
~/prg/ceph/test/timed_openrw_read < /dev/null
[2]-  Donessh worker1110
~/prg/ceph/test/timed_openrw_read < /dev/null
[3]+  Donessh worker1108
~/prg/ceph/test/timed_openrw_read < /dev/null
apataki@worker1108:~/prg/ceph/test$

Looking at the client, it looks like there are 5 second pauses waiting for
capabilities.  It seems that it doesn't get a response, and perhaps at some
frequency it polls back to find out what happened to its capabilities.  Here
is an example of such 5 second gaps from the client logs:

2017-03-31 07:33:05.010111 2aaab1a57700 10 client.3849178 waiting for caps
need Fr want Fc
2017-03-31 07:33:05.464058 2fd47700 20 client.3849178 trim_cache size 4
max 16384
2017-03-31 07:33:06.464180 2fd47700 20 client.3849178 trim_cache size 4
max 16384
2017-03-31 07:33:07.464307 2fd47700 20 client.3849178 trim_cache size 4
max 16384
2017-03-31 07:33:08.464421 2fd47700 20 client.3849178 trim_cache size 4
max 16384
2017-03-31 07:33:09.464554 2fd47700 20 client.3849178 trim_cache size 4
max 16384
2017-03-31 07:33:10.464680 2fd47700 10 client.3849178 check_caps on
100024a7b1d.head(faked_ino=0 ref=3 ll_ref=79894 cap_refs={1024=0,2048=0}
open={3=1} mode=100664 size=23/4194304 mtime=2017-03-30 11:56:38.588478
caps=pAsLsXsFr(0=pAsLsXsFr) objectset[100024a7b1d ts 0/0 objects 0
dirty_or_tx 0] parents=0x5f5d5340 0x5f73ea00) wanted pAsxXsxFsxcrwb
used - issued pAsLsXsFr revoking Fsc is_delayed=1

The inode in question is 100024a7b1d which corresponds to the test file
/mnt/ceph/users/apataki/test/TEST-READ.txt
When I grep for this inode in the MDS logs, it also has a gap in the 7:33:05
to 7:33:10 time.  I've learned some about the caps by trying to reproduce
the problem, but I'm afraid I don't understand enough of the MDS logic to
see what the problem is there.

The full logs are too large for the mailing list, so I've put them here:
http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/

Some help/advice with this would very much be appreciated.  Thanks in
advance!

Thanks for the reproducer -- I've tried this against master ceph with
several clients on one node and I'm seeing hundreds of repeats per 10
second interval, although my clients are all on the same node.  I got
a similar result when setting the client to inject 10ms delays on
messages to simulate being remote.  I ran it in a loop with three
clients several tens of times without seeing a hang.

I'm compiling jewel now to see if it has a different behaviour to
master (with the optimistic notion that we might have already fixed
the underlying cause...)

John


Andras




On 03/14/2017 12:55 PM, John Spray wrote:

On Tue, Mar 14, 2017 

Re: [ceph-users] CephFS fuse client users stuck

2017-03-31 Thread John Spray
On Fri, Mar 31, 2017 at 1:27 PM, Andras Pataki
 wrote:
> Hi John,
>
> It took a while but I believe now I have a reproducible test case for the
> capabilities being lost issue in CephFS I wrote about a couple of weeks ago.
> The quick summary of problem is that often processes hang using CephFS
> either for a while or sometimes indefinitely.  The fuse client believes it
> is waiting for some caps which it does not get from the MDS.
>
> A quick summary of the setup: the whole system is on Jewel 10.2.6.  We are
> using the fuse client on all nodes for the file system.
>
> The full test program is uploaded together with the verbose logs of all
> clients and the MDS (at log level 20) to
> http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/
> It essentially runs in a loop opening a file for read/write, reading from it
> and closing it.  The read/write open is key, if open the file read-only, the
> problem doesn't happen:
>
> while (time(NULL) - t0 < INTERVAL) {
>
> int fd = open(FILENAME, O_RDWR);
> if (fd < 0) {
> printf("Could not open file '%s' for read/write", FILENAME);
> exit(1);
> }
>
> read(fd, buffer, 100);
>
> close(fd);
> count++;
> }
>
> With INTERVAL = 10 seconds, on one machine with a single process I get
> something like 30k opens per 10 seconds - an excellent result.  When I run 3
> processes on 3 different nodes, I get a few (perhaps 4 or 5) opens per 10
> seconds per process.  The particular case I collected logs for looks like
> this:
>
> apataki@worker1108:~/prg/ceph/test$ date; ssh worker1109
> ~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1110
> ~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1108
> ~/prg/ceph/test/timed_openrw_read < /dev/null &
> Fri Mar 31 07:33:04 EDT 2017
> [1] 53833
> [2] 53834
> [3] 53835
> apataki@worker1108:~/prg/ceph/test$ Number of repeats: 4
> Number of repeats: 3
> Number of repeats: 5
>
> [1]   Donessh worker1109
> ~/prg/ceph/test/timed_openrw_read < /dev/null
> [2]-  Donessh worker1110
> ~/prg/ceph/test/timed_openrw_read < /dev/null
> [3]+  Donessh worker1108
> ~/prg/ceph/test/timed_openrw_read < /dev/null
> apataki@worker1108:~/prg/ceph/test$
>
> Looking at the client, it looks like there are 5 second pauses waiting for
> capabilities.  It seems that it doesn't get a response, and perhaps at some
> frequency it polls back to find out what happened to its capabilities.  Here
> is an example of such 5 second gaps from the client logs:
>
> 2017-03-31 07:33:05.010111 2aaab1a57700 10 client.3849178 waiting for caps
> need Fr want Fc
> 2017-03-31 07:33:05.464058 2fd47700 20 client.3849178 trim_cache size 4
> max 16384
> 2017-03-31 07:33:06.464180 2fd47700 20 client.3849178 trim_cache size 4
> max 16384
> 2017-03-31 07:33:07.464307 2fd47700 20 client.3849178 trim_cache size 4
> max 16384
> 2017-03-31 07:33:08.464421 2fd47700 20 client.3849178 trim_cache size 4
> max 16384
> 2017-03-31 07:33:09.464554 2fd47700 20 client.3849178 trim_cache size 4
> max 16384
> 2017-03-31 07:33:10.464680 2fd47700 10 client.3849178 check_caps on
> 100024a7b1d.head(faked_ino=0 ref=3 ll_ref=79894 cap_refs={1024=0,2048=0}
> open={3=1} mode=100664 size=23/4194304 mtime=2017-03-30 11:56:38.588478
> caps=pAsLsXsFr(0=pAsLsXsFr) objectset[100024a7b1d ts 0/0 objects 0
> dirty_or_tx 0] parents=0x5f5d5340 0x5f73ea00) wanted pAsxXsxFsxcrwb
> used - issued pAsLsXsFr revoking Fsc is_delayed=1
>
> The inode in question is 100024a7b1d which corresponds to the test file
> /mnt/ceph/users/apataki/test/TEST-READ.txt
> When I grep for this inode in the MDS logs, it also has a gap in the 7:33:05
> to 7:33:10 time.  I've learned some about the caps by trying to reproduce
> the problem, but I'm afraid I don't understand enough of the MDS logic to
> see what the problem is there.
>
> The full logs are too large for the mailing list, so I've put them here:
> http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/
>
> Some help/advice with this would very much be appreciated.  Thanks in
> advance!

Thanks for the reproducer -- I've tried this against master ceph with
several clients on one node and I'm seeing hundreds of repeats per 10
second interval, although my clients are all on the same node.  I got
a similar result when setting the client to inject 10ms delays on
messages to simulate being remote.  I ran it in a loop with three
clients several tens of times without seeing a hang.

I'm compiling jewel now to see if it has a different behaviour to
master (with the optimistic notion that we might have already fixed
the underlying cause...)

John

> Andras
>
>
>
>
> On 03/14/2017 12:55 PM, John Spray wrote:
>
> On Tue, Mar 14, 2017 at 2:10 PM, Andras Pataki
>  wrote:
>
> Hi John,
>
> I've 

Re: [ceph-users] CephFS fuse client users stuck

2017-03-31 Thread Andras Pataki

Hi John,

It took a while but I believe now I have a reproducible test case for 
the capabilities being lost issue in CephFS I wrote about a couple of 
weeks ago.  The quick summary of problem is that often processes hang 
using CephFS either for a while or sometimes indefinitely.  The fuse 
client believes it is waiting for some caps which it does not get from 
the MDS.


A quick summary of the setup: the whole system is on Jewel 10.2.6. We 
are using the fuse client on all nodes for the file system.


The full test program is uploaded together with the verbose logs of all 
clients and the MDS (at log level 20) to 
http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/
It essentially runs in a loop opening a file for read/write, reading 
from it and closing it.  The read/write open is key, if open the file 
read-only, the problem doesn't happen:


while (time(NULL) - t0 < INTERVAL) {

int fd = open(FILENAME, O_RDWR);
if (fd < 0) {
printf("Could not open file '%s' for read/write",
   FILENAME);
exit(1);
}

read(fd, buffer, 100);

close(fd);
count++;
}

With INTERVAL = 10 seconds, on one machine with a single process I get 
something like 30k opens per 10 seconds - an excellent result. When I 
run 3 processes on 3 different nodes, I get a few (perhaps 4 or 5) opens 
per 10 seconds per process.  The particular case I collected logs for 
looks like this:


   apataki@worker1108:~/prg/ceph/test$ date; ssh worker1109
   ~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1110
   ~/prg/ceph/test/timed_openrw_read < /dev/null & ssh worker1108
   ~/prg/ceph/test/timed_openrw_read < /dev/null &
   Fri Mar 31 07:33:04 EDT 2017
   [1] 53833
   [2] 53834
   [3] 53835
   apataki@worker1108:~/prg/ceph/test$ *Number of repeats: 4**
   **Number of repeats: 3**
   **Number of repeats: 5*

   [1]   Donessh worker1109
   ~/prg/ceph/test/timed_openrw_read < /dev/null
   [2]-  Donessh worker1110
   ~/prg/ceph/test/timed_openrw_read < /dev/null
   [3]+  Donessh worker1108
   ~/prg/ceph/test/timed_openrw_read < /dev/null
   apataki@worker1108:~/prg/ceph/test$

Looking at the client, it looks like there are 5 second pauses waiting 
for capabilities.  It seems that it doesn't get a response, and perhaps 
at some frequency it polls back to find out what happened to its 
capabilities.  Here is an example of such 5 second gaps from the client 
logs:


   2017-03-31 07:33:05.010111 2aaab1a57700 10 client.3849178 *waiting
   for caps need Fr want Fc*
   2017-03-31 07:33:05.464058 2fd47700 20 client.3849178 trim_cache
   size 4 max 16384
   2017-03-31 07:33:06.464180 2fd47700 20 client.3849178 trim_cache
   size 4 max 16384
   2017-03-31 07:33:07.464307 2fd47700 20 client.3849178 trim_cache
   size 4 max 16384
   2017-03-31 07:33:08.464421 2fd47700 20 client.3849178 trim_cache
   size 4 max 16384
   2017-03-31 07:33:09.464554 2fd47700 20 client.3849178 trim_cache
   size 4 max 16384
   2017-03-31 07:33:10.464680 2fd47700 10 client.3849178 check_caps
   on 100024a7b1d.head(faked_ino=0 ref=3 ll_ref=79894
   cap_refs={1024=0,2048=0} open={3=1} mode=100664 size=23/4194304
   mtime=2017-03-30 11:56:38.588478 caps=pAsLsXsFr(0=pAsLsXsFr)
   objectset[100024a7b1d ts 0/0 objects 0 dirty_or_tx 0]
   parents=0x5f5d5340 0x5f73ea00) wanted pAsxXsxFsxcrwb used -
   issued pAsLsXsFr revoking Fsc is_delayed=1

The inode in question is 100024a7b1d which corresponds to the test file 
/mnt/ceph/users/apataki/test/TEST-READ.txt
When I grep for this inode in the MDS logs, it also has a gap in the 
7:33:05 to 7:33:10 time.  I've learned some about the caps by trying to 
reproduce the problem, but I'm afraid I don't understand enough of the 
MDS logic to see what the problem is there.


The full logs are too large for the mailing list, so I've put them here: 
http://voms.simonsfoundation.org:50013/GTrbrMWDHb9F7CampXyYt5Ensdjg47w/ceph-20170331/


Some help/advice with this would very much be appreciated.  Thanks in 
advance!


Andras



On 03/14/2017 12:55 PM, John Spray wrote:

On Tue, Mar 14, 2017 at 2:10 PM, Andras Pataki
 wrote:

Hi John,

I've checked the MDS session list, and the fuse client does appear on that
with 'state' as 'open'.  So both the fuse client and the MDS agree on an
open connection.

Attached is the log of the ceph fuse client at debug level 20.  The MDS got
restarted at 9:44:20, and it went through its startup, and was in an
'active' state in ceph -s by 9:45:20.  As for the IP addresses in the logs,
10.128.128.110 is the MDS IP, the 10.128.128.1xy addresses are OSDs,
10.128.129.63 is the IP of the client the log is from.

So it looks like the client is getting stuck waiting for some
capabilities (the 7fff9c3f7700 thread in that log, which eventually
completes 

Re: [ceph-users] CephFS fuse client users stuck

2017-03-16 Thread Dan van der Ster
On Tue, Mar 14, 2017 at 5:55 PM, John Spray  wrote:
> On Tue, Mar 14, 2017 at 2:10 PM, Andras Pataki
>  wrote:
>> Hi John,
>>
>> I've checked the MDS session list, and the fuse client does appear on that
>> with 'state' as 'open'.  So both the fuse client and the MDS agree on an
>> open connection.
>>
>> Attached is the log of the ceph fuse client at debug level 20.  The MDS got
>> restarted at 9:44:20, and it went through its startup, and was in an
>> 'active' state in ceph -s by 9:45:20.  As for the IP addresses in the logs,
>> 10.128.128.110 is the MDS IP, the 10.128.128.1xy addresses are OSDs,
>> 10.128.129.63 is the IP of the client the log is from.
>
> So it looks like the client is getting stuck waiting for some
> capabilities (the 7fff9c3f7700 thread in that log, which eventually
> completes a ll_write on inode 100024ebea8 after the MDS restart).
> Hard to say whether the MDS failed to send it the proper messages, or
> if the client somehow missed it.
>
> It would be useful to have equally verbose logs from the MDS side from
> earlier on, at the point that the client started trying to do the
> write.  I wonder if you could see if your MDS+client can handle both
> being run at "debug mds = 20", "debug client = 20" respectively for a
> while, then when a client gets stuck, do the MDS restart, and follow
> back in the client log to work out which inode it was stuck on, then
> find log areas on the MDS side relating to that inode number.
>

Yesterday we had a network outage and afterwards had similarly stuck
ceph-fuse's (v10.2.6), which were fixed by an mds flip.

Here is debug_client=20 when we try to ls /cephfs (it's hanging forever):

2017-03-16 09:15:36.164051 7f9e7cf61700 20 client.9341013 trim_cache
size 33 max 16384
2017-03-16 09:15:37.164169 7f9e7cf61700 20 client.9341013 trim_cache
size 33 max 16384
2017-03-16 09:15:38.164258 7f9e7cf61700 20 client.9341013 trim_cache
size 33 max 16384
2017-03-16 09:15:38.77 7f9e515d8700  3 client.9341013 ll_getattr 1.head
2017-03-16 09:15:38.744491 7f9e515d8700 10 client.9341013 _getattr
mask pAsLsXsFs issued=0
2017-03-16 09:15:38.744533 7f9e515d8700 15 inode.get on 0x7f9e8ea7c300
1.head now 56
2017-03-16 09:15:38.744558 7f9e515d8700 20 client.9341013
choose_target_mds starting with req->inode 1.head(faked_ino=0 re
f=56 ll_ref=143935987 cap_refs={} open={} mode=40755 size=0/0
mtime=2017-03-15 05:50:02.430699 caps=-(0=pAsLsXsFs) has_dir
_layout 0x7f9e8ea7c300)
2017-03-16 09:15:38.744584 7f9e515d8700 20 client.9341013
choose_target_mds 1.head(faked_ino=0 ref=56 ll_ref=143935987 cap
_refs={} open={} mode=40755 size=0/0 mtime=2017-03-15 05:50:02.430699
caps=-(0=pAsLsXsFs) has_dir_layout 0x7f9e8ea7c300) i
s_hash=0 hash=0
2017-03-16 09:15:38.744592 7f9e515d8700 10 client.9341013
choose_target_mds from caps on inode 1.head(faked_ino=0 ref=56 l
l_ref=143935987 cap_refs={} open={} mode=40755 size=0/0
mtime=2017-03-15 05:50:02.430699 caps=-(0=pAsLsXsFs) has_dir_layou
t 0x7f9e8ea7c300)
2017-03-16 09:15:38.744601 7f9e515d8700 20 client.9341013 mds is 0
2017-03-16 09:15:38.744608 7f9e515d8700 10 client.9341013 send_request
rebuilding request 18992614 for mds.0
2017-03-16 09:15:38.744624 7f9e515d8700 20 client.9341013
encode_cap_releases enter (req: 0x7f9e8ef0a280, mds: 0)
2017-03-16 09:15:38.744627 7f9e515d8700 20 client.9341013 send_request
set sent_stamp to 2017-03-16 09:15:38.744626
2017-03-16 09:15:38.744632 7f9e515d8700 10 client.9341013 send_request
client_request(unknown.0:18992614 getattr pAsLsXsFs
 #1 2017-03-16 09:15:38.744538) v3 to mds.0
2017-03-16 09:15:38.744691 7f9e515d8700 20 client.9341013 awaiting
reply|forward|kick on 0x7f9e515d6fa0
2017-03-16 09:15:39.164365 7f9e7cf61700 20 client.9341013 trim_cache
size 33 max 16384
2017-03-16 09:15:40.164470 7f9e7cf61700 20 client.9341013 trim_cache
size 33 max 16384


And the full log when we failover the mds at 2017-03-16
09:20:47.799250 is here:
https://cernbox.cern.ch/index.php/s/sCYdvb9furqS64y

I also have the ceph-fuse core dump if it would be useful.

-- Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS fuse client users stuck

2017-03-14 Thread Andras Pataki
Thanks for the decoding of the logs, now I see what to look for. Can you 
point me to any documentation that explains a bit more on the logic 
(about capabilities, Fb/Fw, how the communication between the client and 
the MDS works, etc.)?


I've tried running the client and the MDS at log level 20, but I didn't 
get any freeze-ups before the MDS (almost) filled up a 100GB root disk.  
Also, the processes were running extremely slow, presumably since the 
MDS was too busy logging so many gigabytes. I'm going to try just 
running the clients at 20, and see what happens before the freeze-up to 
that thread and inode.


The inode in question is a file that many many processes write to 
periodically opening it as append (it is a file with output warnings of 
the algorithm).  This is an MPI job with a few hundred ranks, and they 
all append to do the same warning file.  I'm suspecting that they 
somehow concurrently write and things go wrong there (just from what the 
output of the processes looks like).


FsUsage is a change that I made on our side to the ceph client to 
collect information on the client usage.  It is quite on the outside of 
the client, it collects some statistics at the fuse API level about 
calls, it doesn't mess with the logic of what the calls do afterwards.  
I've tried running the vanilla fuse client from the CentOS repositories 
just to rule those changes out as the cause - but that also fail the 
same way.  There are no modifications to any other parts of ceph other 
than the fuse client data collection, they come from the official CentOS 
builds.


I'll try some more debugging and write back if/when I find something.  
But any info on explaining how things are supposed to work would be 
appreciated.


Andras


On 03/14/2017 12:55 PM, John Spray wrote:

On Tue, Mar 14, 2017 at 2:10 PM, Andras Pataki
 wrote:

Hi John,

I've checked the MDS session list, and the fuse client does appear on that
with 'state' as 'open'.  So both the fuse client and the MDS agree on an
open connection.

Attached is the log of the ceph fuse client at debug level 20.  The MDS got
restarted at 9:44:20, and it went through its startup, and was in an
'active' state in ceph -s by 9:45:20.  As for the IP addresses in the logs,
10.128.128.110 is the MDS IP, the 10.128.128.1xy addresses are OSDs,
10.128.129.63 is the IP of the client the log is from.

So it looks like the client is getting stuck waiting for some
capabilities (the 7fff9c3f7700 thread in that log, which eventually
completes a ll_write on inode 100024ebea8 after the MDS restart).
Hard to say whether the MDS failed to send it the proper messages, or
if the client somehow missed it.

It would be useful to have equally verbose logs from the MDS side from
earlier on, at the point that the client started trying to do the
write.  I wonder if you could see if your MDS+client can handle both
being run at "debug mds = 20", "debug client = 20" respectively for a
while, then when a client gets stuck, do the MDS restart, and follow
back in the client log to work out which inode it was stuck on, then
find log areas on the MDS side relating to that inode number.

BTW I see this "FsUsage" stuff in your log, which I don't recognise
from mainline Ceph, are you running something modified?

John



As per another suggestion, I've also tried kick_stale_sessions on the fuse
client, which didn't help (I guess since it doesn't think the session is
stale).
Let me know if there is anything else I can do to help.

Andras



On 03/13/2017 06:08 PM, John Spray wrote:

On Mon, Mar 13, 2017 at 8:15 PM, Andras Pataki
 wrote:

Dear Cephers,

We're using the ceph file system with the fuse client, and lately some of
our processes are getting stuck seemingly waiting for fuse operations.
At
the same time, the cluster is healthy, no slow requests, all OSDs up and
running, and both the MDS and the fuse client think that there are no
pending operations.  The situation is semi-reproducible.  When I run a
various cluster jobs, some get stuck after a few hours of correct
operation.
The cluster is on ceph 10.2.5 and 10.2.6, the fuse clients are 10.2.6,
but I
have tried 10.2.5 and 10.2.3, all of which have the same issue.  This is
on
CentOS (7.2 for the clients, 7.3 for the MDS/OSDs).

Here are some details:

The node with the stuck processes:

[root@worker1070 ~]# ps -auxwww | grep 30519
apataki   30519 39.8  0.9 8728064 5257588 ? Dl   12:11  60:50 ./Arepo
param.txt 2 6
[root@worker1070 ~]# cat /proc/30519/stack
[] fuse_file_aio_write+0xbb/0x340 [fuse]
[] do_sync_write+0x8d/0xd0
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x7f/0xe0
[] system_call_fastpath+0x16/0x1b
[] 0x

[root@worker1070 ~]# ps -auxwww | grep 30533
apataki   30533 39.8  0.9 8795316 5261308 ? Sl   12:11  60:55 ./Arepo
param.txt 2 6
[root@worker1070 ~]# cat /proc/30533/stack
[] wait_answer_interruptible+0x91/0xe0 [fuse]
[] 

Re: [ceph-users] CephFS fuse client users stuck

2017-03-14 Thread John Spray
On Tue, Mar 14, 2017 at 2:10 PM, Andras Pataki
 wrote:
> Hi John,
>
> I've checked the MDS session list, and the fuse client does appear on that
> with 'state' as 'open'.  So both the fuse client and the MDS agree on an
> open connection.
>
> Attached is the log of the ceph fuse client at debug level 20.  The MDS got
> restarted at 9:44:20, and it went through its startup, and was in an
> 'active' state in ceph -s by 9:45:20.  As for the IP addresses in the logs,
> 10.128.128.110 is the MDS IP, the 10.128.128.1xy addresses are OSDs,
> 10.128.129.63 is the IP of the client the log is from.

So it looks like the client is getting stuck waiting for some
capabilities (the 7fff9c3f7700 thread in that log, which eventually
completes a ll_write on inode 100024ebea8 after the MDS restart).
Hard to say whether the MDS failed to send it the proper messages, or
if the client somehow missed it.

It would be useful to have equally verbose logs from the MDS side from
earlier on, at the point that the client started trying to do the
write.  I wonder if you could see if your MDS+client can handle both
being run at "debug mds = 20", "debug client = 20" respectively for a
while, then when a client gets stuck, do the MDS restart, and follow
back in the client log to work out which inode it was stuck on, then
find log areas on the MDS side relating to that inode number.

BTW I see this "FsUsage" stuff in your log, which I don't recognise
from mainline Ceph, are you running something modified?

John


> As per another suggestion, I've also tried kick_stale_sessions on the fuse
> client, which didn't help (I guess since it doesn't think the session is
> stale).
> Let me know if there is anything else I can do to help.
>
> Andras
>
>
>
> On 03/13/2017 06:08 PM, John Spray wrote:
>>
>> On Mon, Mar 13, 2017 at 8:15 PM, Andras Pataki
>>  wrote:
>>>
>>> Dear Cephers,
>>>
>>> We're using the ceph file system with the fuse client, and lately some of
>>> our processes are getting stuck seemingly waiting for fuse operations.
>>> At
>>> the same time, the cluster is healthy, no slow requests, all OSDs up and
>>> running, and both the MDS and the fuse client think that there are no
>>> pending operations.  The situation is semi-reproducible.  When I run a
>>> various cluster jobs, some get stuck after a few hours of correct
>>> operation.
>>> The cluster is on ceph 10.2.5 and 10.2.6, the fuse clients are 10.2.6,
>>> but I
>>> have tried 10.2.5 and 10.2.3, all of which have the same issue.  This is
>>> on
>>> CentOS (7.2 for the clients, 7.3 for the MDS/OSDs).
>>>
>>> Here are some details:
>>>
>>> The node with the stuck processes:
>>>
>>> [root@worker1070 ~]# ps -auxwww | grep 30519
>>> apataki   30519 39.8  0.9 8728064 5257588 ? Dl   12:11  60:50 ./Arepo
>>> param.txt 2 6
>>> [root@worker1070 ~]# cat /proc/30519/stack
>>> [] fuse_file_aio_write+0xbb/0x340 [fuse]
>>> [] do_sync_write+0x8d/0xd0
>>> [] vfs_write+0xbd/0x1e0
>>> [] SyS_write+0x7f/0xe0
>>> [] system_call_fastpath+0x16/0x1b
>>> [] 0x
>>>
>>> [root@worker1070 ~]# ps -auxwww | grep 30533
>>> apataki   30533 39.8  0.9 8795316 5261308 ? Sl   12:11  60:55 ./Arepo
>>> param.txt 2 6
>>> [root@worker1070 ~]# cat /proc/30533/stack
>>> [] wait_answer_interruptible+0x91/0xe0 [fuse]
>>> [] __fuse_request_send+0x253/0x2c0 [fuse]
>>> [] fuse_request_send+0x12/0x20 [fuse]
>>> [] fuse_send_write+0xd6/0x110 [fuse]
>>> [] fuse_perform_write+0x2ed/0x590 [fuse]
>>> [] fuse_file_aio_write+0x2a1/0x340 [fuse]
>>> [] do_sync_write+0x8d/0xd0
>>> [] vfs_write+0xbd/0x1e0
>>> [] SyS_write+0x7f/0xe0
>>> [] system_call_fastpath+0x16/0x1b
>>> [] 0x
>>>
>>> Presumably the second process is waiting on the first holding some lock
>>> ...
>>>
>>> The fuse client on the node:
>>>
>>> [root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
>>> status
>>> {
>>>  "metadata": {
>>>  "ceph_sha1": "656b5b63ed7c43bd014bcafd81b001959d5f089f",
>>>  "ceph_version": "ceph version 10.2.6
>>> (656b5b63ed7c43bd014bcafd81b001959d5f089f)",
>>>  "entity_id": "admin",
>>>  "hostname": "worker1070",
>>>  "mount_point": "\/mnt\/ceph",
>>>  "root": "\/"
>>>  },
>>>  "dentry_count": 40,
>>>  "dentry_pinned_count": 23,
>>>  "inode_count": 123,
>>>  "mds_epoch": 19041,
>>>  "osd_epoch": 462327,
>>>  "osd_epoch_barrier": 462326
>>> }
>>>
>>> [root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
>>> mds_sessions
>>> {
>>>  "id": 3616543,
>>>  "sessions": [
>>>  {
>>>  "mds": 0,
>>>  "addr": "10.128.128.110:6800\/909443124",
>>>  "seq": 338,
>>>  "cap_gen": 0,
>>>  "cap_ttl": "2017-03-13 14:47:37.575229",
>>>  "last_cap_renew_request": "2017-03-13 14:46:37.575229",
>>>  "cap_renew_seq": 12694,
>>>  "num_caps": 713,
>>>  

Re: [ceph-users] CephFS fuse client users stuck

2017-03-14 Thread Andras Pataki

Hi John,

I've checked the MDS session list, and the fuse client does appear on 
that with 'state' as 'open'.  So both the fuse client and the MDS agree 
on an open connection.


Attached is the log of the ceph fuse client at debug level 20.  The MDS 
got restarted at 9:44:20, and it went through its startup, and was in an 
'active' state in ceph -s by 9:45:20.  As for the IP addresses in the 
logs, 10.128.128.110 is the MDS IP, the 10.128.128.1xy addresses are 
OSDs, 10.128.129.63 is the IP of the client the log is from.


As per another suggestion, I've also tried kick_stale_sessions on the 
fuse client, which didn't help (I guess since it doesn't think the 
session is stale).

Let me know if there is anything else I can do to help.

Andras


On 03/13/2017 06:08 PM, John Spray wrote:

On Mon, Mar 13, 2017 at 8:15 PM, Andras Pataki
 wrote:

Dear Cephers,

We're using the ceph file system with the fuse client, and lately some of
our processes are getting stuck seemingly waiting for fuse operations.  At
the same time, the cluster is healthy, no slow requests, all OSDs up and
running, and both the MDS and the fuse client think that there are no
pending operations.  The situation is semi-reproducible.  When I run a
various cluster jobs, some get stuck after a few hours of correct operation.
The cluster is on ceph 10.2.5 and 10.2.6, the fuse clients are 10.2.6, but I
have tried 10.2.5 and 10.2.3, all of which have the same issue.  This is on
CentOS (7.2 for the clients, 7.3 for the MDS/OSDs).

Here are some details:

The node with the stuck processes:

[root@worker1070 ~]# ps -auxwww | grep 30519
apataki   30519 39.8  0.9 8728064 5257588 ? Dl   12:11  60:50 ./Arepo
param.txt 2 6
[root@worker1070 ~]# cat /proc/30519/stack
[] fuse_file_aio_write+0xbb/0x340 [fuse]
[] do_sync_write+0x8d/0xd0
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x7f/0xe0
[] system_call_fastpath+0x16/0x1b
[] 0x

[root@worker1070 ~]# ps -auxwww | grep 30533
apataki   30533 39.8  0.9 8795316 5261308 ? Sl   12:11  60:55 ./Arepo
param.txt 2 6
[root@worker1070 ~]# cat /proc/30533/stack
[] wait_answer_interruptible+0x91/0xe0 [fuse]
[] __fuse_request_send+0x253/0x2c0 [fuse]
[] fuse_request_send+0x12/0x20 [fuse]
[] fuse_send_write+0xd6/0x110 [fuse]
[] fuse_perform_write+0x2ed/0x590 [fuse]
[] fuse_file_aio_write+0x2a1/0x340 [fuse]
[] do_sync_write+0x8d/0xd0
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x7f/0xe0
[] system_call_fastpath+0x16/0x1b
[] 0x

Presumably the second process is waiting on the first holding some lock ...

The fuse client on the node:

[root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok status
{
 "metadata": {
 "ceph_sha1": "656b5b63ed7c43bd014bcafd81b001959d5f089f",
 "ceph_version": "ceph version 10.2.6
(656b5b63ed7c43bd014bcafd81b001959d5f089f)",
 "entity_id": "admin",
 "hostname": "worker1070",
 "mount_point": "\/mnt\/ceph",
 "root": "\/"
 },
 "dentry_count": 40,
 "dentry_pinned_count": 23,
 "inode_count": 123,
 "mds_epoch": 19041,
 "osd_epoch": 462327,
 "osd_epoch_barrier": 462326
}

[root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
mds_sessions
{
 "id": 3616543,
 "sessions": [
 {
 "mds": 0,
 "addr": "10.128.128.110:6800\/909443124",
 "seq": 338,
 "cap_gen": 0,
 "cap_ttl": "2017-03-13 14:47:37.575229",
 "last_cap_renew_request": "2017-03-13 14:46:37.575229",
 "cap_renew_seq": 12694,
 "num_caps": 713,
 "state": "open"
 }
 ],
 "mdsmap_epoch": 19041
}

[root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
mds_requests
{}


The overall cluster health and the MDS:

[root@cephosd000 ~]# ceph -s
 cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
  health HEALTH_WARN
 noscrub,nodeep-scrub,require_jewel_osds flag(s) set
  monmap e17: 3 mons at
{hyperv029=10.4.36.179:6789/0,hyperv030=10.4.36.180:6789/0,hyperv031=10.4.36.181:6789/0}
 election epoch 29148, quorum 0,1,2 hyperv029,hyperv030,hyperv031
   fsmap e19041: 1/1/1 up {0=cephosd000=up:active}
  osdmap e462328: 624 osds: 624 up, 624 in
 flags noscrub,nodeep-scrub,require_jewel_osds
   pgmap v44458747: 42496 pgs, 6 pools, 924 TB data, 272 Mobjects
 2154 TB used, 1791 TB / 3946 TB avail
42496 active+clean
   client io 86911 kB/s rd, 556 MB/s wr, 227 op/s rd, 303 op/s wr

[root@cephosd000 ~]# ceph daemon /var/run/ceph/ceph-mds.cephosd000.asok ops
{
 "ops": [],
 "num_ops": 0
}


The odd thing is that if in this state I restart the MDS, the client process
wakes up and proceeds with its work without any errors.  As if a request was
lost and somehow retransmitted/restarted when the MDS got restarted and the
fuse layer reconnected to it.

Interesting.  A couple of ideas for 

Re: [ceph-users] CephFS fuse client users stuck

2017-03-14 Thread Henrik Korkuc

On 17-03-14 00:08, John Spray wrote:

On Mon, Mar 13, 2017 at 8:15 PM, Andras Pataki
 wrote:

Dear Cephers,

We're using the ceph file system with the fuse client, and lately some of
our processes are getting stuck seemingly waiting for fuse operations.  At
the same time, the cluster is healthy, no slow requests, all OSDs up and
running, and both the MDS and the fuse client think that there are no
pending operations.  The situation is semi-reproducible.  When I run a
various cluster jobs, some get stuck after a few hours of correct operation.
The cluster is on ceph 10.2.5 and 10.2.6, the fuse clients are 10.2.6, but I
have tried 10.2.5 and 10.2.3, all of which have the same issue.  This is on
CentOS (7.2 for the clients, 7.3 for the MDS/OSDs).

Here are some details:

The node with the stuck processes:

[root@worker1070 ~]# ps -auxwww | grep 30519
apataki   30519 39.8  0.9 8728064 5257588 ? Dl   12:11  60:50 ./Arepo
param.txt 2 6
[root@worker1070 ~]# cat /proc/30519/stack
[] fuse_file_aio_write+0xbb/0x340 [fuse]
[] do_sync_write+0x8d/0xd0
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x7f/0xe0
[] system_call_fastpath+0x16/0x1b
[] 0x

[root@worker1070 ~]# ps -auxwww | grep 30533
apataki   30533 39.8  0.9 8795316 5261308 ? Sl   12:11  60:55 ./Arepo
param.txt 2 6
[root@worker1070 ~]# cat /proc/30533/stack
[] wait_answer_interruptible+0x91/0xe0 [fuse]
[] __fuse_request_send+0x253/0x2c0 [fuse]
[] fuse_request_send+0x12/0x20 [fuse]
[] fuse_send_write+0xd6/0x110 [fuse]
[] fuse_perform_write+0x2ed/0x590 [fuse]
[] fuse_file_aio_write+0x2a1/0x340 [fuse]
[] do_sync_write+0x8d/0xd0
[] vfs_write+0xbd/0x1e0
[] SyS_write+0x7f/0xe0
[] system_call_fastpath+0x16/0x1b
[] 0x

Presumably the second process is waiting on the first holding some lock ...

The fuse client on the node:

[root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok status
{
 "metadata": {
 "ceph_sha1": "656b5b63ed7c43bd014bcafd81b001959d5f089f",
 "ceph_version": "ceph version 10.2.6
(656b5b63ed7c43bd014bcafd81b001959d5f089f)",
 "entity_id": "admin",
 "hostname": "worker1070",
 "mount_point": "\/mnt\/ceph",
 "root": "\/"
 },
 "dentry_count": 40,
 "dentry_pinned_count": 23,
 "inode_count": 123,
 "mds_epoch": 19041,
 "osd_epoch": 462327,
 "osd_epoch_barrier": 462326
}

[root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
mds_sessions
{
 "id": 3616543,
 "sessions": [
 {
 "mds": 0,
 "addr": "10.128.128.110:6800\/909443124",
 "seq": 338,
 "cap_gen": 0,
 "cap_ttl": "2017-03-13 14:47:37.575229",
 "last_cap_renew_request": "2017-03-13 14:46:37.575229",
 "cap_renew_seq": 12694,
 "num_caps": 713,
 "state": "open"
 }
 ],
 "mdsmap_epoch": 19041
}

[root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
mds_requests
{}


The overall cluster health and the MDS:

[root@cephosd000 ~]# ceph -s
 cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
  health HEALTH_WARN
 noscrub,nodeep-scrub,require_jewel_osds flag(s) set
  monmap e17: 3 mons at
{hyperv029=10.4.36.179:6789/0,hyperv030=10.4.36.180:6789/0,hyperv031=10.4.36.181:6789/0}
 election epoch 29148, quorum 0,1,2 hyperv029,hyperv030,hyperv031
   fsmap e19041: 1/1/1 up {0=cephosd000=up:active}
  osdmap e462328: 624 osds: 624 up, 624 in
 flags noscrub,nodeep-scrub,require_jewel_osds
   pgmap v44458747: 42496 pgs, 6 pools, 924 TB data, 272 Mobjects
 2154 TB used, 1791 TB / 3946 TB avail
42496 active+clean
   client io 86911 kB/s rd, 556 MB/s wr, 227 op/s rd, 303 op/s wr

[root@cephosd000 ~]# ceph daemon /var/run/ceph/ceph-mds.cephosd000.asok ops
{
 "ops": [],
 "num_ops": 0
}


The odd thing is that if in this state I restart the MDS, the client process
wakes up and proceeds with its work without any errors.  As if a request was
lost and somehow retransmitted/restarted when the MDS got restarted and the
fuse layer reconnected to it.

Interesting.  A couple of ideas for more debugging:

* Next time you go through this process of restarting the MDS while
there is a stuck client, first increase the client's logging (ceph
daemon .asok> config set debug_client
20").  Then we should get a clear sense of exactly what's happening on
the MDS restart that's enabling the client to proceed.
* When inspecting the client's "mds_sessions" output, also check the
"session ls" output on the MDS side to make sure the MDS and client
both agree that it has an open session.

John


please check if kicking ceph-fuse "ceph --admin-daemon 
/var/run/ceph/ceph-client.something_something.asok kick_stale_sessions" 
works for you.


I experienced similar problems, but didn't go down fully on it. You can 
check out:


Re: [ceph-users] CephFS fuse client users stuck

2017-03-13 Thread John Spray
On Mon, Mar 13, 2017 at 8:15 PM, Andras Pataki
 wrote:
> Dear Cephers,
>
> We're using the ceph file system with the fuse client, and lately some of
> our processes are getting stuck seemingly waiting for fuse operations.  At
> the same time, the cluster is healthy, no slow requests, all OSDs up and
> running, and both the MDS and the fuse client think that there are no
> pending operations.  The situation is semi-reproducible.  When I run a
> various cluster jobs, some get stuck after a few hours of correct operation.
> The cluster is on ceph 10.2.5 and 10.2.6, the fuse clients are 10.2.6, but I
> have tried 10.2.5 and 10.2.3, all of which have the same issue.  This is on
> CentOS (7.2 for the clients, 7.3 for the MDS/OSDs).
>
> Here are some details:
>
> The node with the stuck processes:
>
> [root@worker1070 ~]# ps -auxwww | grep 30519
> apataki   30519 39.8  0.9 8728064 5257588 ? Dl   12:11  60:50 ./Arepo
> param.txt 2 6
> [root@worker1070 ~]# cat /proc/30519/stack
> [] fuse_file_aio_write+0xbb/0x340 [fuse]
> [] do_sync_write+0x8d/0xd0
> [] vfs_write+0xbd/0x1e0
> [] SyS_write+0x7f/0xe0
> [] system_call_fastpath+0x16/0x1b
> [] 0x
>
> [root@worker1070 ~]# ps -auxwww | grep 30533
> apataki   30533 39.8  0.9 8795316 5261308 ? Sl   12:11  60:55 ./Arepo
> param.txt 2 6
> [root@worker1070 ~]# cat /proc/30533/stack
> [] wait_answer_interruptible+0x91/0xe0 [fuse]
> [] __fuse_request_send+0x253/0x2c0 [fuse]
> [] fuse_request_send+0x12/0x20 [fuse]
> [] fuse_send_write+0xd6/0x110 [fuse]
> [] fuse_perform_write+0x2ed/0x590 [fuse]
> [] fuse_file_aio_write+0x2a1/0x340 [fuse]
> [] do_sync_write+0x8d/0xd0
> [] vfs_write+0xbd/0x1e0
> [] SyS_write+0x7f/0xe0
> [] system_call_fastpath+0x16/0x1b
> [] 0x
>
> Presumably the second process is waiting on the first holding some lock ...
>
> The fuse client on the node:
>
> [root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok status
> {
> "metadata": {
> "ceph_sha1": "656b5b63ed7c43bd014bcafd81b001959d5f089f",
> "ceph_version": "ceph version 10.2.6
> (656b5b63ed7c43bd014bcafd81b001959d5f089f)",
> "entity_id": "admin",
> "hostname": "worker1070",
> "mount_point": "\/mnt\/ceph",
> "root": "\/"
> },
> "dentry_count": 40,
> "dentry_pinned_count": 23,
> "inode_count": 123,
> "mds_epoch": 19041,
> "osd_epoch": 462327,
> "osd_epoch_barrier": 462326
> }
>
> [root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
> mds_sessions
> {
> "id": 3616543,
> "sessions": [
> {
> "mds": 0,
> "addr": "10.128.128.110:6800\/909443124",
> "seq": 338,
> "cap_gen": 0,
> "cap_ttl": "2017-03-13 14:47:37.575229",
> "last_cap_renew_request": "2017-03-13 14:46:37.575229",
> "cap_renew_seq": 12694,
> "num_caps": 713,
> "state": "open"
> }
> ],
> "mdsmap_epoch": 19041
> }
>
> [root@worker1070 ~]# ceph daemon /var/run/ceph/ceph-client.admin.asok
> mds_requests
> {}
>
>
> The overall cluster health and the MDS:
>
> [root@cephosd000 ~]# ceph -s
> cluster d7b33135-0940-4e48-8aa6-1d2026597c2f
>  health HEALTH_WARN
> noscrub,nodeep-scrub,require_jewel_osds flag(s) set
>  monmap e17: 3 mons at
> {hyperv029=10.4.36.179:6789/0,hyperv030=10.4.36.180:6789/0,hyperv031=10.4.36.181:6789/0}
> election epoch 29148, quorum 0,1,2 hyperv029,hyperv030,hyperv031
>   fsmap e19041: 1/1/1 up {0=cephosd000=up:active}
>  osdmap e462328: 624 osds: 624 up, 624 in
> flags noscrub,nodeep-scrub,require_jewel_osds
>   pgmap v44458747: 42496 pgs, 6 pools, 924 TB data, 272 Mobjects
> 2154 TB used, 1791 TB / 3946 TB avail
>42496 active+clean
>   client io 86911 kB/s rd, 556 MB/s wr, 227 op/s rd, 303 op/s wr
>
> [root@cephosd000 ~]# ceph daemon /var/run/ceph/ceph-mds.cephosd000.asok ops
> {
> "ops": [],
> "num_ops": 0
> }
>
>
> The odd thing is that if in this state I restart the MDS, the client process
> wakes up and proceeds with its work without any errors.  As if a request was
> lost and somehow retransmitted/restarted when the MDS got restarted and the
> fuse layer reconnected to it.

Interesting.  A couple of ideas for more debugging:

* Next time you go through this process of restarting the MDS while
there is a stuck client, first increase the client's logging (ceph
daemon .asok> config set debug_client
20").  Then we should get a clear sense of exactly what's happening on
the MDS restart that's enabling the client to proceed.
* When inspecting the client's "mds_sessions" output, also check the
"session ls" output on the MDS side to make sure the MDS and client
both agree that it has an open session.

John

>
> When I try to attach a gdb session to either of the client processes, gdb
> just hangs.  However, right after the MDS