Re: [ceph-users] Luminous cephfs maybe not to stable as expected?

2019-07-11 Thread Marc Roos
 


I decided to restart osd.0, then the load of the cephfs and on all osd 
nodes dropped. After this I still have on the first server


[@~]# cat 
/sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client357431
0/osdc
REQUESTS 0 homeless 0
LINGER REQUESTS
BACKOFFS


[@~]# cat 
/sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client358422
4/osdc
REQUESTS 2 homeless 0
317841  osd020.d6ec44c1 20.1[0,28,5]/0  [0,28,5]/0  
e65040  10001b44a70.0x40001c102023  read
317853  osd020.5956d31b 20.1b   [0,5,10]/0  [0,5,10]/0  
e65040  10001ad8962.0x40001c40731   read
LINGER REQUESTS
BACKOFFS

And dmesg -T keeps giving me these (again with wrong timestamps)

[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 session 
established
[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 io error
[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 session 
lost, hunting for new mon
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 session 
established
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 io error
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 session 
lost, hunting for new mon
[Thu Jul 11 11:23:21 2019] libceph: mon2 192.168.10.113:6789 session 
established
[Thu Jul 11 11:23:21 2019] libceph: mon2 192.168.10.113:6789 io error
[Thu Jul 11 11:23:21 2019] libceph: mon2 192.168.10.113:6789 session 
lost, hunting for new mon
[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 session 
established
[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 io error
[Thu Jul 11 11:23:21 2019] libceph: mon1 192.168.10.112:6789 session 
lost, hunting for new mon
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 session 
established
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 io error
[Thu Jul 11 11:23:21 2019] libceph: mon0 192.168.10.111:6789 session 
lost, hunting for new mon

What to do now? Restarting the monitor did not help.


-Original Message-
Subject: Re: [ceph-users] Luminous cephfs maybe not to stable as 
expected?

 

Forgot to add these

[@ ~]# cat
/sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client357431
0/osdc
REQUESTS 0 homeless 0
LINGER REQUESTS
BACKOFFS

[@~]# cat
/sys/kernel/debug/ceph/0f1701f5-453a-4a3b-928d-f652a2bbbcb0.client358422
4/osdc
REQUESTS 38 homeless 0
317841  osd020.d6ec44c1 20.1[0,28,5]/0  [0,28,5]/0  
e65040  10001b44a70.0x40001c101139  read
317853  osd020.5956d31b 20.1b   [0,5,10]/0  [0,5,10]/0  
e65040  10001ad8962.0x40001c39847   read
317835  osd320.ede889de 20.1e   [3,12,27]/3 [3,12,27]/3 
e65040  10001ad80f6.0x40001c87758   read
317838  osd320.7b730a4e 20.e[3,31,9]/3  [3,31,9]/3  
e65040  10001ad89d8.0x40001c83444   read
317844  osd320.feead84c 20.c[3,13,18]/3 [3,13,18]/3 
e65040  10001ad8733.0x40001c77267   read
317852  osd320.bd2658e  20.e[3,31,9]/3  [3,31,9]/3  
e65040  10001ad7e00.0x40001c39331   read
317830  osd420.922e6d04 20.4[4,16,27]/4 [4,16,27]/4 
e65040  10001ad80f2.0x40001c86326   read
317837  osd420.fe93d4ab 20.2b   [4,14,25]/4 [4,14,25]/4 
e65040  10001ad80fb.0x40001c78951   read
317839  osd420.d7af926b 20.2b   [4,14,25]/4 [4,14,25]/4 
e65040  10001ad80ee.0x40001c77556   read
317849  osd520.5fcb95c5 20.5[5,18,29]/5 [5,18,29]/5 
e65040  10001ad7f75.0x40001c61147   read
317857  osd520.28764e9a 20.1a   [5,7,28]/5  [5,7,28]/5  
e65040  10001ad8a10.0x40001c30369   read
317859  osd520.7bb79985 20.5[5,18,29]/5 [5,18,29]/5 
e65040  10001ad7fe8.0x40001c27942   read
317836  osd820.e7bf5cf4 20.34   [8,5,10]/8  [8,5,10]/8  
e65040  10001ad7d79.0x40001c133699  read
317842  osd820.abbb9df4 20.34   [8,5,10]/8  [8,5,10]/8  
e65040  10001d5903f.0x40001c125308  read
317850  osd820.ecd0034  20.34   [8,5,10]/8  [8,5,10]/8  
e65040  10001ad89b2.0x40001c68348   read
317854  osd820.cef50134 20.34   [8,5,10]/8  [8,5,10]/8  
e65040  10001ad8728.0x40001c57431   read
317861  osd820.3e859bb4 20.34   [8,5,10]/8  [8,5,10]/8  
e65040  10001ad8108.0x40001c50642   read
317847  osd920.fc9e9f43 20.3[9,29,17]/9 [9,29,17]/9 
e65040  10001ad8101.0x40001c88464   read
317848  osd920.d32b6ac3 20.3[9,29,17]/9 [9,29,17]/9 
e65040  10001ad8100.0x40001c85929   read
317862  osd11   20

Re: [ceph-users] Luminous cephfs maybe not to stable as expected?

2019-07-11 Thread Marc Roos
 20.11   [25,4,0]/25 [25,4,0]/25 
e65040  10001ad8023.0x40001c147589  read
317856  osd26   20.febb382a 20.2a   [26,0,18]/26[26,0,18]/26
e65040  10001ad8145.0x40001c65169   read
317845  osd27   20.5b433067 20.27   [27,7,14]/27[27,7,14]/27
e65040  10001ad8965.0x40001c124461  read
LINGER REQUESTS
BACKOFFS




-Original Message-
Subject: [ceph-users] Luminous cephfs maybe not to stable as expected?


Maybe this requires some attention. I have a default centos7 (maybe not 
the most recent kernel though), ceph luminous setup eg. no different 
kernels. 

This is 2nd or 3rd time that a vm is going into a high load (151) and 
stopping its services. I have two vm's both mounting the same 2 cephfs 
'shares'. After the last incident I dismounted the shares on the 2nd 
server. (Migrating to a new environment this 2nd server is not doing 
anything). Last time I thought maybe this could be related to my work on 
the switch from the stupid allocator to the bitmap.

Anyway yesterday I thought lets mount again the 2 shares on the 2nd 
server, see what happens. And this morning the high load was back. Afaik 
the 2nd server is only doing a cron job on the cephfs mounts, creating 
snapshots.

1) I have now still increased load on the osd nodes, from cephfs. How 
can I see what client is doing this? I don’t seem to get this from 
'ceph daemon mds.c session ls' however 'ceph osd pool stats | grep 
client -B 1' indicates it is cephfs.

2) ceph osd blacklist ls
No blacklist entries

3) the first server keeps generating such messages, while there is no 
issue with connectivity.

[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session 
lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: mon1 
192.168.10.112:6789 session established [Thu Jul 11 10:41:22 2019] 
libceph: mon1 192.168.10.112:6789 io error [Thu Jul 11 10:41:22 2019] 
libceph: mon1 192.168.10.112:6789 session lost, hunting for new mon [Thu 
Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session 
established [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 
io error [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 
session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: 
mon2 192.168.10.113:6789 session established [Thu Jul 11 10:41:22 2019] 
libceph: mon2 192.168.10.113:6789 io error [Thu Jul 11 10:41:22 2019] 
libceph: mon2 192.168.10.113:6789 session lost, hunting for new mon [Thu 
Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session 
established [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 
io error [Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 
session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: 
mon2 192.168.10.113:6789 session established [Thu Jul 11 10:41:22 2019] 
libceph: mon2 192.168.10.113:6789 io error [Thu Jul 11 10:41:22 2019] 
libceph: mon2 192.168.10.113:6789 session lost, hunting for new mon [Thu 
Jul 11 10:41:22 2019] libceph: osd25 192.168.10.114:6804 io error [Thu 
Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
established [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 
io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 
session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: 
mon2 192.168.10.113:6789 session established [Thu Jul 11 10:41:22 2019] 
libceph: mon2 192.168.10.113:6789 io error [Thu Jul 11 10:41:22 2019] 
libceph: mon2 192.168.10.113:6789 session lost, hunting for new mon [Thu 
Jul 11 10:41:22 2019] libceph: osd18 192.168.10.112:6802 io error [Thu 
Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
established [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 
io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 
session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: 
mon2 192.168.10.113:6789 session established [Thu Jul 11 10:41:22 2019] 
libceph: mon2 192.168.10.113:6789 io error [Thu Jul 11 10:41:22 2019] 
libceph: mon2 192.168.10.113:6789 session lost, hunting for new mon [Thu 
Jul 11 10:41:22 2019] libceph: osd22 192.168.10.111:6811 io error [Thu 
Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
established [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 
io error [Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 
session lost, hunting for new mon [Thu Jul 11 10:41:22 2019] libceph: 
mon0 192.168.10.111:6789 session established

PS dmesg -T gives me strange times, as you can see these are in the 
future, os time is 2 min behind (which is the correct one, ntpd sync).
[@ ]# uptime
 10:39:17 up 50 days, 13:31,  2 users,  load average: 3.60, 3.02, 2.57

4) unmount the filesystem on the first server fails.

5) evicting the cephfs sessions of the first server, does not change the 
load of the cephfs on the osd nodes.

6) unmounting all cephfs clients

[ceph-users] Luminous cephfs maybe not to stable as expected?

2019-07-11 Thread Marc Roos

Maybe this requires some attention. I have a default centos7 (maybe not 
the most recent kernel though), ceph luminous setup eg. no different 
kernels. 

This is 2nd or 3rd time that a vm is going into a high load (151) and 
stopping its services. I have two vm's both mounting the same 2 cephfs 
'shares'. After the last incident I dismounted the shares on the 2nd 
server. (Migrating to a new environment this 2nd server is not doing 
anything). Last time I thought maybe this could be related to my work on 
the switch from the stupid allocator to the bitmap.

Anyway yesterday I thought lets mount again the 2 shares on the 2nd 
server, see what happens. And this morning the high load was back. Afaik 
the 2nd server is only doing a cron job on the cephfs mounts, creating 
snapshots.

1) I have now still increased load on the osd nodes, from cephfs. How 
can I see what client is doing this? I don’t seem to get this from 
'ceph daemon mds.c session ls' however 'ceph osd pool stats | grep 
client -B 1' indicates it is cephfs.

2) ceph osd blacklist ls
No blacklist entries

3) the first server keeps generating such messages, while there is no 
issue with connectivity.

[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: osd25 192.168.10.114:6804 io error
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: osd18 192.168.10.112:6802 io error
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon2 192.168.10.113:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: osd22 192.168.10.111:6811 io error
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
established
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 io error
[Thu Jul 11 10:41:22 2019] libceph: mon1 192.168.10.112:6789 session 
lost, hunting for new mon
[Thu Jul 11 10:41:22 2019] libceph: mon0 192.168.10.111:6789 session 
established

PS dmesg -T gives me strange times, as you can see these are in the 
future, os time is 2 min behind (which is the correct one, ntpd sync).
[@ ]# uptime
 10:39:17 up 50 days, 13:31,  2 users,  load average: 3.60, 3.02, 2.57

4) unmount the filesystem on the first server fails.

5) evicting the cephfs sessions of the first server, does not change the 
load of the cephfs on the osd nodes.

6) unmounting all cephfs clients, still leaves me with cephfs activity 
on the data pool and on the osd nodes.

[@c03 ~]# ceph daemon mds.c session ls
[] 

7) On the first server 
[@~]# ps -auxf| grep D
USER   PID %CPU %MEMVSZ   RSS TTY  STAT START   TIME COMMAND
root  6716  3.0  0.0  0 0 ?D10:18   0:59  \_ 
[kworker/0:2]
root 20039  0.0  0.0 123520  1212 pts/0D+   10:28   0:00  |  
 \_ umount /home/mail-archive/

[@ ~]# cat /proc/6716/stack
[] __wait_on_freeing_inode+0xb0/0xf0
[] find_inode+0x99/0xc0