Re: [ceph-users] cephfs clients hanging multi mds to single mds

2018-10-02 Thread Paul Emmerich
The kernel cephfs client unfortunately has the tendency to get stuck
in some unrecoverable states requiring a reboot, especially in older
kernels.
Usually it's not recoverable without a reboot.

Paul
Am Di., 2. Okt. 2018 um 14:55 Uhr schrieb Jaime Ibar :
>
> Hi Paul,
>
> I tried ceph-fuse mounting it in a different mount point and it worked.
>
> The problem here is we can't unmount ceph kernel client as it is in use
>
> by some virsh processes. We forced the unmount and mount ceph-fuse
>
> but we got an I/O error and mount -l cleared all the processes but after
>
> rebooting the vm's they didn't come back and a server reboot was needed.
>
> Not sure how can I restore mds session or remounting cephfs keeping
>
> all processes.
>
> Thanks a lot for your help.
>
> Jaime
>
>
> On 02/10/18 11:02, Paul Emmerich wrote:
> > Kernel 4.4 is not suitable for a multi MDS setup. In general, I
> > wouldn't feel comfortable running 4.4 with kernel cephfs in
> > production.
> > I think at least 4.15 (not sure, but definitely > 4.9) is recommended
> > for multi MDS setups.
> >
> > If you can't reboot: maybe try cephfs-fuse instead which is usually
> > very awesome and usually fast enough.
> >
> > Paul
> >
> > Am Di., 2. Okt. 2018 um 10:45 Uhr schrieb Jaime Ibar :
> >> Hi Paul,
> >>
> >> we're using 4.4 kernel. Not sure if more recent kernels are stable
> >>
> >> for production services. In any case, as there are some production
> >>
> >> services running on those servers, rebooting wouldn't be an option
> >>
> >> if we can bring ceph clients back without rebooting.
> >>
> >> Thanks
> >>
> >> Jaime
> >>
> >>
> >> On 01/10/18 21:10, Paul Emmerich wrote:
> >>> Which kernel version are you using for the kernel cephfs clients?
> >>> I've seen this problem with "older" kernels (where old is as recent as 
> >>> 4.9)
> >>>
> >>> Paul
> >>> Am Mo., 1. Okt. 2018 um 18:35 Uhr schrieb Jaime Ibar :
>  Hi all,
> 
>  we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled
>  multi mds and after few hours
> 
>  these errors started showing up
> 
>  2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds
>  old, received at 2018-09-28 09:40:16.155841:
>  client_request(client.31059144:8544450 getattr Xs #0$
>  12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{})
>  currently failed to authpin local pins
> 
>  2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients
>  failing to respond to cache pressure (MDS_CLIENT_RECALL)
>  2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included
>  below; oldest blocked for > 4614.580689 secs
>  2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds
>  old, received at 2018-09-28 10:53:03.203476:
>  client_request(client.31059144:9080057 lookup #0x100
>  000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{})
>  currently initiated
>  2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients
>  failing to respond to capability release; 5 clients failing to respond
>  to cache pressure; 1 MDSs report slow requests,
> 
>  Due to this, we decide to go back to single mds(as it worked before),
>  however, the clients pointing to mds.1 started hanging, however, the
>  ones pointing to mds.0 worked fine.
> 
>  Then, we tried to enable multi mds again and the clients pointing mds.1
>  went back online, however the ones pointing to mds.0 stopped work.
> 
>  Today, we tried to go back to single mds, however this error was
>  preventing ceph to disable second active mds(mds.1)
> 
>  2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client
>  X: (30108925), after 68213.084174 seconds
> 
>  After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in
>  stopping state forever due to the above error), we waited for it to
>  become active again,
> 
>  unmount the problematic clients, wait for the cluster to be healthy and
>  try to go back to single mds again.
> 
>  Apparently this worked with some of the clients, we tried to enable
>  multi mds again to bring faulty clients back again, however no luck this
>  time
> 
>  and some of them are hanging and can't access to ceph fs.
> 
>  This is what we have in kern.log
> 
>  Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
>  Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
>  Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed
> 
>  Not sure what else can we try to bring hanging clients back without
>  rebooting as they're in production and rebooting is not an option.
> 
>  Does anyone know how can we deal with this, please?
> 
>  Thanks
> 
>  Jaime
> 
>  --
> 
>  Jaime Ibar
>  High Performanc

Re: [ceph-users] cephfs clients hanging multi mds to single mds

2018-10-02 Thread Jaime Ibar

Hi Paul,

I tried ceph-fuse mounting it in a different mount point and it worked.

The problem here is we can't unmount ceph kernel client as it is in use

by some virsh processes. We forced the unmount and mount ceph-fuse

but we got an I/O error and mount -l cleared all the processes but after

rebooting the vm's they didn't come back and a server reboot was needed.

Not sure how can I restore mds session or remounting cephfs keeping

all processes.

Thanks a lot for your help.

Jaime


On 02/10/18 11:02, Paul Emmerich wrote:

Kernel 4.4 is not suitable for a multi MDS setup. In general, I
wouldn't feel comfortable running 4.4 with kernel cephfs in
production.
I think at least 4.15 (not sure, but definitely > 4.9) is recommended
for multi MDS setups.

If you can't reboot: maybe try cephfs-fuse instead which is usually
very awesome and usually fast enough.

Paul

Am Di., 2. Okt. 2018 um 10:45 Uhr schrieb Jaime Ibar :

Hi Paul,

we're using 4.4 kernel. Not sure if more recent kernels are stable

for production services. In any case, as there are some production

services running on those servers, rebooting wouldn't be an option

if we can bring ceph clients back without rebooting.

Thanks

Jaime


On 01/10/18 21:10, Paul Emmerich wrote:

Which kernel version are you using for the kernel cephfs clients?
I've seen this problem with "older" kernels (where old is as recent as 4.9)

Paul
Am Mo., 1. Okt. 2018 um 18:35 Uhr schrieb Jaime Ibar :

Hi all,

we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled
multi mds and after few hours

these errors started showing up

2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds
old, received at 2018-09-28 09:40:16.155841:
client_request(client.31059144:8544450 getattr Xs #0$
12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{})
currently failed to authpin local pins

2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients
failing to respond to cache pressure (MDS_CLIENT_RECALL)
2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included
below; oldest blocked for > 4614.580689 secs
2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds
old, received at 2018-09-28 10:53:03.203476:
client_request(client.31059144:9080057 lookup #0x100
000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{})
currently initiated
2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients
failing to respond to capability release; 5 clients failing to respond
to cache pressure; 1 MDSs report slow requests,

Due to this, we decide to go back to single mds(as it worked before),
however, the clients pointing to mds.1 started hanging, however, the
ones pointing to mds.0 worked fine.

Then, we tried to enable multi mds again and the clients pointing mds.1
went back online, however the ones pointing to mds.0 stopped work.

Today, we tried to go back to single mds, however this error was
preventing ceph to disable second active mds(mds.1)

2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client
X: (30108925), after 68213.084174 seconds

After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in
stopping state forever due to the above error), we waited for it to
become active again,

unmount the problematic clients, wait for the cluster to be healthy and
try to go back to single mds again.

Apparently this worked with some of the clients, we tried to enable
multi mds again to bring faulty clients back again, however no luck this
time

and some of them are hanging and can't access to ceph fs.

This is what we have in kern.log

Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed

Not sure what else can we try to bring hanging clients back without
rebooting as they're in production and rebooting is not an option.

Does anyone know how can we deal with this, please?

Thanks

Jaime

--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725





--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs clients hanging multi mds to single mds

2018-10-02 Thread Paul Emmerich
Kernel 4.4 is not suitable for a multi MDS setup. In general, I
wouldn't feel comfortable running 4.4 with kernel cephfs in
production.
I think at least 4.15 (not sure, but definitely > 4.9) is recommended
for multi MDS setups.

If you can't reboot: maybe try cephfs-fuse instead which is usually
very awesome and usually fast enough.

Paul

Am Di., 2. Okt. 2018 um 10:45 Uhr schrieb Jaime Ibar :
>
> Hi Paul,
>
> we're using 4.4 kernel. Not sure if more recent kernels are stable
>
> for production services. In any case, as there are some production
>
> services running on those servers, rebooting wouldn't be an option
>
> if we can bring ceph clients back without rebooting.
>
> Thanks
>
> Jaime
>
>
> On 01/10/18 21:10, Paul Emmerich wrote:
> > Which kernel version are you using for the kernel cephfs clients?
> > I've seen this problem with "older" kernels (where old is as recent as 4.9)
> >
> > Paul
> > Am Mo., 1. Okt. 2018 um 18:35 Uhr schrieb Jaime Ibar :
> >> Hi all,
> >>
> >> we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled
> >> multi mds and after few hours
> >>
> >> these errors started showing up
> >>
> >> 2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds
> >> old, received at 2018-09-28 09:40:16.155841:
> >> client_request(client.31059144:8544450 getattr Xs #0$
> >> 12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{})
> >> currently failed to authpin local pins
> >>
> >> 2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients
> >> failing to respond to cache pressure (MDS_CLIENT_RECALL)
> >> 2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included
> >> below; oldest blocked for > 4614.580689 secs
> >> 2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds
> >> old, received at 2018-09-28 10:53:03.203476:
> >> client_request(client.31059144:9080057 lookup #0x100
> >> 000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{})
> >> currently initiated
> >> 2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients
> >> failing to respond to capability release; 5 clients failing to respond
> >> to cache pressure; 1 MDSs report slow requests,
> >>
> >> Due to this, we decide to go back to single mds(as it worked before),
> >> however, the clients pointing to mds.1 started hanging, however, the
> >> ones pointing to mds.0 worked fine.
> >>
> >> Then, we tried to enable multi mds again and the clients pointing mds.1
> >> went back online, however the ones pointing to mds.0 stopped work.
> >>
> >> Today, we tried to go back to single mds, however this error was
> >> preventing ceph to disable second active mds(mds.1)
> >>
> >> 2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client
> >> X: (30108925), after 68213.084174 seconds
> >>
> >> After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in
> >> stopping state forever due to the above error), we waited for it to
> >> become active again,
> >>
> >> unmount the problematic clients, wait for the cluster to be healthy and
> >> try to go back to single mds again.
> >>
> >> Apparently this worked with some of the clients, we tried to enable
> >> multi mds again to bring faulty clients back again, however no luck this
> >> time
> >>
> >> and some of them are hanging and can't access to ceph fs.
> >>
> >> This is what we have in kern.log
> >>
> >> Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
> >> Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
> >> Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed
> >>
> >> Not sure what else can we try to bring hanging clients back without
> >> rebooting as they're in production and rebooting is not an option.
> >>
> >> Does anyone know how can we deal with this, please?
> >>
> >> Thanks
> >>
> >> Jaime
> >>
> >> --
> >>
> >> Jaime Ibar
> >> High Performance & Research Computing, IS Services
> >> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> >> http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
> >> Tel: +353-1-896-3725
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
>
> --
>
> Jaime Ibar
> High Performance & Research Computing, IS Services
> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
> Tel: +353-1-896-3725
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs clients hanging multi mds to single mds

2018-10-02 Thread Jaime Ibar

Hi Paul,

we're using 4.4 kernel. Not sure if more recent kernels are stable

for production services. In any case, as there are some production

services running on those servers, rebooting wouldn't be an option

if we can bring ceph clients back without rebooting.

Thanks

Jaime


On 01/10/18 21:10, Paul Emmerich wrote:

Which kernel version are you using for the kernel cephfs clients?
I've seen this problem with "older" kernels (where old is as recent as 4.9)

Paul
Am Mo., 1. Okt. 2018 um 18:35 Uhr schrieb Jaime Ibar :

Hi all,

we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled
multi mds and after few hours

these errors started showing up

2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds
old, received at 2018-09-28 09:40:16.155841:
client_request(client.31059144:8544450 getattr Xs #0$
12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{})
currently failed to authpin local pins

2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients
failing to respond to cache pressure (MDS_CLIENT_RECALL)
2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included
below; oldest blocked for > 4614.580689 secs
2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds
old, received at 2018-09-28 10:53:03.203476:
client_request(client.31059144:9080057 lookup #0x100
000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{})
currently initiated
2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients
failing to respond to capability release; 5 clients failing to respond
to cache pressure; 1 MDSs report slow requests,

Due to this, we decide to go back to single mds(as it worked before),
however, the clients pointing to mds.1 started hanging, however, the
ones pointing to mds.0 worked fine.

Then, we tried to enable multi mds again and the clients pointing mds.1
went back online, however the ones pointing to mds.0 stopped work.

Today, we tried to go back to single mds, however this error was
preventing ceph to disable second active mds(mds.1)

2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client
X: (30108925), after 68213.084174 seconds

After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in
stopping state forever due to the above error), we waited for it to
become active again,

unmount the problematic clients, wait for the cluster to be healthy and
try to go back to single mds again.

Apparently this worked with some of the clients, we tried to enable
multi mds again to bring faulty clients back again, however no luck this
time

and some of them are hanging and can't access to ceph fs.

This is what we have in kern.log

Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed

Not sure what else can we try to bring hanging clients back without
rebooting as they're in production and rebooting is not an option.

Does anyone know how can we deal with this, please?

Thanks

Jaime

--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs clients hanging multi mds to single mds

2018-10-02 Thread Jaime Ibar

Hi,

there's only one entry in blacklist, however is a mon, not a cephfs 
client and no cephfs


is mounted on that host.

We're using kernel client and the kernel version is 4.4 for ceph 
services and cephfs clients.


This is what we have in /sys/kernel/debug/ceph

cat mdsmap

epoch 59259
root 0
session_timeout 60
session_autoclose 300
    mds0    xxx:6800    (up:active)


cat mdsc

13049   mds0    getattr  #1506e43
13051   (no request)    getattr  #150922b
13053   (no request)    getattr  #150922b
13055   (no request)    getattr  #150922b
13057   (no request)    getattr  #150922b
13058   (no request)    getattr  #150922b
13059   (no request)    getattr  #150922b
13063   mds0    lookup   #150922b/.cache (.cache)

[...]

cat mds_sessions

global_id 29669848
name "cephfs"
mds.0 opening
mds.1 restarting

And is similar for other clients.

Thanks

Jaime


On 01/10/18 19:13, Burkhard Linke wrote:

Hi,


we also experience hanging clients after MDS restarts; in our case we 
only use a single active MDS server, and the client are actively 
blacklisted by the MDS server after restart. It usually happens if the 
clients are not responsive during MDS restart (e.g. being very busy).



You can check whether this is the case in your setup by inspecting the 
blacklist ('ceph osd blacklist ls'). It should print the connections 
which are currently blacklisted.



You can also remove entries ('ceph osd blacklist rm ...'), but be 
warned that the mechanism is there for a reason. Removing a 
blacklisted entry might result in file corruption if client and MDS 
server disagree about the current state. Use at own risk.



We were also trying a multi active setup after upgrading to luminous, 
but we were running into the same problem with the same error message. 
If was probably due to old kernel clients, so in case of kernel based 
cephfs I would recommend to upgrade to the latest available kernel.



As another approach you can check the current state of the cephfs 
client, either by using the daemon socket in case of ceph-fuse, or the 
debug information in /sys/kernel/debug/ceph/... for the kernel client.


Regards,

Burkhard


On 01.10.2018 18:34, Jaime Ibar wrote:

Hi all,

we're running a ceph 12.2.7 Luminous cluster, two weeks ago we 
enabled multi mds and after few hours


these errors started showing up

2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds 
old, received at 2018-09-28 09:40:16.155841: 
client_request(client.31059144:8544450 getattr Xs #0$
12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, 
caller_gid=124{}) currently failed to authpin local pins


2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients 
failing to respond to cache pressure (MDS_CLIENT_RECALL)
2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included 
below; oldest blocked for > 4614.580689 secs
2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 
seconds old, received at 2018-09-28 10:53:03.203476: 
client_request(client.31059144:9080057 lookup #0x100
000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{}) 
currently initiated
2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients 
failing to respond to capability release; 5 clients failing to 
respond to cache pressure; 1 MDSs report slow requests,


Due to this, we decide to go back to single mds(as it worked before), 
however, the clients pointing to mds.1 started hanging, however, the 
ones pointing to mds.0 worked fine.


Then, we tried to enable multi mds again and the clients pointing 
mds.1 went back online, however the ones pointing to mds.0 stopped work.


Today, we tried to go back to single mds, however this error was 
preventing ceph to disable second active mds(mds.1)


2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client 
X: (30108925), after 68213.084174 seconds


After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in 
stopping state forever due to the above error), we waited for it to 
become active again,


unmount the problematic clients, wait for the cluster to be healthy 
and try to go back to single mds again.


Apparently this worked with some of the clients, we tried to enable 
multi mds again to bring faulty clients back again, however no luck 
this time


and some of them are hanging and can't access to ceph fs.

This is what we have in kern.log

Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery 
completed


Not sure what else can we try to bring hanging clients back without 
rebooting as they're in production and rebooting is not an option.


Does anyone know how can we deal with this, please?

Thanks

Jaime



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-us

Re: [ceph-users] cephfs clients hanging multi mds to single mds

2018-10-01 Thread Paul Emmerich
Which kernel version are you using for the kernel cephfs clients?
I've seen this problem with "older" kernels (where old is as recent as 4.9)

Paul
Am Mo., 1. Okt. 2018 um 18:35 Uhr schrieb Jaime Ibar :
>
> Hi all,
>
> we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled
> multi mds and after few hours
>
> these errors started showing up
>
> 2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds
> old, received at 2018-09-28 09:40:16.155841:
> client_request(client.31059144:8544450 getattr Xs #0$
> 12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{})
> currently failed to authpin local pins
>
> 2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients
> failing to respond to cache pressure (MDS_CLIENT_RECALL)
> 2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included
> below; oldest blocked for > 4614.580689 secs
> 2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds
> old, received at 2018-09-28 10:53:03.203476:
> client_request(client.31059144:9080057 lookup #0x100
> 000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{})
> currently initiated
> 2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients
> failing to respond to capability release; 5 clients failing to respond
> to cache pressure; 1 MDSs report slow requests,
>
> Due to this, we decide to go back to single mds(as it worked before),
> however, the clients pointing to mds.1 started hanging, however, the
> ones pointing to mds.0 worked fine.
>
> Then, we tried to enable multi mds again and the clients pointing mds.1
> went back online, however the ones pointing to mds.0 stopped work.
>
> Today, we tried to go back to single mds, however this error was
> preventing ceph to disable second active mds(mds.1)
>
> 2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client
> X: (30108925), after 68213.084174 seconds
>
> After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in
> stopping state forever due to the above error), we waited for it to
> become active again,
>
> unmount the problematic clients, wait for the cluster to be healthy and
> try to go back to single mds again.
>
> Apparently this worked with some of the clients, we tried to enable
> multi mds again to bring faulty clients back again, however no luck this
> time
>
> and some of them are hanging and can't access to ceph fs.
>
> This is what we have in kern.log
>
> Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
> Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
> Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed
>
> Not sure what else can we try to bring hanging clients back without
> rebooting as they're in production and rebooting is not an option.
>
> Does anyone know how can we deal with this, please?
>
> Thanks
>
> Jaime
>
> --
>
> Jaime Ibar
> High Performance & Research Computing, IS Services
> Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
> http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
> Tel: +353-1-896-3725
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs clients hanging multi mds to single mds

2018-10-01 Thread Burkhard Linke

Hi,


we also experience hanging clients after MDS restarts; in our case we 
only use a single active MDS server, and the client are actively 
blacklisted by the MDS server after restart. It usually happens if the 
clients are not responsive during MDS restart (e.g. being very busy).



You can check whether this is the case in your setup by inspecting the 
blacklist ('ceph osd blacklist ls'). It should print the connections 
which are currently blacklisted.



You can also remove entries ('ceph osd blacklist rm ...'), but be warned 
that the mechanism is there for a reason. Removing a blacklisted entry 
might result in file corruption if client and MDS server disagree about 
the current state. Use at own risk.



We were also trying a multi active setup after upgrading to luminous, 
but we were running into the same problem with the same error message. 
If was probably due to old kernel clients, so in case of kernel based 
cephfs I would recommend to upgrade to the latest available kernel.



As another approach you can check the current state of the cephfs 
client, either by using the daemon socket in case of ceph-fuse, or the 
debug information in /sys/kernel/debug/ceph/... for the kernel client.


Regards,

Burkhard


On 01.10.2018 18:34, Jaime Ibar wrote:

Hi all,

we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled 
multi mds and after few hours


these errors started showing up

2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds 
old, received at 2018-09-28 09:40:16.155841: 
client_request(client.31059144:8544450 getattr Xs #0$
12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{}) 
currently failed to authpin local pins


2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients 
failing to respond to cache pressure (MDS_CLIENT_RECALL)
2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included 
below; oldest blocked for > 4614.580689 secs
2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds 
old, received at 2018-09-28 10:53:03.203476: 
client_request(client.31059144:9080057 lookup #0x100
000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{}) 
currently initiated
2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients 
failing to respond to capability release; 5 clients failing to respond 
to cache pressure; 1 MDSs report slow requests,


Due to this, we decide to go back to single mds(as it worked before), 
however, the clients pointing to mds.1 started hanging, however, the 
ones pointing to mds.0 worked fine.


Then, we tried to enable multi mds again and the clients pointing 
mds.1 went back online, however the ones pointing to mds.0 stopped work.


Today, we tried to go back to single mds, however this error was 
preventing ceph to disable second active mds(mds.1)


2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client 
X: (30108925), after 68213.084174 seconds


After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in 
stopping state forever due to the above error), we waited for it to 
become active again,


unmount the problematic clients, wait for the cluster to be healthy 
and try to go back to single mds again.


Apparently this worked with some of the clients, we tried to enable 
multi mds again to bring faulty clients back again, however no luck 
this time


and some of them are hanging and can't access to ceph fs.

This is what we have in kern.log

Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed

Not sure what else can we try to bring hanging clients back without 
rebooting as they're in production and rebooting is not an option.


Does anyone know how can we deal with this, please?

Thanks

Jaime



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs clients hanging multi mds to single mds

2018-10-01 Thread Jaime Ibar

Hi all,

we're running a ceph 12.2.7 Luminous cluster, two weeks ago we enabled 
multi mds and after few hours


these errors started showing up

2018-09-28 09:41:20.577350 mds.1 [WRN] slow request 64.421475 seconds 
old, received at 2018-09-28 09:40:16.155841: 
client_request(client.31059144:8544450 getattr Xs #0$
12e1e73 2018-09-28 09:40:16.147368 caller_uid=0, caller_gid=124{}) 
currently failed to authpin local pins


2018-09-28 10:56:51.051100 mon.1 [WRN] Health check failed: 5 clients 
failing to respond to cache pressure (MDS_CLIENT_RECALL)
2018-09-28 10:57:08.000361 mds.1 [WRN] 3 slow requests, 1 included 
below; oldest blocked for > 4614.580689 secs
2018-09-28 10:57:08.000365 mds.1 [WRN] slow request 244.796854 seconds 
old, received at 2018-09-28 10:53:03.203476: 
client_request(client.31059144:9080057 lookup #0x100
000b7564/58 2018-09-28 10:53:03.197922 caller_uid=0, caller_gid=0{}) 
currently initiated
2018-09-28 11:00:00.000105 mon.1 [WRN] overall HEALTH_WARN 1 clients 
failing to respond to capability release; 5 clients failing to respond 
to cache pressure; 1 MDSs report slow requests,


Due to this, we decide to go back to single mds(as it worked before), 
however, the clients pointing to mds.1 started hanging, however, the 
ones pointing to mds.0 worked fine.


Then, we tried to enable multi mds again and the clients pointing mds.1 
went back online, however the ones pointing to mds.0 stopped work.


Today, we tried to go back to single mds, however this error was 
preventing ceph to disable second active mds(mds.1)


2018-10-01 14:33:48.358443 mds.1 [WRN] evicting unresponsive client 
X: (30108925), after 68213.084174 seconds


After wait for 3 hours, we restarted mds.1 daemon (as it was stuck in 
stopping state forever due to the above error), we waited for it to 
become active again,


unmount the problematic clients, wait for the cluster to be healthy and 
try to go back to single mds again.


Apparently this worked with some of the clients, we tried to enable 
multi mds again to bring faulty clients back again, however no luck this 
time


and some of them are hanging and can't access to ceph fs.

This is what we have in kern.log

Oct  1 15:29:32 05 kernel: [2342847.017426] ceph: mds1 reconnect start
Oct  1 15:29:32 05 kernel: [2342847.018677] ceph: mds1 reconnect success
Oct  1 15:29:49 05 kernel: [2342864.651398] ceph: mds1 recovery completed

Not sure what else can we try to bring hanging clients back without 
rebooting as they're in production and rebooting is not an option.


Does anyone know how can we deal with this, please?

Thanks

Jaime

--

Jaime Ibar
High Performance & Research Computing, IS Services
Lloyd Building, Trinity College Dublin, Dublin 2, Ireland.
http://www.tchpc.tcd.ie/ | ja...@tchpc.tcd.ie
Tel: +353-1-896-3725

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com