Re: [ceph-users] Help needed ! cluster unstable after upgrade from Hammer to Jewel
Hi, On 16.11.2016 19:01, Vincent Godin wrote: > Hello, > > We now have a full cluster (Mon, OSD & Clients) in jewel 10.2.2 > (initial was hammer 0.94.5) but we have still some big problems on our > production environment : > > * some ceph filesystem are not mounted at startup and we have to > mount them with the "/bin/sh -c 'flock /var/lock/ceph-disk > /usr/sbin/ceph-disk --verbose --log-stdout trigger --syn /dev/vdX1'" > vdX1?? This sounds you use ceph inside an virtualized system? Udo ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Help needed ! cluster unstable after upgrade from Hammer to Jewel
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Vincent Godin Sent: 16 November 2016 18:02 To: ceph-users Subject: [ceph-users] Help needed ! cluster unstable after upgrade from Hammer to Jewel Hello, We now have a full cluster (Mon, OSD & Clients) in jewel 10.2.2 (initial was hammer 0.94.5) but we have still some big problems on our production environment : * some ceph filesystem are not mounted at startup and we have to mount them with the "/bin/sh -c 'flock /var/lock/ceph-disk /usr/sbin/ceph-disk --verbose --log-stdout trigger --syn /dev/vdX1'" * some OSD start but are in timeout as soon as they start for a pretty long time (more than 5 mn) * 016-11-15 01:46:26.625945 7f79db91e800 0 osd.32 191438 done with init, starting boot process 2016-11-15 01:47:28.344996 7f79d61f7700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f79c5c91700' had timed out after 60 2016-11-15 01:47:33.345098 7f79d61f7700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f79c5c91700' had timed out after 60 ... * these OSD take very long time to stop * we just loosed one OSD and the cluster is unable to stabilize and some OSDs go Up and Down. The cluster is in ERR state and can not serve production environment * we are in jewel 10.2.2 on CentOS 7.2 kernel 3.10.0-327.36.3.el7.x86_64 Help will be apreciate ! Vincent Can you see anything that might indicate why the OSD’s are taking a long time to start up. Ie any errors in the kernel log or do the disks look like they are working very hard when the OSD tries to start? Also a quick google of “heartbeat_map is_healthy 'FileStore::op_tp thread” brings up several past threads, it might be worth seeing if any of them had a solution. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Help needed ! cluster unstable after upgrade from Hammer to Jewel
Hello, We now have a full cluster (Mon, OSD & Clients) in jewel 10.2.2 (initial was hammer 0.94.5) but we have still some big problems on our production environment : - some ceph filesystem are not mounted at startup and we have to mount them with the "/bin/sh -c 'flock /var/lock/ceph-disk /usr/sbin/ceph-disk --verbose --log-stdout trigger --syn /dev/vdX1'" - some OSD start but are in timeout as soon as they start for a pretty long time (more than 5 mn) - 016-11-15 01:46:26.625945 7f79db91e800 0 osd.32 191438 done with init, starting boot process 2016-11-15 01:47:28.344996 7f79d61f7700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f79c5c91700' had timed out after 60 2016-11-15 01:47:33.345098 7f79d61f7700 1 heartbeat_map is_healthy 'FileStore::op_tp thread 0x7f79c5c91700' had timed out after 60 ... - these OSD take very long time to stop - we just loosed one OSD and the cluster is unable to stabilize and some OSDs go Up and Down. The cluster is in ERR state and can not serve production environment - we are in jewel 10.2.2 on CentOS 7.2 kernel 3.10.0-327.36.3.el7.x86_64 Help will be apreciate ! Vincent ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com