Re: [ceph-users] Power failure recovery woes (fwd)
Should I infer from the silence that there is no way to recover from the FAILED assert(last_e.version.version e.version.version) errors? Thanks, Jeff - Forwarded message from Jeff j...@usedmoviefinder.com - Date: Tue, 17 Feb 2015 09:16:33 -0500 From: Jeff j...@usedmoviefinder.com To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Power failure recovery woes Some additional information/questions: Here is the output of ceph osd tree Some of the down OSD's are actually running, but are down. For example osd.1: root 30158 8.6 12.7 1542860 781288 ? Ssl 07:47 4:40 /usr/bin/ceph-osd --cluster=ceph -i 0 -f Is there any way to get the cluster to recognize them as being up? osd-1 has the FAILED assert(last_e.version.version e.version.version) errors. Thanks, Jeff # idweight type name up/down reweight -1 10.22 root default -2 2.72host ceph1 0 0.91osd.0 up 1 1 0.91osd.1 down0 2 0.9 osd.2 down0 -3 1.82host ceph2 3 0.91osd.3 down0 4 0.91osd.4 down0 -4 2.04host ceph3 5 0.68osd.5 up 1 6 0.68osd.6 up 1 7 0.68osd.7 up 1 8 0.68osd.8 down0 -5 1.82host ceph4 9 0.91osd.9 up 1 10 0.91osd.10 down0 -6 1.82host ceph5 11 0.91osd.11 up 1 12 0.91osd.12 up 1 On 2/17/2015 8:28 AM, Jeff wrote: Original Message Subject: Re: [ceph-users] Power failure recovery woes Date: 2015-02-17 04:23 From: Udo Lembke ulem...@polarzone.de To: Jeff j...@usedmoviefinder.com, ceph-users@lists.ceph.com Hi Jeff, is the osd /var/lib/ceph/osd/ceph-2 mounted? If not, does it helps, if you mounted the osd and start with service ceph start osd.2 ?? Udo Am 17.02.2015 09:54, schrieb Jeff: Hi, We had a nasty power failure yesterday and even with UPS's our small (5 node, 12 OSD) cluster is having problems recovering. We are running ceph 0.87 3 of our OSD's are down consistently (others stop and are restartable, but our cluster is so slow that almost everything we do times out). We are seeing errors like this on the OSD's that never run: ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1) Operation not permitted We are seeing errors like these of the OSD's that run some of the time: osd/PGLog.cc: 844: FAILED assert(last_e.version.version e.version.version) common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide timeout) Does anyone have any suggestions on how to recover our cluster? Thanks! Jeff ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com - End forwarded message - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Power failure recovery woes (fwd)
You can try searching the archives and tracker.ceph.com for hints about repairing these issues, but your disk stores have definitely been corrupted and it's likely to be an adventure. I'd recommend examining your local storage stack underneath Ceph and figuring out which part was ignoring barriers. -Greg On Fri, Feb 20, 2015 at 10:39 AM, Jeff j...@usedmoviefinder.com wrote: Should I infer from the silence that there is no way to recover from the FAILED assert(last_e.version.version e.version.version) errors? Thanks, Jeff - Forwarded message from Jeff j...@usedmoviefinder.com - Date: Tue, 17 Feb 2015 09:16:33 -0500 From: Jeff j...@usedmoviefinder.com To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Power failure recovery woes Some additional information/questions: Here is the output of ceph osd tree Some of the down OSD's are actually running, but are down. For example osd.1: root 30158 8.6 12.7 1542860 781288 ? Ssl 07:47 4:40 /usr/bin/ceph-osd --cluster=ceph -i 0 -f Is there any way to get the cluster to recognize them as being up? osd-1 has the FAILED assert(last_e.version.version e.version.version) errors. Thanks, Jeff # idweight type name up/down reweight -1 10.22 root default -2 2.72host ceph1 0 0.91osd.0 up 1 1 0.91osd.1 down0 2 0.9 osd.2 down0 -3 1.82host ceph2 3 0.91osd.3 down0 4 0.91osd.4 down0 -4 2.04host ceph3 5 0.68osd.5 up 1 6 0.68osd.6 up 1 7 0.68osd.7 up 1 8 0.68osd.8 down0 -5 1.82host ceph4 9 0.91osd.9 up 1 10 0.91osd.10 down0 -6 1.82host ceph5 11 0.91osd.11 up 1 12 0.91osd.12 up 1 On 2/17/2015 8:28 AM, Jeff wrote: Original Message Subject: Re: [ceph-users] Power failure recovery woes Date: 2015-02-17 04:23 From: Udo Lembke ulem...@polarzone.de To: Jeff j...@usedmoviefinder.com, ceph-users@lists.ceph.com Hi Jeff, is the osd /var/lib/ceph/osd/ceph-2 mounted? If not, does it helps, if you mounted the osd and start with service ceph start osd.2 ?? Udo Am 17.02.2015 09:54, schrieb Jeff: Hi, We had a nasty power failure yesterday and even with UPS's our small (5 node, 12 OSD) cluster is having problems recovering. We are running ceph 0.87 3 of our OSD's are down consistently (others stop and are restartable, but our cluster is so slow that almost everything we do times out). We are seeing errors like this on the OSD's that never run: ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1) Operation not permitted We are seeing errors like these of the OSD's that run some of the time: osd/PGLog.cc: 844: FAILED assert(last_e.version.version e.version.version) common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide timeout) Does anyone have any suggestions on how to recover our cluster? Thanks! Jeff ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com - End forwarded message - ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Power failure recovery woes
Hi Jeff, What type model drives are you using as OSDs? Any Journals? If so, what model? What does your ceph.conf look like? What sort of load is on the cluster (if it's still online)? What distro/version? Firewall rules set properly? Michal Kozanecki -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jeff Sent: February-17-15 9:17 AM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Power failure recovery woes Some additional information/questions: Here is the output of ceph osd tree Some of the down OSD's are actually running, but are down. For example osd.1: root 30158 8.6 12.7 1542860 781288 ? Ssl 07:47 4:40 /usr/bin/ceph-osd --cluster=ceph -i 0 -f Is there any way to get the cluster to recognize them as being up? osd-1 has the FAILED assert(last_e.version.version e.version.version) errors. Thanks, Jeff # idweight type name up/down reweight -1 10.22 root default -2 2.72host ceph1 0 0.91osd.0 up 1 1 0.91osd.1 down0 2 0.9 osd.2 down0 -3 1.82host ceph2 3 0.91osd.3 down0 4 0.91osd.4 down0 -4 2.04host ceph3 5 0.68osd.5 up 1 6 0.68osd.6 up 1 7 0.68osd.7 up 1 8 0.68osd.8 down0 -5 1.82host ceph4 9 0.91osd.9 up 1 10 0.91osd.10 down0 -6 1.82host ceph5 11 0.91osd.11 up 1 12 0.91osd.12 up 1 On 2/17/2015 8:28 AM, Jeff wrote: Original Message Subject: Re: [ceph-users] Power failure recovery woes Date: 2015-02-17 04:23 From: Udo Lembke ulem...@polarzone.de To: Jeff j...@usedmoviefinder.com, ceph-users@lists.ceph.com Hi Jeff, is the osd /var/lib/ceph/osd/ceph-2 mounted? If not, does it helps, if you mounted the osd and start with service ceph start osd.2 ?? Udo Am 17.02.2015 09:54, schrieb Jeff: Hi, We had a nasty power failure yesterday and even with UPS's our small (5 node, 12 OSD) cluster is having problems recovering. We are running ceph 0.87 3 of our OSD's are down consistently (others stop and are restartable, but our cluster is so slow that almost everything we do times out). We are seeing errors like this on the OSD's that never run: ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1) Operation not permitted We are seeing errors like these of the OSD's that run some of the time: osd/PGLog.cc: 844: FAILED assert(last_e.version.version e.version.version) common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide timeout) Does anyone have any suggestions on how to recover our cluster? Thanks! Jeff ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Power failure recovery woes
Udo, Yes, the osd is mounted: /dev/sda4 963605972 260295676 703310296 28% /var/lib/ceph/osd/ceph-2 Thanks, Jeff Original Message Subject: Re: [ceph-users] Power failure recovery woes Date: 2015-02-17 04:23 From: Udo Lembke ulem...@polarzone.de To: Jeff j...@usedmoviefinder.com, ceph-users@lists.ceph.com Hi Jeff, is the osd /var/lib/ceph/osd/ceph-2 mounted? If not, does it helps, if you mounted the osd and start with service ceph start osd.2 ?? Udo Am 17.02.2015 09:54, schrieb Jeff: Hi, We had a nasty power failure yesterday and even with UPS's our small (5 node, 12 OSD) cluster is having problems recovering. We are running ceph 0.87 3 of our OSD's are down consistently (others stop and are restartable, but our cluster is so slow that almost everything we do times out). We are seeing errors like this on the OSD's that never run: ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1) Operation not permitted We are seeing errors like these of the OSD's that run some of the time: osd/PGLog.cc: 844: FAILED assert(last_e.version.version e.version.version) common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide timeout) Does anyone have any suggestions on how to recover our cluster? Thanks! Jeff ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Power failure recovery woes
Hi, We had a nasty power failure yesterday and even with UPS's our small (5 node, 12 OSD) cluster is having problems recovering. We are running ceph 0.87 3 of our OSD's are down consistently (others stop and are restartable, but our cluster is so slow that almost everything we do times out). We are seeing errors like this on the OSD's that never run: ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1) Operation not permitted We are seeing errors like these of the OSD's that run some of the time: osd/PGLog.cc: 844: FAILED assert(last_e.version.version e.version.version) common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide timeout) Does anyone have any suggestions on how to recover our cluster? Thanks! Jeff ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Power failure recovery woes
Hi Jeff, is the osd /var/lib/ceph/osd/ceph-2 mounted? If not, does it helps, if you mounted the osd and start with service ceph start osd.2 ?? Udo Am 17.02.2015 09:54, schrieb Jeff: Hi, We had a nasty power failure yesterday and even with UPS's our small (5 node, 12 OSD) cluster is having problems recovering. We are running ceph 0.87 3 of our OSD's are down consistently (others stop and are restartable, but our cluster is so slow that almost everything we do times out). We are seeing errors like this on the OSD's that never run: ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1) Operation not permitted We are seeing errors like these of the OSD's that run some of the time: osd/PGLog.cc: 844: FAILED assert(last_e.version.version e.version.version) common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide timeout) Does anyone have any suggestions on how to recover our cluster? Thanks! Jeff ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com