Re: [ceph-users] Power failure recovery woes (fwd)

2015-02-20 Thread Jeff
Should I infer from the silence that there is no way to recover from the

FAILED assert(last_e.version.version  e.version.version) errors?

Thanks,
Jeff

- Forwarded message from Jeff j...@usedmoviefinder.com -

Date: Tue, 17 Feb 2015 09:16:33 -0500
From: Jeff j...@usedmoviefinder.com
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Power failure recovery woes

Some additional information/questions:

Here is the output of ceph osd tree

Some of the down OSD's are actually running, but are down. For example
osd.1:

root 30158  8.6 12.7 1542860 781288 ?  Ssl 07:47   4:40
/usr/bin/ceph-osd --cluster=ceph -i 0 -f

 Is there any way to get the cluster to recognize them as being up?  osd-1 has
the FAILED assert(last_e.version.version  e.version.version) errors.

Thanks,
 Jeff


# idweight  type name   up/down reweight
-1  10.22   root default
-2  2.72host ceph1
0   0.91osd.0   up  1
1   0.91osd.1   down0
2   0.9 osd.2   down0
-3  1.82host ceph2
3   0.91osd.3   down0
4   0.91osd.4   down0
-4  2.04host ceph3
5   0.68osd.5   up  1
6   0.68osd.6   up  1
7   0.68osd.7   up  1
8   0.68osd.8   down0
-5  1.82host ceph4
9   0.91osd.9   up  1
10  0.91osd.10  down0
-6  1.82host ceph5
11  0.91osd.11  up  1
12  0.91osd.12  up  1

On 2/17/2015 8:28 AM, Jeff wrote:
 
 
  Original Message 
 Subject: Re: [ceph-users] Power failure recovery woes
 Date: 2015-02-17 04:23
 From: Udo Lembke ulem...@polarzone.de
 To: Jeff j...@usedmoviefinder.com, ceph-users@lists.ceph.com
 
 Hi Jeff,
 is the osd /var/lib/ceph/osd/ceph-2 mounted?
 
 If not, does it helps, if you mounted the osd and start with
 service ceph start osd.2
 ??
 
 Udo
 
 Am 17.02.2015 09:54, schrieb Jeff:
 Hi,
 
 We had a nasty power failure yesterday and even with UPS's our small (5
 node, 12 OSD) cluster is having problems recovering.
 
 We are running ceph 0.87
 
 3 of our OSD's are down consistently (others stop and are restartable,
 but our cluster is so slow that almost everything we do times out).
 
 We are seeing errors like this on the OSD's that never run:
 
 ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1)
 Operation not permitted
 
 We are seeing errors like these of the OSD's that run some of the time:
 
 osd/PGLog.cc: 844: FAILED assert(last_e.version.version 
 e.version.version)
 common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide
 timeout)
 
 Does anyone have any suggestions on how to recover our cluster?
 
 Thanks!
   Jeff
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

- End forwarded message -

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Power failure recovery woes (fwd)

2015-02-20 Thread Gregory Farnum
You can try searching the archives and tracker.ceph.com for hints
about repairing these issues, but your disk stores have definitely
been corrupted and it's likely to be an adventure. I'd recommend
examining your local storage stack underneath Ceph and figuring out
which part was ignoring barriers.
-Greg

On Fri, Feb 20, 2015 at 10:39 AM, Jeff j...@usedmoviefinder.com wrote:
 Should I infer from the silence that there is no way to recover from the

 FAILED assert(last_e.version.version  e.version.version) errors?

 Thanks,
 Jeff

 - Forwarded message from Jeff j...@usedmoviefinder.com -

 Date: Tue, 17 Feb 2015 09:16:33 -0500
 From: Jeff j...@usedmoviefinder.com
 To: ceph-users@lists.ceph.com
 Subject: Re: [ceph-users] Power failure recovery woes

 Some additional information/questions:

 Here is the output of ceph osd tree

 Some of the down OSD's are actually running, but are down. For example
 osd.1:

 root 30158  8.6 12.7 1542860 781288 ?  Ssl 07:47   4:40
 /usr/bin/ceph-osd --cluster=ceph -i 0 -f

  Is there any way to get the cluster to recognize them as being up?  osd-1 has
 the FAILED assert(last_e.version.version  e.version.version) errors.

 Thanks,
  Jeff


 # idweight  type name   up/down reweight
 -1  10.22   root default
 -2  2.72host ceph1
 0   0.91osd.0   up  1
 1   0.91osd.1   down0
 2   0.9 osd.2   down0
 -3  1.82host ceph2
 3   0.91osd.3   down0
 4   0.91osd.4   down0
 -4  2.04host ceph3
 5   0.68osd.5   up  1
 6   0.68osd.6   up  1
 7   0.68osd.7   up  1
 8   0.68osd.8   down0
 -5  1.82host ceph4
 9   0.91osd.9   up  1
 10  0.91osd.10  down0
 -6  1.82host ceph5
 11  0.91osd.11  up  1
 12  0.91osd.12  up  1

 On 2/17/2015 8:28 AM, Jeff wrote:


  Original Message 
 Subject: Re: [ceph-users] Power failure recovery woes
 Date: 2015-02-17 04:23
 From: Udo Lembke ulem...@polarzone.de
 To: Jeff j...@usedmoviefinder.com, ceph-users@lists.ceph.com

 Hi Jeff,
 is the osd /var/lib/ceph/osd/ceph-2 mounted?

 If not, does it helps, if you mounted the osd and start with
 service ceph start osd.2
 ??

 Udo

 Am 17.02.2015 09:54, schrieb Jeff:
 Hi,

 We had a nasty power failure yesterday and even with UPS's our small (5
 node, 12 OSD) cluster is having problems recovering.

 We are running ceph 0.87

 3 of our OSD's are down consistently (others stop and are restartable,
 but our cluster is so slow that almost everything we do times out).

 We are seeing errors like this on the OSD's that never run:

 ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1)
 Operation not permitted

 We are seeing errors like these of the OSD's that run some of the time:

 osd/PGLog.cc: 844: FAILED assert(last_e.version.version 
 e.version.version)
 common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide
 timeout)

 Does anyone have any suggestions on how to recover our cluster?

 Thanks!
   Jeff


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 - End forwarded message -

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Power failure recovery woes

2015-02-17 Thread Michal Kozanecki
Hi Jeff,

What type model drives are you using as OSDs? Any Journals? If so, what model? 
What does your ceph.conf look like? What sort of load is on the cluster (if 
it's still online)? What distro/version? Firewall rules set properly?

Michal Kozanecki


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jeff
Sent: February-17-15 9:17 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Power failure recovery woes

Some additional information/questions:

Here is the output of ceph osd tree

Some of the down OSD's are actually running, but are down. For example 
osd.1:

 root 30158  8.6 12.7 1542860 781288 ?  Ssl 07:47   4:40 
/usr/bin/ceph-osd --cluster=ceph -i 0 -f

  Is there any way to get the cluster to recognize them as being up?  
osd-1 has the FAILED assert(last_e.version.version  e.version.version) 
errors.

Thanks,
  Jeff


# idweight  type name   up/down reweight
-1  10.22   root default
-2  2.72host ceph1
0   0.91osd.0   up  1
1   0.91osd.1   down0
2   0.9 osd.2   down0
-3  1.82host ceph2
3   0.91osd.3   down0
4   0.91osd.4   down0
-4  2.04host ceph3
5   0.68osd.5   up  1
6   0.68osd.6   up  1
7   0.68osd.7   up  1
8   0.68osd.8   down0
-5  1.82host ceph4
9   0.91osd.9   up  1
10  0.91osd.10  down0
-6  1.82host ceph5
11  0.91osd.11  up  1
12  0.91osd.12  up  1

On 2/17/2015 8:28 AM, Jeff wrote:


  Original Message 
 Subject: Re: [ceph-users] Power failure recovery woes
 Date: 2015-02-17 04:23
 From: Udo Lembke ulem...@polarzone.de
 To: Jeff j...@usedmoviefinder.com, ceph-users@lists.ceph.com

 Hi Jeff,
 is the osd /var/lib/ceph/osd/ceph-2 mounted?

 If not, does it helps, if you mounted the osd and start with service 
 ceph start osd.2 ??

 Udo

 Am 17.02.2015 09:54, schrieb Jeff:
 Hi,

 We had a nasty power failure yesterday and even with UPS's our small 
 (5 node, 12 OSD) cluster is having problems recovering.

 We are running ceph 0.87

 3 of our OSD's are down consistently (others stop and are 
 restartable, but our cluster is so slow that almost everything we do times 
 out).

 We are seeing errors like this on the OSD's that never run:

 ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1) 
 Operation not permitted

 We are seeing errors like these of the OSD's that run some of the time:

 osd/PGLog.cc: 844: FAILED assert(last_e.version.version 
 e.version.version)
 common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide
 timeout)

 Does anyone have any suggestions on how to recover our cluster?

 Thanks!
   Jeff


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Power failure recovery woes

2015-02-17 Thread Jeff

Udo,

Yes, the osd is mounted:  /dev/sda4  963605972 260295676 703310296  
28% /var/lib/ceph/osd/ceph-2


Thanks,
Jeff

 Original Message 
Subject: Re: [ceph-users] Power failure recovery woes
Date: 2015-02-17 04:23
From: Udo Lembke ulem...@polarzone.de
To: Jeff j...@usedmoviefinder.com, ceph-users@lists.ceph.com

Hi Jeff,
is the osd /var/lib/ceph/osd/ceph-2 mounted?

If not, does it helps, if you mounted the osd and start with
service ceph start osd.2
??

Udo

Am 17.02.2015 09:54, schrieb Jeff:

Hi,

We had a nasty power failure yesterday and even with UPS's our small (5
node, 12 OSD) cluster is having problems recovering.

We are running ceph 0.87

3 of our OSD's are down consistently (others stop and are restartable,
but our cluster is so slow that almost everything we do times out).

We are seeing errors like this on the OSD's that never run:

ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1)
Operation not permitted

We are seeing errors like these of the OSD's that run some of the time:

osd/PGLog.cc: 844: FAILED assert(last_e.version.version 
e.version.version)
common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide
timeout)

Does anyone have any suggestions on how to recover our cluster?

Thanks!
  Jeff


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


--



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Power failure recovery woes

2015-02-17 Thread Jeff

Hi,

We had a nasty power failure yesterday and even with UPS's our small (5 
node, 12 OSD) cluster is having problems recovering.


We are running ceph 0.87

3 of our OSD's are down consistently (others stop and are restartable, 
but our cluster is so slow that almost everything we do times out).


We are seeing errors like this on the OSD's that never run:

ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1) 
Operation not permitted


We are seeing errors like these of the OSD's that run some of the time:

osd/PGLog.cc: 844: FAILED assert(last_e.version.version  
e.version.version)

common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide timeout)

Does anyone have any suggestions on how to recover our cluster?

Thanks!
  Jeff


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Power failure recovery woes

2015-02-17 Thread Udo Lembke
Hi Jeff,
is the osd /var/lib/ceph/osd/ceph-2 mounted?

If not, does it helps, if you mounted the osd and start with
service ceph start osd.2
??

Udo

Am 17.02.2015 09:54, schrieb Jeff:
 Hi,
 
 We had a nasty power failure yesterday and even with UPS's our small (5
 node, 12 OSD) cluster is having problems recovering.
 
 We are running ceph 0.87
 
 3 of our OSD's are down consistently (others stop and are restartable,
 but our cluster is so slow that almost everything we do times out).
 
 We are seeing errors like this on the OSD's that never run:
 
 ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1)
 Operation not permitted
 
 We are seeing errors like these of the OSD's that run some of the time:
 
 osd/PGLog.cc: 844: FAILED assert(last_e.version.version 
 e.version.version)
 common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide timeout)
 
 Does anyone have any suggestions on how to recover our cluster?
 
 Thanks!
   Jeff
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com