Re: [ceph-users] Monitor failure after series of traumatic network failures

2015-03-24 Thread Greg Chavez
This was excellent advice. It should be on some official Ceph
troubleshooting page. It takes a while for the monitors to deal with new
info, but it works.

Thanks again!
--Greg

On Wed, Mar 18, 2015 at 5:24 PM, Sage Weil s...@newdream.net wrote:

 On Wed, 18 Mar 2015, Greg Chavez wrote:
  We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network
  availability several times since this past Thursday and whose nodes were
 all
  rebooted twice (hastily and inadvisably each time). The final reboot,
 which
  was supposed to be the last thing before recovery according to our data
  center team, resulted in a failure of the cluster's 4 monitors. This
  happened yesterday afternoon.
 
  [ By the way, we use Ceph to back Cinder and Glance in our OpenStack
 Cloud,
  block storage only; also this network problems were the result of our
 data
  center team executing maintenance on our switches that was supposed to be
  quick and painless ]
 
  After working all day on various troubleshooting techniques found here
 and
  there, we have this situation on our monitor nodes (debug 20):
 
 
  node-10: dead. ceph-mon will not start
 
  node-14: Seemed to rebuild its monmap. The log has stopped reporting with
  this final tail -100: http://pastebin.com/tLiq2ewV
 
  node-16: Same as 14, similar outcome in the
  log: http://pastebin.com/W87eT7Mw
 
  node-15: ceph-mon starts but even at debug 20, it will only ouput this
 line,
  over and over again:
 
 2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0)
  AdminSocket: request 'mon_status' not defined
 
  node-02: I added this guy to replace node-10. I updated ceph.conf and
 pushed
  it to all the monitor nodes (the osd nodes without monitors did not get
 the
  config push). Since he's a new guy the log out is obviously different,
 but
  again, here are the last 50 lines: http://pastebin.com/pfixdD3d
 
 
  I run my ceph client from my OpenStack controller. All ceph -s shows me
 is
  faults, albeit only to node-15
 
  2015-03-18 16:47:27.145194 7ff762cff700  0 -- 192.168.241.100:0/15112 
  192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0
 l=1).fault
 
 
  Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S
 
  So that's where we stand. Did we kill our Ceph Cluster (and thus our
  OpenStack Cloud)?

 Unlikely!  You have 5 copies, and I doubt all of them are unrecoverable.

  Or is there hope? Any suggestions would be greatly
  appreciated.

 Stop all mons.

 Make a backup copy of each mon data dir.

 Copy the node-14 data dir over the node-15 and/or node-10 and/or
 node-02.

 Start all mons, see if they form a quorum.

 Once things are working again, at the *very* least upgrade to dumpling,
 and preferably then upgrade to firefly!!  Cuttlefish was EOL more than a
 year ago, and dumpling is EOL in a couple months.

 sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Monitor failure after series of traumatic network failures

2015-03-18 Thread Sage Weil
On Wed, 18 Mar 2015, Greg Chavez wrote:
 We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network
 availability several times since this past Thursday and whose nodes were all
 rebooted twice (hastily and inadvisably each time). The final reboot, which
 was supposed to be the last thing before recovery according to our data
 center team, resulted in a failure of the cluster's 4 monitors. This
 happened yesterday afternoon.
 
 [ By the way, we use Ceph to back Cinder and Glance in our OpenStack Cloud,
 block storage only; also this network problems were the result of our data
 center team executing maintenance on our switches that was supposed to be
 quick and painless ]
 
 After working all day on various troubleshooting techniques found here and
 there, we have this situation on our monitor nodes (debug 20):
 
 
 node-10: dead. ceph-mon will not start
 
 node-14: Seemed to rebuild its monmap. The log has stopped reporting with
 this final tail -100: http://pastebin.com/tLiq2ewV
 
 node-16: Same as 14, similar outcome in the
 log: http://pastebin.com/W87eT7Mw
 
 node-15: ceph-mon starts but even at debug 20, it will only ouput this line,
 over and over again:
 
        2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0)
 AdminSocket: request 'mon_status' not defined
                
 node-02: I added this guy to replace node-10. I updated ceph.conf and pushed
 it to all the monitor nodes (the osd nodes without monitors did not get the
 config push). Since he's a new guy the log out is obviously different, but
 again, here are the last 50 lines: http://pastebin.com/pfixdD3d
 
 
 I run my ceph client from my OpenStack controller. All ceph -s shows me is
 faults, albeit only to node-15
 
 2015-03-18 16:47:27.145194 7ff762cff700  0 -- 192.168.241.100:0/15112 
 192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault
 
 
 Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S
 
 So that's where we stand. Did we kill our Ceph Cluster (and thus our
 OpenStack Cloud)?

Unlikely!  You have 5 copies, and I doubt all of them are unrecoverable.

 Or is there hope? Any suggestions would be greatly
 appreciated.

Stop all mons.

Make a backup copy of each mon data dir.

Copy the node-14 data dir over the node-15 and/or node-10 and/or 
node-02.

Start all mons, see if they form a quorum.

Once things are working again, at the *very* least upgrade to dumpling, 
and preferably then upgrade to firefly!!  Cuttlefish was EOL more than a 
year ago, and dumpling is EOL in a couple months.

sage___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Monitor failure after series of traumatic network failures

2015-03-18 Thread Greg Chavez
We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network
availability several times since this past Thursday and whose nodes were
all rebooted twice (hastily and inadvisably each time). The final reboot,
which was supposed to be the last thing before recovery according to our
data center team, resulted in a failure of the cluster's 4 monitors. This
happened yesterday afternoon.

[ By the way, we use Ceph to back Cinder and Glance in our OpenStack Cloud,
block storage only; also this network problems were the result of our data
center team executing maintenance on our switches that was supposed to be
quick and painless ]

After working all day on various troubleshooting techniques found here and
there, we have this situation on our monitor nodes (debug 20):


node-10: dead. ceph-mon will not start

node-14: Seemed to rebuild its monmap. The log has stopped reporting with
this final tail -100: http://pastebin.com/tLiq2ewV

node-16: Same as 14, similar outcome in the log:
http://pastebin.com/W87eT7Mw

node-15: ceph-mon starts but even at debug 20, it will only ouput this
line, over and over again:

   2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0)
AdminSocket: request 'mon_status' not defined

node-02: I added this guy to replace node-10. I updated ceph.conf and
pushed it to all the monitor nodes (the osd nodes without monitors did not
get the config push). Since he's a new guy the log out is obviously
different, but again, here are the last 50 lines:
http://pastebin.com/pfixdD3d


I run my ceph client from my OpenStack controller. All ceph -s shows me is
faults, albeit only to node-15

2015-03-18 16:47:27.145194 7ff762cff700  0 -- 192.168.241.100:0/15112 
192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault


Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S

So that's where we stand. Did we kill our Ceph Cluster (and thus our
OpenStack Cloud)? Or is there hope? Any suggestions would be greatly
appreciated.


-- 
\*..+.-
--Greg Chavez
+//..;};
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com