Re: [ceph-users] Monitor failure after series of traumatic network failures
This was excellent advice. It should be on some official Ceph troubleshooting page. It takes a while for the monitors to deal with new info, but it works. Thanks again! --Greg On Wed, Mar 18, 2015 at 5:24 PM, Sage Weil s...@newdream.net wrote: On Wed, 18 Mar 2015, Greg Chavez wrote: We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network availability several times since this past Thursday and whose nodes were all rebooted twice (hastily and inadvisably each time). The final reboot, which was supposed to be the last thing before recovery according to our data center team, resulted in a failure of the cluster's 4 monitors. This happened yesterday afternoon. [ By the way, we use Ceph to back Cinder and Glance in our OpenStack Cloud, block storage only; also this network problems were the result of our data center team executing maintenance on our switches that was supposed to be quick and painless ] After working all day on various troubleshooting techniques found here and there, we have this situation on our monitor nodes (debug 20): node-10: dead. ceph-mon will not start node-14: Seemed to rebuild its monmap. The log has stopped reporting with this final tail -100: http://pastebin.com/tLiq2ewV node-16: Same as 14, similar outcome in the log: http://pastebin.com/W87eT7Mw node-15: ceph-mon starts but even at debug 20, it will only ouput this line, over and over again: 2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0) AdminSocket: request 'mon_status' not defined node-02: I added this guy to replace node-10. I updated ceph.conf and pushed it to all the monitor nodes (the osd nodes without monitors did not get the config push). Since he's a new guy the log out is obviously different, but again, here are the last 50 lines: http://pastebin.com/pfixdD3d I run my ceph client from my OpenStack controller. All ceph -s shows me is faults, albeit only to node-15 2015-03-18 16:47:27.145194 7ff762cff700 0 -- 192.168.241.100:0/15112 192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S So that's where we stand. Did we kill our Ceph Cluster (and thus our OpenStack Cloud)? Unlikely! You have 5 copies, and I doubt all of them are unrecoverable. Or is there hope? Any suggestions would be greatly appreciated. Stop all mons. Make a backup copy of each mon data dir. Copy the node-14 data dir over the node-15 and/or node-10 and/or node-02. Start all mons, see if they form a quorum. Once things are working again, at the *very* least upgrade to dumpling, and preferably then upgrade to firefly!! Cuttlefish was EOL more than a year ago, and dumpling is EOL in a couple months. sage ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Monitor failure after series of traumatic network failures
On Wed, 18 Mar 2015, Greg Chavez wrote: We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network availability several times since this past Thursday and whose nodes were all rebooted twice (hastily and inadvisably each time). The final reboot, which was supposed to be the last thing before recovery according to our data center team, resulted in a failure of the cluster's 4 monitors. This happened yesterday afternoon. [ By the way, we use Ceph to back Cinder and Glance in our OpenStack Cloud, block storage only; also this network problems were the result of our data center team executing maintenance on our switches that was supposed to be quick and painless ] After working all day on various troubleshooting techniques found here and there, we have this situation on our monitor nodes (debug 20): node-10: dead. ceph-mon will not start node-14: Seemed to rebuild its monmap. The log has stopped reporting with this final tail -100: http://pastebin.com/tLiq2ewV node-16: Same as 14, similar outcome in the log: http://pastebin.com/W87eT7Mw node-15: ceph-mon starts but even at debug 20, it will only ouput this line, over and over again: 2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0) AdminSocket: request 'mon_status' not defined node-02: I added this guy to replace node-10. I updated ceph.conf and pushed it to all the monitor nodes (the osd nodes without monitors did not get the config push). Since he's a new guy the log out is obviously different, but again, here are the last 50 lines: http://pastebin.com/pfixdD3d I run my ceph client from my OpenStack controller. All ceph -s shows me is faults, albeit only to node-15 2015-03-18 16:47:27.145194 7ff762cff700 0 -- 192.168.241.100:0/15112 192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S So that's where we stand. Did we kill our Ceph Cluster (and thus our OpenStack Cloud)? Unlikely! You have 5 copies, and I doubt all of them are unrecoverable. Or is there hope? Any suggestions would be greatly appreciated. Stop all mons. Make a backup copy of each mon data dir. Copy the node-14 data dir over the node-15 and/or node-10 and/or node-02. Start all mons, see if they form a quorum. Once things are working again, at the *very* least upgrade to dumpling, and preferably then upgrade to firefly!! Cuttlefish was EOL more than a year ago, and dumpling is EOL in a couple months. sage___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Monitor failure after series of traumatic network failures
We have a cuttlefish (0.61.9) 192-OSD cluster that has lost network availability several times since this past Thursday and whose nodes were all rebooted twice (hastily and inadvisably each time). The final reboot, which was supposed to be the last thing before recovery according to our data center team, resulted in a failure of the cluster's 4 monitors. This happened yesterday afternoon. [ By the way, we use Ceph to back Cinder and Glance in our OpenStack Cloud, block storage only; also this network problems were the result of our data center team executing maintenance on our switches that was supposed to be quick and painless ] After working all day on various troubleshooting techniques found here and there, we have this situation on our monitor nodes (debug 20): node-10: dead. ceph-mon will not start node-14: Seemed to rebuild its monmap. The log has stopped reporting with this final tail -100: http://pastebin.com/tLiq2ewV node-16: Same as 14, similar outcome in the log: http://pastebin.com/W87eT7Mw node-15: ceph-mon starts but even at debug 20, it will only ouput this line, over and over again: 2015-03-18 14:54:35.859511 7f8c82ad3700 -1 asok(0x2e560e0) AdminSocket: request 'mon_status' not defined node-02: I added this guy to replace node-10. I updated ceph.conf and pushed it to all the monitor nodes (the osd nodes without monitors did not get the config push). Since he's a new guy the log out is obviously different, but again, here are the last 50 lines: http://pastebin.com/pfixdD3d I run my ceph client from my OpenStack controller. All ceph -s shows me is faults, albeit only to node-15 2015-03-18 16:47:27.145194 7ff762cff700 0 -- 192.168.241.100:0/15112 192.168.241.115:6789/0 pipe(0x7ff75000cf00 sd=3 :0 s=1 pgs=0 cs=0 l=1).fault Finally, here is our ceph.conf: http://pastebin.com/Gmiq2V8S So that's where we stand. Did we kill our Ceph Cluster (and thus our OpenStack Cloud)? Or is there hope? Any suggestions would be greatly appreciated. -- \*..+.- --Greg Chavez +//..;}; ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com