After Action Report RE: How'd this for a bad day? AKA bad me
Now that the dust has settled, we know what happened. Our tech didn't completely disconnect the SAN connections (he unplugged them, but not far enough) when installing ESX v3.5 on a new physical host and it formatted a SAN drive instead of the local drive. If we had known this before powering off the VM's we could have VMotioned them to the other SAN, but at the time we didn't know this. I still shouldn't have had all my eggs on one SAN (and now don't), and version 4 of ESX doesn't allow this without having to click on some very prominent are you sure!?!?! boxes, whereas apparently v3.5 just throws it wherever and apparently making it easy to shoot yourself in the foot. Dave From: Brian Desmond [mailto:br...@briandesmond.com] Sent: Friday, October 08, 2010 2:13 PM To: NT System Admin Issues Subject: RE: How'd this for a bad day? AKA bad me Sounds like you should home the redundant sets of VMs on different SAN volumes/whatever? Thanks, Brian Desmond br...@briandesmond.com c - 312.731.3132 From: David Lum [mailto:david@nwea.org] Sent: Friday, October 08, 2010 11:51 AM To: NT System Admin Issues Subject: How'd this for a bad day? AKA bad me I have 7 production systems running on 3 different ESX boxes in an ESX cluster, and 2 different logical SAN volumes (sorry am not SAN savvy, I just know I have two different SAN volumes to choose from when making a VM). Today, a SAN blows up and takes out half - our SharePoint server (heavily used), a Terminal Server , and an internal occasionally-used web server (Namescape rDirectory). Then somehow, when I was told to power down the other 4 VM's so our VMWare guy could reboot a vCenter server, 3 of the 4 remaining VM's decided to go AWOL (a combination of missing and disconnected). That took out my other two Terminal Servers and another lightly used internal web server. Did I mention I don't have the normal backups for these things because ...well...I'm an idiot and didn't confirm our backup guy installed backup software on these servers as I stood them up (process error on my part since I should confirm it's on there). None of these store data - they all talk to a backend SQL and the Terminal Servers are used to run apps that are slow if they run the same apps over VPN. SharePoint we got back quick because we do have a staging equivalent of it, so it was repoint to a config and content DB, DNS change, and done. I do have copious notes on how I built the others and can rebuild from scratch easily enough (I just finished the three TS boxes), but dude...six servers at once? The most frustrating part was discovering that the 4 systems that had been powered off could have been migrated before power off and there would have been no issue with them - the power down nuked 'em. Oh, and the lone surviving server - the PGP Universal Server that manages the encrypted machines. (Yes, the PGP machines will still boot w/out the server up, but still, I've been on this server 50% of my time over the last two weeks!). Dave ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.commailto:listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.commailto:listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin
How'd this for a bad day? AKA bad me
I have 7 production systems running on 3 different ESX boxes in an ESX cluster, and 2 different logical SAN volumes (sorry am not SAN savvy, I just know I have two different SAN volumes to choose from when making a VM). Today, a SAN blows up and takes out half - our SharePoint server (heavily used), a Terminal Server , and an internal occasionally-used web server (Namescape rDirectory). Then somehow, when I was told to power down the other 4 VM's so our VMWare guy could reboot a vCenter server, 3 of the 4 remaining VM's decided to go AWOL (a combination of missing and disconnected). That took out my other two Terminal Servers and another lightly used internal web server. Did I mention I don't have the normal backups for these things because ...well...I'm an idiot and didn't confirm our backup guy installed backup software on these servers as I stood them up (process error on my part since I should confirm it's on there). None of these store data - they all talk to a backend SQL and the Terminal Servers are used to run apps that are slow if they run the same apps over VPN. SharePoint we got back quick because we do have a staging equivalent of it, so it was repoint to a config and content DB, DNS change, and done. I do have copious notes on how I built the others and can rebuild from scratch easily enough (I just finished the three TS boxes), but dude...six servers at once? The most frustrating part was discovering that the 4 systems that had been powered off could have been migrated before power off and there would have been no issue with them - the power down nuked 'em. Oh, and the lone surviving server - the PGP Universal Server that manages the encrypted machines. (Yes, the PGP machines will still boot w/out the server up, but still, I've been on this server 50% of my time over the last two weeks!). Dave ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin
Re: How'd this for a bad day? AKA bad me
Yes, process failures can be deadly... Also, it is more important in this day and age of massive consolidation to make sure that your backups and DR are effective, because cascading failures can take out much more of your infrastructure than ever before. *ASB *(My XeeSM Profile) http://XeeSM.com/AndrewBaker *Exploiting Technology for Business Advantage...* * * On Fri, Oct 8, 2010 at 5:51 AM, David Lum david@nwea.org wrote: I have 7 production systems running on 3 different ESX boxes in an ESX cluster, and 2 different logical SAN volumes (sorry am not SAN savvy, I just know I have two different SAN volumes to choose from when making a VM). Today, a SAN blows up and takes out half – our SharePoint server (heavily used), a Terminal Server , and an internal occasionally-used web server (Namescape rDirectory). Then somehow, when I was told to power down the other 4 VM’s so our VMWare guy could reboot a vCenter server, 3 of the 4 remaining VM’s decided to go AWOL (a combination of “missing” and “disconnected”). That took out my other two Terminal Servers and another lightly used internal web server. Did I mention I don’t have the normal backups for these things because …well…I’m an idiot and didn’t confirm our backup guy installed backup software on these servers as I stood them up (process error on my part since I should confirm it’s on there). None of these store data – they all talk to a backend SQL and the Terminal Servers are used to run apps that are slow if they run the same apps over VPN. SharePoint we got back quick because we do have a staging equivalent of it, so it was repoint to a config and content DB, DNS change, and done. I do have copious notes on how I built the others and can rebuild from scratch easily enough (I just finished the three TS boxes), but dude…six servers at once? The most frustrating part was discovering that the 4 systems that had been powered off could have been “migrated” before power off and there would have been no issue with them – the power down nuked ‘em. Oh, and the lone surviving server – the PGP Universal Server that manages the encrypted machines. (Yes, the PGP machines will still boot w/out the server up, but still, I’ve been on this server 50% of my time over the last two weeks!). Dave ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin
RE: How'd this for a bad day? AKA bad me
All I can say is OUCH! :-( From: David Lum [mailto:david@nwea.org] Sent: Friday, October 08, 2010 5:51 AM To: NT System Admin Issues Subject: How'd this for a bad day? AKA bad me I have 7 production systems running on 3 different ESX boxes in an ESX cluster, and 2 different logical SAN volumes (sorry am not SAN savvy, I just know I have two different SAN volumes to choose from when making a VM). Today, a SAN blows up and takes out half our SharePoint server (heavily used), a Terminal Server , and an internal occasionally-used web server (Namescape rDirectory). Then somehow, when I was told to power down the other 4 VMs so our VMWare guy could reboot a vCenter server, 3 of the 4 remaining VMs decided to go AWOL (a combination of missing and disconnected). That took out my other two Terminal Servers and another lightly used internal web server. Did I mention I dont have the normal backups for these things because well Im an idiot and didnt confirm our backup guy installed backup software on these servers as I stood them up (process error on my part since I should confirm its on there). None of these store data they all talk to a backend SQL and the Terminal Servers are used to run apps that are slow if they run the same apps over VPN. SharePoint we got back quick because we do have a staging equivalent of it, so it was repoint to a config and content DB, DNS change, and done. I do have copious notes on how I built the others and can rebuild from scratch easily enough (I just finished the three TS boxes), but dude six servers at once? The most frustrating part was discovering that the 4 systems that had been powered off could have been migrated before power off and there would have been no issue with them the power down nuked em. Oh, and the lone surviving server the PGP Universal Server that manages the encrypted machines. (Yes, the PGP machines will still boot w/out the server up, but still, Ive been on this server 50% of my time over the last two weeks!). Dave ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin
RE: How'd this for a bad day? AKA bad me
Being slightly serious for a moment, it's a pretty good illustration of how something like a SAN in isolation is no use :-) -Original Message- From: John Aldrich [mailto:jaldr...@blueridgecarpet.com] Sent: 08 October 2010 13:43 To: NT System Admin Issues Subject: RE: How'd this for a bad day? AKA bad me All I can say is OUCH! :-( From: David Lum [mailto:david@nwea.org] Sent: Friday, October 08, 2010 5:51 AM To: NT System Admin Issues Subject: How'd this for a bad day? AKA bad me I have 7 production systems running on 3 different ESX boxes in an ESX cluster, and 2 different logical SAN volumes (sorry am not SAN savvy, I just know I have two different SAN volumes to choose from when making a VM). Today, a SAN blows up and takes out half - our SharePoint server (heavily used), a Terminal Server , and an internal occasionally-used web server (Namescape rDirectory). Then somehow, when I was told to power down the other 4 VM's so our VMWare guy could reboot a vCenter server, 3 of the 4 remaining VM's decided to go AWOL (a combination of missing and disconnected). That took out my other two Terminal Servers and another lightly used internal web server. Did I mention I don't have the normal backups for these things because ...well...I'm an idiot and didn't confirm our backup guy installed backup software on these servers as I stood them up (process error on my part since I should confirm it's on there). None of these store data - they all talk to a backend SQL and the Terminal Servers are used to run apps that are slow if they run the same apps over VPN. SharePoint we got back quick because we do have a staging equivalent of it, so it was repoint to a config and content DB, DNS change, and done. I do have copious notes on how I built the others and can rebuild from scratch easily enough (I just finished the three TS boxes), but dude...six servers at once? The most frustrating part was discovering that the 4 systems that had been powered off could have been migrated before power off and there would have been no issue with them - the power down nuked 'em. Oh, and the lone surviving server - the PGP Universal Server that manages the encrypted machines. (Yes, the PGP machines will still boot w/out the server up, but still, I've been on this server 50% of my time over the last two weeks!). Dave ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin -- MIRA Ltd Watling Street, Nuneaton, Warwickshire, CV10 0TU, England Registered in England and Wales No. 402570 VAT Registration GB 114 5409 96 The contents of this e-mail are confidential and are solely for the use of the intended recipient. If you receive this e-mail in error, please delete it and notify us either by e-mail, telephone or fax. You should not copy, forward or otherwise disclose the content of the e-mail as this is prohibited. ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin
RE: How'd this for a bad day? AKA bad me
Yep. Good point. :-) VERY good point! -Original Message- From: Paul Hutchings [mailto:paul.hutchi...@mira.co.uk] Sent: Friday, October 08, 2010 8:55 AM To: NT System Admin Issues Subject: RE: How'd this for a bad day? AKA bad me Being slightly serious for a moment, it's a pretty good illustration of how something like a SAN in isolation is no use :-) -Original Message- From: John Aldrich [mailto:jaldr...@blueridgecarpet.com] Sent: 08 October 2010 13:43 To: NT System Admin Issues Subject: RE: How'd this for a bad day? AKA bad me All I can say is OUCH! :-( From: David Lum [mailto:david@nwea.org] Sent: Friday, October 08, 2010 5:51 AM To: NT System Admin Issues Subject: How'd this for a bad day? AKA bad me I have 7 production systems running on 3 different ESX boxes in an ESX cluster, and 2 different logical SAN volumes (sorry am not SAN savvy, I just know I have two different SAN volumes to choose from when making a VM). Today, a SAN blows up and takes out half - our SharePoint server (heavily used), a Terminal Server , and an internal occasionally-used web server (Namescape rDirectory). Then somehow, when I was told to power down the other 4 VM's so our VMWare guy could reboot a vCenter server, 3 of the 4 remaining VM's decided to go AWOL (a combination of missing and disconnected). That took out my other two Terminal Servers and another lightly used internal web server. Did I mention I don't have the normal backups for these things because ...well...I'm an idiot and didn't confirm our backup guy installed backup software on these servers as I stood them up (process error on my part since I should confirm it's on there). None of these store data - they all talk to a backend SQL and the Terminal Servers are used to run apps that are slow if they run the same apps over VPN. SharePoint we got back quick because we do have a staging equivalent of it, so it was repoint to a config and content DB, DNS change, and done. I do have copious notes on how I built the others and can rebuild from scratch easily enough (I just finished the three TS boxes), but dude...six servers at once? The most frustrating part was discovering that the 4 systems that had been powered off could have been migrated before power off and there would have been no issue with them - the power down nuked 'em. Oh, and the lone surviving server - the PGP Universal Server that manages the encrypted machines. (Yes, the PGP machines will still boot w/out the server up, but still, I've been on this server 50% of my time over the last two weeks!). Dave ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin -- MIRA Ltd Watling Street, Nuneaton, Warwickshire, CV10 0TU, England Registered in England and Wales No. 402570 VAT Registration GB 114 5409 96 The contents of this e-mail are confidential and are solely for the use of the intended recipient. If you receive this e-mail in error, please delete it and notify us either by e-mail, telephone or fax. You should not copy, forward or otherwise disclose the content of the e-mail as this is prohibited. ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin
Re: How'd this for a bad day? AKA bad me
Why do you need to power down VMs to reboot vCenter? vCenter might be the problem with the missing VMs. VMWare support might be able to help you with those. Jeff On Fri, Oct 8, 2010 at 5:51 AM, David Lum david@nwea.org wrote: I have 7 production systems running on 3 different ESX boxes in an ESX cluster, and 2 different logical SAN volumes (sorry am not SAN savvy, I just know I have two different SAN volumes to choose from when making a VM). Today, a SAN blows up and takes out half – our SharePoint server (heavily used), a Terminal Server , and an internal occasionally-used web server (Namescape rDirectory). Then somehow, when I was told to power down the other 4 VM’s so our VMWare guy could reboot a vCenter server, 3 of the 4 remaining VM’s decided to go AWOL (a combination of “missing” and “disconnected”). That took out my other two Terminal Servers and another lightly used internal web server. Did I mention I don’t have the normal backups for these things because …well…I’m an idiot and didn’t confirm our backup guy installed backup software on these servers as I stood them up (process error on my part since I should confirm it’s on there). None of these store data – they all talk to a backend SQL and the Terminal Servers are used to run apps that are slow if they run the same apps over VPN. SharePoint we got back quick because we do have a staging equivalent of it, so it was repoint to a config and content DB, DNS change, and done. I do have copious notes on how I built the others and can rebuild from scratch easily enough (I just finished the three TS boxes), but dude…six servers at once? The most frustrating part was discovering that the 4 systems that had been powered off could have been “migrated” before power off and there would have been no issue with them – the power down nuked ‘em. Oh, and the lone surviving server – the PGP Universal Server that manages the encrypted machines. (Yes, the PGP machines will still boot w/out the server up, but still, I’ve been on this server 50% of my time over the last two weeks!). Dave ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin
Re: How'd this for a bad day? AKA bad me
+1 I'm just getting caught up on emails this morning. vCenter reboot shouldn't necessitate a reboot of a host server. On Fri, Oct 8, 2010 at 9:34 AM, Jeff Bunting bunting.j...@gmail.com wrote: Why do you need to power down VMs to reboot vCenter? vCenter might be the problem with the missing VMs. VMWare support might be able to help you with those. Jeff On Fri, Oct 8, 2010 at 5:51 AM, David Lum david@nwea.org wrote: I have 7 production systems running on 3 different ESX boxes in an ESX cluster, and 2 different logical SAN volumes (sorry am not SAN savvy, I just know I have two different SAN volumes to choose from when making a VM). Today, a SAN blows up and takes out half – our SharePoint server (heavily used), a Terminal Server , and an internal occasionally-used web server (Namescape rDirectory). Then somehow, when I was told to power down the other 4 VM’s so our VMWare guy could reboot a vCenter server, 3 of the 4 remaining VM’s decided to go AWOL (a combination of “missing” and “disconnected”). That took out my other two Terminal Servers and another lightly used internal web server. Did I mention I don’t have the normal backups for these things because …well…I’m an idiot and didn’t confirm our backup guy installed backup software on these servers as I stood them up (process error on my part since I should confirm it’s on there). None of these store data – they all talk to a backend SQL and the Terminal Servers are used to run apps that are slow if they run the same apps over VPN. SharePoint we got back quick because we do have a staging equivalent of it, so it was repoint to a config and content DB, DNS change, and done. I do have copious notes on how I built the others and can rebuild from scratch easily enough (I just finished the three TS boxes), but dude…six servers at once? The most frustrating part was discovering that the 4 systems that had been powered off could have been “migrated” before power off and there would have been no issue with them – the power down nuked ‘em. Oh, and the lone surviving server – the PGP Universal Server that manages the encrypted machines. (Yes, the PGP machines will still boot w/out the server up, but still, I’ve been on this server 50% of my time over the last two weeks!). Dave ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin
RE: How'd this for a bad day? AKA bad me
I don't know the exact details (and don't remember at the moment), my guess is they needed to do something SAN side - I just now heard one SAN store is what died. Today is gonna bite.. From: Jeff Bunting [mailto:bunting.j...@gmail.com] Sent: Friday, October 08, 2010 6:35 AM To: NT System Admin Issues Subject: Re: How'd this for a bad day? AKA bad me Why do you need to power down VMs to reboot vCenter? vCenter might be the problem with the missing VMs. VMWare support might be able to help you with those. Jeff On Fri, Oct 8, 2010 at 5:51 AM, David Lum david@nwea.orgmailto:david@nwea.org wrote: I have 7 production systems running on 3 different ESX boxes in an ESX cluster, and 2 different logical SAN volumes (sorry am not SAN savvy, I just know I have two different SAN volumes to choose from when making a VM). Today, a SAN blows up and takes out half - our SharePoint server (heavily used), a Terminal Server , and an internal occasionally-used web server (Namescape rDirectory). Then somehow, when I was told to power down the other 4 VM's so our VMWare guy could reboot a vCenter server, 3 of the 4 remaining VM's decided to go AWOL (a combination of missing and disconnected). That took out my other two Terminal Servers and another lightly used internal web server. Did I mention I don't have the normal backups for these things because ...well...I'm an idiot and didn't confirm our backup guy installed backup software on these servers as I stood them up (process error on my part since I should confirm it's on there). None of these store data - they all talk to a backend SQL and the Terminal Servers are used to run apps that are slow if they run the same apps over VPN. SharePoint we got back quick because we do have a staging equivalent of it, so it was repoint to a config and content DB, DNS change, and done. I do have copious notes on how I built the others and can rebuild from scratch easily enough (I just finished the three TS boxes), but dude...six servers at once? The most frustrating part was discovering that the 4 systems that had been powered off could have been migrated before power off and there would have been no issue with them - the power down nuked 'em. Oh, and the lone surviving server - the PGP Universal Server that manages the encrypted machines. (Yes, the PGP machines will still boot w/out the server up, but still, I've been on this server 50% of my time over the last two weeks!). Dave ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.commailto:listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.commailto:listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin
RE: How'd this for a bad day? AKA bad me
+1 from here as well. A vCenter reboot should not require a host reboot. If it did, that would (IMHO) be a huge problem in the design and purpose behind VMware. Talk to VMware. If your maintenance is not current, get current. On a related note, YESTERDAY, one of our storage groups on our SAN ran out of space (fortunately I'm not in or over the group responsible for that anymore!), and thus took down a number of systems, all part of our core electronic medical record system, eClinicalWorks, all virtual... We were without that app for more than 6 hours, and are still dealing with database replication issues today as a result TGIF! Jonathan L. Raper, A+, MCSA, MCSE Technology Coordinator Eagle Physicians Associates, PA jra...@eaglemds.comBLOCKED::mailto:%20jra...@eaglemds.com www.eaglemds.comBLOCKED::http://www.eaglemds.com/ From: Jonathan Link [mailto:jonathan.l...@gmail.com] Sent: Friday, October 08, 2010 9:40 AM To: NT System Admin Issues Subject: Re: How'd this for a bad day? AKA bad me +1 I'm just getting caught up on emails this morning. vCenter reboot shouldn't necessitate a reboot of a host server. On Fri, Oct 8, 2010 at 9:34 AM, Jeff Bunting bunting.j...@gmail.commailto:bunting.j...@gmail.com wrote: Why do you need to power down VMs to reboot vCenter? vCenter might be the problem with the missing VMs. VMWare support might be able to help you with those. Jeff On Fri, Oct 8, 2010 at 5:51 AM, David Lum david@nwea.orgmailto:david@nwea.org wrote: I have 7 production systems running on 3 different ESX boxes in an ESX cluster, and 2 different logical SAN volumes (sorry am not SAN savvy, I just know I have two different SAN volumes to choose from when making a VM). Today, a SAN blows up and takes out half - our SharePoint server (heavily used), a Terminal Server , and an internal occasionally-used web server (Namescape rDirectory). Then somehow, when I was told to power down the other 4 VM's so our VMWare guy could reboot a vCenter server, 3 of the 4 remaining VM's decided to go AWOL (a combination of missing and disconnected). That took out my other two Terminal Servers and another lightly used internal web server. Did I mention I don't have the normal backups for these things because ...well...I'm an idiot and didn't confirm our backup guy installed backup software on these servers as I stood them up (process error on my part since I should confirm it's on there). None of these store data - they all talk to a backend SQL and the Terminal Servers are used to run apps that are slow if they run the same apps over VPN. SharePoint we got back quick because we do have a staging equivalent of it, so it was repoint to a config and content DB, DNS change, and done. I do have copious notes on how I built the others and can rebuild from scratch easily enough (I just finished the three TS boxes), but dude...six servers at once? The most frustrating part was discovering that the 4 systems that had been powered off could have been migrated before power off and there would have been no issue with them - the power down nuked 'em. Oh, and the lone surviving server - the PGP Universal Server that manages the encrypted machines. (Yes, the PGP machines will still boot w/out the server up, but still, I've been on this server 50% of my time over the last two weeks!). Dave ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.commailto:listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.commailto:listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.commailto:listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin Any medical information contained in this electronic message is CONFIDENTIAL and privileged. It is unlawful for unauthorized persons to view, copy, disclose, or disseminate CONFIDENTIAL information. This electronic message may contain information that is confidential and/or legally privileged. It is intended only for the use of the individual(s) and/or entity named as recipients in the message. If you are not an intended recipient of this message
Re: How'd this for a bad day? AKA bad me
On Fri, Oct 8, 2010 at 5:51 AM, David Lum david@nwea.org wrote: I have 7 production systems ... Oh, boy. Fun. I've had days like that. Not many, fortunately (and knock on wood). Hope you get it all sorted out in time for the weekend! Today I find myself having to arbitrate a pooch screw regarding important procedures, and thus get everyone's story and try and make sense of it all. I feel like I'm playing the cop in a police interrogation scene. I much prefer dealing with recalcitrant machines than people. -- Ben ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin
Re: How'd this for a bad day? AKA bad me
Machines are recalcitrant, they're just misunderstood. On Fri, Oct 8, 2010 at 12:15 PM, Ben Scott mailvor...@gmail.com wrote: On Fri, Oct 8, 2010 at 5:51 AM, David Lum david@nwea.org wrote: I have 7 production systems ... Oh, boy. Fun. I've had days like that. Not many, fortunately (and knock on wood). Hope you get it all sorted out in time for the weekend! Today I find myself having to arbitrate a pooch screw regarding important procedures, and thus get everyone's story and try and make sense of it all. I feel like I'm playing the cop in a police interrogation scene. I much prefer dealing with recalcitrant machines than people. -- Ben ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin
Re: How'd this for a bad day? AKA bad me
If the systems are still actually on the LUNs, then you should be able to reconnect them and bring them up. Rebooting vCenter should not have had anything to do with shutting down guests but rebooting the SAN might possibly have been required to address it's fire. From vCenter just reconnect to the ESX hosts, and then start connecting to the guests. Frankly I'd get on hold with VMware now. They are pretty good at getting this sort of thing sorted out so rebuilding shouldn't be necessary unless the data on the SAN went poof. Steven Peck http://www.blkmtn.org . On Fri, Oct 8, 2010 at 9:20 AM, Jonathan Link jonathan.l...@gmail.comwrote: Machines are recalcitrant, they're just misunderstood. On Fri, Oct 8, 2010 at 12:15 PM, Ben Scott mailvor...@gmail.com wrote: On Fri, Oct 8, 2010 at 5:51 AM, David Lum david@nwea.org wrote: I have 7 production systems ... Oh, boy. Fun. I've had days like that. Not many, fortunately (and knock on wood). Hope you get it all sorted out in time for the weekend! Today I find myself having to arbitrate a pooch screw regarding important procedures, and thus get everyone's story and try and make sense of it all. I feel like I'm playing the cop in a police interrogation scene. I much prefer dealing with recalcitrant machines than people. -- Ben ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin
Re: How'd this for a bad day? AKA bad me
Your not is AWOL *ASB * * * On Fri, Oct 8, 2010 at 12:20 PM, Jonathan Link jonathan.l...@gmail.comwrote: Machines are recalcitrant, they're just misunderstood. On Fri, Oct 8, 2010 at 12:15 PM, Ben Scott mailvor...@gmail.com wrote: On Fri, Oct 8, 2010 at 5:51 AM, David Lum david@nwea.org wrote: I have 7 production systems ... Oh, boy. Fun. I've had days like that. Not many, fortunately (and knock on wood). Hope you get it all sorted out in time for the weekend! Today I find myself having to arbitrate a pooch screw regarding important procedures, and thus get everyone's story and try and make sense of it all. I feel like I'm playing the cop in a police interrogation scene. I much prefer dealing with recalcitrant machines than people. -- Ben ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin
Re: How'd this for a bad day? AKA bad me
That's not the only thing... On Fri, Oct 8, 2010 at 12:32 PM, Andrew S. Baker asbz...@gmail.com wrote: Your not is AWOL *ASB * * * On Fri, Oct 8, 2010 at 12:20 PM, Jonathan Link jonathan.l...@gmail.comwrote: Machines are recalcitrant, they're just misunderstood. On Fri, Oct 8, 2010 at 12:15 PM, Ben Scott mailvor...@gmail.com wrote: On Fri, Oct 8, 2010 at 5:51 AM, David Lum david@nwea.org wrote: I have 7 production systems ... Oh, boy. Fun. I've had days like that. Not many, fortunately (and knock on wood). Hope you get it all sorted out in time for the weekend! Today I find myself having to arbitrate a pooch screw regarding important procedures, and thus get everyone's story and try and make sense of it all. I feel like I'm playing the cop in a police interrogation scene. I much prefer dealing with recalcitrant machines than people. -- Ben ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin
Re: How'd this for a bad day? AKA bad me
I've said it before, but I will say it again. In a highly virtualized, heavily consolidated world, we need more planning, more thinking and more time for effective execution. Cutting corners will become more and more painful, and will bite more and more organizations. Hopefully, enough near misses will teach enough entities to do the right thing. That's just my optimism speaking, however. It will be incumbent on each technology professional to advocate or fight for the right solutions, or have an excellent exit strategy planned out. :) *ASB *(My XeeSM Profile) http://XeeSM.com/AndrewBaker *Exploiting Technology for Business Advantage...* * * On Fri, Oct 8, 2010 at 11:27 AM, Raper, Jonathan - Eagle jra...@eaglemds.com wrote: +1 from here as well. A vCenter reboot should not require a host reboot. If it did, that would (IMHO) be a huge problem in the design and purpose behind VMware. Talk to VMware. If your maintenance is not current, get current. On a related note, YESTERDAY, one of our storage groups on our SAN ran out of space (fortunately I’m not in or over the group responsible for that anymore!), and thus took down a number of systems, all part of our core electronic medical record system, eClinicalWorks, all virtual… We were without that app for more than 6 hours, and are still dealing with database replication issues today as a result…. TGIF! Jonathan L. Raper, A+, MCSA, MCSE Technology Coordinator Eagle Physicians Associates, PA* *jra...@eaglemds.com* *www.eaglemds.com -- *From:* Jonathan Link [mailto:jonathan.l...@gmail.com] *Sent:* Friday, October 08, 2010 9:40 AM *To:* NT System Admin Issues *Subject:* Re: How'd this for a bad day? AKA bad me +1 I'm just getting caught up on emails this morning. vCenter reboot shouldn't necessitate a reboot of a host server. On Fri, Oct 8, 2010 at 9:34 AM, Jeff Bunting bunting.j...@gmail.com wrote: Why do you need to power down VMs to reboot vCenter? vCenter might be the problem with the missing VMs. VMWare support might be able to help you with those. Jeff On Fri, Oct 8, 2010 at 5:51 AM, David Lum david@nwea.org wrote: I have 7 production systems running on 3 different ESX boxes in an ESX cluster, and 2 different logical SAN volumes (sorry am not SAN savvy, I just know I have two different SAN volumes to choose from when making a VM). Today, a SAN blows up and takes out half – our SharePoint server (heavily used), a Terminal Server , and an internal occasionally-used web server (Namescape rDirectory). Then somehow, when I was told to power down the other 4 VM’s so our VMWare guy could reboot a vCenter server, 3 of the 4 remaining VM’s decided to go AWOL (a combination of “missing” and “disconnected”). That took out my other two Terminal Servers and another lightly used internal web server. Did I mention I don’t have the normal backups for these things because …well…I’m an idiot and didn’t confirm our backup guy installed backup software on these servers as I stood them up (process error on my part since I should confirm it’s on there). None of these store data – they all talk to a backend SQL and the Terminal Servers are used to run apps that are slow if they run the same apps over VPN. SharePoint we got back quick because we do have a staging equivalent of it, so it was repoint to a config and content DB, DNS change, and done. I do have copious notes on how I built the others and can rebuild from scratch easily enough (I just finished the three TS boxes), but dude…six servers at once? The most frustrating part was discovering that the 4 systems that had been powered off could have been “migrated” before power off and there would have been no issue with them – the power down nuked ‘em. Oh, and the lone surviving server – the PGP Universal Server that manages the encrypted machines. (Yes, the PGP machines will still boot w/out the server up, but still, I’ve been on this server 50% of my time over the last two weeks!). Dave ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin
RE: How'd this for a bad day? AKA bad me
Yeah I seem to run into this kind of I should change my career event once every five years or so, although this event isn't nearly as stressful as being at a client (these down systems are at %dayjob%) and having a RAID5 card die and thinking I don't even know how the RAID volumes were configured, this setup pre-dated me..., this on their primary SBS server. The worst in my 15 years was P2V-ing a different customer's SBS server with Hyper-V, then about two months later when I rebooted the host, SCVMM (MS's fancy VM manager) tells me No virtual machines found... Current status of my disaster: I have 5 of 6 servers back up and 95%+ back to normal, not too bad for 12 hours of work...or is it? The last server is low on the critical list, I believe I will not suffer a heart attack this day. Dave -Original Message- From: Ben Scott [mailto:mailvor...@gmail.com] Sent: Friday, October 08, 2010 9:16 AM To: NT System Admin Issues Subject: Re: How'd this for a bad day? AKA bad me On Fri, Oct 8, 2010 at 5:51 AM, David Lum david@nwea.org wrote: I have 7 production systems ... Oh, boy. Fun. I've had days like that. Not many, fortunately (and knock on wood). Hope you get it all sorted out in time for the weekend! Today I find myself having to arbitrate a pooch screw regarding important procedures, and thus get everyone's story and try and make sense of it all. I feel like I'm playing the cop in a police interrogation scene. I much prefer dealing with recalcitrant machines than people. -- Ben ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin
RE: How'd this for a bad day? AKA bad me
Just be glad it didn't happen on a Monday! Terrible way to start off a week! Jonathan L. Raper, A+, MCSA, MCSE Technology Coordinator Eagle Physicians Associates, PA jra...@eaglemds.com www.eaglemds.com -Original Message- From: David Lum [mailto:david@nwea.org] Sent: Friday, October 08, 2010 12:54 PM To: NT System Admin Issues Subject: RE: How'd this for a bad day? AKA bad me Yeah I seem to run into this kind of I should change my career event once every five years or so, although this event isn't nearly as stressful as being at a client (these down systems are at %dayjob%) and having a RAID5 card die and thinking I don't even know how the RAID volumes were configured, this setup pre-dated me..., this on their primary SBS server. The worst in my 15 years was P2V-ing a different customer's SBS server with Hyper-V, then about two months later when I rebooted the host, SCVMM (MS's fancy VM manager) tells me No virtual machines found... Current status of my disaster: I have 5 of 6 servers back up and 95%+ back to normal, not too bad for 12 hours of work...or is it? The last server is low on the critical list, I believe I will not suffer a heart attack this day. Dave -Original Message- From: Ben Scott [mailto:mailvor...@gmail.com] Sent: Friday, October 08, 2010 9:16 AM To: NT System Admin Issues Subject: Re: How'd this for a bad day? AKA bad me On Fri, Oct 8, 2010 at 5:51 AM, David Lum david@nwea.org wrote: I have 7 production systems ... Oh, boy. Fun. I've had days like that. Not many, fortunately (and knock on wood). Hope you get it all sorted out in time for the weekend! Today I find myself having to arbitrate a pooch screw regarding important procedures, and thus get everyone's story and try and make sense of it all. I feel like I'm playing the cop in a police interrogation scene. I much prefer dealing with recalcitrant machines than people. -- Ben ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin Any medical information contained in this electronic message is CONFIDENTIAL and privileged. It is unlawful for unauthorized persons to view, copy, disclose, or disseminate CONFIDENTIAL information. This electronic message may contain information that is confidential and/or legally privileged. It is intended only for the use of the individual(s) and/or entity named as recipients in the message. If you are not an intended recipient of this message, please notify the sender immediately and delete this material from your computer. Do not deliver, distribute or copy this message, and do not disclose its contents or take any action in reliance on the information that it contains. ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin
RE: How'd this for a bad day? AKA bad me
Amen -Original Message- From: Andrew S. Baker [mailto:asbz...@gmail.com] Sent: Friday, October 08, 2010 11:36 AM To: NT System Admin Issues Subject: Re: How'd this for a bad day? AKA bad me I've said it before, but I will say it again. In a highly virtualized, heavily consolidated world, we need more planning, more thinking and more time for effective execution. Cutting corners will become more and more painful, and will bite more and more organizations. Hopefully, enough near misses will teach enough entities to do the right thing. That's just my optimism speaking, however. It will be incumbent on each technology professional to advocate or fight for the right solutions, or have an excellent exit strategy planned out. :) ASB (My XeeSM Profile) http://XeeSM.com/AndrewBaker Exploiting Technology for Business Advantage... On Fri, Oct 8, 2010 at 11:27 AM, Raper, Jonathan - Eagle jra...@eaglemds.com wrote: +1 from here as well. A vCenter reboot should not require a host reboot. If it did, that would (IMHO) be a huge problem in the design and purpose behind VMware. Talk to VMware. If your maintenance is not current, get current. On a related note, YESTERDAY, one of our storage groups on our SAN ran out of space (fortunately I'm not in or over the group responsible for that anymore!), and thus took down a number of systems, all part of our core electronic medical record system, eClinicalWorks, all virtual... We were without that app for more than 6 hours, and are still dealing with database replication issues today as a result TGIF! Jonathan L. Raper, A+, MCSA, MCSE Technology Coordinator Eagle Physicians Associates, PA jra...@eaglemds.com www.eaglemds.com From: Jonathan Link [mailto:jonathan.l...@gmail.com] Sent: Friday, October 08, 2010 9:40 AM To: NT System Admin Issues Subject: Re: How'd this for a bad day? AKA bad me +1 I'm just getting caught up on emails this morning. vCenter reboot shouldn't necessitate a reboot of a host server. On Fri, Oct 8, 2010 at 9:34 AM, Jeff Bunting bunting.j...@gmail.com wrote: Why do you need to power down VMs to reboot vCenter? vCenter might be the problem with the missing VMs. VMWare support might be able to help you with those. Jeff On Fri, Oct 8, 2010 at 5:51 AM, David Lum david@nwea.org wrote: I have 7 production systems running on 3 different ESX boxes in an ESX cluster, and 2 different logical SAN volumes (sorry am not SAN savvy, I just know I have two different SAN volumes to choose from when making a VM). Today, a SAN blows up and takes out half - our SharePoint server (heavily used), a Terminal Server , and an internal occasionally-used web server (Namescape rDirectory). Then somehow, when I was told to power down the other 4 VM's so our VMWare guy could reboot a vCenter server, 3 of the 4 remaining VM's decided to go AWOL (a combination of missing and disconnected). That took out my other two Terminal Servers and another lightly used internal web server. Did I mention I don't have the normal backups for these things because ...well...I'm an idiot and didn't confirm our backup guy installed backup software on these servers as I stood them up (process error on my part since I should confirm it's on there). None of these store data - they all talk to a backend SQL and the Terminal Servers are used to run apps that are slow if they run the same apps over VPN. SharePoint we got back quick because we do have a staging equivalent of it, so it was repoint to a config and content DB, DNS change, and done. I do have copious notes on how I built the others and can rebuild from scratch easily enough (I just finished the three TS boxes), but dude...six servers at once? The most frustrating part was discovering that the 4 systems that had been powered off could have been migrated before power off and there would have been no issue with them - the power down nuked 'em. Oh, and the lone surviving server - the PGP Universal Server that manages the encrypted machines. (Yes, the PGP machines will still boot w/out the server up, but still, I've been on this server 50% of my time over the last two weeks!). Dave ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana
Root cause of: RE: How'd this for a bad day? AKA bad me
So, the root cause: ESX 3.5 OS was installed onto SAN volume that contained my VM's. The install of that OS (effectively) removes pointers that VM's need when they boot up. Best practice is to disconnect the SAN links when installing this version of the OS so this doesn't happen. In fact our SE did this but apparently didn't disconnect one far enough. If we had left the VM's running we could have used a VM converter to move them to a different storage location. ESX 4.0 doesn't allow this activity. Our SE feels really about out the work he created for me - personally I'm just really happy he's a stand up guy and explained what happened. You do this stuff long enough and something like this eventually happens - it's called experience. Dave From: Andrew S. Baker [mailto:asbz...@gmail.com] Sent: Friday, October 08, 2010 9:36 AM To: NT System Admin Issues Subject: Re: How'd this for a bad day? AKA bad me I've said it before, but I will say it again. In a highly virtualized, heavily consolidated world, we need more planning, more thinking and more time for effective execution. Cutting corners will become more and more painful, and will bite more and more organizations. Hopefully, enough near misses will teach enough entities to do the right thing. That's just my optimism speaking, however. It will be incumbent on each technology professional to advocate or fight for the right solutions, or have an excellent exit strategy planned out. :) ASB (My XeeSM Profile)http://XeeSM.com/AndrewBaker Exploiting Technology for Business Advantage... On Fri, Oct 8, 2010 at 11:27 AM, Raper, Jonathan - Eagle jra...@eaglemds.commailto:jra...@eaglemds.com wrote: +1 from here as well. A vCenter reboot should not require a host reboot. If it did, that would (IMHO) be a huge problem in the design and purpose behind VMware. Talk to VMware. If your maintenance is not current, get current. On a related note, YESTERDAY, one of our storage groups on our SAN ran out of space (fortunately I'm not in or over the group responsible for that anymore!), and thus took down a number of systems, all part of our core electronic medical record system, eClinicalWorks, all virtual... We were without that app for more than 6 hours, and are still dealing with database replication issues today as a result TGIF! Jonathan L. Raper, A+, MCSA, MCSE Technology Coordinator Eagle Physicians Associates, PA jra...@eaglemds.com www.eaglemds.com From: Jonathan Link [mailto:jonathan.l...@gmail.commailto:jonathan.l...@gmail.com] Sent: Friday, October 08, 2010 9:40 AM To: NT System Admin Issues Subject: Re: How'd this for a bad day? AKA bad me +1 I'm just getting caught up on emails this morning. vCenter reboot shouldn't necessitate a reboot of a host server. On Fri, Oct 8, 2010 at 9:34 AM, Jeff Bunting bunting.j...@gmail.commailto:bunting.j...@gmail.com wrote: Why do you need to power down VMs to reboot vCenter? vCenter might be the problem with the missing VMs. VMWare support might be able to help you with those. Jeff On Fri, Oct 8, 2010 at 5:51 AM, David Lum david@nwea.orgmailto:david@nwea.org wrote: I have 7 production systems running on 3 different ESX boxes in an ESX cluster, and 2 different logical SAN volumes (sorry am not SAN savvy, I just know I have two different SAN volumes to choose from when making a VM). Today, a SAN blows up and takes out half - our SharePoint server (heavily used), a Terminal Server , and an internal occasionally-used web server (Namescape rDirectory). Then somehow, when I was told to power down the other 4 VM's so our VMWare guy could reboot a vCenter server, 3 of the 4 remaining VM's decided to go AWOL (a combination of missing and disconnected). That took out my other two Terminal Servers and another lightly used internal web server. Did I mention I don't have the normal backups for these things because ...well...I'm an idiot and didn't confirm our backup guy installed backup software on these servers as I stood them up (process error on my part since I should confirm it's on there). None of these store data - they all talk to a backend SQL and the Terminal Servers are used to run apps that are slow if they run the same apps over VPN. SharePoint we got back quick because we do have a staging equivalent of it, so it was repoint to a config and content DB, DNS change, and done. I do have copious notes on how I built the others and can rebuild from scratch easily enough (I just finished the three TS boxes), but dude...six servers at once? The most frustrating part was discovering that the 4 systems that had been powered off could have been migrated before power off and there would have been no issue with them - the power down nuked 'em. Oh, and the lone surviving server - the PGP Universal Server that manages the encrypted machines. (Yes, the PGP machines will still boot w/out the server up, but still, I've been
Re: Root cause of: RE: How'd this for a bad day? AKA bad me
Experience may not be the best teacher, but it is the most expensive one... On Fri, Oct 8, 2010 at 13:34, David Lum david@nwea.org wrote: So, the root cause: ESX 3.5 OS was installed onto SAN volume that contained my VM’s. The install of that OS (effectively) removes pointers that VM’s need when they boot up. Best practice is to disconnect the SAN links when installing this version of the OS so this doesn’t happen. In fact our SE did this but apparently didn’t disconnect one far enough. If we had left the VM’s running we could have used a VM converter to move them to a different storage location. ESX 4.0 doesn’t allow this activity. Our SE feels really about out the work he created for me – personally I’m just really happy he’s a stand up guy and explained what happened. You do this stuff long enough and something like this eventually happens – it’s called “experience”. Dave From: Andrew S. Baker [mailto:asbz...@gmail.com] Sent: Friday, October 08, 2010 9:36 AM To: NT System Admin Issues Subject: Re: How'd this for a bad day? AKA bad me I've said it before, but I will say it again. In a highly virtualized, heavily consolidated world, we need more planning, more thinking and more time for effective execution. Cutting corners will become more and more painful, and will bite more and more organizations. Hopefully, enough near misses will teach enough entities to do the right thing. That's just my optimism speaking, however. It will be incumbent on each technology professional to advocate or fight for the right solutions, or have an excellent exit strategy planned out. :) ASB (My XeeSM Profile) Exploiting Technology for Business Advantage... On Fri, Oct 8, 2010 at 11:27 AM, Raper, Jonathan - Eagle jra...@eaglemds.com wrote: +1 from here as well. A vCenter reboot should not require a host reboot. If it did, that would (IMHO) be a huge problem in the design and purpose behind VMware. Talk to VMware. If your maintenance is not current, get current. On a related note, YESTERDAY, one of our storage groups on our SAN ran out of space (fortunately I’m not in or over the group responsible for that anymore!), and thus took down a number of systems, all part of our core electronic medical record system, eClinicalWorks, all virtual… We were without that app for more than 6 hours, and are still dealing with database replication issues today as a result…. TGIF! Jonathan L. Raper, A+, MCSA, MCSE Technology Coordinator Eagle Physicians Associates, PA jra...@eaglemds.com www.eaglemds.com From: Jonathan Link [mailto:jonathan.l...@gmail.com] Sent: Friday, October 08, 2010 9:40 AM To: NT System Admin Issues Subject: Re: How'd this for a bad day? AKA bad me +1 I'm just getting caught up on emails this morning. vCenter reboot shouldn't necessitate a reboot of a host server. On Fri, Oct 8, 2010 at 9:34 AM, Jeff Bunting bunting.j...@gmail.com wrote: Why do you need to power down VMs to reboot vCenter? vCenter might be the problem with the missing VMs. VMWare support might be able to help you with those. Jeff On Fri, Oct 8, 2010 at 5:51 AM, David Lum david@nwea.org wrote: I have 7 production systems running on 3 different ESX boxes in an ESX cluster, and 2 different logical SAN volumes (sorry am not SAN savvy, I just know I have two different SAN volumes to choose from when making a VM). Today, a SAN blows up and takes out half – our SharePoint server (heavily used), a Terminal Server , and an internal occasionally-used web server (Namescape rDirectory). Then somehow, when I was told to power down the other 4 VM’s so our VMWare guy could reboot a vCenter server, 3 of the 4 remaining VM’s decided to go AWOL (a combination of “missing” and “disconnected”). That took out my other two Terminal Servers and another lightly used internal web server. Did I mention I don’t have the normal backups for these things because …well…I’m an idiot and didn’t confirm our backup guy installed backup software on these servers as I stood them up (process error on my part since I should confirm it’s on there). None of these store data – they all talk to a backend SQL and the Terminal Servers are used to run apps that are slow if they run the same apps over VPN. SharePoint we got back quick because we do have a staging equivalent of it, so it was repoint to a config and content DB, DNS change, and done. I do have copious notes on how I built the others and can rebuild from scratch easily enough (I just finished the three TS boxes), but dude…six servers at once? The most frustrating part was discovering that the 4 systems that had been powered off could have been “migrated” before power off and there would have been no issue with them – the power down nuked ‘em. Oh, and the lone surviving server – the PGP Universal Server that manages the encrypted machines. (Yes
RE: How'd this for a bad day? AKA bad me
Sounds like you should home the redundant sets of VMs on different SAN volumes/whatever? Thanks, Brian Desmond br...@briandesmond.com c - 312.731.3132 From: David Lum [mailto:david@nwea.org] Sent: Friday, October 08, 2010 11:51 AM To: NT System Admin Issues Subject: How'd this for a bad day? AKA bad me I have 7 production systems running on 3 different ESX boxes in an ESX cluster, and 2 different logical SAN volumes (sorry am not SAN savvy, I just know I have two different SAN volumes to choose from when making a VM). Today, a SAN blows up and takes out half - our SharePoint server (heavily used), a Terminal Server , and an internal occasionally-used web server (Namescape rDirectory). Then somehow, when I was told to power down the other 4 VM's so our VMWare guy could reboot a vCenter server, 3 of the 4 remaining VM's decided to go AWOL (a combination of missing and disconnected). That took out my other two Terminal Servers and another lightly used internal web server. Did I mention I don't have the normal backups for these things because ...well...I'm an idiot and didn't confirm our backup guy installed backup software on these servers as I stood them up (process error on my part since I should confirm it's on there). None of these store data - they all talk to a backend SQL and the Terminal Servers are used to run apps that are slow if they run the same apps over VPN. SharePoint we got back quick because we do have a staging equivalent of it, so it was repoint to a config and content DB, DNS change, and done. I do have copious notes on how I built the others and can rebuild from scratch easily enough (I just finished the three TS boxes), but dude...six servers at once? The most frustrating part was discovering that the 4 systems that had been powered off could have been migrated before power off and there would have been no issue with them - the power down nuked 'em. Oh, and the lone surviving server - the PGP Universal Server that manages the encrypted machines. (Yes, the PGP machines will still boot w/out the server up, but still, I've been on this server 50% of my time over the last two weeks!). Dave ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.commailto:listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin ~ Finally, powerful endpoint security that ISN'T a resource hog! ~ ~ http://www.sunbeltsoftware.com/Business/VIPRE-Enterprise/ ~ --- To manage subscriptions click here: http://lyris.sunbelt-software.com/read/my_forums/ or send an email to listmana...@lyris.sunbeltsoftware.com with the body: unsubscribe ntsysadmin