Root cause of: RE: How'd this for a bad day? AKA bad me

2010-10-08 Thread David Lum
So, the root cause: ESX 3.5 OS was installed onto SAN volume that contained my 
VM's. The install of that OS (effectively) removes pointers that VM's need when 
they boot up. Best practice is to disconnect the SAN links when installing this 
version of the OS so this doesn't happen. In fact our SE did this but 
apparently didn't disconnect one far enough. If we had left the VM's running we 
could have used a VM converter to move them to a different storage location.

ESX 4.0 doesn't allow this activity.

Our SE feels really about out the work he created for me - personally I'm just 
really happy he's a stand up guy and explained what happened. You do this stuff 
long enough and something like this eventually happens - it's called 
experience.

Dave

From: Andrew S. Baker [mailto:asbz...@gmail.com]
Sent: Friday, October 08, 2010 9:36 AM
To: NT System Admin Issues
Subject: Re: How'd this for a bad day? AKA bad me

I've said it before, but I will say it again.

In a highly virtualized, heavily consolidated world, we need more planning, 
more thinking and more time for effective execution.

Cutting corners will become more and more painful, and will bite more and more 
organizations.

Hopefully, enough near misses will teach enough entities to do the right thing. 
  That's just my optimism speaking, however.

It will be incumbent on each technology professional to advocate or fight for 
the right solutions, or have an excellent exit strategy planned out. :)

ASB (My XeeSM Profile)http://XeeSM.com/AndrewBaker
Exploiting Technology for Business Advantage...

On Fri, Oct 8, 2010 at 11:27 AM, Raper, Jonathan - Eagle 
jra...@eaglemds.commailto:jra...@eaglemds.com wrote:
+1 from here as well. A vCenter reboot should not require a host reboot. If it 
did, that would (IMHO) be a huge problem in the design and purpose behind 
VMware. Talk to VMware. If your maintenance is not current, get current.

On a related note, YESTERDAY, one of our storage groups on our SAN ran out of 
space (fortunately I'm not in or over the group responsible for that anymore!), 
and thus took down a number of systems, all part of our core electronic medical 
record system, eClinicalWorks, all virtual... We were without that app for more 
than 6 hours, and are still dealing with database replication issues today as a 
result

TGIF!

Jonathan L. Raper, A+, MCSA, MCSE
Technology Coordinator
Eagle Physicians  Associates, PA
jra...@eaglemds.com
www.eaglemds.com


From: Jonathan Link 
[mailto:jonathan.l...@gmail.commailto:jonathan.l...@gmail.com]
Sent: Friday, October 08, 2010 9:40 AM

To: NT System Admin Issues
Subject: Re: How'd this for a bad day? AKA bad me

+1  I'm just getting caught up on emails this morning.  vCenter reboot 
shouldn't necessitate a reboot of a host server.



On Fri, Oct 8, 2010 at 9:34 AM, Jeff Bunting 
bunting.j...@gmail.commailto:bunting.j...@gmail.com wrote:
Why do you need to power down VMs to reboot vCenter?  vCenter might be the 
problem with the missing VMs.  VMWare support might be able to help you with 
those.

Jeff
On Fri, Oct 8, 2010 at 5:51 AM, David Lum 
david@nwea.orgmailto:david@nwea.org wrote:
I have 7 production systems running on 3 different ESX boxes in an ESX cluster, 
and 2 different logical SAN volumes (sorry am not SAN savvy, I just know I have 
two different SAN volumes to choose from when making a VM).

Today, a SAN blows up and takes out half - our SharePoint server (heavily 
used), a Terminal Server , and an internal occasionally-used web server 
(Namescape rDirectory). Then somehow, when I was told to power down the other 4 
VM's so our VMWare guy could reboot a vCenter server, 3 of the 4 remaining VM's 
decided to go AWOL (a combination of missing and disconnected). That took 
out my other two Terminal Servers and another lightly used internal web server.

Did I mention I don't have the normal backups for these things because 
...well...I'm an idiot and didn't confirm our backup guy installed backup 
software on these servers as I stood them up (process error on my part since I 
should confirm it's on there). None of these store data - they all talk to a 
backend SQL and the Terminal Servers are used to run apps that are slow if they 
run the same apps over VPN. SharePoint we got back quick because we do have a 
staging equivalent of it, so it was repoint to a config and content DB, DNS 
change, and done.

I do have copious notes on how I built the others and can rebuild from scratch 
easily enough (I just finished the three TS boxes), but dude...six servers at 
once?

The most frustrating part was discovering that the 4 systems that had been 
powered off could have been migrated before power off and there would have 
been no issue with them - the power down nuked 'em.

Oh, and the lone surviving server - the PGP Universal Server that manages the 
encrypted machines. (Yes, the PGP machines will still boot w/out the server up, 
but still, I've been on 

Re: Root cause of: RE: How'd this for a bad day? AKA bad me

2010-10-08 Thread Kurt Buff
Experience may not be the best teacher, but it is the most expensive one...

On Fri, Oct 8, 2010 at 13:34, David Lum david@nwea.org wrote:
 So, the root cause: ESX 3.5 OS was installed onto SAN volume that contained
 my VM’s. The install of that OS (effectively) removes pointers that VM’s
 need when they boot up. Best practice is to disconnect the SAN links when
 installing this version of the OS so this doesn’t happen. In fact our SE did
 this but apparently didn’t disconnect one far enough. If we had left the
 VM’s running we could have used a VM converter to move them to a different
 storage location.



 ESX 4.0 doesn’t allow this activity.



 Our SE feels really about out the work he created for me – personally I’m
 just really happy he’s a stand up guy and explained what happened. You do
 this stuff long enough and something like this eventually happens – it’s
 called “experience”.



 Dave



 From: Andrew S. Baker [mailto:asbz...@gmail.com]
 Sent: Friday, October 08, 2010 9:36 AM
 To: NT System Admin Issues
 Subject: Re: How'd this for a bad day? AKA bad me



 I've said it before, but I will say it again.



 In a highly virtualized, heavily consolidated world, we need more planning,
 more thinking and more time for effective execution.

 Cutting corners will become more and more painful, and will bite more and
 more organizations.



 Hopefully, enough near misses will teach enough entities to do the right
 thing.   That's just my optimism speaking, however.



 It will be incumbent on each technology professional to advocate or fight
 for the right solutions, or have an excellent exit strategy planned out. :)

 ASB (My XeeSM Profile)
 Exploiting Technology for Business Advantage...


 On Fri, Oct 8, 2010 at 11:27 AM, Raper, Jonathan - Eagle
 jra...@eaglemds.com wrote:

 +1 from here as well. A vCenter reboot should not require a host reboot. If
 it did, that would (IMHO) be a huge problem in the design and purpose behind
 VMware. Talk to VMware. If your maintenance is not current, get current.



 On a related note, YESTERDAY, one of our storage groups on our SAN ran out
 of space (fortunately I’m not in or over the group responsible for that
 anymore!), and thus took down a number of systems, all part of our core
 electronic medical record system, eClinicalWorks, all virtual… We were
 without that app for more than 6 hours, and are still dealing with database
 replication issues today as a result….



 TGIF!

 Jonathan L. Raper, A+, MCSA, MCSE
 Technology Coordinator
 Eagle Physicians  Associates, PA
 jra...@eaglemds.com
 www.eaglemds.com

 

 From: Jonathan Link [mailto:jonathan.l...@gmail.com]
 Sent: Friday, October 08, 2010 9:40 AM

 To: NT System Admin Issues
 Subject: Re: How'd this for a bad day? AKA bad me



 +1  I'm just getting caught up on emails this morning.  vCenter reboot
 shouldn't necessitate a reboot of a host server.



 On Fri, Oct 8, 2010 at 9:34 AM, Jeff Bunting bunting.j...@gmail.com wrote:

 Why do you need to power down VMs to reboot vCenter?  vCenter might be the
 problem with the missing VMs.  VMWare support might be able to help you with
 those.

 Jeff

 On Fri, Oct 8, 2010 at 5:51 AM, David Lum david@nwea.org wrote:

 I have 7 production systems running on 3 different ESX boxes in an ESX
 cluster, and 2 different logical SAN volumes (sorry am not SAN savvy, I just
 know I have two different SAN volumes to choose from when making a VM).



 Today, a SAN blows up and takes out half – our SharePoint server (heavily
 used), a Terminal Server , and an internal occasionally-used web server
 (Namescape rDirectory). Then somehow, when I was told to power down the
 other 4 VM’s so our VMWare guy could reboot a vCenter server, 3 of the 4
 remaining VM’s decided to go AWOL (a combination of “missing” and
 “disconnected”). That took out my other two Terminal Servers and another
 lightly used internal web server.



 Did I mention I don’t have the normal backups for these things because
 …well…I’m an idiot and didn’t confirm our backup guy installed backup
 software on these servers as I stood them up (process error on my part since
 I should confirm it’s on there). None of these store data – they all talk to
 a backend SQL and the Terminal Servers are used to run apps that are slow if
 they run the same apps over VPN. SharePoint we got back quick because we do
 have a staging equivalent of it, so it was repoint to a config and content
 DB, DNS change, and done.



 I do have copious notes on how I built the others and can rebuild from
 scratch easily enough (I just finished the three TS boxes), but dude…six
 servers at once?



 The most frustrating part was discovering that the 4 systems that had been
 powered off could have been “migrated” before power off and there would have
 been no issue with them – the power down nuked ‘em.



 Oh, and the lone surviving server – the PGP Universal Server that manages
 the encrypted machines.