Re: [lopsa-discuss] Resource query...

Dan Ritter Mon, 13 Apr 2015 07:37:01 -0700

On Sun, Apr 12, 2015 at 05:45:53PM -0400, Michael Tiernan wrote:
> On 4/12/15 1:25 PM, [email protected] wrote:
> > Michael Tiernan wrote:
> >> Normally, what will happen is that the kickstart process will wipe and
> >> rebuild on the drive in Slot1 since it is the first drive. This is not
> >> the desired outcome.
> >>
> >> What I want to do is confirm that the drive I'm focusing on is the
> >> "correct" one in the physical hardware slot 0.
> > Curious why the 'for whatever reason' has gone unremarked.
> > What's the use case? (Frankly, could this happen to me and
> > should I pay close attention?) One response seemed to assume you
> > have data on other disks you want to preserve. Maybe you're
> > trying to track a physical disk that contains the root
> > filesystem? Fail the install if any disks are inop? Something
> > else?
> First off, a preface/reminder. We don't always get to choose the
> entirety of our infrastructure that we inherit. :(  (i.e. suspend
> preferred logic and assume to pick one battle at a time.)
> 
> So, my use case, as screwy as it may seem is this:
> 
> I've got a machine with > 1 drives in it. Usually the number is 6, 8, or
> greater now that we've got some new slot rich Dells in the racks.
> 
> Machine is running along with the system on the disk in slot 0 and the
> other 6, 8, etc drives configured as individual RAID0 containers or just
> as raw disks. Don't ask, just go with it. Principle rule, data (on data
> drives) is sacred and should never be lost.
> 
> Now, something happens and "other person with permission" reboots the
> system "because" and instead of it coming up, we find that the system
> drive has gone bad. Sadly this happens much more than I'd like. This
> results in a situation where I cannot determine the UUIDs of the
> existing drives and divine where to build the new root.
> 
> Sometimes the system drive truly disappears and in other cases just
> begins to exhibit signs of total failure. Either way, I replace the
> system drive with another drive (SATA) and then have to build a new
> system on this drive.
> 
> What *sometimes* happens is that the new/replacement drive is bad or
> also going dumb. When this happens, at times, instead of it being "Drive
> #1", it is ignored and not counted and then what will happen when the
> kickstart proceeds is that the *SECOND* drive in the system, the first
> data drive, aka "Drive #2" will get wiped out and the system built on
> it.[2] This is not the desired result.
>


I think you can solve this in a generic fashion, but it will be
hard.

You'll need an external source of truth. I imagine something
like this:

0. Everything is working properly.

1. An inventory builder that records the serial numbers of all
drives, and deduces the system drive from the fact that it has
the root filesystem in use. Inventory is uploaded to a control
server.

2. A kickstarter partitioning script that enumerates the serial
numbers of all drives and sends that info to the control server.
The control server matches those against the list, and if it
can fit that list to an existing machine, sends back the serial
number of the new system disk (which it had not previously
known.)

3. If two or more serial numbers are wrong, the control server
sends back a stop message, and then sends mail or otherwise
alerts a human that something will need manual attention.

4. The inventory builder script is run on a successful boot with
all filesystems mounted and no disks reporting errors.

While this system does not currently exist (to my knowledge), it
would not be ridiculous to build, especially if you have an
existing inventory database.

-dsr-
_______________________________________________
Discuss mailing list
[email protected]
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Re: [lopsa-discuss] Resource query...

Reply via email to