2009]

James Carlson Thu, 21 May 2009 08:38:15 -0400

Jan Setje-Eilers writes:
> James Carlson wrote:
> > The current man page seems to be pretty clear that you're not supposed
> > to call it unless you really know what you're doing (which would
> > preclude calling it when the system is unstable).
> 
>   I did quote that to the customer and then explained it to them 
> further.  They did understand, and weren't opposed to changing their 
> tools, but had no method to roll out updated tools company wide. 
> Interestingly they do have rather tight controls on system 
> configuration, making configurable behavior a viable solution. The 
> ability to configure this behavior does not as far as I can tell violate 
> the definition of the uadmin interface, and may benefit more than just 
> this one customer.


As it doesn't cover the other boot-time issues, and it currently is
just for one customer, I still think it's a hack.  And I'm very much
concerned that providing a tunable here will drive customers in
exactly the wrong direction.

If you wanted a private interface for this (say, an undocumented /etc
file or /etc/default entry or /etc/system variable that is written up
in an infodoc article and explained to this one customer), then I'd be
more supportive of the change.  I'd still think that you're putting
_way_ too many moving parts into the uadmin(2) system call interface
(how exactly does a syscall invoke a user-space archive rebuild
anyway?  or did you mean uadmin(1M)?), but making that one customer
happy sounds like a good trade-off.

However, you're proposing it as a public interface, and as something
we're committing to for the long term.  Given that it doesn't actually
solve the underlying problem, and that it mistakenly tells customers
that Solaris isn't safe to use, and that the default is to be
"unsafe," I can't agree with that.

> > It is it at all plausible that someone might fix this problem by some
> > means that do not include just "bootadm update-archive"?  If so, then
> > what exactly is that scenario?  Or is it ever possible that someone
> > might want to continue running despite the obvious problem?  Again, if
> > so, why?
> 
>   If the administrator knows why the archive is out of date and is for 
> instance willing to move forward using the older driver (if it has 
> already been loaded) they can simply clear the service and drive on.

It may have already been loaded, but the fact that it's out of sync
with the one on disk means:

  - If it happens to unload, then the next load will cause a
    *different* copy of the driver to be loaded, with possibly
    unexpected results.  It's all timing dependent and hard to
    predict.

  - The fact that it's out of date with respect to the disk is a
    likely indication that this isn't the only problem.  There may
    well be applications that depend on that driver (drivers usually
    aren't too interesting without at least some applications that use
    them), and the fact that the driver has been updated on disk
    likely indicates that the non-archive-resident applications have
    *also* been updated by the same patching process.

For the normal administrator -- one who hasn't yet memorized the
source code for the drivers -- I suspect that the behavior is just
unpredictable.  I can't see how anyone would accept that as a
reasonable risk for running the system, when the alternative is to
spend a couple of minutes rebuilding the archive and rebooting to get
a stable and predictable system.

Perhaps more importantly: if someone actually did this, and then later
ran into a problem, what would our support people say when they got
the call?

> If 
> whatever files that are out of sync were not used yet (say a driver 
> that's not part of the boot path), then it's safe to drive on and the 
> fact that the test failed is really a bug.

Can we fix that bug?  At the point when the real root is mounted, is
it possible to remember the files that have been used, so that when we
later check the archive, we know whether the out-of-date files have
been used by accident?

In any event, this is just a hard-to-predict corner case.  As with the
other one, rebuilding is safer and easier.  It's possible that driver
developers and others hacking around in the kernel may know when
skipping an archive rebuild is ok, but I'm not seeing a good argument
for providing that sort of functionality to regular system
administrators.

> > If there are no realistic cases where the user can do anything but
> > update the archive based on the current disk contents, then this looks
> > to me like the same sort of "please hang up and dial 1" annoyance
> > features that we ought to be avoiding.
> > 
> > Especially so given the annoying regularity of the problem ...
> 
>   We have some plans to address these issues from multiple directions, 
> but that is a separate case.

I don't agree that it's a separate case as long as we're talking about
a committed interface and a higher-level statement of direction
regarding uadmin.

-- 
James Carlson, Solaris Networking              <james.d.carlson at sun.com>
Sun Microsystems / 35 Network Drive        71.232W   Vox +1 781 442 2084
MS UBUR02-212 / Burlington MA 01803-2757   42.496N   Fax +1 781 442 1677

Configurable Boot Archive Updates [2009/312 05/26/2009]

Reply via email to