[gentoo-amd64] Re: Hi and init problem

Duncan Mon, 08 May 2006 03:15:01 -0700

Dieter Ries posted <[EMAIL PROTECTED]>, excerpted below,  on
Mon, 08 May 2006 10:30:02 +0200:


> I still dont understand why
> Checking all filesystems
> is running in the boot-up process without checkfs and checkroot in one of
> the runlevels.

There's two reasons for that.

One, Gentoo has an initscript dependency system.  If you had read the
Working with Gentoo section of the handbook, you'd probably understand
this a bit better.  Unfortunately, many people apparently think the
handbook is only for installation, and end up missing out on understanding
a lot of the rest of Gentoo as covered in the rest of the handbook. 
Without that understanding, they are much less efficient at properly
administrating their Gentoo system than they'd be otherwise, as they end
up doing things the hard way, and making mistakes they'd not make had they
read the documentation.  Gentoo has a reputation for some of the best
documentation in the community, so it's a shame when folks don't read it
and end up doing things the hard way as a result.

Anyway, what it amounts to is that other initscripts depend on checkfs and
checkroot, so the system ensures they are run before these other
initscripts  run, even if checkroot and checkfs aren't directly listed to
be run, themselves.  Again, this is covered in the handbook, if you want
to better understand how and why it works that way.

Reason two is actually what's working here, however.  Without it, it would
fall back to reason one above, but reason two is the actual mechanism in
play here.  Unfortunately, this one is /not/ covered in the handbook, or
wasn't last I looked, anyway.  However, it's a logical extension of reason
one, so understanding it makes following reason two easier.

As actually implemented by the /sbin/rc initscript (which is run
repeatedly by init, as configured in /etc/inittab, as part of the boot
process), certain scripts are considered "critical" to the boot process,
and thus, barring a local configuration that bypasses them, default to
being run directly by /sbin/rc as part of the boot process, regardless of
whether they are in the boot runlevel or not.

Take a look at the "get_critical_services" routine in /sbin/rc. 
Basically, unless you have an /etc/runlevels/boot/.critical file, rc sets:

CRITICAL_SERVICES="checkroot modules checkfs localmount clock"

Those services are then started in exactly that order, directly by rc,
previous to running the boot runlevel, regardless of whether they are set
to be started by the boot runlevel or not.

If you have the modules you need to mount your automatically mounted
filesystems built into the kernel, you can eliminate modules from that
list.  You can also try eliminating checkroot and checkfs, and localmount
in some cases, but the results won't always be quite what you expected. 
Certain other services might not start in the expected order, or at all,
because stuff is missing that they depend on and assume  is there.

With my system, I can safely list only checkroot and clock in my
/etc/runlevels/boot/.critical file.  That works, altho I have checkfs and
localmount in the boot runlevel so they get run anyway -- they just
parallelize a bit better (I have RC_PARALLEL_STARTUP="yes" set in
/etc/conf.d/rc).  However, if I remove checkroot or clock from the
.critical file, things don't work quite right -- they have to be there and
started by rc directly or the rest of the services in the boot runlevel
don't work as intended.


The question then occurs...  Why are these services considered so
critical?  In general, you will find your system remains much more stable
if you run checkroot and checkfs at boot every time, for your normally
mounted filesystems.  The problem is that a hardware fault that would
cause a small problem, if caught by an fsck at the next boot, may end up
being a HUGE problem if the system is allowed to continue writing to that
filesystem as if nothing were wrong.  A single cross-linked file can soon
become hundreds or thousands, as the metadata becomes increasingly
jumbled, until it's impossible to recover from without simply overwriting
it with a good backup.  The problem may take weeks or months, even years,
to develop into a system stability compromising issue that's finally
noticed when something critical gets damaged.  However, regularly running
those at-boot fscks ensures that doesn't happen.  With a journaled
filesystem, it's not as if it takes hours to run those checks anyway.  A
few extra seconds or a minute taken at boot, can save you a huge amount of
work later, because a small and initially insignificant error wasn't
caught until hundreds of files had been corrupted.

Of course, one is also expected to use fstab appropriately, turning off
fsck at boot for non-critical or not automounted filesystems.  Here, I
have identical backup snapshots of all the filesystems I consider valuable
enough to want to retain.  Those are not automounted, and are only written
to when I mkfs them and recopy over the data from the live filesystem
periodically as part of my backup routine.  As such, there's no need to
fsck them at every boot, because they've most likely not even been touched
since the last boot, not written to, not read from, or even mounted. 
Likewise, any partitions (like /tmp) that contain essentially throwaway
data, it's probably safe to skip the fsck, putting a zero in the
appropriate column of fstab.

For any partitions you depend on, however, while you can probably get away
with avoiding fsck at boot in the short term, to be safe, it's far better
just to do it.  As mentioned by someone else, you can set ext3 partitions
to not fsck at every boot, if desired.  That's a useful option.  Set it to
every third boot, or every fifth, but don't turn it off entirely, at the
risk of not catching minor/insignificant damage until it's major and
causes you serious issues.  Keep in mind that even a partition never
written to will develop "bit rot" over time, due to cosmic ray bitflipping
and the like.  The reality is that on the single bit level hard drives
aren't nearly as reliable as we like to think they are.  Awesome levels of
automated redundant information and error correction normally handle the
problems as they develop, correcting them behind the scenes.  That's
normal and good, and generally suffices for partitions not normally
written to.  However, once you start actively using a partition, writing
as well as reading, if one of those normally insignificant bitflips
happens in the wrong place, your write intended for one location on the
disk might end up at quite a different location.  That's what automated
fscks at boot, even after proper shutdown, are designed to detect and
correct.  Catch it early, and it's insignificant, background noise,
corrected by automated mechanisms such that you likely won't notice it at
all.  Fail to do those automated boot-time fscks, and you are playing the
odds, risking your data.  Setting the fscks to once every third boot is
still well within reasonable safety limits,  Setting one in five should be
safe under normal conditions but is playing the odds a bit more.  I'd not
recommend turning it off altogether, or setting it much less frequently
than one in five, as that's just undue risk, IMO.  You may well have no
problems doing it that way for years, if ever.  Another person may have
problems in a week or a month.  It's up to you how much risk you want to
put your data at.

Meanwhile, back in the Gentoo init scripts, mandating checkroot and
checkfs as "critical" parts of the boot sequence remains the most sane
default.  Gentoo provides the configurability to change those defaults for
those sysadmins that choose to do so, but setting anything else as the
default would simply not be the sane or responsible thing for Gentoo devs
to do.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman in
http://www.linuxdevcenter.com/pub/a/linux/2004/12/22/rms_interview.html


-- 
gentoo-amd64@gentoo.org mailing list

[gentoo-amd64] Re: Hi and init problem

Reply via email to