Mayuresh wrote:
> Bob Proulx wrote:
> > Is your Load_Cycle_Count continuously increasing?
> 
> Doesn't look like. It was 3634 when I started watching and over last few
> minutes it changed only to 3635.

That still seems like a rather high load_cycle_count.  And if it is
increasing every minute then I would investigate the issue further.
What does this say?

  hdparm -B /dev/sda

> > Install smartmontools.  I also think think you should set up regular
> > drive selftests.  Ask if you want me to suggest something about this.
> 
> Yes, please do suggest.

Is your laptop something that is mostly off but only sometimes on?  Or
something that is mostly on and sometimes mobile?  Or something
different?  Mobile devices are a little hard to schedule selftests
upon because we would want to do it sometime when the device is
otherwise idle but on AC mains power.

I don't know a perfect answer to mobile devices so let me start by
explaining the default configuration, then explaining my preferred
configuration for always on systems, then guessing at something good
for a mobile device.

Install the smartmontools package.  The default configuration
dynamically searches for disk drives.  If smart detects a failure it
will notify by sending email.  Here is the default config:

  # The word DEVICESCAN will cause any remaining lines in this
  # configuration file to be ignored: it tells smartd to scan for all
  # ATA and SCSI devices.  DEVICESCAN may be followed by any of the
  # Directives listed below, which will be applied to all devices that
  # are found.  Most users should comment out DEVICESCAN and explicitly
  # list the devices that they wish to monitor.
  DEVICESCAN -d removable -n standby -m root -M exec 
/usr/share/smartmontools/smartd-runner

The above is the default.  It is documented in the smartd.conf man
page.  The options listed are:

   -d TYPE Set the device type: ata, scsi, marvell, removable, 3ware,N, 
hpt,L/M/N
   -n MODE No check. MODE is one of: never, sleep, standby, idle
   -m ADD  Send warning email to ADD for -H, -l error, -l selftest, and -f
   -M TYPE Modify email warning behavior (see man page)

The reason it recommends removing DEVICESCAN and replacing it with an
explicit configuration is for systems with multiple disk drives.  A
server with two mirrored RAID1 disks might have one disk fail
completely.  If using DEVICESCAN it will only detect one disk and
won't know there should be a second one.  By explicitly telling it
that there should be two disks it can report the failure on the
missing one.

The default config is a good safe default in that it is installable on
any system and provides something.  Unfortunately they never run any
selftests.  Therefore on an always on server I change the
configuration to be this:

  # Monitor all attributes, enable automatic offline data collection,
  # automatic attribute autosave, and start a short self-test every
  # weekday between 2-3am, and a long self test Saturdays between 3-4am.
  # Ignore attribute 194 temperature change.
  # Ignore attribute 190 airflow temperature change.
  # On failure run all installed scripts (to send notification email).
  /dev/sda -a -o on -S on -s (S/../../[1-5]/03|L/../../6/03) -I 194 -I 190 -m 
root -M exec /usr/share/smartmontools/smartd-runner
  /dev/sdb -a -o on -S on -s (S/../../[1-5]/03|L/../../6/03) -I 194 -I 190 -m 
root -M exec /usr/share/smartmontools/smartd-runner

And for those options, all of these are on:

   -a      Default: equivalent to -H -f -t -l error -l selftest -C 197 -U 198
   -f      Monitor for failure of any 'Usage' Attributes
   -H      Monitor SMART Health Status, report if failed
   -f      Monitor for failure of any 'Usage' Attributes
   -t      Equivalent to -p and -u Directives
   -p      Report changes in 'Prefailure' Normalized Attributes
   -u      Report changes in 'Usage' Normalized Attributes
   -o VAL  Enable/disable automatic offline tests (on/off)
   -S VAL  Enable/disable attribute autosave (on/off)
   -I ID   Ignore Attribute ID for -p, -u or -t Directive

Those should be relatively straight forward.  Basically all of the
above is monitor important things and ignore unimportant things.

   -s REGE Start self-test when type/date matches regular expression -(see man 
page)

That one is a mouthful.  That one is where my comments come in to
help.  With the man page documentation the
  -s (S/../../[1-5]/03|L/../../6/03)
option becomes this:

  start a short self-test every weekday between 2-3am,
  and a long self test Saturdays between 3-4am.

That is the part that runs the selftests.  Without -s it doesn't.  The
example file has examples that do almost exactly this.  But those
examples are commented out.

For a server I like the above configuration where the selftests are
run periodically.  But for a mobile laptop this is more difficult
because depending upon the user it might not be powered up at all
during that time.  Although if it is then it is likely to be on AC
mains power.  If it is on battery then we probably do not want to run
the self-tests and at least not the long self tests.

On my laptops I have upgraded to SSDs everywhere so don't need to
worry about this problem anymore.  Therefore I don't have a good
canned solution.  I think for me I would use anacron to run the
selftests by cron when AC power is on.  At this point I think I must
leave this as an exercise for the reader.  Couple that with what you
know about how you use your device because everyone is different.  But
I would read up on "anacron" and "on_ac_power" and create a cron
script that runs short selftests daily and long selftests once a week
or so.

Hope that helps!
Bob

Attachment: signature.asc
Description: Digital signature

Reply via email to