Re: Short SMART check causes disk op timeouts

2008-10-28 Thread JoaoBR
On Monday 27 October 2008 20:03:21 Jeremy Chadwick wrote:
 I had no idea users were blindly uncommenting examples in

well seems you're new in support business then :)

the issue might be the reason why weapons are not delivered with roles in the 
chambers ... so developers probably should take care of what kind of example 
they include 

 smartd.conf.sample without reading what the features do.  Then again, I
 guess many users/admins have no idea what sort of impact offline tests
 could have on a system.  Short/long tests should not have any effect on
 a running/used disk -- and most do not see any effect -- but under high
 I/O I would assume there is a chance the suspend/resume aspect of SMART
 tests could take longer than 5 seconds.  Though I am disappointed in
 the fact that people often schedule maintenance things all at the same
 time (between 0200 and 0500) but never think about the implications of
 them all running in parallel.

well good idea to start a faq with general orientations :)


-- 

João







A mensagem foi scaneada pelo sistema de e-mail e pode ser considerada segura.
Service fornecido pelo Datacenter Matik  https://datacenter.matik.com.br
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Short SMART check causes disk op timeouts

2008-10-28 Thread Andriy Gapon
on 27/10/2008 21:59 Jeremy Chadwick said the following:
 On Mon, Oct 27, 2008 at 08:50:44PM +0100, martinko wrote:
 Jeremy Chadwick wrote:
 On Mon, Oct 27, 2008 at 07:52:01PM +0100, martinko wrote:
 Jeremy Chadwick wrote:
 Now, does the timeout cause loss of any data? Is there anything besides
 disabling the testing that I can do about it?
 Do you understand what short and long offline tests actually do and what
 they're used for?  :-)  If so, you'd know that running them periodically
 is more or less silly (IMHO).
 I do not, not completely :) I think I have just copied the settings from
 somewhere and only just tweaked it a bit whenever I have added a disk.
 Let me know if you figure out who or what online resource solicited
 adding daily short/long tests, as I'd like to talk to them about their
 decision.  I have a feeling whoever thought it up felt that the tests
 were performing entire sector scans of the entire disk, which is simply
 not the case.

 Hallo,

 Reading this thread I checked my config to find this: ;-)

 #/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) 
 -m  root# ++ 2006-11-03 mato
 /dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++  
 2006-11-03 mato

 I believe I came up with the settings after reading manual page /   
 documentation of the tool.
 Can you explain why you're doing this?  So far no one's provided a
 reason *why* they're doing short and long offline scans on a daily
 basis.  I'm under the impression the conclusion was reached like this:
 man smartd.conf ... oh, -s, a neat thing, let's enable it.

 There are negative repercussions to doing tests of this nature at such
 regular intervals.  Once-a-week is borderline acceptable; once a month
 would be quite reasonable.  I'd love to know what kind of affect daily
 tests have on MTBF; I can imagine it's reached much sooner with this.

 The main point of smartd is to monitor SMART attribute changes.  If
 you're concerned about the health of your hard disk, you should be
 looking at your logs and not relying on things like automatic short/long
 tests.  Most SMART attributes are updated immediately and not during an
 offline test, and all of those attribute changes will be logged.

 You asked Miroslav about source of his configuration.  And as it is very  
 similar to mine I think we both have it from smartd documentation. Where 
 else to look for information?  It's a usual source.  So if you think it's 
 wrong please contact the authors, we're obviously just users.
 Thanks.
 
 I'm not asking *where* you got the information from (we know where you
 and others got it from: the documentation).  I'm asking you *why* you
 enabled what you did, because this is not something smartd.conf enables
 by default (the example is commented out).
 
 If you *really* want me to talk to Bruce about this, I can/will, but I'm
 left with the impression that the example in smartd.conf is there to
 show people the syntactical usage of -o, and not to advocate its usage.
 
 PS: Btw, long offline scan is scheduled on weekly basis, not daily. If  
 it's good or not I do not know.
 
 The OP's long scan is also scheduled on a weekly basis (every Sunday),
 but his short scan trumps it.
 
 Folks, the point I'm trying to make here is that daily -- and even
 weekly -- SMART offline tests are unnecessary.  If you're that concerned
 about your disk health, you should be looking at your syslog logs for
 attribute changes that indicate drive issues.  Performing SMART offline
 tests at regular intervals like this does very little other than
 increase wear/tear on drive components (not necessarily the physical
 platters/heads; there are many pieces to a hard disk.  :-) )


BTW, I am not entirely sure what Bruce you mentioned above - probably I
missed something, but I found this post:
http://article.gmane.org/gmane.linux.utilities.smartmontools/1443

This was a long time ago, and it really warns about the same balance
that you do, but it can hint at source of authority for all the
configs around.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Short SMART check causes disk op timeouts

2008-10-27 Thread Vaclav Haisman
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi,
I have recently bought a new disk (Seagate 500G, ST3500320NS). I have
enabled SMART checking using the smartmontools as usual for the disk
(/dev/ad6 -a -S on -s (S/../.././03|L/../../7/03) -m root). The problem
is that each time the test runs I get messages like the following in
/var/log/messages:

Oct 26 04:54:15 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (1 retry
left) LBA=836986454
Oct 26 04:54:25 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (0
retries left) LBA=836986454
Oct 26 04:54:25 35 kernel: ad6: FAILURE - WRITE_DMA48 timed out
LBA=836986454
Oct 26 04:54:25 35 kernel: g_vfs_done():ad6s2d[WRITE(offset=13150142464,
length=16384)]error = 5

And the SMART test results log on the disk contains line like this:

# 1  Short offline   Interrupted (host reset)  00%   297
 -

This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC
kernel.

Now, does the timeout cause loss of any data? Is there anything besides
disabling the testing that I can do about it?

- --
VH
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (FreeBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iFYEAREIAAYFAkkFlRsACgkQhQBMvHf/WHn6fQDbBv7gpSV3x2GwDsM5VeVI+iax
oCp7aGDcgFwD9ADaAzA219KdJfu2aCgZfqXOthqvJhah6u06VObcIw==
=jm7d
-END PGP SIGNATURE-
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Jeremy Chadwick
On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256
 
 Hi,
 I have recently bought a new disk (Seagate 500G, ST3500320NS). I have
 enabled SMART checking using the smartmontools as usual for the disk
 (/dev/ad6 -a -S on -s (S/../.././03|L/../../7/03) -m root). The problem
 is that each time the test runs I get messages like the following in
 /var/log/messages:
 
 Oct 26 04:54:15 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (1 retry
 left) LBA=836986454
 Oct 26 04:54:25 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (0
 retries left) LBA=836986454
 Oct 26 04:54:25 35 kernel: ad6: FAILURE - WRITE_DMA48 timed out
 LBA=836986454
 Oct 26 04:54:25 35 kernel: g_vfs_done():ad6s2d[WRITE(offset=13150142464,
 length=16384)]error = 5
 
 And the SMART test results log on the disk contains line like this:
 
 # 1  Short offline   Interrupted (host reset)  00%   297
  -

First and foremost, your above smartd.conf -s flags are conflicting.
Your long offline test will never get run on Sunday; the short will run
first, and the long won't ever start (because the short is already
running).  I would recommend telling the short test to run only between
days 0-6, leaving Sunday solely for the long test.  (I noticed this
because the above Interrupted test indicates a short test was
interrupted and not a long).

Second, your short offline test runs at 0300, but the errors you're
seeing are at 0454 in the morning.  A short offline test does not
take 2 hours to run -- they take between 2-10 minutes -- unless the
system is also in the middle of doing a lot of I/O, in which case the
short test will be suspended.

There are cronjobs (specifically periodic jobs) that run starting at
0301 in the morning (periodic daily), and many of those are I/O bound.
This could possibly extend the length of the short test until 0454.

Weekly periodic jobs run at 0415 in the morning, on Sundays.  These also
perform a lot of disk I/O, so it's possible that on Sunday specifically
the short SMART test gets pushed back quite some time.

Third, the DMA timeouts you're seeing are possibly caused by the drive
taking too long when internally suspending the SMART test.

In most cases, it's safe for SMART tests (short and long) to be run
while the machine is operational, and disk I/O requests are being
performed.  When an I/O request comes and the disk is in the middle of
performing a SMART test, the drive has to stop the SMART test (e.g.
suspend it), complete the I/O request, then resume the SMART test.

The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it
doesn't receive an acknowledgement back from the controller (disk)
within 5 seconds, it'll report a timeout on whatever operation it was
performing.  I'm thinking the disk gets stuck in a do the offline
test, no wait stop there's an I/O request, okay its done continue the
test, no way stop there's another I/O loop.

Another possibility is that your drive really *does* have a bad block at
LBA 836986454, and that one of those cron/periodic jobs is what's
noticing it, and that upon noticing a bad block, the drive more or less
aborts the SMART test to perform internal remapping of the block.

To confirm this, you would need to boot the SeaTools utilities from DOS
or from a CD (see Seagate's site) and run a full sector scan (NOT the
quick test).  This takes a few hours.  Assuming it comes back clean,
then my above claim of the offline test taking too long to suspend is
probably the case.

Possibly this is a firmware bug in the drive -- you might consider
mailing Seagate about this problem, although I'm doubting their Tier 1
support will understand what the issue is.

Is the block number always the same?  Do you only see this error on
Sundays?  These are two questions which might help narrow things down.

 This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC
 kernel.
 
 Now, does the timeout cause loss of any data? Is there anything besides
 disabling the testing that I can do about it?

Do you understand what short and long offline tests actually do and what
they're used for?  :-)  If so, you'd know that running them periodically
is more or less silly (IMHO).

If you're trying to accomplish a cheap version of disk scrubbing, e.g.
scanning the entire disk for bad blocks and report them or have them
automatically remapped by the drive, consider using sysutils/diskcheckd,
which was made for this purpose.  However, be aware of a problem I've
run into with it (still needs someone clueful to figure out why this
happens):
http://www.freebsd.org/cgi/query-pr.cgi?pr=ports/115853

I do not advocate the use of periodic offline tests on disks, especially
at such aggressive intervals (daily).  In fact, I don't even know why
Bruce added that option to smartd.  There are only a few attributes in
SMART which get updated on offline tests, so I cease to see the point.

You shouldn't be doing what you're doing, IMHO.  

Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Vaclav Haisman
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Jeremy Chadwick wrote:
 On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256

 Hi,
 I have recently bought a new disk (Seagate 500G, ST3500320NS). I have
 enabled SMART checking using the smartmontools as usual for the disk
 (/dev/ad6 -a -S on -s (S/../.././03|L/../../7/03) -m root). The problem
 is that each time the test runs I get messages like the following in
 /var/log/messages:

 Oct 26 04:54:15 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (1 retry
 left) LBA=836986454
 Oct 26 04:54:25 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (0
 retries left) LBA=836986454
 Oct 26 04:54:25 35 kernel: ad6: FAILURE - WRITE_DMA48 timed out
 LBA=836986454
 Oct 26 04:54:25 35 kernel: g_vfs_done():ad6s2d[WRITE(offset=13150142464,
 length=16384)]error = 5

 And the SMART test results log on the disk contains line like this:

 # 1  Short offline   Interrupted (host reset)  00%   297
  -
 
 First and foremost, your above smartd.conf -s flags are conflicting.
 Your long offline test will never get run on Sunday; the short will run
 first, and the long won't ever start (because the short is already
 running).  I would recommend telling the short test to run only between
 days 0-6, leaving Sunday solely for the long test.  (I noticed this
 because the above Interrupted test indicates a short test was
 interrupted and not a long).
Thanks, I have not noticed the overlap at all.

 
 Second, your short offline test runs at 0300, but the errors you're
 seeing are at 0454 in the morning.  A short offline test does not
 take 2 hours to run -- they take between 2-10 minutes -- unless the
 system is also in the middle of doing a lot of I/O, in which case the
 short test will be suspended.
 
 There are cronjobs (specifically periodic jobs) that run starting at
 0301 in the morning (periodic daily), and many of those are I/O bound.
 This could possibly extend the length of the short test until 0454.
 
 Weekly periodic jobs run at 0415 in the morning, on Sundays.  These also
 perform a lot of disk I/O, so it's possible that on Sunday specifically
 the short SMART test gets pushed back quite some time.
 
 Third, the DMA timeouts you're seeing are possibly caused by the drive
 taking too long when internally suspending the SMART test.
 
 In most cases, it's safe for SMART tests (short and long) to be run
 while the machine is operational, and disk I/O requests are being
 performed.  When an I/O request comes and the disk is in the middle of
 performing a SMART test, the drive has to stop the SMART test (e.g.
 suspend it), complete the I/O request, then resume the SMART test.
 
 The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it
 doesn't receive an acknowledgement back from the controller (disk)
 within 5 seconds, it'll report a timeout on whatever operation it was
 performing.  I'm thinking the disk gets stuck in a do the offline
 test, no wait stop there's an I/O request, okay its done continue the
 test, no way stop there's another I/O loop.
Can I make the timeout higher? For the sake of elimination.

 
 Another possibility is that your drive really *does* have a bad block at
 LBA 836986454, and that one of those cron/periodic jobs is what's
 noticing it, and that upon noticing a bad block, the drive more or less
 aborts the SMART test to perform internal remapping of the block.
 
 To confirm this, you would need to boot the SeaTools utilities from DOS
 or from a CD (see Seagate's site) and run a full sector scan (NOT the
 quick test).  This takes a few hours.  Assuming it comes back clean,
 then my above claim of the offline test taking too long to suspend is
 probably the case.
 
 Possibly this is a firmware bug in the drive -- you might consider
 mailing Seagate about this problem, although I'm doubting their Tier 1
 support will understand what the issue is.
 
 Is the block number always the same?  Do you only see this error on
 Sundays?  These are two questions which might help narrow things down.
Nope, the LBA is always different and I see it in the logs once every day.

 
 This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC
 kernel.

 Now, does the timeout cause loss of any data? Is there anything besides
 disabling the testing that I can do about it?
 
 Do you understand what short and long offline tests actually do and what
 they're used for?  :-)  If so, you'd know that running them periodically
 is more or less silly (IMHO).
I do not, not completely :) I think I have just copied the settings from
somewhere and only just tweaked it a bit whenever I have added a disk.

 
 If you're trying to accomplish a cheap version of disk scrubbing, e.g.
 scanning the entire disk for bad blocks and report them or have them
 automatically remapped by the drive, consider using sysutils/diskcheckd,
 which was made for this purpose.  However, be aware of a problem I've
 run 

Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Jeremy Chadwick
On Mon, Oct 27, 2008 at 06:22:03PM +0100, Vaclav Haisman wrote:
 Jeremy Chadwick wrote:
  On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote:
  Second, your short offline test runs at 0300, but the errors you're
  seeing are at 0454 in the morning.  A short offline test does not
  take 2 hours to run -- they take between 2-10 minutes -- unless the
  system is also in the middle of doing a lot of I/O, in which case the
  short test will be suspended.
  
  There are cronjobs (specifically periodic jobs) that run starting at
  0301 in the morning (periodic daily), and many of those are I/O bound.
  This could possibly extend the length of the short test until 0454.
  
  Weekly periodic jobs run at 0415 in the morning, on Sundays.  These also
  perform a lot of disk I/O, so it's possible that on Sunday specifically
  the short SMART test gets pushed back quite some time.
  
  Third, the DMA timeouts you're seeing are possibly caused by the drive
  taking too long when internally suspending the SMART test.
  
  In most cases, it's safe for SMART tests (short and long) to be run
  while the machine is operational, and disk I/O requests are being
  performed.  When an I/O request comes and the disk is in the middle of
  performing a SMART test, the drive has to stop the SMART test (e.g.
  suspend it), complete the I/O request, then resume the SMART test.
  
  The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it
  doesn't receive an acknowledgement back from the controller (disk)
  within 5 seconds, it'll report a timeout on whatever operation it was
  performing.  I'm thinking the disk gets stuck in a do the offline
  test, no wait stop there's an I/O request, okay its done continue the
  test, no way stop there's another I/O loop.
 Can I make the timeout higher? For the sake of elimination.

You will have to make modifications to the ata(4) driver code, and
rebuild+reinstall your kernel.

There is a patch from the FreeNAS folks which turns the command timeout
value into a sysctl for tuning, but that patch has not been brought into
FreeBSD (any version) at this time.  You can find it referenced below
(see one of the Workarounds sections).  You will probably have to
apply the patch by hand rather than blindly using patch  patchfile,
because the ATA code has changed since the patch was created.

http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting

  Another possibility is that your drive really *does* have a bad block at
  LBA 836986454, and that one of those cron/periodic jobs is what's
  noticing it, and that upon noticing a bad block, the drive more or less
  aborts the SMART test to perform internal remapping of the block.
  
  To confirm this, you would need to boot the SeaTools utilities from DOS
  or from a CD (see Seagate's site) and run a full sector scan (NOT the
  quick test).  This takes a few hours.  Assuming it comes back clean,
  then my above claim of the offline test taking too long to suspend is
  probably the case.
  
  Possibly this is a firmware bug in the drive -- you might consider
  mailing Seagate about this problem, although I'm doubting their Tier 1
  support will understand what the issue is.
  
  Is the block number always the same?  Do you only see this error on
  Sundays?  These are two questions which might help narrow things down.
 Nope, the LBA is always different and I see it in the logs once every day.

Okay, so that greatly diminishes the possibility of it being a bad
block.  I'd still advocate running SeaTools on the disk to ensure
everything is 100% okay (re: sake of elimination); chances are it will
pass with flying colours.

  This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC
  kernel.
 
  Now, does the timeout cause loss of any data? Is there anything besides
  disabling the testing that I can do about it?
  
  Do you understand what short and long offline tests actually do and what
  they're used for?  :-)  If so, you'd know that running them periodically
  is more or less silly (IMHO).
 I do not, not completely :) I think I have just copied the settings from
 somewhere and only just tweaked it a bit whenever I have added a disk.

Let me know if you figure out who or what online resource solicited
adding daily short/long tests, as I'd like to talk to them about their
decision.  I have a feeling whoever thought it up felt that the tests
were performing entire sector scans of the entire disk, which is simply
not the case.

  If you're trying to accomplish a cheap version of disk scrubbing, e.g.
  scanning the entire disk for bad blocks and report them or have them
  automatically remapped by the drive, consider using sysutils/diskcheckd,
  which was made for this purpose.  However, be aware of a problem I've
  run into with it (still needs someone clueful to figure out why this
  happens):
  http://www.freebsd.org/cgi/query-pr.cgi?pr=ports/115853
  
  I do not advocate the use of periodic offline tests on 

Re: Short SMART check causes disk op timeouts

2008-10-27 Thread martinko

Jeremy Chadwick wrote:


Now, does the timeout cause loss of any data? Is there anything besides
disabling the testing that I can do about it?

Do you understand what short and long offline tests actually do and what
they're used for?  :-)  If so, you'd know that running them periodically
is more or less silly (IMHO).

I do not, not completely :) I think I have just copied the settings from
somewhere and only just tweaked it a bit whenever I have added a disk.


Let me know if you figure out who or what online resource solicited
adding daily short/long tests, as I'd like to talk to them about their
decision.  I have a feeling whoever thought it up felt that the tests
were performing entire sector scans of the entire disk, which is simply
not the case.



Hallo,

Reading this thread I checked my config to find this: ;-)

#/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) -m 
root# ++ 2006-11-03 mato
/dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++ 
2006-11-03 mato


I believe I came up with the settings after reading manual page / 
documentation of the tool.


Regards,

Martin

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Jeremy Chadwick
On Mon, Oct 27, 2008 at 07:52:01PM +0100, martinko wrote:
 Jeremy Chadwick wrote:

 Now, does the timeout cause loss of any data? Is there anything besides
 disabling the testing that I can do about it?
 Do you understand what short and long offline tests actually do and what
 they're used for?  :-)  If so, you'd know that running them periodically
 is more or less silly (IMHO).
 I do not, not completely :) I think I have just copied the settings from
 somewhere and only just tweaked it a bit whenever I have added a disk.

 Let me know if you figure out who or what online resource solicited
 adding daily short/long tests, as I'd like to talk to them about their
 decision.  I have a feeling whoever thought it up felt that the tests
 were performing entire sector scans of the entire disk, which is simply
 not the case.


 Hallo,

 Reading this thread I checked my config to find this: ;-)

 #/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) -m  
 root# ++ 2006-11-03 mato
 /dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++  
 2006-11-03 mato

 I believe I came up with the settings after reading manual page /  
 documentation of the tool.

Can you explain why you're doing this?  So far no one's provided a
reason *why* they're doing short and long offline scans on a daily
basis.  I'm under the impression the conclusion was reached like this:
man smartd.conf ... oh, -s, a neat thing, let's enable it.

There are negative repercussions to doing tests of this nature at such
regular intervals.  Once-a-week is borderline acceptable; once a month
would be quite reasonable.  I'd love to know what kind of affect daily
tests have on MTBF; I can imagine it's reached much sooner with this.

The main point of smartd is to monitor SMART attribute changes.  If
you're concerned about the health of your hard disk, you should be
looking at your logs and not relying on things like automatic short/long
tests.  Most SMART attributes are updated immediately and not during an
offline test, and all of those attribute changes will be logged.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Miroslav Lachman

Jeremy Chadwick wrote:


On Mon, Oct 27, 2008 at 06:22:03PM +0100, Vaclav Haisman wrote:


Jeremy Chadwick wrote:


On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote:
Second, your short offline test runs at 0300, but the errors you're
seeing are at 0454 in the morning.  A short offline test does not
take 2 hours to run -- they take between 2-10 minutes -- unless the
system is also in the middle of doing a lot of I/O, in which case the
short test will be suspended.

There are cronjobs (specifically periodic jobs) that run starting at
0301 in the morning (periodic daily), and many of those are I/O bound.
This could possibly extend the length of the short test until 0454.

Weekly periodic jobs run at 0415 in the morning, on Sundays.  These also
perform a lot of disk I/O, so it's possible that on Sunday specifically
the short SMART test gets pushed back quite some time.

Third, the DMA timeouts you're seeing are possibly caused by the drive
taking too long when internally suspending the SMART test.

In most cases, it's safe for SMART tests (short and long) to be run
while the machine is operational, and disk I/O requests are being
performed.  When an I/O request comes and the disk is in the middle of
performing a SMART test, the drive has to stop the SMART test (e.g.
suspend it), complete the I/O request, then resume the SMART test.

The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it
doesn't receive an acknowledgement back from the controller (disk)
within 5 seconds, it'll report a timeout on whatever operation it was
performing.  I'm thinking the disk gets stuck in a do the offline
test, no wait stop there's an I/O request, okay its done continue the
test, no way stop there's another I/O loop.


Can I make the timeout higher? For the sake of elimination.



You will have to make modifications to the ata(4) driver code, and
rebuild+reinstall your kernel.

There is a patch from the FreeNAS folks which turns the command timeout
value into a sysctl for tuning, but that patch has not been brought into
FreeBSD (any version) at this time.  You can find it referenced below
(see one of the Workarounds sections).  You will probably have to
apply the patch by hand rather than blindly using patch  patchfile,
because the ATA code has changed since the patch was created.

http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting



Another possibility is that your drive really *does* have a bad block at
LBA 836986454, and that one of those cron/periodic jobs is what's
noticing it, and that upon noticing a bad block, the drive more or less
aborts the SMART test to perform internal remapping of the block.

To confirm this, you would need to boot the SeaTools utilities from DOS
or from a CD (see Seagate's site) and run a full sector scan (NOT the
quick test).  This takes a few hours.  Assuming it comes back clean,
then my above claim of the offline test taking too long to suspend is
probably the case.

Possibly this is a firmware bug in the drive -- you might consider
mailing Seagate about this problem, although I'm doubting their Tier 1
support will understand what the issue is.

Is the block number always the same?  Do you only see this error on
Sundays?  These are two questions which might help narrow things down.


Nope, the LBA is always different and I see it in the logs once every day.



Okay, so that greatly diminishes the possibility of it being a bad
block.  I'd still advocate running SeaTools on the disk to ensure
everything is 100% okay (re: sake of elimination); chances are it will
pass with flying colours.



This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC
kernel.

Now, does the timeout cause loss of any data? Is there anything besides
disabling the testing that I can do about it?


Do you understand what short and long offline tests actually do and what
they're used for?  :-)  If so, you'd know that running them periodically
is more or less silly (IMHO).


I do not, not completely :) I think I have just copied the settings from
somewhere and only just tweaked it a bit whenever I have added a disk.



Let me know if you figure out who or what online resource solicited
adding daily short/long tests, as I'd like to talk to them about their
decision.  I have a feeling whoever thought it up felt that the tests
were performing entire sector scans of the entire disk, which is simply
not the case.


It seems like a little modified example from smartd.conf.sample

# First (primary) ATA/IDE hard disk.  Monitor all attributes, enable
# automatic online data collection, automatic Attribute autosave, and
# start a short self-test every day between 2-3am, and a long self test
# Saturdays between 3-4am.
#/dev/hda -a -o on -S on -s (S/../.././02|L/../../6/03)

I am using similar config without problem:

/dev/ad4 -a -o on -S on -m root -M test -M diminishing -s 
(S/../.././01|L/../../(3|6)/05) -t -I 194
/dev/ad6 -a -o on -S on -m root -M test -M diminishing 

Re: Short SMART check causes disk op timeouts

2008-10-27 Thread martinko

Jeremy Chadwick wrote:

On Mon, Oct 27, 2008 at 07:52:01PM +0100, martinko wrote:

Jeremy Chadwick wrote:

Now, does the timeout cause loss of any data? Is there anything besides
disabling the testing that I can do about it?

Do you understand what short and long offline tests actually do and what
they're used for?  :-)  If so, you'd know that running them periodically
is more or less silly (IMHO).

I do not, not completely :) I think I have just copied the settings from
somewhere and only just tweaked it a bit whenever I have added a disk.

Let me know if you figure out who or what online resource solicited
adding daily short/long tests, as I'd like to talk to them about their
decision.  I have a feeling whoever thought it up felt that the tests
were performing entire sector scans of the entire disk, which is simply
not the case.


Hallo,

Reading this thread I checked my config to find this: ;-)

#/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) -m  
root# ++ 2006-11-03 mato
/dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++  
2006-11-03 mato


I believe I came up with the settings after reading manual page /  
documentation of the tool.


Can you explain why you're doing this?  So far no one's provided a
reason *why* they're doing short and long offline scans on a daily
basis.  I'm under the impression the conclusion was reached like this:
man smartd.conf ... oh, -s, a neat thing, let's enable it.

There are negative repercussions to doing tests of this nature at such
regular intervals.  Once-a-week is borderline acceptable; once a month
would be quite reasonable.  I'd love to know what kind of affect daily
tests have on MTBF; I can imagine it's reached much sooner with this.

The main point of smartd is to monitor SMART attribute changes.  If
you're concerned about the health of your hard disk, you should be
looking at your logs and not relying on things like automatic short/long
tests.  Most SMART attributes are updated immediately and not during an
offline test, and all of those attribute changes will be logged.



You asked Miroslav about source of his configuration.  And as it is very 
similar to mine I think we both have it from smartd documentation. 
Where else to look for information?  It's a usual source.  So if you 
think it's wrong please contact the authors, we're obviously just users.

Thanks.

M.

PS: Btw, long offline scan is scheduled on weekly basis, not daily. If 
it's good or not I do not know.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Short SMART check causes disk op timeouts

2008-10-27 Thread martinko

martinko wrote:

Jeremy Chadwick wrote:

On Mon, Oct 27, 2008 at 07:52:01PM +0100, martinko wrote:

Jeremy Chadwick wrote:
Now, does the timeout cause loss of any data? Is there anything 
besides

disabling the testing that I can do about it?
Do you understand what short and long offline tests actually do 
and what
they're used for?  :-)  If so, you'd know that running them 
periodically

is more or less silly (IMHO).
I do not, not completely :) I think I have just copied the settings 
from

somewhere and only just tweaked it a bit whenever I have added a disk.

Let me know if you figure out who or what online resource solicited
adding daily short/long tests, as I'd like to talk to them about their
decision.  I have a feeling whoever thought it up felt that the tests
were performing entire sector scans of the entire disk, which is simply
not the case.


Hallo,

Reading this thread I checked my config to find this: ;-)

#/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) 
-m  root# ++ 2006-11-03 mato
/dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++  
2006-11-03 mato


I believe I came up with the settings after reading manual page /  
documentation of the tool.


Can you explain why you're doing this?  So far no one's provided a
reason *why* they're doing short and long offline scans on a daily
basis.  I'm under the impression the conclusion was reached like this:
man smartd.conf ... oh, -s, a neat thing, let's enable it.

There are negative repercussions to doing tests of this nature at such
regular intervals.  Once-a-week is borderline acceptable; once a month
would be quite reasonable.  I'd love to know what kind of affect daily
tests have on MTBF; I can imagine it's reached much sooner with this.

The main point of smartd is to monitor SMART attribute changes.  If
you're concerned about the health of your hard disk, you should be
looking at your logs and not relying on things like automatic short/long
tests.  Most SMART attributes are updated immediately and not during an
offline test, and all of those attribute changes will be logged.



You asked Miroslav about source of his configuration.  And as it is very 


 I meant Vaclav, of course, Miroslav's email just arrived. :)

similar to mine I think we both have it from smartd documentation. Where 
else to look for information?  It's a usual source.  So if you think 
it's wrong please contact the authors, we're obviously just users.

Thanks.

M.

PS: Btw, long offline scan is scheduled on weekly basis, not daily. If 
it's good or not I do not know.




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Jeremy Chadwick
On Mon, Oct 27, 2008 at 08:50:44PM +0100, martinko wrote:
 Jeremy Chadwick wrote:
 On Mon, Oct 27, 2008 at 07:52:01PM +0100, martinko wrote:
 Jeremy Chadwick wrote:
 Now, does the timeout cause loss of any data? Is there anything besides
 disabling the testing that I can do about it?
 Do you understand what short and long offline tests actually do and what
 they're used for?  :-)  If so, you'd know that running them periodically
 is more or less silly (IMHO).
 I do not, not completely :) I think I have just copied the settings from
 somewhere and only just tweaked it a bit whenever I have added a disk.
 Let me know if you figure out who or what online resource solicited
 adding daily short/long tests, as I'd like to talk to them about their
 decision.  I have a feeling whoever thought it up felt that the tests
 were performing entire sector scans of the entire disk, which is simply
 not the case.

 Hallo,

 Reading this thread I checked my config to find this: ;-)

 #/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) 
 -m  root# ++ 2006-11-03 mato
 /dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++  
 2006-11-03 mato

 I believe I came up with the settings after reading manual page /   
 documentation of the tool.

 Can you explain why you're doing this?  So far no one's provided a
 reason *why* they're doing short and long offline scans on a daily
 basis.  I'm under the impression the conclusion was reached like this:
 man smartd.conf ... oh, -s, a neat thing, let's enable it.

 There are negative repercussions to doing tests of this nature at such
 regular intervals.  Once-a-week is borderline acceptable; once a month
 would be quite reasonable.  I'd love to know what kind of affect daily
 tests have on MTBF; I can imagine it's reached much sooner with this.

 The main point of smartd is to monitor SMART attribute changes.  If
 you're concerned about the health of your hard disk, you should be
 looking at your logs and not relying on things like automatic short/long
 tests.  Most SMART attributes are updated immediately and not during an
 offline test, and all of those attribute changes will be logged.


 You asked Miroslav about source of his configuration.  And as it is very  
 similar to mine I think we both have it from smartd documentation. Where 
 else to look for information?  It's a usual source.  So if you think it's 
 wrong please contact the authors, we're obviously just users.
 Thanks.

I'm not asking *where* you got the information from (we know where you
and others got it from: the documentation).  I'm asking you *why* you
enabled what you did, because this is not something smartd.conf enables
by default (the example is commented out).

If you *really* want me to talk to Bruce about this, I can/will, but I'm
left with the impression that the example in smartd.conf is there to
show people the syntactical usage of -o, and not to advocate its usage.

 PS: Btw, long offline scan is scheduled on weekly basis, not daily. If  
 it's good or not I do not know.

The OP's long scan is also scheduled on a weekly basis (every Sunday),
but his short scan trumps it.

Folks, the point I'm trying to make here is that daily -- and even
weekly -- SMART offline tests are unnecessary.  If you're that concerned
about your disk health, you should be looking at your syslog logs for
attribute changes that indicate drive issues.  Performing SMART offline
tests at regular intervals like this does very little other than
increase wear/tear on drive components (not necessarily the physical
platters/heads; there are many pieces to a hard disk.  :-) )

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Miroslav Lachman

Jeremy Chadwick wrote:


On Mon, Oct 27, 2008 at 08:50:44PM +0100, martinko wrote:


Jeremy Chadwick wrote:


On Mon, Oct 27, 2008 at 07:52:01PM +0100, martinko wrote:


Jeremy Chadwick wrote:


Now, does the timeout cause loss of any data? Is there anything besides
disabling the testing that I can do about it?


Do you understand what short and long offline tests actually do and what
they're used for?  :-)  If so, you'd know that running them periodically
is more or less silly (IMHO).


I do not, not completely :) I think I have just copied the settings from
somewhere and only just tweaked it a bit whenever I have added a disk.


Let me know if you figure out who or what online resource solicited
adding daily short/long tests, as I'd like to talk to them about their
decision.  I have a feeling whoever thought it up felt that the tests
were performing entire sector scans of the entire disk, which is simply
not the case.



Hallo,

Reading this thread I checked my config to find this: ;-)

#/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) 
-m  root# ++ 2006-11-03 mato
/dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++  
2006-11-03 mato


I believe I came up with the settings after reading manual page /   
documentation of the tool.


Can you explain why you're doing this?  So far no one's provided a
reason *why* they're doing short and long offline scans on a daily
basis.  I'm under the impression the conclusion was reached like this:
man smartd.conf ... oh, -s, a neat thing, let's enable it.

There are negative repercussions to doing tests of this nature at such
regular intervals.  Once-a-week is borderline acceptable; once a month
would be quite reasonable.  I'd love to know what kind of affect daily
tests have on MTBF; I can imagine it's reached much sooner with this.

The main point of smartd is to monitor SMART attribute changes.  If
you're concerned about the health of your hard disk, you should be
looking at your logs and not relying on things like automatic short/long
tests.  Most SMART attributes are updated immediately and not during an
offline test, and all of those attribute changes will be logged.



You asked Miroslav about source of his configuration.  And as it is very  
similar to mine I think we both have it from smartd documentation. Where 
else to look for information?  It's a usual source.  So if you think it's 
wrong please contact the authors, we're obviously just users.

Thanks.



I'm not asking *where* you got the information from (we know where you
and others got it from: the documentation).  I'm asking you *why* you
enabled what you did, because this is not something smartd.conf enables
by default (the example is commented out).

If you *really* want me to talk to Bruce about this, I can/will, but I'm
left with the impression that the example in smartd.conf is there to
show people the syntactical usage of -o, and not to advocate its usage.


PS: Btw, long offline scan is scheduled on weekly basis, not daily. If  
it's good or not I do not know.



The OP's long scan is also scheduled on a weekly basis (every Sunday),
but his short scan trumps it.

Folks, the point I'm trying to make here is that daily -- and even
weekly -- SMART offline tests are unnecessary.  If you're that concerned
about your disk health, you should be looking at your syslog logs for
attribute changes that indicate drive issues.  Performing SMART offline
tests at regular intervals like this does very little other than
increase wear/tear on drive components (not necessarily the physical
platters/heads; there are many pieces to a hard disk.  :-) )


It is more than three years ago when I started to use smartd and I did 
not change my configs from that time, just copy it to all the new 
servers, so I can't tell why I had feeling that daily short and weekly 
long test is the right way.
Do you have some link to brief overview, where we can read something 
about the best practices with smartd? Or may I just change the config 
to do short test once a week and long test once a month?


Miroslav Lachman

PS: all examples in smartd.conf.sample are commented out (DEVICESCAN is 
the default), but almost all of the examples have weekly long test, this 
may lead to our conclusion weekly long test is good

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Alexandre Sunny Kovalenko
On Mon, 2008-10-27 at 10:53 -0700, Jeremy Chadwick wrote:
 On Mon, Oct 27, 2008 at 06:22:03PM +0100, Vaclav Haisman wrote:
  Jeremy Chadwick wrote:
skipped
   Do you understand what short and long offline tests actually do and what
   they're used for?  :-)  If so, you'd know that running them periodically
   is more or less silly (IMHO).
  I do not, not completely :) I think I have just copied the settings from
  somewhere and only just tweaked it a bit whenever I have added a disk.
 
 Let me know if you figure out who or what online resource solicited
 adding daily short/long tests, as I'd like to talk to them about their
 decision.  I have a feeling whoever thought it up felt that the tests
 were performing entire sector scans of the entire disk, which is simply
 not the case.
While I am not the OP, one such place would be example configuration
file in 'man smartd.conf'.

HTH,
-- 
Alexandre Sunny Kovalenko (Олександр Коваленко)

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Alexandre Sunny Kovalenko
On Mon, 2008-10-27 at 21:45 +0100, Miroslav Lachman wrote:
 Jeremy Chadwick wrote:
 
  On Mon, Oct 27, 2008 at 08:50:44PM +0100, martinko wrote:
  
 Jeremy Chadwick wrote:
 
 On Mon, Oct 27, 2008 at 07:52:01PM +0100, martinko wrote:
 
 Jeremy Chadwick wrote:
 
 Now, does the timeout cause loss of any data? Is there anything 
 besides
 disabling the testing that I can do about it?
 
 Do you understand what short and long offline tests actually do and 
 what
 they're used for?  :-)  If so, you'd know that running them 
 periodically
 is more or less silly (IMHO).
 
 I do not, not completely :) I think I have just copied the settings from
 somewhere and only just tweaked it a bit whenever I have added a disk.
 
 Let me know if you figure out who or what online resource solicited
 adding daily short/long tests, as I'd like to talk to them about their
 decision.  I have a feeling whoever thought it up felt that the tests
 were performing entire sector scans of the entire disk, which is simply
 not the case.
 
 
 Hallo,
 
 Reading this thread I checked my config to find this: ;-)
 
 #/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) 
 -m  root# ++ 2006-11-03 mato
 /dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++  
 2006-11-03 mato
 
 I believe I came up with the settings after reading manual page /   
 documentation of the tool.
 
 Can you explain why you're doing this?  So far no one's provided a
 reason *why* they're doing short and long offline scans on a daily
 basis.  I'm under the impression the conclusion was reached like this:
 man smartd.conf ... oh, -s, a neat thing, let's enable it.
 
 There are negative repercussions to doing tests of this nature at such
 regular intervals.  Once-a-week is borderline acceptable; once a month
 would be quite reasonable.  I'd love to know what kind of affect daily
 tests have on MTBF; I can imagine it's reached much sooner with this.
 
 The main point of smartd is to monitor SMART attribute changes.  If
 you're concerned about the health of your hard disk, you should be
 looking at your logs and not relying on things like automatic short/long
 tests.  Most SMART attributes are updated immediately and not during an
 offline test, and all of those attribute changes will be logged.
 
 
 You asked Miroslav about source of his configuration.  And as it is very  
 similar to mine I think we both have it from smartd documentation. Where 
 else to look for information?  It's a usual source.  So if you think it's 
 wrong please contact the authors, we're obviously just users.
 Thanks.
  
  
  I'm not asking *where* you got the information from (we know where you
  and others got it from: the documentation).  I'm asking you *why* you
  enabled what you did, because this is not something smartd.conf enables
  by default (the example is commented out).
  
  If you *really* want me to talk to Bruce about this, I can/will, but I'm
  left with the impression that the example in smartd.conf is there to
  show people the syntactical usage of -o, and not to advocate its usage.
  
  
 PS: Btw, long offline scan is scheduled on weekly basis, not daily. If  
 it's good or not I do not know.
  
  
  The OP's long scan is also scheduled on a weekly basis (every Sunday),
  but his short scan trumps it.
  
  Folks, the point I'm trying to make here is that daily -- and even
  weekly -- SMART offline tests are unnecessary.  If you're that concerned
  about your disk health, you should be looking at your syslog logs for
  attribute changes that indicate drive issues.  Performing SMART offline
  tests at regular intervals like this does very little other than
  increase wear/tear on drive components (not necessarily the physical
  platters/heads; there are many pieces to a hard disk.  :-) )
 
 It is more than three years ago when I started to use smartd and I did 
 not change my configs from that time, just copy it to all the new 
 servers, so I can't tell why I had feeling that daily short and weekly 
 long test is the right way.
 Do you have some link to brief overview, where we can read something 
 about the best practices with smartd? Or may I just change the config 
 to do short test once a week and long test once a month?
 
 Miroslav Lachman
 
 PS: all examples in smartd.conf.sample are commented out (DEVICESCAN is 
 the default), but almost all of the examples have weekly long test, this 
 may lead to our conclusion weekly long test is good
They are *not* commented out in the example configuration found while
reading the man page. All of the examples in the man page use daily
short and weekly long offline tests. Since it would have been as
illustrative to depict monthly short and annual long tests, most of the
readers assumed that it is The Good Thing. If it indeed is not, someone
should kindly ask man page author to use different frequencies or, at
least vary them from example to example, so they are less suggestive.


-- 
Alexandre Sunny 

Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Jeremy Chadwick
On Mon, Oct 27, 2008 at 05:30:26PM -0400, Alexandre Sunny Kovalenko wrote:
 On Mon, 2008-10-27 at 21:45 +0100, Miroslav Lachman wrote:
  Jeremy Chadwick wrote:
  
   On Mon, Oct 27, 2008 at 08:50:44PM +0100, martinko wrote:
   
  Jeremy Chadwick wrote:
  
  On Mon, Oct 27, 2008 at 07:52:01PM +0100, martinko wrote:
  
  Jeremy Chadwick wrote:
  
  Now, does the timeout cause loss of any data? Is there anything 
  besides
  disabling the testing that I can do about it?
  
  Do you understand what short and long offline tests actually do and 
  what
  they're used for?  :-)  If so, you'd know that running them 
  periodically
  is more or less silly (IMHO).
  
  I do not, not completely :) I think I have just copied the settings 
  from
  somewhere and only just tweaked it a bit whenever I have added a disk.
  
  Let me know if you figure out who or what online resource solicited
  adding daily short/long tests, as I'd like to talk to them about their
  decision.  I have a feeling whoever thought it up felt that the tests
  were performing entire sector scans of the entire disk, which is simply
  not the case.
  
  
  Hallo,
  
  Reading this thread I checked my config to find this: ;-)
  
  #/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) 
  -m  root# ++ 2006-11-03 mato
  /dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++  
  2006-11-03 mato
  
  I believe I came up with the settings after reading manual page /   
  documentation of the tool.
  
  Can you explain why you're doing this?  So far no one's provided a
  reason *why* they're doing short and long offline scans on a daily
  basis.  I'm under the impression the conclusion was reached like this:
  man smartd.conf ... oh, -s, a neat thing, let's enable it.
  
  There are negative repercussions to doing tests of this nature at such
  regular intervals.  Once-a-week is borderline acceptable; once a month
  would be quite reasonable.  I'd love to know what kind of affect daily
  tests have on MTBF; I can imagine it's reached much sooner with this.
  
  The main point of smartd is to monitor SMART attribute changes.  If
  you're concerned about the health of your hard disk, you should be
  looking at your logs and not relying on things like automatic short/long
  tests.  Most SMART attributes are updated immediately and not during an
  offline test, and all of those attribute changes will be logged.
  
  
  You asked Miroslav about source of his configuration.  And as it is very  
  similar to mine I think we both have it from smartd documentation. Where 
  else to look for information?  It's a usual source.  So if you think it's 
  wrong please contact the authors, we're obviously just users.
  Thanks.
   
   
   I'm not asking *where* you got the information from (we know where you
   and others got it from: the documentation).  I'm asking you *why* you
   enabled what you did, because this is not something smartd.conf enables
   by default (the example is commented out).
   
   If you *really* want me to talk to Bruce about this, I can/will, but I'm
   left with the impression that the example in smartd.conf is there to
   show people the syntactical usage of -o, and not to advocate its usage.
   
   
  PS: Btw, long offline scan is scheduled on weekly basis, not daily. If  
  it's good or not I do not know.
   
   
   The OP's long scan is also scheduled on a weekly basis (every Sunday),
   but his short scan trumps it.
   
   Folks, the point I'm trying to make here is that daily -- and even
   weekly -- SMART offline tests are unnecessary.  If you're that concerned
   about your disk health, you should be looking at your syslog logs for
   attribute changes that indicate drive issues.  Performing SMART offline
   tests at regular intervals like this does very little other than
   increase wear/tear on drive components (not necessarily the physical
   platters/heads; there are many pieces to a hard disk.  :-) )
  
  It is more than three years ago when I started to use smartd and I did 
  not change my configs from that time, just copy it to all the new 
  servers, so I can't tell why I had feeling that daily short and weekly 
  long test is the right way.
  Do you have some link to brief overview, where we can read something 
  about the best practices with smartd? Or may I just change the config 
  to do short test once a week and long test once a month?
  
  Miroslav Lachman
  
  PS: all examples in smartd.conf.sample are commented out (DEVICESCAN is 
  the default), but almost all of the examples have weekly long test, this 
  may lead to our conclusion weekly long test is good
 They are *not* commented out in the example configuration found while
 reading the man page. All of the examples in the man page use daily
 short and weekly long offline tests. Since it would have been as
 illustrative to depict monthly short and annual long tests, most of the
 readers assumed that it is The Good Thing. If it 

Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Andrew Snow


IMO, A much better option to run on a weekly basis is to use a RAID 
controller with verify feature (eg. 3ware) or use ZFS scrub mode.



___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to [EMAIL PROTECTED]