Re: Short SMART check causes disk op timeouts

2008-10-28 Thread Andriy Gapon
on 27/10/2008 21:59 Jeremy Chadwick said the following:
> On Mon, Oct 27, 2008 at 08:50:44PM +0100, martinko wrote:
>> Jeremy Chadwick wrote:
>>> On Mon, Oct 27, 2008 at 07:52:01PM +0100, martinko wrote:
 Jeremy Chadwick wrote:
 Now, does the timeout cause loss of any data? Is there anything besides
 disabling the testing that I can do about it?
>>> Do you understand what short and long offline tests actually do and what
>>> they're used for?  :-)  If so, you'd know that running them periodically
>>> is more or less silly (IMHO).
>> I do not, not completely :) I think I have just copied the settings from
>> somewhere and only just tweaked it a bit whenever I have added a disk.
> Let me know if you figure out who or what online resource solicited
> adding daily short/long tests, as I'd like to talk to them about their
> decision.  I have a feeling whoever thought it up felt that the tests
> were performing entire sector scans of the entire disk, which is simply
> not the case.
>
 Hallo,

 Reading this thread I checked my config to find this: ;-)

 #/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) 
 -m  root# ++ 2006-11-03 mato
 /dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++  
 2006-11-03 mato

 I believe I came up with the settings after reading manual page /   
 documentation of the tool.
>>> Can you explain why you're doing this?  So far no one's provided a
>>> reason *why* they're doing short and long offline scans on a daily
>>> basis.  I'm under the impression the conclusion was reached like this:
>>> "man smartd.conf ... oh, -s, a neat thing, let's enable it".
>>>
>>> There are negative repercussions to doing tests of this nature at such
>>> regular intervals.  Once-a-week is borderline acceptable; once a month
>>> would be quite reasonable.  I'd love to know what kind of affect daily
>>> tests have on MTBF; I can imagine it's reached much sooner with this.
>>>
>>> The main point of smartd is to monitor SMART attribute changes.  If
>>> you're concerned about the health of your hard disk, you should be
>>> looking at your logs and not relying on things like automatic short/long
>>> tests.  Most SMART attributes are updated immediately and not during an
>>> offline test, and all of those attribute changes will be logged.
>>>
>> You asked Miroslav about source of his configuration.  And as it is very  
>> similar to mine I think we both have it from smartd documentation. Where 
>> else to look for information?  It's a usual source.  So if you think it's 
>> wrong please contact the authors, we're obviously just users.
>> Thanks.
> 
> I'm not asking *where* you got the information from (we know where you
> and others got it from: the documentation).  I'm asking you *why* you
> enabled what you did, because this is not something smartd.conf enables
> by default (the example is commented out).
> 
> If you *really* want me to talk to Bruce about this, I can/will, but I'm
> left with the impression that the example in smartd.conf is there to
> show people the syntactical usage of -o, and not to advocate its usage.
> 
>> PS: Btw, long offline scan is scheduled on weekly basis, not daily. If  
>> it's good or not I do not know.
> 
> The OP's long scan is also scheduled on a weekly basis (every Sunday),
> but his short scan trumps it.
> 
> Folks, the point I'm trying to make here is that daily -- and even
> weekly -- SMART offline tests are unnecessary.  If you're that concerned
> about your disk health, you should be looking at your syslog logs for
> attribute changes that indicate drive issues.  Performing SMART offline
> tests at regular intervals like this does very little other than
> increase wear/tear on drive components (not necessarily the physical
> platters/heads; there are many pieces to a hard disk.  :-) )


BTW, I am not entirely sure what Bruce you mentioned above - probably I
missed something, but I found this post:
http://article.gmane.org/gmane.linux.utilities.smartmontools/1443

This was a long time ago, and it really warns about the same balance
that you do, but it can hint at "source of authority" for all the
configs around.

-- 
Andriy Gapon
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Short SMART check causes disk op timeouts

2008-10-28 Thread JoaoBR
On Monday 27 October 2008 20:03:21 Jeremy Chadwick wrote:
> I had no idea users were blindly uncommenting examples in

well seems you're new in support business then :)

the issue might be the reason why weapons are not delivered with roles in the 
chambers ... so developers probably should take care of what kind of example 
they include 

> smartd.conf.sample without reading what the features do.  Then again, I
> guess many users/admins have no idea what sort of impact offline tests
> could have on a system.  Short/long tests should not have any effect on
> a running/used disk -- and most do not see any effect -- but under high
> I/O I would assume there is a chance the suspend/resume aspect of SMART
> tests could take longer than 5 seconds.  Though I am disappointed in
> the fact that people often schedule "maintenance things" all at the same
> time (between 0200 and 0500) but never think about the implications of
> them all running in parallel.

well good idea to start a faq with general orientations :)


-- 

João







A mensagem foi scaneada pelo sistema de e-mail e pode ser considerada segura.
Service fornecido pelo Datacenter Matik  https://datacenter.matik.com.br
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Andrew Snow


IMO, A much better option to run on a weekly basis is to use a RAID 
controller with "verify" feature (eg. 3ware) or use ZFS "scrub" mode.



___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Jeremy Chadwick
On Mon, Oct 27, 2008 at 05:30:26PM -0400, Alexandre Sunny Kovalenko wrote:
> On Mon, 2008-10-27 at 21:45 +0100, Miroslav Lachman wrote:
> > Jeremy Chadwick wrote:
> > 
> > > On Mon, Oct 27, 2008 at 08:50:44PM +0100, martinko wrote:
> > > 
> > >>Jeremy Chadwick wrote:
> > >>
> > >>>On Mon, Oct 27, 2008 at 07:52:01PM +0100, martinko wrote:
> > >>>
> > Jeremy Chadwick wrote:
> > 
> > Now, does the timeout cause loss of any data? Is there anything 
> > besides
> > disabling the testing that I can do about it?
> > >>>
> > >>>Do you understand what short and long offline tests actually do and 
> > >>>what
> > >>>they're used for?  :-)  If so, you'd know that running them 
> > >>>periodically
> > >>>is more or less silly (IMHO).
> > >>
> > >>I do not, not completely :) I think I have just copied the settings 
> > >>from
> > >>somewhere and only just tweaked it a bit whenever I have added a disk.
> > >
> > >Let me know if you figure out who or what online resource solicited
> > >adding daily short/long tests, as I'd like to talk to them about their
> > >decision.  I have a feeling whoever thought it up felt that the tests
> > >were performing entire sector scans of the entire disk, which is simply
> > >not the case.
> > >
> > 
> > Hallo,
> > 
> > Reading this thread I checked my config to find this: ;-)
> > 
> > #/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) 
> > -m  root# ++ 2006-11-03 mato
> > /dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++  
> > 2006-11-03 mato
> > 
> > I believe I came up with the settings after reading manual page /   
> > documentation of the tool.
> > >>>
> > >>>Can you explain why you're doing this?  So far no one's provided a
> > >>>reason *why* they're doing short and long offline scans on a daily
> > >>>basis.  I'm under the impression the conclusion was reached like this:
> > >>>"man smartd.conf ... oh, -s, a neat thing, let's enable it".
> > >>>
> > >>>There are negative repercussions to doing tests of this nature at such
> > >>>regular intervals.  Once-a-week is borderline acceptable; once a month
> > >>>would be quite reasonable.  I'd love to know what kind of affect daily
> > >>>tests have on MTBF; I can imagine it's reached much sooner with this.
> > >>>
> > >>>The main point of smartd is to monitor SMART attribute changes.  If
> > >>>you're concerned about the health of your hard disk, you should be
> > >>>looking at your logs and not relying on things like automatic short/long
> > >>>tests.  Most SMART attributes are updated immediately and not during an
> > >>>offline test, and all of those attribute changes will be logged.
> > >>>
> > >>
> > >>You asked Miroslav about source of his configuration.  And as it is very  
> > >>similar to mine I think we both have it from smartd documentation. Where 
> > >>else to look for information?  It's a usual source.  So if you think it's 
> > >>wrong please contact the authors, we're obviously just users.
> > >>Thanks.
> > > 
> > > 
> > > I'm not asking *where* you got the information from (we know where you
> > > and others got it from: the documentation).  I'm asking you *why* you
> > > enabled what you did, because this is not something smartd.conf enables
> > > by default (the example is commented out).
> > > 
> > > If you *really* want me to talk to Bruce about this, I can/will, but I'm
> > > left with the impression that the example in smartd.conf is there to
> > > show people the syntactical usage of -o, and not to advocate its usage.
> > > 
> > > 
> > >>PS: Btw, long offline scan is scheduled on weekly basis, not daily. If  
> > >>it's good or not I do not know.
> > > 
> > > 
> > > The OP's long scan is also scheduled on a weekly basis (every Sunday),
> > > but his short scan trumps it.
> > > 
> > > Folks, the point I'm trying to make here is that daily -- and even
> > > weekly -- SMART offline tests are unnecessary.  If you're that concerned
> > > about your disk health, you should be looking at your syslog logs for
> > > attribute changes that indicate drive issues.  Performing SMART offline
> > > tests at regular intervals like this does very little other than
> > > increase wear/tear on drive components (not necessarily the physical
> > > platters/heads; there are many pieces to a hard disk.  :-) )
> > 
> > It is more than three years ago when I started to use smartd and I did 
> > not change my configs from that time, just copy it to all the new 
> > servers, so I can't tell why I had feeling that daily short and weekly 
> > long test is "the right way".
> > Do you have some link to brief overview, where we can read something 
> > about "the best practices" with smartd? Or may I just change the config 
> > to do short test once a week and long test once a month?
> > 
> > Miroslav Lachman
> > 
> > PS: all examples in smartd.conf.sample a

Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Alexandre "Sunny" Kovalenko
On Mon, 2008-10-27 at 21:45 +0100, Miroslav Lachman wrote:
> Jeremy Chadwick wrote:
> 
> > On Mon, Oct 27, 2008 at 08:50:44PM +0100, martinko wrote:
> > 
> >>Jeremy Chadwick wrote:
> >>
> >>>On Mon, Oct 27, 2008 at 07:52:01PM +0100, martinko wrote:
> >>>
> Jeremy Chadwick wrote:
> 
> Now, does the timeout cause loss of any data? Is there anything 
> besides
> disabling the testing that I can do about it?
> >>>
> >>>Do you understand what short and long offline tests actually do and 
> >>>what
> >>>they're used for?  :-)  If so, you'd know that running them 
> >>>periodically
> >>>is more or less silly (IMHO).
> >>
> >>I do not, not completely :) I think I have just copied the settings from
> >>somewhere and only just tweaked it a bit whenever I have added a disk.
> >
> >Let me know if you figure out who or what online resource solicited
> >adding daily short/long tests, as I'd like to talk to them about their
> >decision.  I have a feeling whoever thought it up felt that the tests
> >were performing entire sector scans of the entire disk, which is simply
> >not the case.
> >
> 
> Hallo,
> 
> Reading this thread I checked my config to find this: ;-)
> 
> #/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) 
> -m  root# ++ 2006-11-03 mato
> /dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++  
> 2006-11-03 mato
> 
> I believe I came up with the settings after reading manual page /   
> documentation of the tool.
> >>>
> >>>Can you explain why you're doing this?  So far no one's provided a
> >>>reason *why* they're doing short and long offline scans on a daily
> >>>basis.  I'm under the impression the conclusion was reached like this:
> >>>"man smartd.conf ... oh, -s, a neat thing, let's enable it".
> >>>
> >>>There are negative repercussions to doing tests of this nature at such
> >>>regular intervals.  Once-a-week is borderline acceptable; once a month
> >>>would be quite reasonable.  I'd love to know what kind of affect daily
> >>>tests have on MTBF; I can imagine it's reached much sooner with this.
> >>>
> >>>The main point of smartd is to monitor SMART attribute changes.  If
> >>>you're concerned about the health of your hard disk, you should be
> >>>looking at your logs and not relying on things like automatic short/long
> >>>tests.  Most SMART attributes are updated immediately and not during an
> >>>offline test, and all of those attribute changes will be logged.
> >>>
> >>
> >>You asked Miroslav about source of his configuration.  And as it is very  
> >>similar to mine I think we both have it from smartd documentation. Where 
> >>else to look for information?  It's a usual source.  So if you think it's 
> >>wrong please contact the authors, we're obviously just users.
> >>Thanks.
> > 
> > 
> > I'm not asking *where* you got the information from (we know where you
> > and others got it from: the documentation).  I'm asking you *why* you
> > enabled what you did, because this is not something smartd.conf enables
> > by default (the example is commented out).
> > 
> > If you *really* want me to talk to Bruce about this, I can/will, but I'm
> > left with the impression that the example in smartd.conf is there to
> > show people the syntactical usage of -o, and not to advocate its usage.
> > 
> > 
> >>PS: Btw, long offline scan is scheduled on weekly basis, not daily. If  
> >>it's good or not I do not know.
> > 
> > 
> > The OP's long scan is also scheduled on a weekly basis (every Sunday),
> > but his short scan trumps it.
> > 
> > Folks, the point I'm trying to make here is that daily -- and even
> > weekly -- SMART offline tests are unnecessary.  If you're that concerned
> > about your disk health, you should be looking at your syslog logs for
> > attribute changes that indicate drive issues.  Performing SMART offline
> > tests at regular intervals like this does very little other than
> > increase wear/tear on drive components (not necessarily the physical
> > platters/heads; there are many pieces to a hard disk.  :-) )
> 
> It is more than three years ago when I started to use smartd and I did 
> not change my configs from that time, just copy it to all the new 
> servers, so I can't tell why I had feeling that daily short and weekly 
> long test is "the right way".
> Do you have some link to brief overview, where we can read something 
> about "the best practices" with smartd? Or may I just change the config 
> to do short test once a week and long test once a month?
> 
> Miroslav Lachman
> 
> PS: all examples in smartd.conf.sample are commented out (DEVICESCAN is 
> the default), but almost all of the examples have weekly long test, this 
> may lead to our conclusion "weekly long test is good"
They are *not* commented out in the example configuration found while
reading the man page. All of the examples in the man page

Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Alexandre "Sunny" Kovalenko
On Mon, 2008-10-27 at 10:53 -0700, Jeremy Chadwick wrote:
> On Mon, Oct 27, 2008 at 06:22:03PM +0100, Vaclav Haisman wrote:
> > Jeremy Chadwick wrote:

> > > Do you understand what short and long offline tests actually do and what
> > > they're used for?  :-)  If so, you'd know that running them periodically
> > > is more or less silly (IMHO).
> > I do not, not completely :) I think I have just copied the settings from
> > somewhere and only just tweaked it a bit whenever I have added a disk.
> 
> Let me know if you figure out who or what online resource solicited
> adding daily short/long tests, as I'd like to talk to them about their
> decision.  I have a feeling whoever thought it up felt that the tests
> were performing entire sector scans of the entire disk, which is simply
> not the case.
While I am not the OP, one such place would be example configuration
file in 'man smartd.conf'.

HTH,
-- 
Alexandre "Sunny" Kovalenko (Олександр Коваленко)

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Miroslav Lachman

Jeremy Chadwick wrote:


On Mon, Oct 27, 2008 at 08:50:44PM +0100, martinko wrote:


Jeremy Chadwick wrote:


On Mon, Oct 27, 2008 at 07:52:01PM +0100, martinko wrote:


Jeremy Chadwick wrote:


Now, does the timeout cause loss of any data? Is there anything besides
disabling the testing that I can do about it?


Do you understand what short and long offline tests actually do and what
they're used for?  :-)  If so, you'd know that running them periodically
is more or less silly (IMHO).


I do not, not completely :) I think I have just copied the settings from
somewhere and only just tweaked it a bit whenever I have added a disk.


Let me know if you figure out who or what online resource solicited
adding daily short/long tests, as I'd like to talk to them about their
decision.  I have a feeling whoever thought it up felt that the tests
were performing entire sector scans of the entire disk, which is simply
not the case.



Hallo,

Reading this thread I checked my config to find this: ;-)

#/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) 
-m  root# ++ 2006-11-03 mato
/dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++  
2006-11-03 mato


I believe I came up with the settings after reading manual page /   
documentation of the tool.


Can you explain why you're doing this?  So far no one's provided a
reason *why* they're doing short and long offline scans on a daily
basis.  I'm under the impression the conclusion was reached like this:
"man smartd.conf ... oh, -s, a neat thing, let's enable it".

There are negative repercussions to doing tests of this nature at such
regular intervals.  Once-a-week is borderline acceptable; once a month
would be quite reasonable.  I'd love to know what kind of affect daily
tests have on MTBF; I can imagine it's reached much sooner with this.

The main point of smartd is to monitor SMART attribute changes.  If
you're concerned about the health of your hard disk, you should be
looking at your logs and not relying on things like automatic short/long
tests.  Most SMART attributes are updated immediately and not during an
offline test, and all of those attribute changes will be logged.



You asked Miroslav about source of his configuration.  And as it is very  
similar to mine I think we both have it from smartd documentation. Where 
else to look for information?  It's a usual source.  So if you think it's 
wrong please contact the authors, we're obviously just users.

Thanks.



I'm not asking *where* you got the information from (we know where you
and others got it from: the documentation).  I'm asking you *why* you
enabled what you did, because this is not something smartd.conf enables
by default (the example is commented out).

If you *really* want me to talk to Bruce about this, I can/will, but I'm
left with the impression that the example in smartd.conf is there to
show people the syntactical usage of -o, and not to advocate its usage.


PS: Btw, long offline scan is scheduled on weekly basis, not daily. If  
it's good or not I do not know.



The OP's long scan is also scheduled on a weekly basis (every Sunday),
but his short scan trumps it.

Folks, the point I'm trying to make here is that daily -- and even
weekly -- SMART offline tests are unnecessary.  If you're that concerned
about your disk health, you should be looking at your syslog logs for
attribute changes that indicate drive issues.  Performing SMART offline
tests at regular intervals like this does very little other than
increase wear/tear on drive components (not necessarily the physical
platters/heads; there are many pieces to a hard disk.  :-) )


It is more than three years ago when I started to use smartd and I did 
not change my configs from that time, just copy it to all the new 
servers, so I can't tell why I had feeling that daily short and weekly 
long test is "the right way".
Do you have some link to brief overview, where we can read something 
about "the best practices" with smartd? Or may I just change the config 
to do short test once a week and long test once a month?


Miroslav Lachman

PS: all examples in smartd.conf.sample are commented out (DEVICESCAN is 
the default), but almost all of the examples have weekly long test, this 
may lead to our conclusion "weekly long test is good"

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Jeremy Chadwick
On Mon, Oct 27, 2008 at 08:50:44PM +0100, martinko wrote:
> Jeremy Chadwick wrote:
>> On Mon, Oct 27, 2008 at 07:52:01PM +0100, martinko wrote:
>>> Jeremy Chadwick wrote:
>>> Now, does the timeout cause loss of any data? Is there anything besides
>>> disabling the testing that I can do about it?
>> Do you understand what short and long offline tests actually do and what
>> they're used for?  :-)  If so, you'd know that running them periodically
>> is more or less silly (IMHO).
> I do not, not completely :) I think I have just copied the settings from
> somewhere and only just tweaked it a bit whenever I have added a disk.
 Let me know if you figure out who or what online resource solicited
 adding daily short/long tests, as I'd like to talk to them about their
 decision.  I have a feeling whoever thought it up felt that the tests
 were performing entire sector scans of the entire disk, which is simply
 not the case.

>>> Hallo,
>>>
>>> Reading this thread I checked my config to find this: ;-)
>>>
>>> #/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) 
>>> -m  root# ++ 2006-11-03 mato
>>> /dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++  
>>> 2006-11-03 mato
>>>
>>> I believe I came up with the settings after reading manual page /   
>>> documentation of the tool.
>>
>> Can you explain why you're doing this?  So far no one's provided a
>> reason *why* they're doing short and long offline scans on a daily
>> basis.  I'm under the impression the conclusion was reached like this:
>> "man smartd.conf ... oh, -s, a neat thing, let's enable it".
>>
>> There are negative repercussions to doing tests of this nature at such
>> regular intervals.  Once-a-week is borderline acceptable; once a month
>> would be quite reasonable.  I'd love to know what kind of affect daily
>> tests have on MTBF; I can imagine it's reached much sooner with this.
>>
>> The main point of smartd is to monitor SMART attribute changes.  If
>> you're concerned about the health of your hard disk, you should be
>> looking at your logs and not relying on things like automatic short/long
>> tests.  Most SMART attributes are updated immediately and not during an
>> offline test, and all of those attribute changes will be logged.
>>
>
> You asked Miroslav about source of his configuration.  And as it is very  
> similar to mine I think we both have it from smartd documentation. Where 
> else to look for information?  It's a usual source.  So if you think it's 
> wrong please contact the authors, we're obviously just users.
> Thanks.

I'm not asking *where* you got the information from (we know where you
and others got it from: the documentation).  I'm asking you *why* you
enabled what you did, because this is not something smartd.conf enables
by default (the example is commented out).

If you *really* want me to talk to Bruce about this, I can/will, but I'm
left with the impression that the example in smartd.conf is there to
show people the syntactical usage of -o, and not to advocate its usage.

> PS: Btw, long offline scan is scheduled on weekly basis, not daily. If  
> it's good or not I do not know.

The OP's long scan is also scheduled on a weekly basis (every Sunday),
but his short scan trumps it.

Folks, the point I'm trying to make here is that daily -- and even
weekly -- SMART offline tests are unnecessary.  If you're that concerned
about your disk health, you should be looking at your syslog logs for
attribute changes that indicate drive issues.  Performing SMART offline
tests at regular intervals like this does very little other than
increase wear/tear on drive components (not necessarily the physical
platters/heads; there are many pieces to a hard disk.  :-) )

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Short SMART check causes disk op timeouts

2008-10-27 Thread martinko

martinko wrote:

Jeremy Chadwick wrote:

On Mon, Oct 27, 2008 at 07:52:01PM +0100, martinko wrote:

Jeremy Chadwick wrote:
Now, does the timeout cause loss of any data? Is there anything 
besides

disabling the testing that I can do about it?
Do you understand what short and long offline tests actually do 
and what
they're used for?  :-)  If so, you'd know that running them 
periodically

is more or less silly (IMHO).
I do not, not completely :) I think I have just copied the settings 
from

somewhere and only just tweaked it a bit whenever I have added a disk.

Let me know if you figure out who or what online resource solicited
adding daily short/long tests, as I'd like to talk to them about their
decision.  I have a feeling whoever thought it up felt that the tests
were performing entire sector scans of the entire disk, which is simply
not the case.


Hallo,

Reading this thread I checked my config to find this: ;-)

#/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) 
-m  root# ++ 2006-11-03 mato
/dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++  
2006-11-03 mato


I believe I came up with the settings after reading manual page /  
documentation of the tool.


Can you explain why you're doing this?  So far no one's provided a
reason *why* they're doing short and long offline scans on a daily
basis.  I'm under the impression the conclusion was reached like this:
"man smartd.conf ... oh, -s, a neat thing, let's enable it".

There are negative repercussions to doing tests of this nature at such
regular intervals.  Once-a-week is borderline acceptable; once a month
would be quite reasonable.  I'd love to know what kind of affect daily
tests have on MTBF; I can imagine it's reached much sooner with this.

The main point of smartd is to monitor SMART attribute changes.  If
you're concerned about the health of your hard disk, you should be
looking at your logs and not relying on things like automatic short/long
tests.  Most SMART attributes are updated immediately and not during an
offline test, and all of those attribute changes will be logged.



You asked Miroslav about source of his configuration.  And as it is very 


 I meant Vaclav, of course, Miroslav's email just arrived. :)

similar to mine I think we both have it from smartd documentation. Where 
else to look for information?  It's a usual source.  So if you think 
it's wrong please contact the authors, we're obviously just users.

Thanks.

M.

PS: Btw, long offline scan is scheduled on weekly basis, not daily. If 
it's good or not I do not know.




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Short SMART check causes disk op timeouts

2008-10-27 Thread martinko

Jeremy Chadwick wrote:

On Mon, Oct 27, 2008 at 07:52:01PM +0100, martinko wrote:

Jeremy Chadwick wrote:

Now, does the timeout cause loss of any data? Is there anything besides
disabling the testing that I can do about it?

Do you understand what short and long offline tests actually do and what
they're used for?  :-)  If so, you'd know that running them periodically
is more or less silly (IMHO).

I do not, not completely :) I think I have just copied the settings from
somewhere and only just tweaked it a bit whenever I have added a disk.

Let me know if you figure out who or what online resource solicited
adding daily short/long tests, as I'd like to talk to them about their
decision.  I have a feeling whoever thought it up felt that the tests
were performing entire sector scans of the entire disk, which is simply
not the case.


Hallo,

Reading this thread I checked my config to find this: ;-)

#/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) -m  
root# ++ 2006-11-03 mato
/dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++  
2006-11-03 mato


I believe I came up with the settings after reading manual page /  
documentation of the tool.


Can you explain why you're doing this?  So far no one's provided a
reason *why* they're doing short and long offline scans on a daily
basis.  I'm under the impression the conclusion was reached like this:
"man smartd.conf ... oh, -s, a neat thing, let's enable it".

There are negative repercussions to doing tests of this nature at such
regular intervals.  Once-a-week is borderline acceptable; once a month
would be quite reasonable.  I'd love to know what kind of affect daily
tests have on MTBF; I can imagine it's reached much sooner with this.

The main point of smartd is to monitor SMART attribute changes.  If
you're concerned about the health of your hard disk, you should be
looking at your logs and not relying on things like automatic short/long
tests.  Most SMART attributes are updated immediately and not during an
offline test, and all of those attribute changes will be logged.



You asked Miroslav about source of his configuration.  And as it is very 
similar to mine I think we both have it from smartd documentation. 
Where else to look for information?  It's a usual source.  So if you 
think it's wrong please contact the authors, we're obviously just users.

Thanks.

M.

PS: Btw, long offline scan is scheduled on weekly basis, not daily. If 
it's good or not I do not know.


___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Miroslav Lachman

Jeremy Chadwick wrote:


On Mon, Oct 27, 2008 at 06:22:03PM +0100, Vaclav Haisman wrote:


Jeremy Chadwick wrote:


On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote:
Second, your short offline test runs at 0300, but the errors you're
seeing are at 0454 in the morning.  A short offline test does not
take 2 hours to run -- they take between 2-10 minutes -- unless the
system is also in the middle of doing a lot of I/O, in which case the
short test will be suspended.

There are cronjobs (specifically periodic jobs) that run starting at
0301 in the morning ("periodic daily"), and many of those are I/O bound.
This could possibly extend the length of the short test until 0454.

Weekly periodic jobs run at 0415 in the morning, on Sundays.  These also
perform a lot of disk I/O, so it's possible that on Sunday specifically
the short SMART test gets pushed back quite some time.

Third, the DMA timeouts you're seeing are possibly caused by the drive
taking too long when internally suspending the SMART test.

In most cases, it's safe for SMART tests (short and long) to be run
while the machine is operational, and disk I/O requests are being
performed.  When an I/O request comes and the disk is in the middle of
performing a SMART test, the drive has to stop the SMART test (e.g.
"suspend" it), complete the I/O request, then resume the SMART test.

The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it
doesn't receive an acknowledgement back from the controller (disk)
within 5 seconds, it'll report a timeout on whatever operation it was
performing.  I'm thinking the disk gets stuck in a "do the offline
test, no wait stop there's an I/O request, okay its done continue the
test, no way stop there's another I/O" loop.


Can I make the timeout higher? For the sake of elimination.



You will have to make modifications to the ata(4) driver code, and
rebuild+reinstall your kernel.

There is a patch from the FreeNAS folks which turns the command timeout
value into a sysctl for tuning, but that patch has not been brought into
FreeBSD (any version) at this time.  You can find it referenced below
(see one of the "Workarounds" sections).  You will probably have to
apply the patch "by hand" rather than blindly using patch < patchfile,
because the ATA code has changed since the patch was created.

http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting



Another possibility is that your drive really *does* have a bad block at
LBA 836986454, and that one of those cron/periodic jobs is what's
noticing it, and that upon noticing a bad block, the drive more or less
aborts the SMART test to perform internal remapping of the block.

To confirm this, you would need to boot the SeaTools utilities from DOS
or from a CD (see Seagate's site) and run a full sector scan (NOT the
"quick" test).  This takes a few hours.  Assuming it comes back clean,
then my above claim of the offline test taking too long to suspend is
probably the case.

Possibly this is a firmware bug in the drive -- you might consider
mailing Seagate about this problem, although I'm doubting their Tier 1
support will understand what the issue is.

Is the block number always the same?  Do you only see this error on
Sundays?  These are two questions which might help narrow things down.


Nope, the LBA is always different and I see it in the logs once every day.



Okay, so that greatly diminishes the possibility of it being a bad
block.  I'd still advocate running SeaTools on the disk to ensure
everything is 100% okay (re: "sake of elimination"); chances are it will
pass with flying colours.



This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC
kernel.

Now, does the timeout cause loss of any data? Is there anything besides
disabling the testing that I can do about it?


Do you understand what short and long offline tests actually do and what
they're used for?  :-)  If so, you'd know that running them periodically
is more or less silly (IMHO).


I do not, not completely :) I think I have just copied the settings from
somewhere and only just tweaked it a bit whenever I have added a disk.



Let me know if you figure out who or what online resource solicited
adding daily short/long tests, as I'd like to talk to them about their
decision.  I have a feeling whoever thought it up felt that the tests
were performing entire sector scans of the entire disk, which is simply
not the case.


It seems like a little modified example from smartd.conf.sample

# First (primary) ATA/IDE hard disk.  Monitor all attributes, enable
# automatic online data collection, automatic Attribute autosave, and
# start a short self-test every day between 2-3am, and a long self test
# Saturdays between 3-4am.
#/dev/hda -a -o on -S on -s (S/../.././02|L/../../6/03)

I am using similar config without problem:

/dev/ad4 -a -o on -S on -m root -M test -M diminishing -s 
(S/../.././01|L/../../(3|6)/05) -t -I 194
/dev/ad6 -a -o on -S on -m root -M test -M

Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Jeremy Chadwick
On Mon, Oct 27, 2008 at 07:52:01PM +0100, martinko wrote:
> Jeremy Chadwick wrote:
>
> Now, does the timeout cause loss of any data? Is there anything besides
> disabling the testing that I can do about it?
 Do you understand what short and long offline tests actually do and what
 they're used for?  :-)  If so, you'd know that running them periodically
 is more or less silly (IMHO).
>>> I do not, not completely :) I think I have just copied the settings from
>>> somewhere and only just tweaked it a bit whenever I have added a disk.
>>
>> Let me know if you figure out who or what online resource solicited
>> adding daily short/long tests, as I'd like to talk to them about their
>> decision.  I have a feeling whoever thought it up felt that the tests
>> were performing entire sector scans of the entire disk, which is simply
>> not the case.
>>
>
> Hallo,
>
> Reading this thread I checked my config to find this: ;-)
>
> #/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) -m  
> root# ++ 2006-11-03 mato
> /dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++  
> 2006-11-03 mato
>
> I believe I came up with the settings after reading manual page /  
> documentation of the tool.

Can you explain why you're doing this?  So far no one's provided a
reason *why* they're doing short and long offline scans on a daily
basis.  I'm under the impression the conclusion was reached like this:
"man smartd.conf ... oh, -s, a neat thing, let's enable it".

There are negative repercussions to doing tests of this nature at such
regular intervals.  Once-a-week is borderline acceptable; once a month
would be quite reasonable.  I'd love to know what kind of affect daily
tests have on MTBF; I can imagine it's reached much sooner with this.

The main point of smartd is to monitor SMART attribute changes.  If
you're concerned about the health of your hard disk, you should be
looking at your logs and not relying on things like automatic short/long
tests.  Most SMART attributes are updated immediately and not during an
offline test, and all of those attribute changes will be logged.

-- 
| Jeremy Chadwickjdc at parodius.com |
| Parodius Networking   http://www.parodius.com/ |
| UNIX Systems Administrator  Mountain View, CA, USA |
| Making life hard for others since 1977.  PGP: 4BD6C0CB |

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Short SMART check causes disk op timeouts

2008-10-27 Thread martinko

Jeremy Chadwick wrote:


Now, does the timeout cause loss of any data? Is there anything besides
disabling the testing that I can do about it?

Do you understand what short and long offline tests actually do and what
they're used for?  :-)  If so, you'd know that running them periodically
is more or less silly (IMHO).

I do not, not completely :) I think I have just copied the settings from
somewhere and only just tweaked it a bit whenever I have added a disk.


Let me know if you figure out who or what online resource solicited
adding daily short/long tests, as I'd like to talk to them about their
decision.  I have a feeling whoever thought it up felt that the tests
were performing entire sector scans of the entire disk, which is simply
not the case.



Hallo,

Reading this thread I checked my config to find this: ;-)

#/dev/ad0 -a -n standby,q -o on -S on -s (S/../.././02|L/../../7/03) -m 
root# ++ 2006-11-03 mato
/dev/ad0 -a -o on -S on -s (S/../.././02|L/../../7/03) -m root  # ++ 
2006-11-03 mato


I believe I came up with the settings after reading manual page / 
documentation of the tool.


Regards,

Martin

___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Jeremy Chadwick
On Mon, Oct 27, 2008 at 06:22:03PM +0100, Vaclav Haisman wrote:
> Jeremy Chadwick wrote:
> > On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote:
> > Second, your short offline test runs at 0300, but the errors you're
> > seeing are at 0454 in the morning.  A short offline test does not
> > take 2 hours to run -- they take between 2-10 minutes -- unless the
> > system is also in the middle of doing a lot of I/O, in which case the
> > short test will be suspended.
> > 
> > There are cronjobs (specifically periodic jobs) that run starting at
> > 0301 in the morning ("periodic daily"), and many of those are I/O bound.
> > This could possibly extend the length of the short test until 0454.
> > 
> > Weekly periodic jobs run at 0415 in the morning, on Sundays.  These also
> > perform a lot of disk I/O, so it's possible that on Sunday specifically
> > the short SMART test gets pushed back quite some time.
> > 
> > Third, the DMA timeouts you're seeing are possibly caused by the drive
> > taking too long when internally suspending the SMART test.
> > 
> > In most cases, it's safe for SMART tests (short and long) to be run
> > while the machine is operational, and disk I/O requests are being
> > performed.  When an I/O request comes and the disk is in the middle of
> > performing a SMART test, the drive has to stop the SMART test (e.g.
> > "suspend" it), complete the I/O request, then resume the SMART test.
> > 
> > The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it
> > doesn't receive an acknowledgement back from the controller (disk)
> > within 5 seconds, it'll report a timeout on whatever operation it was
> > performing.  I'm thinking the disk gets stuck in a "do the offline
> > test, no wait stop there's an I/O request, okay its done continue the
> > test, no way stop there's another I/O" loop.
> Can I make the timeout higher? For the sake of elimination.

You will have to make modifications to the ata(4) driver code, and
rebuild+reinstall your kernel.

There is a patch from the FreeNAS folks which turns the command timeout
value into a sysctl for tuning, but that patch has not been brought into
FreeBSD (any version) at this time.  You can find it referenced below
(see one of the "Workarounds" sections).  You will probably have to
apply the patch "by hand" rather than blindly using patch < patchfile,
because the ATA code has changed since the patch was created.

http://wiki.freebsd.org/JeremyChadwick/ATA_issues_and_troubleshooting

> > Another possibility is that your drive really *does* have a bad block at
> > LBA 836986454, and that one of those cron/periodic jobs is what's
> > noticing it, and that upon noticing a bad block, the drive more or less
> > aborts the SMART test to perform internal remapping of the block.
> > 
> > To confirm this, you would need to boot the SeaTools utilities from DOS
> > or from a CD (see Seagate's site) and run a full sector scan (NOT the
> > "quick" test).  This takes a few hours.  Assuming it comes back clean,
> > then my above claim of the offline test taking too long to suspend is
> > probably the case.
> > 
> > Possibly this is a firmware bug in the drive -- you might consider
> > mailing Seagate about this problem, although I'm doubting their Tier 1
> > support will understand what the issue is.
> > 
> > Is the block number always the same?  Do you only see this error on
> > Sundays?  These are two questions which might help narrow things down.
> Nope, the LBA is always different and I see it in the logs once every day.

Okay, so that greatly diminishes the possibility of it being a bad
block.  I'd still advocate running SeaTools on the disk to ensure
everything is 100% okay (re: "sake of elimination"); chances are it will
pass with flying colours.

> >> This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC
> >> kernel.
> >>
> >> Now, does the timeout cause loss of any data? Is there anything besides
> >> disabling the testing that I can do about it?
> > 
> > Do you understand what short and long offline tests actually do and what
> > they're used for?  :-)  If so, you'd know that running them periodically
> > is more or less silly (IMHO).
> I do not, not completely :) I think I have just copied the settings from
> somewhere and only just tweaked it a bit whenever I have added a disk.

Let me know if you figure out who or what online resource solicited
adding daily short/long tests, as I'd like to talk to them about their
decision.  I have a feeling whoever thought it up felt that the tests
were performing entire sector scans of the entire disk, which is simply
not the case.

> > If you're trying to accomplish a cheap version of disk scrubbing, e.g.
> > scanning the entire disk for bad blocks and report them or have them
> > automatically remapped by the drive, consider using sysutils/diskcheckd,
> > which was made for this purpose.  However, be aware of a problem I've
> > run into with it (still needs someone clueful to figur

Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Vaclav Haisman
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Jeremy Chadwick wrote:
> On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> Hi,
>> I have recently bought a new disk (Seagate 500G, ST3500320NS). I have
>> enabled SMART checking using the smartmontools as usual for the disk
>> (/dev/ad6 -a -S on -s (S/../.././03|L/../../7/03) -m root). The problem
>> is that each time the test runs I get messages like the following in
>> /var/log/messages:
>>
>> Oct 26 04:54:15 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (1 retry
>> left) LBA=836986454
>> Oct 26 04:54:25 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (0
>> retries left) LBA=836986454
>> Oct 26 04:54:25 35 kernel: ad6: FAILURE - WRITE_DMA48 timed out
>> LBA=836986454
>> Oct 26 04:54:25 35 kernel: g_vfs_done():ad6s2d[WRITE(offset=13150142464,
>> length=16384)]error = 5
>>
>> And the SMART test results log on the disk contains line like this:
>>
>> # 1  Short offline   Interrupted (host reset)  00%   297
>>  -
> 
> First and foremost, your above smartd.conf -s flags are conflicting.
> Your long offline test will never get run on Sunday; the short will run
> first, and the long won't ever start (because the short is already
> running).  I would recommend telling the short test to run only between
> days 0-6, leaving Sunday solely for the long test.  (I noticed this
> because the above "Interrupted" test indicates a short test was
> interrupted and not a long).
Thanks, I have not noticed the overlap at all.

> 
> Second, your short offline test runs at 0300, but the errors you're
> seeing are at 0454 in the morning.  A short offline test does not
> take 2 hours to run -- they take between 2-10 minutes -- unless the
> system is also in the middle of doing a lot of I/O, in which case the
> short test will be suspended.
> 
> There are cronjobs (specifically periodic jobs) that run starting at
> 0301 in the morning ("periodic daily"), and many of those are I/O bound.
> This could possibly extend the length of the short test until 0454.
> 
> Weekly periodic jobs run at 0415 in the morning, on Sundays.  These also
> perform a lot of disk I/O, so it's possible that on Sunday specifically
> the short SMART test gets pushed back quite some time.
> 
> Third, the DMA timeouts you're seeing are possibly caused by the drive
> taking too long when internally suspending the SMART test.
> 
> In most cases, it's safe for SMART tests (short and long) to be run
> while the machine is operational, and disk I/O requests are being
> performed.  When an I/O request comes and the disk is in the middle of
> performing a SMART test, the drive has to stop the SMART test (e.g.
> "suspend" it), complete the I/O request, then resume the SMART test.
> 
> The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it
> doesn't receive an acknowledgement back from the controller (disk)
> within 5 seconds, it'll report a timeout on whatever operation it was
> performing.  I'm thinking the disk gets stuck in a "do the offline
> test, no wait stop there's an I/O request, okay its done continue the
> test, no way stop there's another I/O" loop.
Can I make the timeout higher? For the sake of elimination.

> 
> Another possibility is that your drive really *does* have a bad block at
> LBA 836986454, and that one of those cron/periodic jobs is what's
> noticing it, and that upon noticing a bad block, the drive more or less
> aborts the SMART test to perform internal remapping of the block.
> 
> To confirm this, you would need to boot the SeaTools utilities from DOS
> or from a CD (see Seagate's site) and run a full sector scan (NOT the
> "quick" test).  This takes a few hours.  Assuming it comes back clean,
> then my above claim of the offline test taking too long to suspend is
> probably the case.
> 
> Possibly this is a firmware bug in the drive -- you might consider
> mailing Seagate about this problem, although I'm doubting their Tier 1
> support will understand what the issue is.
> 
> Is the block number always the same?  Do you only see this error on
> Sundays?  These are two questions which might help narrow things down.
Nope, the LBA is always different and I see it in the logs once every day.

> 
>> This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC
>> kernel.
>>
>> Now, does the timeout cause loss of any data? Is there anything besides
>> disabling the testing that I can do about it?
> 
> Do you understand what short and long offline tests actually do and what
> they're used for?  :-)  If so, you'd know that running them periodically
> is more or less silly (IMHO).
I do not, not completely :) I think I have just copied the settings from
somewhere and only just tweaked it a bit whenever I have added a disk.

> 
> If you're trying to accomplish a cheap version of disk scrubbing, e.g.
> scanning the entire disk for bad blocks and report them or have them
> automatically remappe

Re: Short SMART check causes disk op timeouts

2008-10-27 Thread Jeremy Chadwick
On Mon, Oct 27, 2008 at 11:16:59AM +0100, Vaclav Haisman wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
> 
> Hi,
> I have recently bought a new disk (Seagate 500G, ST3500320NS). I have
> enabled SMART checking using the smartmontools as usual for the disk
> (/dev/ad6 -a -S on -s (S/../.././03|L/../../7/03) -m root). The problem
> is that each time the test runs I get messages like the following in
> /var/log/messages:
> 
> Oct 26 04:54:15 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (1 retry
> left) LBA=836986454
> Oct 26 04:54:25 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (0
> retries left) LBA=836986454
> Oct 26 04:54:25 35 kernel: ad6: FAILURE - WRITE_DMA48 timed out
> LBA=836986454
> Oct 26 04:54:25 35 kernel: g_vfs_done():ad6s2d[WRITE(offset=13150142464,
> length=16384)]error = 5
> 
> And the SMART test results log on the disk contains line like this:
> 
> # 1  Short offline   Interrupted (host reset)  00%   297
>  -

First and foremost, your above smartd.conf -s flags are conflicting.
Your long offline test will never get run on Sunday; the short will run
first, and the long won't ever start (because the short is already
running).  I would recommend telling the short test to run only between
days 0-6, leaving Sunday solely for the long test.  (I noticed this
because the above "Interrupted" test indicates a short test was
interrupted and not a long).

Second, your short offline test runs at 0300, but the errors you're
seeing are at 0454 in the morning.  A short offline test does not
take 2 hours to run -- they take between 2-10 minutes -- unless the
system is also in the middle of doing a lot of I/O, in which case the
short test will be suspended.

There are cronjobs (specifically periodic jobs) that run starting at
0301 in the morning ("periodic daily"), and many of those are I/O bound.
This could possibly extend the length of the short test until 0454.

Weekly periodic jobs run at 0415 in the morning, on Sundays.  These also
perform a lot of disk I/O, so it's possible that on Sunday specifically
the short SMART test gets pushed back quite some time.

Third, the DMA timeouts you're seeing are possibly caused by the drive
taking too long when internally suspending the SMART test.

In most cases, it's safe for SMART tests (short and long) to be run
while the machine is operational, and disk I/O requests are being
performed.  When an I/O request comes and the disk is in the middle of
performing a SMART test, the drive has to stop the SMART test (e.g.
"suspend" it), complete the I/O request, then resume the SMART test.

The FreeBSD ATA layer has a 5 second timeout on I/O requests; if it
doesn't receive an acknowledgement back from the controller (disk)
within 5 seconds, it'll report a timeout on whatever operation it was
performing.  I'm thinking the disk gets stuck in a "do the offline
test, no wait stop there's an I/O request, okay its done continue the
test, no way stop there's another I/O" loop.

Another possibility is that your drive really *does* have a bad block at
LBA 836986454, and that one of those cron/periodic jobs is what's
noticing it, and that upon noticing a bad block, the drive more or less
aborts the SMART test to perform internal remapping of the block.

To confirm this, you would need to boot the SeaTools utilities from DOS
or from a CD (see Seagate's site) and run a full sector scan (NOT the
"quick" test).  This takes a few hours.  Assuming it comes back clean,
then my above claim of the offline test taking too long to suspend is
probably the case.

Possibly this is a firmware bug in the drive -- you might consider
mailing Seagate about this problem, although I'm doubting their Tier 1
support will understand what the issue is.

Is the block number always the same?  Do you only see this error on
Sundays?  These are two questions which might help narrow things down.

> This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC
> kernel.
> 
> Now, does the timeout cause loss of any data? Is there anything besides
> disabling the testing that I can do about it?

Do you understand what short and long offline tests actually do and what
they're used for?  :-)  If so, you'd know that running them periodically
is more or less silly (IMHO).

If you're trying to accomplish a cheap version of disk scrubbing, e.g.
scanning the entire disk for bad blocks and report them or have them
automatically remapped by the drive, consider using sysutils/diskcheckd,
which was made for this purpose.  However, be aware of a problem I've
run into with it (still needs someone clueful to figure out why this
happens):
http://www.freebsd.org/cgi/query-pr.cgi?pr=ports/115853

I do not advocate the use of periodic offline tests on disks, especially
at such aggressive intervals (daily).  In fact, I don't even know why
Bruce added that option to smartd.  There are only a few attributes in
SMART which get updated on offline tests, so I cease to see the point.

You shouldn't

Short SMART check causes disk op timeouts

2008-10-27 Thread Vaclav Haisman
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi,
I have recently bought a new disk (Seagate 500G, ST3500320NS). I have
enabled SMART checking using the smartmontools as usual for the disk
(/dev/ad6 -a -S on -s (S/../.././03|L/../../7/03) -m root). The problem
is that each time the test runs I get messages like the following in
/var/log/messages:

Oct 26 04:54:15 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (1 retry
left) LBA=836986454
Oct 26 04:54:25 35 kernel: ad6: TIMEOUT - WRITE_DMA48 retrying (0
retries left) LBA=836986454
Oct 26 04:54:25 35 kernel: ad6: FAILURE - WRITE_DMA48 timed out
LBA=836986454
Oct 26 04:54:25 35 kernel: g_vfs_done():ad6s2d[WRITE(offset=13150142464,
length=16384)]error = 5

And the SMART test results log on the disk contains line like this:

# 1  Short offline   Interrupted (host reset)  00%   297
 -

This is on 7.1-PRERELEASE #0: Wed Oct 15 18:56:54 UTC 2008, with GENERIC
kernel.

Now, does the timeout cause loss of any data? Is there anything besides
disabling the testing that I can do about it?

- --
VH
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (FreeBSD)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iFYEAREIAAYFAkkFlRsACgkQhQBMvHf/WHn6fQDbBv7gpSV3x2GwDsM5VeVI+iax
oCp7aGDcgFwD9ADaAzA219KdJfu2aCgZfqXOthqvJhah6u06VObcIw==
=jm7d
-END PGP SIGNATURE-
___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to "[EMAIL PROTECTED]"