subject:"Re\: SATA exceptions"

Re: SATA exceptions

2007-07-13 Thread S.Çağlar Onur

13 Tem 2007 Cum tarihinde, Tejun Heo şunları yazmıştı: 
> >> OS and driver can't really do much about the reallocation event.  Some
> >> number of reallocations is okay but if you it going up constantly, you
> >> probably have a dying disk.
> >
> > Hmm... cut the power while writing is doable from OS and might force
> > reallocations?
>
> Hmmm... We don't have any pending write when power goes out and I don't
> emergency unload can directly increase reallocation count.  It can
> shorten lifespan of the head tho.
>
> > You might want to check if number of reallocated sectors increases
> > with shutdowns/reboots.
>
> I'm curious too.

It seems reboot/shutdown has no effect on reallocated sectors. After 5 rebot/5 
shutdown it didn't change at all.

zangetsu ~ # smartctl -a /dev/sda | grep Reall
  5 Reallocated_Sector_Ct   0x0033   067   067   010Pre-fail  
Always   -   314
196 Reallocated_Event_Count 0x0032   067   067   000Old_age   
Always   -   314

Cheers
-- 
S.Çağlar Onur <[EMAIL PROTECTED]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


signature.asc
Description: This is a digitally signed message part.

Re: SATA exceptions

2007-07-13 Thread S.Çağlar Onur

13 Tem 2007 Cum tarihinde, Tejun Heo şunları yazmıştı: 
  OS and driver can't really do much about the reallocation event.  Some
  number of reallocations is okay but if you it going up constantly, you
  probably have a dying disk.
 
  Hmm... cut the power while writing is doable from OS and might force
  reallocations?

 Hmmm... We don't have any pending write when power goes out and I don't
 emergency unload can directly increase reallocation count.  It can
 shorten lifespan of the head tho.

  You might want to check if number of reallocated sectors increases
  with shutdowns/reboots.

 I'm curious too.

It seems reboot/shutdown has no effect on reallocated sectors. After 5 rebot/5 
shutdown it didn't change at all.

zangetsu ~ # smartctl -a /dev/sda | grep Reall
  5 Reallocated_Sector_Ct   0x0033   067   067   010Pre-fail  
Always   -   314
196 Reallocated_Event_Count 0x0032   067   067   000Old_age   
Always   -   314

Cheers
-- 
S.Çağlar Onur [EMAIL PROTECTED]
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


signature.asc
Description: This is a digitally signed message part.

Re: SATA exceptions

2007-07-12 Thread Tejun Heo

Pavel Machek wrote:
 Your SMART log shows 309 reallocated sectors. That seems somewhat high..
>>> Ah sorry to misinterpret the content:), its a quiet new piece of hardware 
>>> (at 
>>> most ~1.5 month old) and  "Reallocated_Event_Count" constantly increases 
>>> (currently its increased to 313) and although i'm not 100 percent sure 
>>> these 
>>> errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these 
>>> cause according to kern.log these only visible with 2.6.22+) 
>> OS and driver can't really do much about the reallocation event.  Some
>> number of reallocations is okay but if you it going up constantly, you
>> probably have a dying disk.
> 
> Hmm... cut the power while writing is doable from OS and might force
> reallocations?

Hmmm... We don't have any pending write when power goes out and I don't
emergency unload can directly increase reallocation count.  It can
shorten lifespan of the head tho.

> You might want to check if number of reallocated sectors increases
> with shutdowns/reboots.

I'm curious too.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions

2007-07-12 Thread Pavel Machek

Hi!

> >> Your SMART log shows 309 reallocated sectors. That seems somewhat high..
> > 
> > Ah sorry to misinterpret the content:), its a quiet new piece of hardware 
> > (at 
> > most ~1.5 month old) and  "Reallocated_Event_Count" constantly increases 
> > (currently its increased to 313) and although i'm not 100 percent sure 
> > these 
> > errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these 
> > cause according to kern.log these only visible with 2.6.22+) 
> 
> OS and driver can't really do much about the reallocation event.  Some
> number of reallocations is okay but if you it going up constantly, you
> probably have a dying disk.

Hmm... cut the power while writing is doable from OS and might force
reallocations?

You might want to check if number of reallocated sectors increases
with shutdowns/reboots.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions

2007-07-12 Thread Pavel Machek

Hi!

  Your SMART log shows 309 reallocated sectors. That seems somewhat high..
  
  Ah sorry to misinterpret the content:), its a quiet new piece of hardware 
  (at 
  most ~1.5 month old) and  Reallocated_Event_Count constantly increases 
  (currently its increased to 313) and although i'm not 100 percent sure 
  these 
  errors only occured with kernels  2.6.18 (or 2.6.18 didn't report these 
  cause according to kern.log these only visible with 2.6.22+) 
 
 OS and driver can't really do much about the reallocation event.  Some
 number of reallocations is okay but if you it going up constantly, you
 probably have a dying disk.

Hmm... cut the power while writing is doable from OS and might force
reallocations?

You might want to check if number of reallocated sectors increases
with shutdowns/reboots.
Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) 
http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions

2007-07-12 Thread Tejun Heo

Pavel Machek wrote:
 Your SMART log shows 309 reallocated sectors. That seems somewhat high..
 Ah sorry to misinterpret the content:), its a quiet new piece of hardware 
 (at 
 most ~1.5 month old) and  Reallocated_Event_Count constantly increases 
 (currently its increased to 313) and although i'm not 100 percent sure 
 these 
 errors only occured with kernels  2.6.18 (or 2.6.18 didn't report these 
 cause according to kern.log these only visible with 2.6.22+) 
 OS and driver can't really do much about the reallocation event.  Some
 number of reallocations is okay but if you it going up constantly, you
 probably have a dying disk.
 
 Hmm... cut the power while writing is doable from OS and might force
 reallocations?

Hmmm... We don't have any pending write when power goes out and I don't
emergency unload can directly increase reallocation count.  It can
shorten lifespan of the head tho.

 You might want to check if number of reallocated sectors increases
 with shutdowns/reboots.

I'm curious too.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions

2007-07-11 Thread Tejun Heo

Mark Lord wrote:
> I'm not even sure how to interpret those numbers.
> It seems rather odd that nearly all fields are either "100" or "253",
> so those are probably pre-programmed numbers rather than actual counts.
> The raw value at the end of the line (for the various "Reallocated*"
> fields)
> is probably the real value here.

I dunno exactly either.  Different vendors seem to use different metrics
anyway but increasing raw number on reallocate counter is pretty easy to
interpret.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions

2007-07-11 Thread Mark Lord


S.Çag(lar Onur wrote:

Hi;

07 Tem 2007 Cts tarihinde, Robert Hancock ÅunlarÄ± yazmÄ±ÅtÄ±: 

It's not the free space on the drive that matters, it's the number of
free sectors in the spare sector pool on the drive, which is invisible
to software.

Your SMART log shows 309 reallocated sectors. That seems somewhat high..


Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at 
most ~1.5 month old) and  "Reallocated_Event_Count" constantly increases 
(currently its increased to 313) and although i'm not 100 percent sure these 
errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these 
cause according to kern.log these only visible with 2.6.22+) 

We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its 
smartctl output follows as a reference;


smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG HM160JI
Serial Number:S0W6J10P331479
Firmware Version: AD100-16
User Capacity:160.041.885.696 bytes
Device is:In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:Sun Jul  8 00:22:21 2007 EEST

==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for 
details.


SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status:  (   0)	The previous self-test routine 
completed
	without error or no self-test has ever 
	been run.
Total time to complete Offline 
data collection: 		 (5391) seconds.

Offline data collection
capabilities:(0x51) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.

Extended self-test routine
recommended polling time:(  89) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   100   100   051Pre-fail  
Always   -   0
  3 Spin_Up_Time0x0007   253   253   025Pre-fail  
Always   -   2880
  4 Start_Stop_Count0x0032   098   098   000Old_age   
Always   -   2648
  5 Reallocated_Sector_Ct   0x0033   253   253   010Pre-fail  
Always   -   0
  7 Seek_Error_Rate 0x000f   253   253   051Pre-fail  
Always   -   0
  8 Seek_Time_Performance   0x0025   253   253   015Pre-fail  
Offline  -   0
  9 Power_On_Hours  0x0032   253   253   000Old_age   
Always   -   236
 10 Spin_Retry_Count0x0033   100   100   051Pre-fail  
Always   -   1
 11 Calibration_Retry_Count 0x0012   100   100   000Old_age   
Always   -   2
 12 Power_Cycle_Count   0x0032   100   100   000Old_age   
Always   -   57
187 Unknown_Attribute   0x0032   253   253   000Old_age   
Always   -   0
188 Unknown_Attribute   0x0032   253   253   000Old_age   
Always   -   0
190 Temperature_Celsius 0x0022   047   040   040Old_age   Always   
In_the_past 1008009269
191 G-Sense_Error_Rate  0x0012   100   100   000Old_age   
Always   -   5396
192 Power-Off_Retract_Count 0x0012   100   100   000Old_age   
Always   -   40
193 Load_Cycle_Count0x0012   100   100   000Old_age   
Always   -   2575
194 Temperature_Celsius 0x0022   047   040   000Old_age   
Always   -   53 (Lifetime Min/Max 0/15381)
195 Hardware_ECC_Recovered  0x001a   100   100   000Old_age

Re: SATA exceptions

2007-07-11 Thread Bill Davidsen


Tejun Heo wrote:

Hello,

S.Çağlar Onur wrote:
07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı: 

It's not the free space on the drive that matters, it's the number of
free sectors in the spare sector pool on the drive, which is invisible
to software.

Your SMART log shows 309 reallocated sectors. That seems somewhat high..
Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at 
most ~1.5 month old) and  "Reallocated_Event_Count" constantly increases 
(currently its increased to 313) and although i'm not 100 percent sure these 
errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these 
cause according to kern.log these only visible with 2.6.22+) 


OS and driver can't really do much about the reallocation event.  Some
number of reallocations is okay but if you it going up constantly, you
probably have a dying disk.

Or, as I learned the hard way, if you have the problem on all drives 
sharing a power supply, a power issue.


We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its 
smartctl output follows as a reference;


  5 Reallocated_Sector_Ct   0x0033   253   253   010Pre-fail  
196 Reallocated_Event_Count 0x0032   253   253   000Old_age   


Hmm... This is pretty high too.  Do the counts increase on this machine too?




--
Bill Davidsen <[EMAIL PROTECTED]>
  "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions

2007-07-11 Thread Bill Davidsen


Tejun Heo wrote:

Hello,

S.Çağlar Onur wrote:
07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı: 

It's not the free space on the drive that matters, it's the number of
free sectors in the spare sector pool on the drive, which is invisible
to software.

Your SMART log shows 309 reallocated sectors. That seems somewhat high..
Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at 
most ~1.5 month old) and  Reallocated_Event_Count constantly increases 
(currently its increased to 313) and although i'm not 100 percent sure these 
errors only occured with kernels  2.6.18 (or 2.6.18 didn't report these 
cause according to kern.log these only visible with 2.6.22+) 


OS and driver can't really do much about the reallocation event.  Some
number of reallocations is okay but if you it going up constantly, you
probably have a dying disk.

Or, as I learned the hard way, if you have the problem on all drives 
sharing a power supply, a power issue.


We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its 
smartctl output follows as a reference;


  5 Reallocated_Sector_Ct   0x0033   253   253   010Pre-fail  
196 Reallocated_Event_Count 0x0032   253   253   000Old_age   


Hmm... This is pretty high too.  Do the counts increase on this machine too?




--
Bill Davidsen [EMAIL PROTECTED]
  We have more to fear from the bungling of the incompetent than from
the machinations of the wicked.  - from Slashdot

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions

2007-07-11 Thread Mark Lord


S.Çag(lar Onur wrote:

Hi;

07 Tem 2007 Cts tarihinde, Robert Hancock ÅunlarÄ± yazmÄ±ÅtÄ±: 

It's not the free space on the drive that matters, it's the number of
free sectors in the spare sector pool on the drive, which is invisible
to software.

Your SMART log shows 309 reallocated sectors. That seems somewhat high..


Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at 
most ~1.5 month old) and  Reallocated_Event_Count constantly increases 
(currently its increased to 313) and although i'm not 100 percent sure these 
errors only occured with kernels  2.6.18 (or 2.6.18 didn't report these 
cause according to kern.log these only visible with 2.6.22+) 

We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its 
smartctl output follows as a reference;


smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG HM160JI
Serial Number:S0W6J10P331479
Firmware Version: AD100-16
User Capacity:160.041.885.696 bytes
Device is:In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:Sun Jul  8 00:22:21 2007 EEST

== WARNING: May need -F samsung or -F samsung2 enabled; see manual for 
details.


SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status:  (   0)	The previous self-test routine 
completed
	without error or no self-test has ever 
	been run.
Total time to complete Offline 
data collection: 		 (5391) seconds.

Offline data collection
capabilities:(0x51) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time: 	 (   2) minutes.

Extended self-test routine
recommended polling time:(  89) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   100   100   051Pre-fail  
Always   -   0
  3 Spin_Up_Time0x0007   253   253   025Pre-fail  
Always   -   2880
  4 Start_Stop_Count0x0032   098   098   000Old_age   
Always   -   2648
  5 Reallocated_Sector_Ct   0x0033   253   253   010Pre-fail  
Always   -   0
  7 Seek_Error_Rate 0x000f   253   253   051Pre-fail  
Always   -   0
  8 Seek_Time_Performance   0x0025   253   253   015Pre-fail  
Offline  -   0
  9 Power_On_Hours  0x0032   253   253   000Old_age   
Always   -   236
 10 Spin_Retry_Count0x0033   100   100   051Pre-fail  
Always   -   1
 11 Calibration_Retry_Count 0x0012   100   100   000Old_age   
Always   -   2
 12 Power_Cycle_Count   0x0032   100   100   000Old_age   
Always   -   57
187 Unknown_Attribute   0x0032   253   253   000Old_age   
Always   -   0
188 Unknown_Attribute   0x0032   253   253   000Old_age   
Always   -   0
190 Temperature_Celsius 0x0022   047   040   040Old_age   Always   
In_the_past 1008009269
191 G-Sense_Error_Rate  0x0012   100   100   000Old_age   
Always   -   5396
192 Power-Off_Retract_Count 0x0012   100   100   000Old_age   
Always   -   40
193 Load_Cycle_Count0x0012   100   100   000Old_age   
Always   -   2575
194 Temperature_Celsius 0x0022   047   040   000Old_age   
Always   -   53 (Lifetime Min/Max 0/15381)
195 Hardware_ECC_Recovered  0x001a   100   100   000Old_age

Re: SATA exceptions

2007-07-11 Thread Tejun Heo

Mark Lord wrote:
 I'm not even sure how to interpret those numbers.
 It seems rather odd that nearly all fields are either 100 or 253,
 so those are probably pre-programmed numbers rather than actual counts.
 The raw value at the end of the line (for the various Reallocated*
 fields)
 is probably the real value here.

I dunno exactly either.  Different vendors seem to use different metrics
anyway but increasing raw number on reallocate counter is pretty easy to
interpret.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions

2007-07-09 Thread S.Çağlar Onur

Hi;

09 Tem 2007 Pts tarihinde, Tejun Heo şunları yazmıştı: 
> > 07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı:
> >> It's not the free space on the drive that matters, it's the number of
> >> free sectors in the spare sector pool on the drive, which is invisible
> >> to software.
> >>
> >> Your SMART log shows 309 reallocated sectors. That seems somewhat high..
> >
> > Ah sorry to misinterpret the content:), its a quiet new piece of hardware
> > (at most ~1.5 month old) and  "Reallocated_Event_Count" constantly
> > increases (currently its increased to 313) and although i'm not 100
> > percent sure these errors only occured with kernels > 2.6.18 (or 2.6.18
> > didn't report these cause according to kern.log these only visible with
> > 2.6.22+)
>
> OS and driver can't really do much about the reallocation event.  Some
> number of reallocations is okay but if you it going up constantly, you
> probably have a dying disk.

Hmm its really interesting, then it means 3 piece of ~1.5 month old laptops 
dieing for same decease :) or they already somehow defectived (or we are 
damaging them but it sits on my table happily all that time :P)

> > We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18
> > and its smartctl output follows as a reference;
> >
> >   5 Reallocated_Sector_Ct   0x0033   253   253   010Pre-fail
> > 196 Reallocated_Event_Count 0x0032   253   253   000Old_age
>
> Hmm... This is pretty high too.  Do the counts increase on this machine
> too?

Yes, seems so (i'm adding Onur and İsmail to CC as other machines owner) and 
here is the smart logs for this 3 seperate machine, its interesting me and 
İsmail runs  2.6.22 (over 300 reloacations occured for both of us) and Onur 
uses 2.6.18 (0 relocation occured for him)

[1] http://cekirdek.pardus.org.tr/~caglar/SATA/smart.caglar
[2] http://cekirdek.pardus.org.tr/~caglar/SATA/smart.ismail
[3] http://cekirdek.pardus.org.tr/~caglar/SATA/smart.onur

Cheers
-- 
S.Çağlar Onur <[EMAIL PROTECTED]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


signature.asc
Description: This is a digitally signed message part.

Re: SATA exceptions

2007-07-09 Thread Tejun Heo

Hello,

S.Çağlar Onur wrote:
> 07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı: 
>> It's not the free space on the drive that matters, it's the number of
>> free sectors in the spare sector pool on the drive, which is invisible
>> to software.
>>
>> Your SMART log shows 309 reallocated sectors. That seems somewhat high..
> 
> Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at 
> most ~1.5 month old) and  "Reallocated_Event_Count" constantly increases 
> (currently its increased to 313) and although i'm not 100 percent sure these 
> errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these 
> cause according to kern.log these only visible with 2.6.22+) 

OS and driver can't really do much about the reallocation event.  Some
number of reallocations is okay but if you it going up constantly, you
probably have a dying disk.

> We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and 
> its 
> smartctl output follows as a reference;
>
>   5 Reallocated_Sector_Ct   0x0033   253   253   010Pre-fail  
> 196 Reallocated_Event_Count 0x0032   253   253   000Old_age   

Hmm... This is pretty high too.  Do the counts increase on this machine too?

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions

2007-07-09 Thread Tejun Heo

Hello,

S.Çağlar Onur wrote:
 07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı: 
 It's not the free space on the drive that matters, it's the number of
 free sectors in the spare sector pool on the drive, which is invisible
 to software.

 Your SMART log shows 309 reallocated sectors. That seems somewhat high..
 
 Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at 
 most ~1.5 month old) and  Reallocated_Event_Count constantly increases 
 (currently its increased to 313) and although i'm not 100 percent sure these 
 errors only occured with kernels  2.6.18 (or 2.6.18 didn't report these 
 cause according to kern.log these only visible with 2.6.22+) 

OS and driver can't really do much about the reallocation event.  Some
number of reallocations is okay but if you it going up constantly, you
probably have a dying disk.

 We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and 
 its 
 smartctl output follows as a reference;

   5 Reallocated_Sector_Ct   0x0033   253   253   010Pre-fail  
 196 Reallocated_Event_Count 0x0032   253   253   000Old_age   

Hmm... This is pretty high too.  Do the counts increase on this machine too?

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions

2007-07-09 Thread S.Çağlar Onur

Hi;

09 Tem 2007 Pts tarihinde, Tejun Heo şunları yazmıştı: 
  07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı:
  It's not the free space on the drive that matters, it's the number of
  free sectors in the spare sector pool on the drive, which is invisible
  to software.
 
  Your SMART log shows 309 reallocated sectors. That seems somewhat high..
 
  Ah sorry to misinterpret the content:), its a quiet new piece of hardware
  (at most ~1.5 month old) and  Reallocated_Event_Count constantly
  increases (currently its increased to 313) and although i'm not 100
  percent sure these errors only occured with kernels  2.6.18 (or 2.6.18
  didn't report these cause according to kern.log these only visible with
  2.6.22+)

 OS and driver can't really do much about the reallocation event.  Some
 number of reallocations is okay but if you it going up constantly, you
 probably have a dying disk.

Hmm its really interesting, then it means 3 piece of ~1.5 month old laptops 
dieing for same decease :) or they already somehow defectived (or we are 
damaging them but it sits on my table happily all that time :P)

  We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18
  and its smartctl output follows as a reference;
 
5 Reallocated_Sector_Ct   0x0033   253   253   010Pre-fail
  196 Reallocated_Event_Count 0x0032   253   253   000Old_age

 Hmm... This is pretty high too.  Do the counts increase on this machine
 too?

Yes, seems so (i'm adding Onur and İsmail to CC as other machines owner) and 
here is the smart logs for this 3 seperate machine, its interesting me and 
İsmail runs  2.6.22 (over 300 reloacations occured for both of us) and Onur 
uses 2.6.18 (0 relocation occured for him)

[1] http://cekirdek.pardus.org.tr/~caglar/SATA/smart.caglar
[2] http://cekirdek.pardus.org.tr/~caglar/SATA/smart.ismail
[3] http://cekirdek.pardus.org.tr/~caglar/SATA/smart.onur

Cheers
-- 
S.Çağlar Onur [EMAIL PROTECTED]
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


signature.asc
Description: This is a digitally signed message part.

Re: SATA exceptions

2007-07-07 Thread S.Çağlar Onur

Hi;

07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı: 
> It's not the free space on the drive that matters, it's the number of
> free sectors in the spare sector pool on the drive, which is invisible
> to software.
>
> Your SMART log shows 309 reallocated sectors. That seems somewhat high..

Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at 
most ~1.5 month old) and  "Reallocated_Event_Count" constantly increases 
(currently its increased to 313) and although i'm not 100 percent sure these 
errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these 
cause according to kern.log these only visible with 2.6.22+) 

We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its 
smartctl output follows as a reference;

smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG HM160JI
Serial Number:S0W6J10P331479
Firmware Version: AD100-16
User Capacity:160.041.885.696 bytes
Device is:In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:Sun Jul  8 00:22:21 2007 EEST

==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for 
details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status:  (   0) The previous self-test routine 
completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection: (5391) seconds.
Offline data collection
capabilities:(0x51) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time:(   2) minutes.
Extended self-test routine
recommended polling time:(  89) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   100   100   051Pre-fail  
Always   -   0
  3 Spin_Up_Time0x0007   253   253   025Pre-fail  
Always   -   2880
  4 Start_Stop_Count0x0032   098   098   000Old_age   
Always   -   2648
  5 Reallocated_Sector_Ct   0x0033   253   253   010Pre-fail  
Always   -   0
  7 Seek_Error_Rate 0x000f   253   253   051Pre-fail  
Always   -   0
  8 Seek_Time_Performance   0x0025   253   253   015Pre-fail  
Offline  -   0
  9 Power_On_Hours  0x0032   253   253   000Old_age   
Always   -   236
 10 Spin_Retry_Count0x0033   100   100   051Pre-fail  
Always   -   1
 11 Calibration_Retry_Count 0x0012   100   100   000Old_age   
Always   -   2
 12 Power_Cycle_Count   0x0032   100   100   000Old_age   
Always   -   57
187 Unknown_Attribute   0x0032   253   253   000Old_age   
Always   -   0
188 Unknown_Attribute   0x0032   253   253   000Old_age   
Always   -   0
190 Temperature_Celsius 0x0022   047   040   040Old_age   Always   
In_the_past 1008009269
191 G-Sense_Error_Rate  0x0012   100   100   000Old_age   
Always   -   5396
192 Power-Off_Retract_Count 0x0012   100   100   000Old_age   
Always   -   40
193 Load_Cycle_Count0x0012   100   100   000Old_age   
Always   -   2575
194 Temperature_Celsius 0x0022   047   040   000Old_age   
Always   -   53 (Lifetime Min/Max 0/15381)
195

Re: SATA exceptions

2007-07-07 Thread Robert Hancock


S.Çağlar Onur wrote:
06 Tem 2007 Cum tarihinde, Tejun Heo şunları yazmıştı: 

S.Çağlar Onur wrote:

[ 4260.278427] ata1.00: cmd ca/00:08:d0:88:bc/00:00:00:00:00/ee tag 0 cdb
0x0 data 4096 out
[ 4260.278430]  res 51/40:01:d7:88:bc/00:00:0e:00:00/ee Emask 0x9
(media error)

That's media error on sector 247236823 on WRITE.  Media errors on write
are bad signs - it usually means the drive even failed to remap the
sector because extra space ran out. 


Hmm, more than 50GB is empty on disk :)


It's not the free space on the drive that matters, it's the number of 
free sectors in the spare sector pool on the drive, which is invisible 
to software.


Your SMART log shows 309 reallocated sectors. That seems somewhat high..

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions

2007-07-07 Thread Robert Hancock


S.Çağlar Onur wrote:
06 Tem 2007 Cum tarihinde, Tejun Heo şunları yazmıştı: 

S.Çağlar Onur wrote:

[ 4260.278427] ata1.00: cmd ca/00:08:d0:88:bc/00:00:00:00:00/ee tag 0 cdb
0x0 data 4096 out
[ 4260.278430]  res 51/40:01:d7:88:bc/00:00:0e:00:00/ee Emask 0x9
(media error)

That's media error on sector 247236823 on WRITE.  Media errors on write
are bad signs - it usually means the drive even failed to remap the
sector because extra space ran out. 


Hmm, more than 50GB is empty on disk :)


It's not the free space on the drive that matters, it's the number of 
free sectors in the spare sector pool on the drive, which is invisible 
to software.


Your SMART log shows 309 reallocated sectors. That seems somewhat high..

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions

2007-07-07 Thread S.Çağlar Onur

Hi;

07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı: 
 It's not the free space on the drive that matters, it's the number of
 free sectors in the spare sector pool on the drive, which is invisible
 to software.

 Your SMART log shows 309 reallocated sectors. That seems somewhat high..

Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at 
most ~1.5 month old) and  Reallocated_Event_Count constantly increases 
(currently its increased to 313) and although i'm not 100 percent sure these 
errors only occured with kernels  2.6.18 (or 2.6.18 didn't report these 
cause according to kern.log these only visible with 2.6.22+) 

We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its 
smartctl output follows as a reference;

smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model: SAMSUNG HM160JI
Serial Number:S0W6J10P331479
Firmware Version: AD100-16
User Capacity:160.041.885.696 bytes
Device is:In smartctl database [for details use: -P show]
ATA Version is:   7
ATA Standard is:  ATA/ATAPI-7 T13 1532D revision 0
Local Time is:Sun Jul  8 00:22:21 2007 EEST

== WARNING: May need -F samsung or -F samsung2 enabled; see manual for 
details.

SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status:  (   0) The previous self-test routine 
completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection: (5391) seconds.
Offline data collection
capabilities:(0x51) SMART execute Offline immediate.
No Auto Offline data collection support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine 
recommended polling time:(   2) minutes.
Extended self-test routine
recommended polling time:(  89) minutes.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   100   100   051Pre-fail  
Always   -   0
  3 Spin_Up_Time0x0007   253   253   025Pre-fail  
Always   -   2880
  4 Start_Stop_Count0x0032   098   098   000Old_age   
Always   -   2648
  5 Reallocated_Sector_Ct   0x0033   253   253   010Pre-fail  
Always   -   0
  7 Seek_Error_Rate 0x000f   253   253   051Pre-fail  
Always   -   0
  8 Seek_Time_Performance   0x0025   253   253   015Pre-fail  
Offline  -   0
  9 Power_On_Hours  0x0032   253   253   000Old_age   
Always   -   236
 10 Spin_Retry_Count0x0033   100   100   051Pre-fail  
Always   -   1
 11 Calibration_Retry_Count 0x0012   100   100   000Old_age   
Always   -   2
 12 Power_Cycle_Count   0x0032   100   100   000Old_age   
Always   -   57
187 Unknown_Attribute   0x0032   253   253   000Old_age   
Always   -   0
188 Unknown_Attribute   0x0032   253   253   000Old_age   
Always   -   0
190 Temperature_Celsius 0x0022   047   040   040Old_age   Always   
In_the_past 1008009269
191 G-Sense_Error_Rate  0x0012   100   100   000Old_age   
Always   -   5396
192 Power-Off_Retract_Count 0x0012   100   100   000Old_age   
Always   -   40
193 Load_Cycle_Count0x0012   100   100   000Old_age   
Always   -   2575
194 Temperature_Celsius 0x0022   047   040   000Old_age   
Always   -   53 (Lifetime Min/Max 0/15381)
195

Re: SATA exceptions

2007-07-06 Thread S.Çağlar Onur

Hi;

06 Tem 2007 Cum tarihinde, Tejun Heo şunları yazmıştı: 
> S.Çağlar Onur wrote:
> > [ 4260.278427] ata1.00: cmd ca/00:08:d0:88:bc/00:00:00:00:00/ee tag 0 cdb
> > 0x0 data 4096 out
> > [ 4260.278430]  res 51/40:01:d7:88:bc/00:00:0e:00:00/ee Emask 0x9
> > (media error)
>
> That's media error on sector 247236823 on WRITE.  Media errors on write
> are bad signs - it usually means the drive even failed to remap the
> sector because extra space ran out. 

Hmm, more than 50GB is empty on disk :)

> I'm not sure this is the case here 
> tho - the smart log is clear.  Please run smart short/long tests and see
> what they say.

Both completed without a problem;

zangetsu ~ # smartctl -l selftest /dev/sda
smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_DescriptionStatus  Remaining  LifeTime(hours)  
LBA_of_first_error
# 1  Extended offlineCompleted without error   00%   357 -
# 2  Short offline   Completed without error   00%   355 -

If you want me to try something else please just say :)

Cheers
-- 
S.Çağlar Onur <[EMAIL PROTECTED]>
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


signature.asc
Description: This is a digitally signed message part.

Re: SATA exceptions

2007-07-06 Thread S.Çağlar Onur

Hi;

06 Tem 2007 Cum tarihinde, Tejun Heo şunları yazmıştı: 
 S.Çağlar Onur wrote:
  [ 4260.278427] ata1.00: cmd ca/00:08:d0:88:bc/00:00:00:00:00/ee tag 0 cdb
  0x0 data 4096 out
  [ 4260.278430]  res 51/40:01:d7:88:bc/00:00:0e:00:00/ee Emask 0x9
  (media error)

 That's media error on sector 247236823 on WRITE.  Media errors on write
 are bad signs - it usually means the drive even failed to remap the
 sector because extra space ran out. 

Hmm, more than 50GB is empty on disk :)

 I'm not sure this is the case here 
 tho - the smart log is clear.  Please run smart short/long tests and see
 what they say.

Both completed without a problem;

zangetsu ~ # smartctl -l selftest /dev/sda
smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Self-test log structure revision number 1
Num  Test_DescriptionStatus  Remaining  LifeTime(hours)  
LBA_of_first_error
# 1  Extended offlineCompleted without error   00%   357 -
# 2  Short offline   Completed without error   00%   355 -

If you want me to try something else please just say :)

Cheers
-- 
S.Çağlar Onur [EMAIL PROTECTED]
http://cekirdek.pardus.org.tr/~caglar/

Linux is like living in a teepee. No Windows, no Gates and an Apache in house!


signature.asc
Description: This is a digitally signed message part.

Re: SATA exceptions

2007-07-05 Thread Tejun Heo

Hello,

S.Çağlar Onur wrote:
> [ 4260.278427] ata1.00: cmd ca/00:08:d0:88:bc/00:00:00:00:00/ee tag 0 cdb 0x0 
> data 4096 out
> [ 4260.278430]  res 51/40:01:d7:88:bc/00:00:0e:00:00/ee Emask 0x9 
> (media error)

That's media error on sector 247236823 on WRITE.  Media errors on write
are bad signs - it usually means the drive even failed to remap the
sector because extra space ran out.  I'm not sure this is the case here
tho - the smart log is clear.  Please run smart short/long tests and see
what they say.

-- 
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions

2007-07-05 Thread Tejun Heo

Hello,

S.Çağlar Onur wrote:
 [ 4260.278427] ata1.00: cmd ca/00:08:d0:88:bc/00:00:00:00:00/ee tag 0 cdb 0x0 
 data 4096 out
 [ 4260.278430]  res 51/40:01:d7:88:bc/00:00:0e:00:00/ee Emask 0x9 
 (media error)

That's media error on sector 247236823 on WRITE.  Media errors on write
are bad signs - it usually means the drive even failed to remap the
sector because extra space ran out.  I'm not sure this is the case here
tho - the smart log is clear.  Please run smart short/long tests and see
what they say.

-- 
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-02-09 Thread Björn Steinbrink

On 2007.02.04 02:13:51 +0100, Björn Steinbrink wrote:
> On 2007.02.02 23:48:14 -0600, Robert Hancock wrote:
> > There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) 
> > which should hopefully avoid this problem for the cache flush commands, 
> > at least - can you try that one out? You'll have to apply the other 
> > sata_nv patches in -mm first, i.e. this order:
> > 
> > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch
> > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch
> > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch
> 
> Got 2.6.20-rc7 with them applied now (the rejects seemed trivial enough
> for me to fix them). Let's see how that works out...

After about 1.5 days of uptime, an involuntary reboot and another 3
days of uptime, no sign of an exception. No stress testing was done,
but a few disk intensive actions did happen, at least more than with
that -rc6 that did throw an exception at me.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-02-09 Thread Björn Steinbrink

On 2007.02.04 02:13:51 +0100, Björn Steinbrink wrote:
 On 2007.02.02 23:48:14 -0600, Robert Hancock wrote:
  There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) 
  which should hopefully avoid this problem for the cache flush commands, 
  at least - can you try that one out? You'll have to apply the other 
  sata_nv patches in -mm first, i.e. this order:
  
  http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch
  http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch
  http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch
 
 Got 2.6.20-rc7 with them applied now (the rejects seemed trivial enough
 for me to fix them). Let's see how that works out...

After about 1.5 days of uptime, an involuntary reboot and another 3
days of uptime, no sign of an exception. No stress testing was done,
but a few disk intensive actions did happen, at least more than with
that -rc6 that did throw an exception at me.

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-02-03 Thread Björn Steinbrink

On 2007.02.02 23:48:14 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote:
> >>On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:
> >>>Larry Walton wrote:
> The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
> seems to have fix the problem.  Much appreciated, 
> thank you. I'd consider it a must have in 2.6.20.
> >>>Can any of the rest of you that have been seeing this problem also 
> >>>confirm that this fixes it?
> >>Seems to work for me, uptime is about an hour now and no exception yet.
> >>Had the stress test running for only about 10 minutes, but I usually got
> >>an exception within an hour even during plain irssi usage, so I'm quite
> >>confident that the patch fixes it.
> >
> >Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of
> >uptime to trigger, so it's just a lot harder to trigger now.
> 
> Same exception details as before?

Yes, exactly the same.

> There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) 
> which should hopefully avoid this problem for the cache flush commands, 
> at least - can you try that one out? You'll have to apply the other 
> sata_nv patches in -mm first, i.e. this order:
> 
> http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch
> http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch
> http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch

Got 2.6.20-rc7 with them applied now (the rejects seemed trivial enough
for me to fix them). Let's see how that works out...

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-02-03 Thread Björn Steinbrink

On 2007.02.02 23:48:14 -0600, Robert Hancock wrote:
Björn Steinbrink wrote:
On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote:
On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:
Larry Walton wrote:
The last patch (sata_nv-force-int-dev-in-interrupt.patch)
seems to have fix the problem. Much appreciated,
thank you. I'd consider it a must have in 2.6.20.
Can any of the rest of you that have been seeing this problem also
confirm that this fixes it?
Seems to work for me, uptime is about an hour now and no exception yet.
Had the stress test running for only about 10 minutes, but I usually got
an exception within an hour even during plain irssi usage, so I'm quite
confident that the patch fixes it.

Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of
uptime to trigger, so it's just a lot harder to trigger now.

Same exception details as before?

Yes, exactly the same.

There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch)
which should hopefully avoid this problem for the cache flush commands,
at least - can you try that one out? You'll have to apply the other
sata_nv patches in -mm first, i.e. this order:

http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch
http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch
http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch

Got 2.6.20-rc7 with them applied now (the rejects seemed trivial enough
for me to fix them). Let's see how that works out...

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html
Please read the FAQ at http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-02-02 Thread Robert Hancock


Björn Steinbrink wrote:

On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote:

On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:

Larry Walton wrote:
The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
seems to have fix the problem.  Much appreciated, 
thank you. I'd consider it a must have in 2.6.20.
Can any of the rest of you that have been seeing this problem also 
confirm that this fixes it?

Seems to work for me, uptime is about an hour now and no exception yet.
Had the stress test running for only about 10 minutes, but I usually got
an exception within an hour even during plain irssi usage, so I'm quite
confident that the patch fixes it.


Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of
uptime to trigger, so it's just a lot harder to trigger now.


Same exception details as before?

There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) 
which should hopefully avoid this problem for the cache flush commands, 
at least - can you try that one out? You'll have to apply the other 
sata_nv patches in -mm first, i.e. this order:


http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch
http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch
http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-02-02 Thread Björn Steinbrink

On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote:
> On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:
> > Larry Walton wrote:
> > >The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
> > >seems to have fix the problem.  Much appreciated, 
> > >thank you. I'd consider it a must have in 2.6.20.
> > 
> > Can any of the rest of you that have been seeing this problem also 
> > confirm that this fixes it?
> 
> Seems to work for me, uptime is about an hour now and no exception yet.
> Had the stress test running for only about 10 minutes, but I usually got
> an exception within an hour even during plain irssi usage, so I'm quite
> confident that the patch fixes it.

Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of
uptime to trigger, so it's just a lot harder to trigger now.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-02-02 Thread Björn Steinbrink

On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote:
 On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:
  Larry Walton wrote:
  The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
  seems to have fix the problem.  Much appreciated, 
  thank you. I'd consider it a must have in 2.6.20.
  
  Can any of the rest of you that have been seeing this problem also 
  confirm that this fixes it?
 
 Seems to work for me, uptime is about an hour now and no exception yet.
 Had the stress test running for only about 10 minutes, but I usually got
 an exception within an hour even during plain irssi usage, so I'm quite
 confident that the patch fixes it.

Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of
uptime to trigger, so it's just a lot harder to trigger now.

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-02-02 Thread Robert Hancock


Björn Steinbrink wrote:

On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote:

On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:

Larry Walton wrote:
The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
seems to have fix the problem.  Much appreciated, 
thank you. I'd consider it a must have in 2.6.20.
Can any of the rest of you that have been seeing this problem also 
confirm that this fixes it?

Seems to work for me, uptime is about an hour now and no exception yet.
Had the stress test running for only about 10 minutes, but I usually got
an exception within an hour even during plain irssi usage, so I'm quite
confident that the patch fixes it.


Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of
uptime to trigger, so it's just a lot harder to trigger now.


Same exception details as before?

There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) 
which should hopefully avoid this problem for the cache flush commands, 
at least - can you try that one out? You'll have to apply the other 
sata_nv patches in -mm first, i.e. this order:


http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch
http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch
http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-24 Thread Björn Steinbrink

On 2007.01.24 09:24:00 +0100, Ian Kumlien wrote:
> On tis, 2007-01-23 at 17:18 -0600, Robert Hancock wrote:
> > Larry Walton wrote:
> > > The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
> > > seems to have fix the problem.  Much appreciated, 
> > > thank you. I'd consider it a must have in 2.6.20.
> > 
> > Can any of the rest of you that have been seeing this problem also 
> > confirm that this fixes it?
> 
> I applied it yesterday and today my dmesg contains three:
> BUG: at mm/truncate.c:60 cancel_dirty_page()

David Chinner sent two patches regarding that bug yesterday.
http://lkml.org/lkml/2007/1/23/190
http://lkml.org/lkml/2007/1/23/192

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-24 Thread Ian Kumlien

On tis, 2007-01-23 at 17:18 -0600, Robert Hancock wrote:
> Larry Walton wrote:
> > The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
> > seems to have fix the problem.  Much appreciated, 
> > thank you. I'd consider it a must have in 2.6.20.
> 
> Can any of the rest of you that have been seeing this problem also 
> confirm that this fixes it?

I applied it yesterday and today my dmesg contains three:
BUG: at mm/truncate.c:60 cancel_dirty_page()

Call Trace:
 [] cancel_dirty_page+0x43/0x71
 [] reiserfs_cut_from_item+0x5f8/0x61d
 [] find_get_page+0x21/0x47
 [] reiserfs_do_truncate+0x34d/0x495
 [] reiserfs_truncate_file+0x199/0x2aa
 [] reiserfs_file_release+0x261/0x281
 [] __fput+0xb1/0x17d
 [] filp_close+0x5d/0x65
 [] sys_close+0x8c/0xcf
 [] system_call+0x7e/0x83

Which never happened before... I dunno if they are related though, but
they weren't there before...

(It does fix the timeout problem)

-- 
Ian Kumlien  -- http://pomac.netswarm.net


signature.asc
Description: This is a digitally signed message part

Re: SATA exceptions with 2.6.20-rc5

2007-01-24 Thread Ian Kumlien

On tis, 2007-01-23 at 17:18 -0600, Robert Hancock wrote:
 Larry Walton wrote:
  The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
  seems to have fix the problem.  Much appreciated, 
  thank you. I'd consider it a must have in 2.6.20.
 
 Can any of the rest of you that have been seeing this problem also 
 confirm that this fixes it?

I applied it yesterday and today my dmesg contains three:
BUG: at mm/truncate.c:60 cancel_dirty_page()

Call Trace:
 [8029f3e5] cancel_dirty_page+0x43/0x71
 [802ec1ab] reiserfs_cut_from_item+0x5f8/0x61d
 [802074fc] find_get_page+0x21/0x47
 [802ec51d] reiserfs_do_truncate+0x34d/0x495
 [802d9d47] reiserfs_truncate_file+0x199/0x2aa
 [802df9c5] reiserfs_file_release+0x261/0x281
 [80211b02] __fput+0xb1/0x17d
 [802218e0] filp_close+0x5d/0x65
 [8021bef5] sys_close+0x8c/0xcf
 [8025725e] system_call+0x7e/0x83

Which never happened before... I dunno if they are related though, but
they weren't there before...

(It does fix the timeout problem)

-- 
Ian Kumlien pomac () vapor ! com -- http://pomac.netswarm.net


signature.asc
Description: This is a digitally signed message part

Re: SATA exceptions with 2.6.20-rc5

2007-01-24 Thread Björn Steinbrink

On 2007.01.24 09:24:00 +0100, Ian Kumlien wrote:
 On tis, 2007-01-23 at 17:18 -0600, Robert Hancock wrote:
  Larry Walton wrote:
   The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
   seems to have fix the problem.  Much appreciated, 
   thank you. I'd consider it a must have in 2.6.20.
  
  Can any of the rest of you that have been seeing this problem also 
  confirm that this fixes it?
 
 I applied it yesterday and today my dmesg contains three:
 BUG: at mm/truncate.c:60 cancel_dirty_page()

David Chinner sent two patches regarding that bug yesterday.
http://lkml.org/lkml/2007/1/23/190
http://lkml.org/lkml/2007/1/23/192

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-23 Thread Björn Steinbrink

On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:
> Larry Walton wrote:
> >The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
> >seems to have fix the problem.  Much appreciated, 
> >thank you. I'd consider it a must have in 2.6.20.
> 
> Can any of the rest of you that have been seeing this problem also 
> confirm that this fixes it?

Seems to work for me, uptime is about an hour now and no exception yet.
Had the stress test running for only about 10 minutes, but I usually got
an exception within an hour even during plain irssi usage, so I'm quite
confident that the patch fixes it.

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-23 Thread Robert Hancock


Larry Walton wrote:
The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
seems to have fix the problem.  Much appreciated, 
thank you. I'd consider it a must have in 2.6.20.


Can any of the rest of you that have been seeing this problem also 
confirm that this fixes it?


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-23 Thread Larry Walton

The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
seems to have fix the problem.  Much appreciated, 
thank you. I'd consider it a must have in 2.6.20.


-- 
*--* Mail: [EMAIL PROTECTED]
*--* Voice: 206.892.6269
*--* Cell: 206.225.0154
*--* HTTP://real.com
--
- - - - - - - R e a l - - - - - - - -



signature.asc
Description: Digital signature

Re: SATA exceptions with 2.6.20-rc5

2007-01-23 Thread Larry Walton

The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
seems to have fix the problem.  Much appreciated, 
thank you. I'd consider it a must have in 2.6.20.


-- 
*--* Mail: [EMAIL PROTECTED]
*--* Voice: 206.892.6269
*--* Cell: 206.225.0154
*--* HTTP://real.com
--
- - - - - - - R e a l - - - - - - - -



signature.asc
Description: Digital signature

Re: SATA exceptions with 2.6.20-rc5

2007-01-23 Thread Robert Hancock


Larry Walton wrote:
The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
seems to have fix the problem.  Much appreciated, 
thank you. I'd consider it a must have in 2.6.20.


Can any of the rest of you that have been seeing this problem also 
confirm that this fixes it?


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-23 Thread Björn Steinbrink

On 2007.01.23 17:18:43 -0600, Robert Hancock wrote:
 Larry Walton wrote:
 The last patch (sata_nv-force-int-dev-in-interrupt.patch) 
 seems to have fix the problem.  Much appreciated, 
 thank you. I'd consider it a must have in 2.6.20.
 
 Can any of the rest of you that have been seeing this problem also 
 confirm that this fixes it?

Seems to work for me, uptime is about an hour now and no exception yet.
Had the stress test running for only about 10 minutes, but I usually got
an exception within an hour even during plain irssi usage, so I'm quite
confident that the patch fixes it.

Thanks,
Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock


Björn Steinbrink wrote:

Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804.
I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now
without a single problem and that should still look at
NV_INT_STATUS_CK804, right?
I just noticed that my last email might not have been clear enough. The
exceptions happened when I re-enabled the return statement in addition
to the debug message. Without the INT_DEV check, it is completely fine
AFAICT.


Indeed, it seems to be just the NV_INT_DEV check that is problematic. 
Here's a patch that's likely better to test, it forces the NV_INT_DEV 
flag on when a command is active, and also fixes that questionable code 
in nv_host_intr that I mentioned.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 22:33:43.0 
-0600
@@ -700,7 +700,6 @@ static void nv_adma_check_cpb(struct ata
 static int nv_host_intr(struct ata_port *ap, u8 irq_stat)
 {
struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap->active_tag);
-   int handled;
 
/* freeze if hotplugged */
if (unlikely(irq_stat & (NV_INT_ADDED | NV_INT_REMOVED))) {
@@ -719,13 +718,7 @@ static int nv_host_intr(struct ata_port 
}
 
/* handle interrupt */
-   handled = ata_host_intr(ap, qc);
-   if (unlikely(!handled)) {
-   /* spurious, clear it */
-   ata_check_status(ap);
-   }
-
-   return 1;
+   return ata_host_intr(ap, qc);
 }
 
 static irqreturn_t nv_adma_interrupt(int irq, void *dev_instance)
@@ -752,6 +745,11 @@ static irqreturn_t nv_adma_interrupt(int
if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) {
u8 irq_stat = readb(host->mmio_base + 
NV_INT_STATUS_CK804)
>> (NV_INT_PORT_SHIFT * i);
+   if(ata_tag_valid(ap->active_tag))
+   /** NV_INT_DEV indication seems 
unreliable at times
+   at least in ADMA mode. Force it on 
always when a
+   command is active, to prevent 
losing interrupts. */
+   irq_stat |= NV_INT_DEV;
handled += nv_host_intr(ap, irq_stat);
continue;
}

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink

On 2007.01.22 19:24:22 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >>>Running a kernel with the return statement replace by a line that prints
> >>>the irq_stat instead.
> >>>
> >>>Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.
> >>40 minutes stress test now and no exception yet. What's interesting is
> >>that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
> >>might have get dropped are as above.
> >>I'll keep it running for some time and will then re-enable the return
> >>statement to see if there's a relation between the irq_stat 0x0 and the
> >>exception.
> >
> >No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
> >0x0 for ata1. Syslog/dmesg has nothing new either, still the same
> >pattern of dismissed irq_stats.
> 
> I've finally managed to reproduce this problem on my box, by doing:
> 
> watch --interval=0.1 /sbin/hdparm -I /dev/sda
> 
> on one drive and then running bonnie++ on /dev/sdb connected to the 
> other port on the same controller device. Usually within a few minutes 
> one of the IDENTIFY commands would time out in the same way you guys 
> have been seeing.
> 
> Through some various trials and tribulations, the only conclusion I can 
> come to is that this controller really doesn't like that 
> NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried 
> adding some debug code to the qc_issue function that would check to see 
> if the BUSY flag in altstatus went high or that register showed an 
> interrupt within a certain time afterwards, however that really seemed 
> to hose things, the system wouldn't even boot.

Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804.
I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now
without a single problem and that should still look at
NV_INT_STATUS_CK804, right?
I just noticed that my last email might not have been clear enough. The
exceptions happened when I re-enabled the return statement in addition
to the debug message. Without the INT_DEV check, it is completely fine
AFAICT.

> Try out this patch, it just calls the ata_host_intr function where 
> appropriate without using nv_host_intr which looks at the 
> NV_INT_STATUS_CK804 register. This is what the original ADMA patch from 
> Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for 
> that. With this patch I can get through a whole bonnie++ run with the 
> repeated IDENTIFY requests running without seeing the error.

I'll see if I can schedule a test run for tomorrow, I currently need
this box.

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock


Alistair John Strachan wrote:

On Tuesday 23 January 2007 01:24, Robert Hancock wrote:

As a final aside, this is another case where the hardware docs for this
controller would really be useful, in order to know whether we are
actually supposed to be reading that register in ADMA mode or not. I
sent a query to Allen Martin at NVIDIA asking if there's a way I could
get access to the documents, but I haven't heard anything yet.


Obviously, NVIDIA's response is disappointing, but thank you for putting the 
time in to debug this problem. Definitely sounds like a hardware defect, I'm 
just glad there's a workaround.


Will we see this fix in 2.6.20?


Hopefully, assuming it actually does fix the problem for those that have 
been seeing it..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Alistair John Strachan

On Tuesday 23 January 2007 01:24, Robert Hancock wrote:
> As a final aside, this is another case where the hardware docs for this
> controller would really be useful, in order to know whether we are
> actually supposed to be reading that register in ADMA mode or not. I
> sent a query to Allen Martin at NVIDIA asking if there's a way I could
> get access to the documents, but I haven't heard anything yet.

Obviously, NVIDIA's response is disappointing, but thank you for putting the 
time in to debug this problem. Definitely sounds like a hardware defect, I'm 
just glad there's a workaround.

Will we see this fix in 2.6.20?

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock


Björn Steinbrink wrote:

Running a kernel with the return statement replace by a line that prints
the irq_stat instead.

Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.

40 minutes stress test now and no exception yet. What's interesting is
that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
might have get dropped are as above.
I'll keep it running for some time and will then re-enable the return
statement to see if there's a relation between the irq_stat 0x0 and the
exception.


No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
0x0 for ata1. Syslog/dmesg has nothing new either, still the same
pattern of dismissed irq_stats.


I've finally managed to reproduce this problem on my box, by doing:

watch --interval=0.1 /sbin/hdparm -I /dev/sda

on one drive and then running bonnie++ on /dev/sdb connected to the 
other port on the same controller device. Usually within a few minutes 
one of the IDENTIFY commands would time out in the same way you guys 
have been seeing.


Through some various trials and tribulations, the only conclusion I can 
come to is that this controller really doesn't like that 
NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried 
adding some debug code to the qc_issue function that would check to see 
if the BUSY flag in altstatus went high or that register showed an 
interrupt within a certain time afterwards, however that really seemed 
to hose things, the system wouldn't even boot.


Try out this patch, it just calls the ata_host_intr function where 
appropriate without using nv_host_intr which looks at the 
NV_INT_STATUS_CK804 register. This is what the original ADMA patch from 
Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for 
that. With this patch I can get through a whole bonnie++ run with the 
repeated IDENTIFY requests running without seeing the error.


As an aside, there seems to be some dubious code in nv_host_intr, if 
ata_host_intr returns 0 for handled when a command is outstanding, it 
goes and calls ata_check_status anyway. This is rather dangerous since 
if an interrupt showed up right after ata_host_intr but before 
ata_check_status, the ata_check_status would clear it and we would 
forget about it. I tried fixing just that issue and still had this 
problem however. I suspect that code is truly broken and needs further 
thought, but this patch avoids calling it in the ADMA case, at any rate.


As a final aside, this is another case where the hardware docs for this 
controller would really be useful, in order to know whether we are 
actually supposed to be reading that register in ADMA mode or not. I 
sent a query to Allen Martin at NVIDIA asking if there's a way I could 
get access to the documents, but I haven't heard anything yet.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 18:35:09.0 
-0600
@@ -750,9 +750,9 @@ static irqreturn_t nv_adma_interrupt(int
 
/* if in ATA register mode, use standard ata interrupt 
handler */
if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) {
-   u8 irq_stat = readb(host->mmio_base + 
NV_INT_STATUS_CK804)
-   >> (NV_INT_PORT_SHIFT * i);
-   handled += nv_host_intr(ap, irq_stat);
+   struct ata_queued_cmd *qc = ata_qc_from_tag(ap, 
ap->active_tag);
+   if(qc && !(qc->tf.flags & ATA_TFLAG_POLLING))
+   handled += ata_host_intr(ap, qc);
continue;
}

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Eric D. Mudama

On 1/15/07, Jeff Garzik <[EMAIL PROTECTED]> wrote:

Jens Axboe wrote:
> On Mon, Jan 15 2007, Jeff Garzik wrote:
>> Jens Axboe wrote:
>>> I'd be surprised if the device would not obey the 7 second timeout rule
>>> that seems to be set in stone and not allow more dirty in-drive cache
>>> than it could flush out in approximately that time.
>> AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other
>> commands...
>
> Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as
> it would pretty much guarentee lower latencies for random writes and
> write back caching. The concern is the barrier code, of course. I guess
> I should do some timings on potential worst case patterns some day. Alan
> may have done that sometime in the past, iirc.

FWIW:  According to the drive guys (Eric M, among others), FLUSH CACHE
will "probably" be under 30 seconds, but pathological cases might even
extend beyond that.

Definitely more than 7 seconds in less-than-pathological cases,
unfortunately...

The mentioned Maxtor model (6Yxxx) isn't susceptible to the
large-buffer long completion times, due to architectural differences
and availability of only small buffers.  Any "real" long-completion
flush on this device would, I believe, involve damage to the disk that
hinders the ability to seek, settle, or write.  (e.g. 30-second
flushes are easy to hit if you mount the disk on a shaker-table with
sufficient amplitude)

Later in the thread I think people have pretty much isolated it as not
the disk's problem, but just wanted to point this out.

I assume that large enough customers can buy enterprise-type command
completion ("all commands within X seconds") from most any disk
vendor.  However, these firmwares require much smarter or more active
drivers or block layers, to handle the higher error rate when the data
on the device is valid, but it will take longer than allowed by the
arbitrary enterprise rules.  Most customers who are buying this many
devices have software engineers customizing the drivers or disk
management applications to handle this differing behavior.

--eric
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink

On 2007.01.22 17:57:08 +0100, Björn Steinbrink wrote:
> On 2007.01.22 17:12:40 +0100, Björn Steinbrink wrote:
> > On 2007.01.21 18:17:01 -0600, Robert Hancock wrote:
> > > Hmm, another miss, apparently.. Has anyone tried removing these lines
> > > >from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?
> > > 
> > > /* bail out if not our interrupt */
> > > if (!(irq_stat & NV_INT_DEV))
> > > return 0;
> > 
> > Running a kernel with the return statement replace by a line that prints
> > the irq_stat instead.
> > 
> > Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.
> 
> 40 minutes stress test now and no exception yet. What's interesting is
> that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
> might have get dropped are as above.
> I'll keep it running for some time and will then re-enable the return
> statement to see if there's a relation between the irq_stat 0x0 and the
> exception.

No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
0x0 for ata1. Syslog/dmesg has nothing new either, still the same
pattern of dismissed irq_stats.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink

On 2007.01.22 17:12:40 +0100, Björn Steinbrink wrote:
> On 2007.01.21 18:17:01 -0600, Robert Hancock wrote:
> > Björn Steinbrink wrote:
> > >On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
> > >>Björn Steinbrink wrote:
> > >>>All kernels were bad using that approach. So back to square 1. :/
> > >>>
> > >>>Björn
> > >>>
> > >>OK guys, here's a new patch to try against 2.6.20-rc5:
> > >>
> > >>Right now when switching between ADMA mode and legacy mode (i.e. when 
> > >>going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
> > >>set the ADMA GO register bit appropriately and continue with no delay. 
> > >>It looks like in some cases the controller doesn't respond to this 
> > >>immediately, it takes some nanoseconds for the controller's status 
> > >>registers to reflect the change that was made. It's possible that if we 
> > >>were trying to issue commands during this time, the controller might not 
> > >>react properly. This patch adds some code to wait for the status 
> > >>register to change to the state we asked for before continuing.
> > >
> > >Just got two exceptions with your patch, none of the debug messages were
> > >issued.
> > >
> > >Björn
> > 
> > Hmm, another miss, apparently.. Has anyone tried removing these lines
> > >from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?
> > 
> > /* bail out if not our interrupt */
> > if (!(irq_stat & NV_INT_DEV))
> > return 0;
> 
> Running a kernel with the return statement replace by a line that prints
> the irq_stat instead.
> 
> Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.

40 minutes stress test now and no exception yet. What's interesting is
that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
might have get dropped are as above.
I'll keep it running for some time and will then re-enable the return
statement to see if there's a relation between the irq_stat 0x0 and the
exception.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink

On 2007.01.21 18:17:01 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
> >>Björn Steinbrink wrote:
> >>>All kernels were bad using that approach. So back to square 1. :/
> >>>
> >>>Björn
> >>>
> >>OK guys, here's a new patch to try against 2.6.20-rc5:
> >>
> >>Right now when switching between ADMA mode and legacy mode (i.e. when 
> >>going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
> >>set the ADMA GO register bit appropriately and continue with no delay. 
> >>It looks like in some cases the controller doesn't respond to this 
> >>immediately, it takes some nanoseconds for the controller's status 
> >>registers to reflect the change that was made. It's possible that if we 
> >>were trying to issue commands during this time, the controller might not 
> >>react properly. This patch adds some code to wait for the status 
> >>register to change to the state we asked for before continuing.
> >
> >Just got two exceptions with your patch, none of the debug messages were
> >issued.
> >
> >Björn
> 
> Hmm, another miss, apparently.. Has anyone tried removing these lines
> >from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?
> 
> /* bail out if not our interrupt */
> if (!(irq_stat & NV_INT_DEV))
> return 0;

Running a kernel with the return statement replace by a line that prints
the irq_stat instead.

Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Chr

On Monday, 22. January 2007 03:39, Tejun Heo wrote:
> Hello,
> 
> Chr wrote:
> > Ok, you won't believe this... I opened my case and rewired my drives... 
> > And guess what, my second (aka the "good") HDD is now failing! 
> > I guess, my mainboard has a (but maybe two, or three :( ) "bad" 
> > sata-port(s)!  
> 
> Or, you have power related problem.  Try to rewire the power lines or 
> connect harddrives to a separate powersupply.  It's often useful to 
> change one component at a time and watch which change the problem 
> follows.  Anyways, you seem to be suffering transmission failures, not a 
> driver problem.
> 
> Thanks.
> 

Yes and no, it's probably not a power problem, I've tried another
PSU with the same result :( . Futhermore, the RAID0 setup makes
it impossible to try only one drive alone :(. 

Anyway,the WD2500KS is known to have some strange bugs in the FW.
e.g.: It reports 255°C right after a cold start. 
( http://www.bugtrack.almico.com/view.php?id=468 ).

Thanks,
Chr.
 
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-22 Thread Paolo Ornati

On Mon, 22 Jan 2007 18:35:05 +0900
Tejun Heo <[EMAIL PROTECTED]> wrote:

> Yeap, certainly.  I'll ask people first before actually proceeding with 
> the blacklisting.  I'm just getting a bit tired of tides of NCQ firmware 
> problems.

Another interesting thing: it seems that I'm unable to reproduce the
problem mounting XFS with "nobarrier" (using sda queue_depth = 31).

So it looks like a problem with NCQ combined with cache flush command...

-- 
Paolo Ornati
Linux 2.6.20-rc5 on x86_64
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-22 Thread Paolo Ornati

On Mon, 22 Jan 2007 18:35:05 +0900
Tejun Heo <[EMAIL PROTECTED]> wrote:

> Yeap, certainly.  I'll ask people first before actually proceeding with 
> the blacklisting.  I'm just getting a bit tired of tides of NCQ firmware 
> problems.
> 
> Anyways, for the time being, you can easily turn off NCQ using sysfs. 
> Please take a look at http://linux-ata.org/faq.html

ok

-- 
Paolo Ornati
Linux 2.6.20-rc5 on x86_64
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-22 Thread Tejun Heo


Paolo Ornati wrote:

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family
Device Model: ST380817AS

I'll blacklist it.  Thanks.


Ok. It will be better if someone else with the same HD could confirm.

It looks so strange that an HD that works fine, and should support NCQ,
have so big troubles that I can "freeze" it in less than a second by
using XFS (while with ext3 I cannot, or at least it's very hard).


Yeap, certainly.  I'll ask people first before actually proceeding with 
the blacklisting.  I'm just getting a bit tired of tides of NCQ firmware 
problems.


Anyways, for the time being, you can easily turn off NCQ using sysfs. 
Please take a look at http://linux-ata.org/faq.html


--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-22 Thread Paolo Ornati

On Mon, 22 Jan 2007 11:46:01 +0900
Tejun Heo <[EMAIL PROTECTED]> wrote:

> > I don't know. It's a two years old ST380817AS.
> > 
> > # smartctl -a -d ata /dev/sda
> > 
> > smartctl version 5.36 [x86_64-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
> > Home page is http://smartmontools.sourceforge.net/
> > 
> > === START OF INFORMATION SECTION ===
> > Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family
> > Device Model: ST380817AS
> 
> I'll blacklist it.  Thanks.

Ok. It will be better if someone else with the same HD could confirm.

It looks so strange that an HD that works fine, and should support NCQ,
have so big troubles that I can "freeze" it in less than a second by
using XFS (while with ext3 I cannot, or at least it's very hard).

-- 
Paolo Ornati
Linux 2.6.20-rc5 on x86_64
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-22 Thread Paolo Ornati

On Mon, 22 Jan 2007 01:53:21 +0059
Jiri Slaby <[EMAIL PROTECTED]> wrote:

> >>   7 Seek_Error_Rate 0x000f   083   060   030Pre-fail  Always   
> >> -   204305750
> >>   1 Raw_Read_Error_Rate 0x000f   059   049   006Pre-fail  Always   
> >> -   215927244
> >> 195 Hardware_ECC_Recovered  0x001a   059   049   000Old_age   Always   
> >> -   215927244 
> > 
> > Wow! that HDD is really in a bad condition.
> 
> I don't think so, this seems to be normal for Seagate drives...

I agree.

For Chr: I don't think these big raw-numbers are counters, look at the
normalized values instead, and see that they are greater than TRESH
values (so they are good).

The meaning of raw-numbers is vendor specific.

-- 
Paolo Ornati
Linux 2.6.20-rc5 on x86_64
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-22 Thread Paolo Ornati

On Mon, 22 Jan 2007 01:53:21 +0059
Jiri Slaby [EMAIL PROTECTED] wrote:

7 Seek_Error_Rate 0x000f   083   060   030Pre-fail  Always   
  -   204305750
1 Raw_Read_Error_Rate 0x000f   059   049   006Pre-fail  Always   
  -   215927244
  195 Hardware_ECC_Recovered  0x001a   059   049   000Old_age   Always   
  -   215927244 
  
  Wow! that HDD is really in a bad condition.
 
 I don't think so, this seems to be normal for Seagate drives...

I agree.

For Chr: I don't think these big raw-numbers are counters, look at the
normalized values instead, and see that they are greater than TRESH
values (so they are good).

The meaning of raw-numbers is vendor specific.

-- 
Paolo Ornati
Linux 2.6.20-rc5 on x86_64
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-22 Thread Paolo Ornati

On Mon, 22 Jan 2007 11:46:01 +0900
Tejun Heo [EMAIL PROTECTED] wrote:

  I don't know. It's a two years old ST380817AS.
  
  # smartctl -a -d ata /dev/sda
  
  smartctl version 5.36 [x86_64-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
  Home page is http://smartmontools.sourceforge.net/
  
  === START OF INFORMATION SECTION ===
  Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family
  Device Model: ST380817AS
 
 I'll blacklist it.  Thanks.

Ok. It will be better if someone else with the same HD could confirm.

It looks so strange that an HD that works fine, and should support NCQ,
have so big troubles that I can freeze it in less than a second by
using XFS (while with ext3 I cannot, or at least it's very hard).

-- 
Paolo Ornati
Linux 2.6.20-rc5 on x86_64
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-22 Thread Tejun Heo


Paolo Ornati wrote:

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family
Device Model: ST380817AS

I'll blacklist it.  Thanks.


Ok. It will be better if someone else with the same HD could confirm.

It looks so strange that an HD that works fine, and should support NCQ,
have so big troubles that I can freeze it in less than a second by
using XFS (while with ext3 I cannot, or at least it's very hard).


Yeap, certainly.  I'll ask people first before actually proceeding with 
the blacklisting.  I'm just getting a bit tired of tides of NCQ firmware 
problems.


Anyways, for the time being, you can easily turn off NCQ using sysfs. 
Please take a look at http://linux-ata.org/faq.html


--
tejun
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-22 Thread Paolo Ornati

On Mon, 22 Jan 2007 18:35:05 +0900
Tejun Heo [EMAIL PROTECTED] wrote:

 Yeap, certainly.  I'll ask people first before actually proceeding with 
 the blacklisting.  I'm just getting a bit tired of tides of NCQ firmware 
 problems.
 
 Anyways, for the time being, you can easily turn off NCQ using sysfs. 
 Please take a look at http://linux-ata.org/faq.html

ok

-- 
Paolo Ornati
Linux 2.6.20-rc5 on x86_64
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-22 Thread Paolo Ornati

On Mon, 22 Jan 2007 18:35:05 +0900
Tejun Heo [EMAIL PROTECTED] wrote:

 Yeap, certainly.  I'll ask people first before actually proceeding with 
 the blacklisting.  I'm just getting a bit tired of tides of NCQ firmware 
 problems.

Another interesting thing: it seems that I'm unable to reproduce the
problem mounting XFS with nobarrier (using sda queue_depth = 31).

So it looks like a problem with NCQ combined with cache flush command...

-- 
Paolo Ornati
Linux 2.6.20-rc5 on x86_64
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Chr

On Monday, 22. January 2007 03:39, Tejun Heo wrote:
 Hello,
 
 Chr wrote:
  Ok, you won't believe this... I opened my case and rewired my drives... 
  And guess what, my second (aka the good) HDD is now failing! 
  I guess, my mainboard has a (but maybe two, or three :( ) bad 
  sata-port(s)!  
 
 Or, you have power related problem.  Try to rewire the power lines or 
 connect harddrives to a separate powersupply.  It's often useful to 
 change one component at a time and watch which change the problem 
 follows.  Anyways, you seem to be suffering transmission failures, not a 
 driver problem.
 
 Thanks.
 

Yes and no, it's probably not a power problem, I've tried another
PSU with the same result :( . Futhermore, the RAID0 setup makes
it impossible to try only one drive alone :(. 

Anyway,the WD2500KS is known to have some strange bugs in the FW.
e.g.: It reports 255°C right after a cold start. 
( http://www.bugtrack.almico.com/view.php?id=468 ).

Thanks,
Chr.
 
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink

On 2007.01.21 18:17:01 -0600, Robert Hancock wrote:
 Björn Steinbrink wrote:
 On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
 Björn Steinbrink wrote:
 All kernels were bad using that approach. So back to square 1. :/
 
 Björn
 
 OK guys, here's a new patch to try against 2.6.20-rc5:
 
 Right now when switching between ADMA mode and legacy mode (i.e. when 
 going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
 set the ADMA GO register bit appropriately and continue with no delay. 
 It looks like in some cases the controller doesn't respond to this 
 immediately, it takes some nanoseconds for the controller's status 
 registers to reflect the change that was made. It's possible that if we 
 were trying to issue commands during this time, the controller might not 
 react properly. This patch adds some code to wait for the status 
 register to change to the state we asked for before continuing.
 
 Just got two exceptions with your patch, none of the debug messages were
 issued.
 
 Björn
 
 Hmm, another miss, apparently.. Has anyone tried removing these lines
 from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?
 
 /* bail out if not our interrupt */
 if (!(irq_stat  NV_INT_DEV))
 return 0;

Running a kernel with the return statement replace by a line that prints
the irq_stat instead.

Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink

On 2007.01.22 17:12:40 +0100, Björn Steinbrink wrote:
 On 2007.01.21 18:17:01 -0600, Robert Hancock wrote:
  Björn Steinbrink wrote:
  On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
  Björn Steinbrink wrote:
  All kernels were bad using that approach. So back to square 1. :/
  
  Björn
  
  OK guys, here's a new patch to try against 2.6.20-rc5:
  
  Right now when switching between ADMA mode and legacy mode (i.e. when 
  going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
  set the ADMA GO register bit appropriately and continue with no delay. 
  It looks like in some cases the controller doesn't respond to this 
  immediately, it takes some nanoseconds for the controller's status 
  registers to reflect the change that was made. It's possible that if we 
  were trying to issue commands during this time, the controller might not 
  react properly. This patch adds some code to wait for the status 
  register to change to the state we asked for before continuing.
  
  Just got two exceptions with your patch, none of the debug messages were
  issued.
  
  Björn
  
  Hmm, another miss, apparently.. Has anyone tried removing these lines
  from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?
  
  /* bail out if not our interrupt */
  if (!(irq_stat  NV_INT_DEV))
  return 0;
 
 Running a kernel with the return statement replace by a line that prints
 the irq_stat instead.
 
 Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.

40 minutes stress test now and no exception yet. What's interesting is
that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
might have get dropped are as above.
I'll keep it running for some time and will then re-enable the return
statement to see if there's a relation between the irq_stat 0x0 and the
exception.

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink

On 2007.01.22 17:57:08 +0100, Björn Steinbrink wrote:
 On 2007.01.22 17:12:40 +0100, Björn Steinbrink wrote:
  On 2007.01.21 18:17:01 -0600, Robert Hancock wrote:
   Hmm, another miss, apparently.. Has anyone tried removing these lines
   from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?
   
   /* bail out if not our interrupt */
   if (!(irq_stat  NV_INT_DEV))
   return 0;
  
  Running a kernel with the return statement replace by a line that prints
  the irq_stat instead.
  
  Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.
 
 40 minutes stress test now and no exception yet. What's interesting is
 that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
 might have get dropped are as above.
 I'll keep it running for some time and will then re-enable the return
 statement to see if there's a relation between the irq_stat 0x0 and the
 exception.

No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
0x0 for ata1. Syslog/dmesg has nothing new either, still the same
pattern of dismissed irq_stats.

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Eric D. Mudama


On 1/15/07, Jeff Garzik [EMAIL PROTECTED] wrote:

Jens Axboe wrote:
 On Mon, Jan 15 2007, Jeff Garzik wrote:
 Jens Axboe wrote:
 I'd be surprised if the device would not obey the 7 second timeout rule
 that seems to be set in stone and not allow more dirty in-drive cache
 than it could flush out in approximately that time.
 AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other
 commands...

 Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as
 it would pretty much guarentee lower latencies for random writes and
 write back caching. The concern is the barrier code, of course. I guess
 I should do some timings on potential worst case patterns some day. Alan
 may have done that sometime in the past, iirc.

FWIW:  According to the drive guys (Eric M, among others), FLUSH CACHE
will probably be under 30 seconds, but pathological cases might even
extend beyond that.

Definitely more than 7 seconds in less-than-pathological cases,
unfortunately...


The mentioned Maxtor model (6Yxxx) isn't susceptible to the
large-buffer long completion times, due to architectural differences
and availability of only small buffers.  Any real long-completion
flush on this device would, I believe, involve damage to the disk that
hinders the ability to seek, settle, or write.  (e.g. 30-second
flushes are easy to hit if you mount the disk on a shaker-table with
sufficient amplitude)

Later in the thread I think people have pretty much isolated it as not
the disk's problem, but just wanted to point this out.

I assume that large enough customers can buy enterprise-type command
completion (all commands within X seconds) from most any disk
vendor.  However, these firmwares require much smarter or more active
drivers or block layers, to handle the higher error rate when the data
on the device is valid, but it will take longer than allowed by the
arbitrary enterprise rules.  Most customers who are buying this many
devices have software engineers customizing the drivers or disk
management applications to handle this differing behavior.

--eric
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock


Björn Steinbrink wrote:

Running a kernel with the return statement replace by a line that prints
the irq_stat instead.

Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.

40 minutes stress test now and no exception yet. What's interesting is
that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
might have get dropped are as above.
I'll keep it running for some time and will then re-enable the return
statement to see if there's a relation between the irq_stat 0x0 and the
exception.


No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
0x0 for ata1. Syslog/dmesg has nothing new either, still the same
pattern of dismissed irq_stats.


I've finally managed to reproduce this problem on my box, by doing:

watch --interval=0.1 /sbin/hdparm -I /dev/sda

on one drive and then running bonnie++ on /dev/sdb connected to the 
other port on the same controller device. Usually within a few minutes 
one of the IDENTIFY commands would time out in the same way you guys 
have been seeing.


Through some various trials and tribulations, the only conclusion I can 
come to is that this controller really doesn't like that 
NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried 
adding some debug code to the qc_issue function that would check to see 
if the BUSY flag in altstatus went high or that register showed an 
interrupt within a certain time afterwards, however that really seemed 
to hose things, the system wouldn't even boot.


Try out this patch, it just calls the ata_host_intr function where 
appropriate without using nv_host_intr which looks at the 
NV_INT_STATUS_CK804 register. This is what the original ADMA patch from 
Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for 
that. With this patch I can get through a whole bonnie++ run with the 
repeated IDENTIFY requests running without seeing the error.


As an aside, there seems to be some dubious code in nv_host_intr, if 
ata_host_intr returns 0 for handled when a command is outstanding, it 
goes and calls ata_check_status anyway. This is rather dangerous since 
if an interrupt showed up right after ata_host_intr but before 
ata_check_status, the ata_check_status would clear it and we would 
forget about it. I tried fixing just that issue and still had this 
problem however. I suspect that code is truly broken and needs further 
thought, but this patch avoids calling it in the ADMA case, at any rate.


As a final aside, this is another case where the hardware docs for this 
controller would really be useful, in order to know whether we are 
actually supposed to be reading that register in ADMA mode or not. I 
sent a query to Allen Martin at NVIDIA asking if there's a way I could 
get access to the documents, but I haven't heard anything yet.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 18:35:09.0 
-0600
@@ -750,9 +750,9 @@ static irqreturn_t nv_adma_interrupt(int
 
/* if in ATA register mode, use standard ata interrupt 
handler */
if (pp-flags  NV_ADMA_PORT_REGISTER_MODE) {
-   u8 irq_stat = readb(host-mmio_base + 
NV_INT_STATUS_CK804)
-(NV_INT_PORT_SHIFT * i);
-   handled += nv_host_intr(ap, irq_stat);
+   struct ata_queued_cmd *qc = ata_qc_from_tag(ap, 
ap-active_tag);
+   if(qc  !(qc-tf.flags  ATA_TFLAG_POLLING))
+   handled += ata_host_intr(ap, qc);
continue;
}

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Alistair John Strachan

On Tuesday 23 January 2007 01:24, Robert Hancock wrote:
 As a final aside, this is another case where the hardware docs for this
 controller would really be useful, in order to know whether we are
 actually supposed to be reading that register in ADMA mode or not. I
 sent a query to Allen Martin at NVIDIA asking if there's a way I could
 get access to the documents, but I haven't heard anything yet.

Obviously, NVIDIA's response is disappointing, but thank you for putting the 
time in to debug this problem. Definitely sounds like a hardware defect, I'm 
just glad there's a workaround.

Will we see this fix in 2.6.20?

-- 
Cheers,
Alistair.

Final year Computer Science undergraduate.
1F2 55 South Clerk Street, Edinburgh, UK.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock


Alistair John Strachan wrote:

On Tuesday 23 January 2007 01:24, Robert Hancock wrote:

As a final aside, this is another case where the hardware docs for this
controller would really be useful, in order to know whether we are
actually supposed to be reading that register in ADMA mode or not. I
sent a query to Allen Martin at NVIDIA asking if there's a way I could
get access to the documents, but I haven't heard anything yet.


Obviously, NVIDIA's response is disappointing, but thank you for putting the 
time in to debug this problem. Definitely sounds like a hardware defect, I'm 
just glad there's a workaround.


Will we see this fix in 2.6.20?


Hopefully, assuming it actually does fix the problem for those that have 
been seeing it..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Björn Steinbrink

On 2007.01.22 19:24:22 -0600, Robert Hancock wrote:
 Björn Steinbrink wrote:
 Running a kernel with the return statement replace by a line that prints
 the irq_stat instead.
 
 Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2.
 40 minutes stress test now and no exception yet. What's interesting is
 that ata1 saw exactly one interrupt with irq_stat 0x0, all others that
 might have get dropped are as above.
 I'll keep it running for some time and will then re-enable the return
 statement to see if there's a relation between the irq_stat 0x0 and the
 exception.
 
 No, doesn't seem to be related, did get 2 exceptions, but no irq_stat
 0x0 for ata1. Syslog/dmesg has nothing new either, still the same
 pattern of dismissed irq_stats.
 
 I've finally managed to reproduce this problem on my box, by doing:
 
 watch --interval=0.1 /sbin/hdparm -I /dev/sda
 
 on one drive and then running bonnie++ on /dev/sdb connected to the 
 other port on the same controller device. Usually within a few minutes 
 one of the IDENTIFY commands would time out in the same way you guys 
 have been seeing.
 
 Through some various trials and tribulations, the only conclusion I can 
 come to is that this controller really doesn't like that 
 NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried 
 adding some debug code to the qc_issue function that would check to see 
 if the BUSY flag in altstatus went high or that register showed an 
 interrupt within a certain time afterwards, however that really seemed 
 to hose things, the system wouldn't even boot.

Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804.
I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now
without a single problem and that should still look at
NV_INT_STATUS_CK804, right?
I just noticed that my last email might not have been clear enough. The
exceptions happened when I re-enabled the return statement in addition
to the debug message. Without the INT_DEV check, it is completely fine
AFAICT.

 Try out this patch, it just calls the ata_host_intr function where 
 appropriate without using nv_host_intr which looks at the 
 NV_INT_STATUS_CK804 register. This is what the original ADMA patch from 
 Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for 
 that. With this patch I can get through a whole bonnie++ run with the 
 repeated IDENTIFY requests running without seeing the error.

I'll see if I can schedule a test run for tomorrow, I currently need
this box.

Thanks,
Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-22 Thread Robert Hancock


Björn Steinbrink wrote:

Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804.
I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now
without a single problem and that should still look at
NV_INT_STATUS_CK804, right?
I just noticed that my last email might not have been clear enough. The
exceptions happened when I re-enabled the return statement in addition
to the debug message. Without the INT_DEV check, it is completely fine
AFAICT.


Indeed, it seems to be just the NV_INT_DEV check that is problematic. 
Here's a patch that's likely better to test, it forces the NV_INT_DEV 
flag on when a command is active, and also fixes that questionable code 
in nv_host_intr that I mentioned.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 22:33:43.0 
-0600
@@ -700,7 +700,6 @@ static void nv_adma_check_cpb(struct ata
 static int nv_host_intr(struct ata_port *ap, u8 irq_stat)
 {
struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap-active_tag);
-   int handled;
 
/* freeze if hotplugged */
if (unlikely(irq_stat  (NV_INT_ADDED | NV_INT_REMOVED))) {
@@ -719,13 +718,7 @@ static int nv_host_intr(struct ata_port 
}
 
/* handle interrupt */
-   handled = ata_host_intr(ap, qc);
-   if (unlikely(!handled)) {
-   /* spurious, clear it */
-   ata_check_status(ap);
-   }
-
-   return 1;
+   return ata_host_intr(ap, qc);
 }
 
 static irqreturn_t nv_adma_interrupt(int irq, void *dev_instance)
@@ -752,6 +745,11 @@ static irqreturn_t nv_adma_interrupt(int
if (pp-flags  NV_ADMA_PORT_REGISTER_MODE) {
u8 irq_stat = readb(host-mmio_base + 
NV_INT_STATUS_CK804)
 (NV_INT_PORT_SHIFT * i);
+   if(ata_tag_valid(ap-active_tag))
+   /** NV_INT_DEV indication seems 
unreliable at times
+   at least in ADMA mode. Force it on 
always when a
+   command is active, to prevent 
losing interrupts. */
+   irq_stat |= NV_INT_DEV;
handled += nv_host_intr(ap, irq_stat);
continue;
}

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-21 Thread Tejun Heo


Paolo Ornati wrote:

I don't know. It's a two years old ST380817AS.

# smartctl -a -d ata /dev/sda

smartctl version 5.36 [x86_64-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family
Device Model: ST380817AS


I'll blacklist it.  Thanks.

--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Tejun Heo


Hello,

Chr wrote:
Ok, you won't believe this... I opened my case and rewired my drives... 
And guess what, my second (aka the "good") HDD is now failing! 
I guess, my mainboard has a (but maybe two, or three :( ) "bad" sata-port(s)!  


Or, you have power related problem.  Try to rewire the power lines or 
connect harddrives to a separate powersupply.  It's often useful to 
change one component at a time and watch which change the problem 
follows.  Anyways, you seem to be suffering transmission failures, not a 
driver problem.


Thanks.

--
tejun
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-21 Thread Jiri Slaby

Chr wrote:
>>   7 Seek_Error_Rate 0x000f   083   060   030Pre-fail  Always 
>>   -   204305750
>>   1 Raw_Read_Error_Rate 0x000f   059   049   006Pre-fail  Always 
>>   -   215927244
>> 195 Hardware_ECC_Recovered  0x001a   059   049   000Old_age   Always 
>>   -   215927244 
> 
> Wow! that HDD is really in a bad condition.

I don't think so, this seems to be normal for Seagate drives...

regards,
-- 
http://www.fi.muni.cz/~xslaby/Jiri Slaby
faculty of informatics, masaryk university, brno, cz
e-mail: jirislaby gmail com, gpg pubkey fingerprint:
B674 9967 0407 CE62 ACC8  22A0 32CC 55C3 39D4 7A7E
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Robert Hancock


Björn Steinbrink wrote:

On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:

Björn Steinbrink wrote:

All kernels were bad using that approach. So back to square 1. :/

Björn


OK guys, here's a new patch to try against 2.6.20-rc5:

Right now when switching between ADMA mode and legacy mode (i.e. when 
going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
set the ADMA GO register bit appropriately and continue with no delay. 
It looks like in some cases the controller doesn't respond to this 
immediately, it takes some nanoseconds for the controller's status 
registers to reflect the change that was made. It's possible that if we 
were trying to issue commands during this time, the controller might not 
react properly. This patch adds some code to wait for the status 
register to change to the state we asked for before continuing.


Just got two exceptions with your patch, none of the debug messages were
issued.

Björn


Hmm, another miss, apparently.. Has anyone tried removing these lines
from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does?

/* bail out if not our interrupt */
if (!(irq_stat & NV_INT_DEV))
return 0;

--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Robert Hancock


Björn Steinbrink wrote:

On 2007.01.21 23:08:11 +0100, Björn Steinbrink wrote:

On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:

Björn Steinbrink wrote:

All kernels were bad using that approach. So back to square 1. :/

Björn


OK guys, here's a new patch to try against 2.6.20-rc5:

Right now when switching between ADMA mode and legacy mode (i.e. when 
going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
set the ADMA GO register bit appropriately and continue with no delay. 
It looks like in some cases the controller doesn't respond to this 
immediately, it takes some nanoseconds for the controller's status 
registers to reflect the change that was made. It's possible that if we 
were trying to issue commands during this time, the controller might not 
react properly. This patch adds some code to wait for the status 
register to change to the state we asked for before continuing.

I went for the "I feel lucky" route and did just add mmio reads after the
mmio writes, posting them. Rationale being that if it is a write posting
issue, the debug patch would/could actually hide it AFAICT.
It's the "I feel lucky" route, because my whole "knowledge" about mmio
and write posting originates from the few things I read up on when you
discovered the comment about write posting in the generic ata code.


Uhm, yeah, exception occured about the time that I hit "send".

Björn


Yeah, I don't think just adding reads to flush posted writes is enough 
here - it seems to need more delay than that, and it also wasn't always 
in the idle state even before we would write the register..


-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink

On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >All kernels were bad using that approach. So back to square 1. :/
> >
> >Björn
> >
> 
> OK guys, here's a new patch to try against 2.6.20-rc5:
> 
> Right now when switching between ADMA mode and legacy mode (i.e. when 
> going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
> set the ADMA GO register bit appropriately and continue with no delay. 
> It looks like in some cases the controller doesn't respond to this 
> immediately, it takes some nanoseconds for the controller's status 
> registers to reflect the change that was made. It's possible that if we 
> were trying to issue commands during this time, the controller might not 
> react properly. This patch adds some code to wait for the status 
> register to change to the state we asked for before continuing.

Just got two exceptions with your patch, none of the debug messages were
issued.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink

On 2007.01.21 23:08:11 +0100, Björn Steinbrink wrote:
> On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
> > Björn Steinbrink wrote:
> > >All kernels were bad using that approach. So back to square 1. :/
> > >
> > >Björn
> > >
> > 
> > OK guys, here's a new patch to try against 2.6.20-rc5:
> > 
> > Right now when switching between ADMA mode and legacy mode (i.e. when 
> > going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
> > set the ADMA GO register bit appropriately and continue with no delay. 
> > It looks like in some cases the controller doesn't respond to this 
> > immediately, it takes some nanoseconds for the controller's status 
> > registers to reflect the change that was made. It's possible that if we 
> > were trying to issue commands during this time, the controller might not 
> > react properly. This patch adds some code to wait for the status 
> > register to change to the state we asked for before continuing.
> 
> I went for the "I feel lucky" route and did just add mmio reads after the
> mmio writes, posting them. Rationale being that if it is a write posting
> issue, the debug patch would/could actually hide it AFAICT.
> It's the "I feel lucky" route, because my whole "knowledge" about mmio
> and write posting originates from the few things I read up on when you
> discovered the comment about write posting in the generic ata code.

Uhm, yeah, exception occured about the time that I hit "send".

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink

On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >All kernels were bad using that approach. So back to square 1. :/
> >
> >Björn
> >
> 
> OK guys, here's a new patch to try against 2.6.20-rc5:
> 
> Right now when switching between ADMA mode and legacy mode (i.e. when 
> going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
> set the ADMA GO register bit appropriately and continue with no delay. 
> It looks like in some cases the controller doesn't respond to this 
> immediately, it takes some nanoseconds for the controller's status 
> registers to reflect the change that was made. It's possible that if we 
> were trying to issue commands during this time, the controller might not 
> react properly. This patch adds some code to wait for the status 
> register to change to the state we asked for before continuing.

I went for the "I feel lucky" route and did just add mmio reads after the
mmio writes, posting them. Rationale being that if it is a write posting
issue, the debug patch would/could actually hide it AFAICT.
It's the "I feel lucky" route, because my whole "knowledge" about mmio
and write posting originates from the few things I read up on when you
discovered the comment about write posting in the generic ata code.

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-21 Thread Chr

On Sunday, 21. January 2007 20:25, Paolo Ornati wrote:
> On Sun, 21 Jan 2007 11:32:02 -0600
> Robert Hancock <[EMAIL PROTECTED]> wrote:
> 
> > It looks like what you're getting is an actual NCQ write timing out. 
> > That makes the bisect result not very interesting since obviously it 
> > wouldn't have issued any NCQ writes before NCQ support was
> > implemented. Seeing as how it's also an entirely different driver I
> > imagine it's a different problem than what I've been looking at.
> > 
> > Maybe that drive just has some issues with NCQ? I would be surprised
> > at that with a Seagate though..
> 
> I don't know. It's a two years old ST380817AS.
> 
> 
> # smartctl -a -d ata /dev/sda
> 
> smartctl version 5.36 [x86_64-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
> Home page is http://smartmontools.sourceforge.net/
> 
> === START OF INFORMATION SECTION ===
> Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family
> Device Model: ST380817AS
> Serial Number:4MR08EK8
> Firmware Version: 3.42
> User Capacity:80,026,361,856 bytes
> Device is:In smartctl database [for details use: -P show]
> ATA Version is:   6
> ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 2
> Local Time is:Sun Jan 21 20:15:40 2007 CET
> SMART support is: Available - device has SMART capability.
> SMART support is: Enabled
> 
> === START OF READ SMART DATA SECTION ===
> SMART overall-health self-assessment test result: PASSED
> 
> General SMART Values:
> Offline data collection status:  (0x82)   Offline data collection activity
>   was completed without error.
>   Auto Offline Data Collection: Enabled.
> Self-test execution status:  (   0)   The previous self-test routine 
> completed
>   without error or no self-test has ever 
>   been run.
> Total time to complete Offline 
> data collection:   ( 430) seconds.
> Offline data collection
> capabilities:  (0x5b) SMART execute Offline immediate.
>   Auto Offline data collection on/off 
> support.
>   Suspend Offline collection upon new
>   command.
>   Offline surface scan supported.
>   Self-test supported.
>   No Conveyance Self-test supported.
>   Selective Self-test supported.
> SMART capabilities:(0x0003)   Saves SMART data before entering
>   power-saving mode.
>   Supports SMART auto save timer.
> Error logging capability:(0x01)   Error logging supported.
>   No General Purpose Logging support.
> Short self-test routine 
> recommended polling time:  (   1) minutes.
> Extended self-test routine
> recommended polling time:  (  47) minutes.
> 
> SMART Attributes Data Structure revision number: 10
> Vendor Specific SMART Attributes with Thresholds:
> ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
> WHEN_FAILED RAW_VALUE
>   1 Raw_Read_Error_Rate 0x000f   059   049   006Pre-fail  Always  
>  -   215927244
>   3 Spin_Up_Time0x0003   098   098   000Pre-fail  Always  
>  -   0
>   4 Start_Stop_Count0x0032   098   098   020Old_age   Always  
>  -   2182
>   5 Reallocated_Sector_Ct   0x0033   100   100   036Pre-fail  Always  
>  -   0
>   7 Seek_Error_Rate 0x000f   083   060   030Pre-fail  Always  
>  -   204305750
>   9 Power_On_Hours  0x0032   097   097   000Old_age   Always  
>  -   3494
>  10 Spin_Retry_Count0x0013   100   100   097Pre-fail  Always  
>  -   0
>  12 Power_Cycle_Count   0x0032   098   098   020Old_age   Always  
>  -   2541
> 194 Temperature_Celsius 0x0022   024   040   000Old_age   Always  
>  -   24 (Lifetime Min/Max 0/15)
> 195 Hardware_ECC_Recovered  0x001a   059   049   000Old_age   Always  
>  -   215927244
> 197 Current_Pending_Sector  0x0012   100   100   000Old_age   Always  
>  -   1
> 198 Offline_Uncorrectable   0x0010   100   100   000Old_age   Offline 
>  -   1
> 199 UDMA_CRC_Error_Count0x003e   200   200   000Old_age   Always  
>  -   0
> 200 Multi_Zone_Error_Rate   0x   100   253   000Old_age   Offline 
>  -   0
> 202 TA_Increase_Count   0x0032   100   253   000Old_age   Always  
>  -   0
> 
> SMART Error Log Version: 1
> ATA Error Count: 12 (device log contains only the most recent five errors)
>   CR = Command Register [HEX]
>   FR = Features Register [HEX]
>   SC

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Chr

On Sunday, 21. January 2007 19:01, Björn Steinbrink wrote:
> On 2007.01.21 18:34:40 +0100, Chr wrote:
>
> I run those two in parallel:
> while /bin/true; do ls -lR / > /dev/null 2>&1; done
> while /bin/true; do echo 255 > /proc/sys/vm/drop_caches; sleep 1; done
>
> Not sure if running them in parallel is necessary, but I don't want to
> change the test setup ;) Takes between 1 and 40 minutes to trigger it.
> Most of the time it's around 15 minutes now, doing more random stuff in
> addition to that seems to trigger it even easier (like reading mail,
> rebuilding the kernel etc.).
>
> I'm down to 2 commits after 2.6.19 now, only bad kernels, so I tend to
> say that 2.6.19 with 2.6.20-rc5's sata_nv.c will also fail for me, but I
> thought I might finish bisection just to be sure.
>
> > But, this time it looks slightly different:
> > ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> > ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout)
> >
> > [Rest of the error message + SMART error snipped]
>
> I get the same exception every time, doesn't change for me. And neither
> do I get any SMART errors or something.
>
> Thanks,
> Björn

Ok, you won't believe this... I opened my case and rewired my drives... 
And guess what, my second (aka the "good") HDD is now failing! 
I guess, my mainboard has a (but maybe two, or three :( ) "bad" sata-port(s)!  

But, one small question remains: when I opened my case, I saw that my drivers
are pluged in SATA jack 1 and 2... The BIOS also says they're on 1 and 2.
Now, Linux says they're on port 3 & 4! 



it's always ata3.00!
"ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata3.00: tag 0 cmd 0xea Emask 0x4 stat 0x40 err 0x0 (timeout)
ata3: soft resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: configured for UDMA/133
ata3: EH complete
SCSI device sda: 490234752 512-byte hdwr sectors (251000 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back"


Thanks,
Chr.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Robert Hancock


Björn Steinbrink wrote:

All kernels were bad using that approach. So back to square 1. :/

Björn



OK guys, here's a new patch to try against 2.6.20-rc5:

Right now when switching between ADMA mode and legacy mode (i.e. when 
going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
set the ADMA GO register bit appropriately and continue with no delay. 
It looks like in some cases the controller doesn't respond to this 
immediately, it takes some nanoseconds for the controller's status 
registers to reflect the change that was made. It's possible that if we 
were trying to issue commands during this time, the controller might not 
react properly. This patch adds some code to wait for the status 
register to change to the state we asked for before continuing.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-21 13:35:17.0 
-0600
@@ -509,14 +509,38 @@ static void nv_adma_register_mode(struct
 {
void __iomem *mmio = nv_adma_ctl_block(ap);
struct nv_adma_port_priv *pp = ap->private_data;
-   u16 tmp;
+   u16 tmp, status;
+   int count = 0;
 
if (pp->flags & NV_ADMA_PORT_REGISTER_MODE)
return;
 
+   status = readw(mmio + NV_ADMA_STAT);
+   while(!(status & NV_ADMA_STAT_IDLE) && count < 20) {
+   ndelay(50);
+   status = readw(mmio + NV_ADMA_STAT);
+   count++;
+   }
+   if(count == 20)
+   ata_port_printk(ap, KERN_WARNING,
+   "timeout waiting for ADMA IDLE, stat=0x%hx\n",
+   status);
+
tmp = readw(mmio + NV_ADMA_CTL);
writew(tmp & ~NV_ADMA_CTL_GO, mmio + NV_ADMA_CTL);
 
+   count = 0;
+   status = readw(mmio + NV_ADMA_STAT);
+   while(!(status & NV_ADMA_STAT_LEGACY) && count < 20) {
+   ndelay(50);
+   status = readw(mmio + NV_ADMA_STAT);
+   count++;
+   }
+   if(count == 20)
+   ata_port_printk(ap, KERN_WARNING,
+"timeout waiting for ADMA LEGACY, stat=0x%hx\n",
+status);
+
pp->flags |= NV_ADMA_PORT_REGISTER_MODE;
 }
 
@@ -524,7 +548,8 @@ static void nv_adma_mode(struct ata_port
 {
void __iomem *mmio = nv_adma_ctl_block(ap);
struct nv_adma_port_priv *pp = ap->private_data;
-   u16 tmp;
+   u16 tmp, status;
+   int count = 0;
 
if (!(pp->flags & NV_ADMA_PORT_REGISTER_MODE))
return;
@@ -534,6 +559,18 @@ static void nv_adma_mode(struct ata_port
tmp = readw(mmio + NV_ADMA_CTL);
writew(tmp | NV_ADMA_CTL_GO, mmio + NV_ADMA_CTL);
 
+   status = readw(mmio + NV_ADMA_STAT);
+   while(((status & NV_ADMA_STAT_LEGACY) ||
+ !(status & NV_ADMA_STAT_IDLE)) && count < 20) {
+   ndelay(50);
+   status = readw(mmio + NV_ADMA_STAT);
+   count++;
+   }
+   if(count == 20)
+   ata_port_printk(ap, KERN_WARNING,
+   "timeout waiting for ADMA LEGACY clear and IDLE, 
stat=0x%hx\n",
+   status);
+
pp->flags &= ~NV_ADMA_PORT_REGISTER_MODE;
 }

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-21 Thread Paolo Ornati

On Sun, 21 Jan 2007 11:32:02 -0600
Robert Hancock <[EMAIL PROTECTED]> wrote:

> It looks like what you're getting is an actual NCQ write timing out. 
> That makes the bisect result not very interesting since obviously it 
> wouldn't have issued any NCQ writes before NCQ support was
> implemented. Seeing as how it's also an entirely different driver I
> imagine it's a different problem than what I've been looking at.
> 
> Maybe that drive just has some issues with NCQ? I would be surprised
> at that with a Seagate though..

I don't know. It's a two years old ST380817AS.

# smartctl -a -d ata /dev/sda

smartctl version 5.36 [x86_64-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family
Device Model: ST380817AS
Serial Number:4MR08EK8
Firmware Version: 3.42
User Capacity:80,026,361,856 bytes
Device is:In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 2
Local Time is:Sun Jan 21 20:15:40 2007 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:  (   0) The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection: ( 430) seconds.
Offline data collection
capabilities:(0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off 
support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine 
recommended polling time:(   1) minutes.
Extended self-test routine
recommended polling time:(  47) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   059   049   006Pre-fail  Always   
-   215927244
  3 Spin_Up_Time0x0003   098   098   000Pre-fail  Always   
-   0
  4 Start_Stop_Count0x0032   098   098   020Old_age   Always   
-   2182
  5 Reallocated_Sector_Ct   0x0033   100   100   036Pre-fail  Always   
-   0
  7 Seek_Error_Rate 0x000f   083   060   030Pre-fail  Always   
-   204305750
  9 Power_On_Hours  0x0032   097   097   000Old_age   Always   
-   3494
 10 Spin_Retry_Count0x0013   100   100   097Pre-fail  Always   
-   0
 12 Power_Cycle_Count   0x0032   098   098   020Old_age   Always   
-   2541
194 Temperature_Celsius 0x0022   024   040   000Old_age   Always   
-   24 (Lifetime Min/Max 0/15)
195 Hardware_ECC_Recovered  0x001a   059   049   000Old_age   Always   
-   215927244
197 Current_Pending_Sector  0x0012   100   100   000Old_age   Always   
-   1
198 Offline_Uncorrectable   0x0010   100   100   000Old_age   Offline  
-   1
199 UDMA_CRC_Error_Count0x003e   200   200   000Old_age   Always   
-   0
200 Multi_Zone_Error_Rate   0x   100   253   000Old_age   Offline  
-   0
202 TA_Increase_Count   0x0032   100   253   000Old_age   Always   
-   0

SMART Error Log Version: 1
ATA Error Count: 12 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink

On 2007.01.21 09:36:18 +0100, Björn Steinbrink wrote:
> On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
> > Björn Steinbrink wrote:
> > >On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote:
> > >>Robert Hancock wrote:
> > >>>change in 2.6.20-rc is either causing or triggering this problem. It 
> > >>>would be useful if you could try git bisect between 2.6.19 and 
> > >>>2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 
> > >>
> > >>Yes, 'git bisect' would be the next step in figuring out this puzzle.
> > >>
> > >>Anybody up for it?
> > >
> > >I'll go for it, but could I get an explanation how that could lead to a
> > >different result than my last bisection? I see the difference of keeping
> > >sata_nv.c but my brain can't wrap around it right now (woke up in the
> > >middle of the night and still not up to speed...).
> > 
> > Whatever the problem is, only seems to show up when ADMA is enabled, and 
> > so the patch that added ADMA support shows up as the culprit from your 
> > git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA 
> > support added in doesn't seem to have the problem, so presumably 
> > something else that changed in the 2.6.20-rc series is triggering it. 
> > Doing a bisect while keeping the driver code itself the same will 
> > hopefully identify what that change is..
> 
> Ah, right... sata_nv.c of course interacts with the outside world, d'oh!
> 
> Up to now, I only got bad kernels, latest tested being:
> 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576
> 
> Which, unless I missed a commit in the diff, only USB changes,
> continuing anyway.

All kernels were bad using that approach. So back to square 1. :/

Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink

On 2007.01.21 18:34:40 +0100, Chr wrote:
> On Sunday, 21. January 2007 09:36, Björn Steinbrink wrote:
> > On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
> >
> > Ah, right... sata_nv.c of course interacts with the outside world, d'oh!
> >
> > Up to now, I only got bad kernels, latest tested being:
> > 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576
> >
> > Which, unless I missed a commit in the diff, only USB changes,
> > continuing anyway.
> >
> > Just to make sure, here's my little helper for this bisect run, I hope
> > it does what you expected:
> >
> > #!/bin/bash
> > cp ../sata_nv.c.orig drivers/ata/sata_nv.c
> > git bisect good
> > cp drivers/ata/sata_nv.c ../sata_nv.c.orig
> > cp ../sata_nv.c drivers/ata/
> > make oldconfig
> > make -j4
> >
> > Where "../sata_nv.c" is the version from 2.6.20-rc5. The copying is done
> > to avoid conflicts and keep git happy. Of course there's also a version
> > for bad kernels ;) No idea, why I didn't make that an argument to the
> > script...
> >
> > Thanks,
> > Björn
> 
> Ar, 2.6.19 (with 2.6.20-rc5 adma stuff) is affected too (BTW, what do you 
> do to trigger the exceptions? Because, it takes hours to "reproduces" this
> silly *).

I run those two in parallel:
while /bin/true; do ls -lR / > /dev/null 2>&1; done
while /bin/true; do echo 255 > /proc/sys/vm/drop_caches; sleep 1; done

Not sure if running them in parallel is necessary, but I don't want to
change the test setup ;) Takes between 1 and 40 minutes to trigger it.
Most of the time it's around 15 minutes now, doing more random stuff in
addition to that seems to trigger it even easier (like reading mail,
rebuilding the kernel etc.).

I'm down to 2 commits after 2.6.19 now, only bad kernels, so I tend to
say that 2.6.19 with 2.6.20-rc5's sata_nv.c will also fail for me, but I
thought I might finish bisection just to be sure.

> But, this time it looks slightly different:
> ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
> ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout)

> [Rest of the error message + SMART error snipped]

I get the same exception every time, doesn't change for me. And neither
do I get any SMART errors or something.

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Chr

On Sunday, 21. January 2007 09:36, Björn Steinbrink wrote:
> On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
>
> Ah, right... sata_nv.c of course interacts with the outside world, d'oh!
>
> Up to now, I only got bad kernels, latest tested being:
> 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576
>
> Which, unless I missed a commit in the diff, only USB changes,
> continuing anyway.
>
> Just to make sure, here's my little helper for this bisect run, I hope
> it does what you expected:
>
> #!/bin/bash
> cp ../sata_nv.c.orig drivers/ata/sata_nv.c
> git bisect good
> cp drivers/ata/sata_nv.c ../sata_nv.c.orig
> cp ../sata_nv.c drivers/ata/
> make oldconfig
> make -j4
>
> Where "../sata_nv.c" is the version from 2.6.20-rc5. The copying is done
> to avoid conflicts and keep git happy. Of course there's also a version
> for bad kernels ;) No idea, why I didn't make that an argument to the
> script...
>
> Thanks,
> Björn

Ar, 2.6.19 (with 2.6.20-rc5 adma stuff) is affected too (BTW, what do you 
do to trigger the exceptions? Because, it takes hours to "reproduces" this
silly *).

But, this time it looks slightly different:
ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout)
ata3: soft resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
!!!
ata3.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x1)
ata3.00: revalidation failed (errno=-5)
ata3: failed to recover some devices, retrying in 5 secs
!!!
ata3: hard resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: configured for UDMA/133
ata3: EH complete
SCSI device sda: 488395055 512-byte hdwr sectors (250058 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back

Oh, and I got this nice SMART Error: 

ID# ATTRIBUTE_NAME  FLAGRAW VALUE
199 UDMA_CRC_Error_Count0x003e   ...  -   12

SMART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 1 occurred at disk power-on lifetime: 5603 hours (233 days + 11 hours)
  When the command that caused the error occurred, the device was in an 
unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 3f 00 00 00 af

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --    
  91 00 3f 00 00 00 0f 00  05:30:59.655  INITIALIZE DEVICE PARAMETERS 
[OBS-6]
  ec 00 01 01 00 00 00 00  05:30:59.654  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00  05:30:56.191  IDENTIFY DEVICE
  ca 00 28 02 ee 9a 0c 00  05:30:56.190  WRITE DMA
  ca 00 10 e8 4c 10 0a 00  05:30:56.190  WRITE DMA

Maybe, it's really the HDD!

OT: "http://www.nvidia.com/object/680i_hotfix.html;  

Chr.
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-21 Thread Robert Hancock


Paolo Ornati wrote:

On Sun, 21 Jan 2007 15:29:32 +0100
Paolo Ornati <[EMAIL PROTECTED]> wrote:


Sorry for starting a new thread, but I've deleted the messages from my
mail-box, and I'm sot sure it's the same problem as here:
http://lkml.org/lkml/2007/1/14/108

Today I've decided to try XFS... and just doing anything on it
(extracting a tarball, for example) make my SATA HD go crazy ;)

I don't remember to have seen this using Ext3.

[  877.839920] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action
0x2 frozen [  877.839929] ata1.00: cmd
61/02:00:64:98:98/00:00:00:00:00/40 tag 0 cdb 0x0 data 1024 out
[  877.839931]  res 40/00:00:00:4f:c2/00:00:00:4f:c2/00 Emask
0x4 (timeout) [  878.142367] ata1: soft resetting port [  878.351791]
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [  878.354384]
ata1.00: configured for UDMA/133 [  878.354392] ata1: EH complete
[  878.355696] SCSI device sda: 156301488 512-byte hdwr sectors
(80026 MB) [  878.355716] sda: Write Protect is off
[  878.355718] sda: Mode Sense: 00 3a 00 00
[  878.355745] SCSI device sda: write cache: enabled, read cache:
enabled, doesn't support DPO or FUA


It takes nothing to reproduce it.

..


git-bisect points to this commit:

--

12fad3f965830d71f6454f02b2af002a64cec4d3 is first bad commit
commit 12fad3f965830d71f6454f02b2af002a64cec4d3
Author: Tejun Heo <[EMAIL PROTECTED]>
Date:   Mon May 15 21:03:55 2006 +0900

[PATCH] ahci: implement NCQ suppport

Implement NCQ support.


It looks like what you're getting is an actual NCQ write timing out. 
That makes the bisect result not very interesting since obviously it 
wouldn't have issued any NCQ writes before NCQ support was implemented. 
Seeing as how it's also an entirely different driver I imagine it's a 
different problem than what I've been looking at.


Maybe that drive just has some issues with NCQ? I would be surprised at 
that with a Seagate though..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove "nospam" from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink

On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
> Björn Steinbrink wrote:
> >On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote:
> >>Robert Hancock wrote:
> >>>change in 2.6.20-rc is either causing or triggering this problem. It 
> >>>would be useful if you could try git bisect between 2.6.19 and 
> >>>2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 
> >>
> >>Yes, 'git bisect' would be the next step in figuring out this puzzle.
> >>
> >>Anybody up for it?
> >
> >I'll go for it, but could I get an explanation how that could lead to a
> >different result than my last bisection? I see the difference of keeping
> >sata_nv.c but my brain can't wrap around it right now (woke up in the
> >middle of the night and still not up to speed...).
> 
> Whatever the problem is, only seems to show up when ADMA is enabled, and 
> so the patch that added ADMA support shows up as the culprit from your 
> git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA 
> support added in doesn't seem to have the problem, so presumably 
> something else that changed in the 2.6.20-rc series is triggering it. 
> Doing a bisect while keeping the driver code itself the same will 
> hopefully identify what that change is..

Ah, right... sata_nv.c of course interacts with the outside world, d'oh!

Up to now, I only got bad kernels, latest tested being:
94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576

Which, unless I missed a commit in the diff, only USB changes,
continuing anyway.

Just to make sure, here's my little helper for this bisect run, I hope
it does what you expected:

#!/bin/bash
cp ../sata_nv.c.orig drivers/ata/sata_nv.c
git bisect good
cp drivers/ata/sata_nv.c ../sata_nv.c.orig
cp ../sata_nv.c drivers/ata/
make oldconfig
make -j4

Where "../sata_nv.c" is the version from 2.6.20-rc5. The copying is done
to avoid conflicts and keep git happy. Of course there's also a version
for bad kernels ;) No idea, why I didn't make that an argument to the
script...

Thanks,
Björn
-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink

On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
 Björn Steinbrink wrote:
 On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote:
 Robert Hancock wrote:
 change in 2.6.20-rc is either causing or triggering this problem. It 
 would be useful if you could try git bisect between 2.6.19 and 
 2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 
 
 Yes, 'git bisect' would be the next step in figuring out this puzzle.
 
 Anybody up for it?
 
 I'll go for it, but could I get an explanation how that could lead to a
 different result than my last bisection? I see the difference of keeping
 sata_nv.c but my brain can't wrap around it right now (woke up in the
 middle of the night and still not up to speed...).
 
 Whatever the problem is, only seems to show up when ADMA is enabled, and 
 so the patch that added ADMA support shows up as the culprit from your 
 git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA 
 support added in doesn't seem to have the problem, so presumably 
 something else that changed in the 2.6.20-rc series is triggering it. 
 Doing a bisect while keeping the driver code itself the same will 
 hopefully identify what that change is..

Ah, right... sata_nv.c of course interacts with the outside world, d'oh!

Up to now, I only got bad kernels, latest tested being:
94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576

Which, unless I missed a commit in the diff, only USB changes,
continuing anyway.

Just to make sure, here's my little helper for this bisect run, I hope
it does what you expected:

#!/bin/bash
cp ../sata_nv.c.orig drivers/ata/sata_nv.c
git bisect good
cp drivers/ata/sata_nv.c ../sata_nv.c.orig
cp ../sata_nv.c drivers/ata/
make oldconfig
make -j4

Where ../sata_nv.c is the version from 2.6.20-rc5. The copying is done
to avoid conflicts and keep git happy. Of course there's also a version
for bad kernels ;) No idea, why I didn't make that an argument to the
script...

Thanks,
Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-21 Thread Robert Hancock


Paolo Ornati wrote:

On Sun, 21 Jan 2007 15:29:32 +0100
Paolo Ornati [EMAIL PROTECTED] wrote:


Sorry for starting a new thread, but I've deleted the messages from my
mail-box, and I'm sot sure it's the same problem as here:
http://lkml.org/lkml/2007/1/14/108

Today I've decided to try XFS... and just doing anything on it
(extracting a tarball, for example) make my SATA HD go crazy ;)

I don't remember to have seen this using Ext3.

[  877.839920] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action
0x2 frozen [  877.839929] ata1.00: cmd
61/02:00:64:98:98/00:00:00:00:00/40 tag 0 cdb 0x0 data 1024 out
[  877.839931]  res 40/00:00:00:4f:c2/00:00:00:4f:c2/00 Emask
0x4 (timeout) [  878.142367] ata1: soft resetting port [  878.351791]
ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [  878.354384]
ata1.00: configured for UDMA/133 [  878.354392] ata1: EH complete
[  878.355696] SCSI device sda: 156301488 512-byte hdwr sectors
(80026 MB) [  878.355716] sda: Write Protect is off
[  878.355718] sda: Mode Sense: 00 3a 00 00
[  878.355745] SCSI device sda: write cache: enabled, read cache:
enabled, doesn't support DPO or FUA


It takes nothing to reproduce it.

..


git-bisect points to this commit:

--

12fad3f965830d71f6454f02b2af002a64cec4d3 is first bad commit
commit 12fad3f965830d71f6454f02b2af002a64cec4d3
Author: Tejun Heo [EMAIL PROTECTED]
Date:   Mon May 15 21:03:55 2006 +0900

[PATCH] ahci: implement NCQ suppport

Implement NCQ support.


It looks like what you're getting is an actual NCQ write timing out. 
That makes the bisect result not very interesting since obviously it 
wouldn't have issued any NCQ writes before NCQ support was implemented. 
Seeing as how it's also an entirely different driver I imagine it's a 
different problem than what I've been looking at.


Maybe that drive just has some issues with NCQ? I would be surprised at 
that with a Seagate though..


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Chr

On Sunday, 21. January 2007 09:36, Björn Steinbrink wrote:
 On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:

 Ah, right... sata_nv.c of course interacts with the outside world, d'oh!

 Up to now, I only got bad kernels, latest tested being:
 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576

 Which, unless I missed a commit in the diff, only USB changes,
 continuing anyway.

 Just to make sure, here's my little helper for this bisect run, I hope
 it does what you expected:

 #!/bin/bash
 cp ../sata_nv.c.orig drivers/ata/sata_nv.c
 git bisect good
 cp drivers/ata/sata_nv.c ../sata_nv.c.orig
 cp ../sata_nv.c drivers/ata/
 make oldconfig
 make -j4

 Where ../sata_nv.c is the version from 2.6.20-rc5. The copying is done
 to avoid conflicts and keep git happy. Of course there's also a version
 for bad kernels ;) No idea, why I didn't make that an argument to the
 script...

 Thanks,
 Björn

Ar, 2.6.19 (with 2.6.20-rc5 adma stuff) is affected too (BTW, what do you 
do to trigger the exceptions? Because, it takes hours to reproduces this
silly *).

But, this time it looks slightly different:
ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout)
ata3: soft resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
!!!
ata3.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x1)
ata3.00: revalidation failed (errno=-5)
ata3: failed to recover some devices, retrying in 5 secs
!!!
ata3: hard resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: configured for UDMA/133
ata3: EH complete
SCSI device sda: 488395055 512-byte hdwr sectors (250058 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back


Oh, and I got this nice SMART Error: 

ID# ATTRIBUTE_NAME  FLAGRAW VALUE
199 UDMA_CRC_Error_Count0x003e   ...  -   12

SMART Error Log Version: 1
ATA Error Count: 1
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register [HEX]
ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It wraps after 49.710 days.

Error 1 occurred at disk power-on lifetime: 5603 hours (233 days + 11 hours)
  When the command that caused the error occurred, the device was in an 
unknown state.

  After command completion occurred, registers were:
  ER ST SC SN CL CH DH
  -- -- -- -- -- -- --
  04 51 3f 00 00 00 af

  Commands leading to the command that caused the error were:
  CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
  -- -- -- -- -- -- -- --    
  91 00 3f 00 00 00 0f 00  05:30:59.655  INITIALIZE DEVICE PARAMETERS 
[OBS-6]
  ec 00 01 01 00 00 00 00  05:30:59.654  IDENTIFY DEVICE
  ec 00 00 00 00 00 00 00  05:30:56.191  IDENTIFY DEVICE
  ca 00 28 02 ee 9a 0c 00  05:30:56.190  WRITE DMA
  ca 00 10 e8 4c 10 0a 00  05:30:56.190  WRITE DMA


Maybe, it's really the HDD!

OT: http://www.nvidia.com/object/680i_hotfix.html;  


Chr.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink

On 2007.01.21 18:34:40 +0100, Chr wrote:
 On Sunday, 21. January 2007 09:36, Björn Steinbrink wrote:
  On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
 
  Ah, right... sata_nv.c of course interacts with the outside world, d'oh!
 
  Up to now, I only got bad kernels, latest tested being:
  94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576
 
  Which, unless I missed a commit in the diff, only USB changes,
  continuing anyway.
 
  Just to make sure, here's my little helper for this bisect run, I hope
  it does what you expected:
 
  #!/bin/bash
  cp ../sata_nv.c.orig drivers/ata/sata_nv.c
  git bisect good
  cp drivers/ata/sata_nv.c ../sata_nv.c.orig
  cp ../sata_nv.c drivers/ata/
  make oldconfig
  make -j4
 
  Where ../sata_nv.c is the version from 2.6.20-rc5. The copying is done
  to avoid conflicts and keep git happy. Of course there's also a version
  for bad kernels ;) No idea, why I didn't make that an argument to the
  script...
 
  Thanks,
  Björn
 
 Ar, 2.6.19 (with 2.6.20-rc5 adma stuff) is affected too (BTW, what do you 
 do to trigger the exceptions? Because, it takes hours to reproduces this
 silly *).

I run those two in parallel:
while /bin/true; do ls -lR /  /dev/null 21; done
while /bin/true; do echo 255  /proc/sys/vm/drop_caches; sleep 1; done

Not sure if running them in parallel is necessary, but I don't want to
change the test setup ;) Takes between 1 and 40 minutes to trigger it.
Most of the time it's around 15 minutes now, doing more random stuff in
addition to that seems to trigger it even easier (like reading mail,
rebuilding the kernel etc.).

I'm down to 2 commits after 2.6.19 now, only bad kernels, so I tend to
say that 2.6.19 with 2.6.20-rc5's sata_nv.c will also fail for me, but I
thought I might finish bisection just to be sure.

 But, this time it looks slightly different:
 ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
 ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout)

 [Rest of the error message + SMART error snipped]

I get the same exception every time, doesn't change for me. And neither
do I get any SMART errors or something.

Thanks,
Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink

On 2007.01.21 09:36:18 +0100, Björn Steinbrink wrote:
 On 2007.01.21 00:39:20 -0600, Robert Hancock wrote:
  Björn Steinbrink wrote:
  On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote:
  Robert Hancock wrote:
  change in 2.6.20-rc is either causing or triggering this problem. It 
  would be useful if you could try git bisect between 2.6.19 and 
  2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that 
  
  Yes, 'git bisect' would be the next step in figuring out this puzzle.
  
  Anybody up for it?
  
  I'll go for it, but could I get an explanation how that could lead to a
  different result than my last bisection? I see the difference of keeping
  sata_nv.c but my brain can't wrap around it right now (woke up in the
  middle of the night and still not up to speed...).
  
  Whatever the problem is, only seems to show up when ADMA is enabled, and 
  so the patch that added ADMA support shows up as the culprit from your 
  git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA 
  support added in doesn't seem to have the problem, so presumably 
  something else that changed in the 2.6.20-rc series is triggering it. 
  Doing a bisect while keeping the driver code itself the same will 
  hopefully identify what that change is..
 
 Ah, right... sata_nv.c of course interacts with the outside world, d'oh!
 
 Up to now, I only got bad kernels, latest tested being:
 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576
 
 Which, unless I missed a commit in the diff, only USB changes,
 continuing anyway.

All kernels were bad using that approach. So back to square 1. :/

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-21 Thread Paolo Ornati

On Sun, 21 Jan 2007 11:32:02 -0600
Robert Hancock [EMAIL PROTECTED] wrote:

 It looks like what you're getting is an actual NCQ write timing out. 
 That makes the bisect result not very interesting since obviously it 
 wouldn't have issued any NCQ writes before NCQ support was
 implemented. Seeing as how it's also an entirely different driver I
 imagine it's a different problem than what I've been looking at.
 
 Maybe that drive just has some issues with NCQ? I would be surprised
 at that with a Seagate though..

I don't know. It's a two years old ST380817AS.


# smartctl -a -d ata /dev/sda

smartctl version 5.36 [x86_64-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family
Device Model: ST380817AS
Serial Number:4MR08EK8
Firmware Version: 3.42
User Capacity:80,026,361,856 bytes
Device is:In smartctl database [for details use: -P show]
ATA Version is:   6
ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 2
Local Time is:Sun Jan 21 20:15:40 2007 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
was completed without error.
Auto Offline Data Collection: Enabled.
Self-test execution status:  (   0) The previous self-test routine completed
without error or no self-test has ever 
been run.
Total time to complete Offline 
data collection: ( 430) seconds.
Offline data collection
capabilities:(0x5b) SMART execute Offline immediate.
Auto Offline data collection on/off 
support.
Suspend Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities:(0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability:(0x01) Error logging supported.
No General Purpose Logging support.
Short self-test routine 
recommended polling time:(   1) minutes.
Extended self-test routine
recommended polling time:(  47) minutes.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate 0x000f   059   049   006Pre-fail  Always   
-   215927244
  3 Spin_Up_Time0x0003   098   098   000Pre-fail  Always   
-   0
  4 Start_Stop_Count0x0032   098   098   020Old_age   Always   
-   2182
  5 Reallocated_Sector_Ct   0x0033   100   100   036Pre-fail  Always   
-   0
  7 Seek_Error_Rate 0x000f   083   060   030Pre-fail  Always   
-   204305750
  9 Power_On_Hours  0x0032   097   097   000Old_age   Always   
-   3494
 10 Spin_Retry_Count0x0013   100   100   097Pre-fail  Always   
-   0
 12 Power_Cycle_Count   0x0032   098   098   020Old_age   Always   
-   2541
194 Temperature_Celsius 0x0022   024   040   000Old_age   Always   
-   24 (Lifetime Min/Max 0/15)
195 Hardware_ECC_Recovered  0x001a   059   049   000Old_age   Always   
-   215927244
197 Current_Pending_Sector  0x0012   100   100   000Old_age   Always   
-   1
198 Offline_Uncorrectable   0x0010   100   100   000Old_age   Offline  
-   1
199 UDMA_CRC_Error_Count0x003e   200   200   000Old_age   Always   
-   0
200 Multi_Zone_Error_Rate   0x   100   253   000Old_age   Offline  
-   0
202 TA_Increase_Count   0x0032   100   253   000Old_age   Always   
-   0

SMART Error Log Version: 1
ATA Error Count: 12 (device log contains only the most recent five errors)
CR = Command Register [HEX]
FR = Features Register [HEX]
SC = Sector Count Register [HEX]
SN = Sector Number Register [HEX]
CL = Cylinder Low Register [HEX]
CH = Cylinder High Register [HEX]
DH = Device/Head Register [HEX]
DC = Device Command Register [HEX]
ER = Error register

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Robert Hancock


Björn Steinbrink wrote:

All kernels were bad using that approach. So back to square 1. :/

Björn



OK guys, here's a new patch to try against 2.6.20-rc5:

Right now when switching between ADMA mode and legacy mode (i.e. when 
going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
set the ADMA GO register bit appropriately and continue with no delay. 
It looks like in some cases the controller doesn't respond to this 
immediately, it takes some nanoseconds for the controller's status 
registers to reflect the change that was made. It's possible that if we 
were trying to issue commands during this time, the controller might not 
react properly. This patch adds some code to wait for the status 
register to change to the state we asked for before continuing.


--
Robert Hancock  Saskatoon, SK, Canada
To email, remove nospam from [EMAIL PROTECTED]
Home Page: http://www.roberthancock.com/

--- linux-2.6.20-rc5/drivers/ata/sata_nv.c  2007-01-19 19:18:53.0 
-0600
+++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-21 13:35:17.0 
-0600
@@ -509,14 +509,38 @@ static void nv_adma_register_mode(struct
 {
void __iomem *mmio = nv_adma_ctl_block(ap);
struct nv_adma_port_priv *pp = ap-private_data;
-   u16 tmp;
+   u16 tmp, status;
+   int count = 0;
 
if (pp-flags  NV_ADMA_PORT_REGISTER_MODE)
return;
 
+   status = readw(mmio + NV_ADMA_STAT);
+   while(!(status  NV_ADMA_STAT_IDLE)  count  20) {
+   ndelay(50);
+   status = readw(mmio + NV_ADMA_STAT);
+   count++;
+   }
+   if(count == 20)
+   ata_port_printk(ap, KERN_WARNING,
+   timeout waiting for ADMA IDLE, stat=0x%hx\n,
+   status);
+
tmp = readw(mmio + NV_ADMA_CTL);
writew(tmp  ~NV_ADMA_CTL_GO, mmio + NV_ADMA_CTL);
 
+   count = 0;
+   status = readw(mmio + NV_ADMA_STAT);
+   while(!(status  NV_ADMA_STAT_LEGACY)  count  20) {
+   ndelay(50);
+   status = readw(mmio + NV_ADMA_STAT);
+   count++;
+   }
+   if(count == 20)
+   ata_port_printk(ap, KERN_WARNING,
+timeout waiting for ADMA LEGACY, stat=0x%hx\n,
+status);
+
pp-flags |= NV_ADMA_PORT_REGISTER_MODE;
 }
 
@@ -524,7 +548,8 @@ static void nv_adma_mode(struct ata_port
 {
void __iomem *mmio = nv_adma_ctl_block(ap);
struct nv_adma_port_priv *pp = ap-private_data;
-   u16 tmp;
+   u16 tmp, status;
+   int count = 0;
 
if (!(pp-flags  NV_ADMA_PORT_REGISTER_MODE))
return;
@@ -534,6 +559,18 @@ static void nv_adma_mode(struct ata_port
tmp = readw(mmio + NV_ADMA_CTL);
writew(tmp | NV_ADMA_CTL_GO, mmio + NV_ADMA_CTL);
 
+   status = readw(mmio + NV_ADMA_STAT);
+   while(((status  NV_ADMA_STAT_LEGACY) ||
+ !(status  NV_ADMA_STAT_IDLE))  count  20) {
+   ndelay(50);
+   status = readw(mmio + NV_ADMA_STAT);
+   count++;
+   }
+   if(count == 20)
+   ata_port_printk(ap, KERN_WARNING,
+   timeout waiting for ADMA LEGACY clear and IDLE, 
stat=0x%hx\n,
+   status);
+
pp-flags = ~NV_ADMA_PORT_REGISTER_MODE;
 }

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Chr

On Sunday, 21. January 2007 19:01, Björn Steinbrink wrote:
 On 2007.01.21 18:34:40 +0100, Chr wrote:

 I run those two in parallel:
 while /bin/true; do ls -lR /  /dev/null 21; done
 while /bin/true; do echo 255  /proc/sys/vm/drop_caches; sleep 1; done

 Not sure if running them in parallel is necessary, but I don't want to
 change the test setup ;) Takes between 1 and 40 minutes to trigger it.
 Most of the time it's around 15 minutes now, doing more random stuff in
 addition to that seems to trigger it even easier (like reading mail,
 rebuilding the kernel etc.).

 I'm down to 2 commits after 2.6.19 now, only bad kernels, so I tend to
 say that 2.6.19 with 2.6.20-rc5's sata_nv.c will also fail for me, but I
 thought I might finish bisection just to be sure.

  But, this time it looks slightly different:
  ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
  ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout)
 
  [Rest of the error message + SMART error snipped]

 I get the same exception every time, doesn't change for me. And neither
 do I get any SMART errors or something.

 Thanks,
 Björn

Ok, you won't believe this... I opened my case and rewired my drives... 
And guess what, my second (aka the good) HDD is now failing! 
I guess, my mainboard has a (but maybe two, or three :( ) bad sata-port(s)!  

But, one small question remains: when I opened my case, I saw that my drivers
are pluged in SATA jack 1 and 2... The BIOS also says they're on 1 and 2.
Now, Linux says they're on port 3  4! 



it's always ata3.00!
ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen
ata3.00: tag 0 cmd 0xea Emask 0x4 stat 0x40 err 0x0 (timeout)
ata3: soft resetting port
ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300)
ata3.00: configured for UDMA/133
ata3: EH complete
SCSI device sda: 490234752 512-byte hdwr sectors (251000 MB)
sda: Write Protect is off
sda: Mode Sense: 00 3a 00 00
SCSI device sda: drive cache: write back


Thanks,
Chr.
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions triggered by XFS (since 2.6.18)

2007-01-21 Thread Chr

On Sunday, 21. January 2007 20:25, Paolo Ornati wrote:
 On Sun, 21 Jan 2007 11:32:02 -0600
 Robert Hancock [EMAIL PROTECTED] wrote:
 
  It looks like what you're getting is an actual NCQ write timing out. 
  That makes the bisect result not very interesting since obviously it 
  wouldn't have issued any NCQ writes before NCQ support was
  implemented. Seeing as how it's also an entirely different driver I
  imagine it's a different problem than what I've been looking at.
  
  Maybe that drive just has some issues with NCQ? I would be surprised
  at that with a Seagate though..
 
 I don't know. It's a two years old ST380817AS.
 
 
 # smartctl -a -d ata /dev/sda
 
 smartctl version 5.36 [x86_64-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen
 Home page is http://smartmontools.sourceforge.net/
 
 === START OF INFORMATION SECTION ===
 Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family
 Device Model: ST380817AS
 Serial Number:4MR08EK8
 Firmware Version: 3.42
 User Capacity:80,026,361,856 bytes
 Device is:In smartctl database [for details use: -P show]
 ATA Version is:   6
 ATA Standard is:  ATA/ATAPI-6 T13 1410D revision 2
 Local Time is:Sun Jan 21 20:15:40 2007 CET
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled
 
 === START OF READ SMART DATA SECTION ===
 SMART overall-health self-assessment test result: PASSED
 
 General SMART Values:
 Offline data collection status:  (0x82)   Offline data collection activity
   was completed without error.
   Auto Offline Data Collection: Enabled.
 Self-test execution status:  (   0)   The previous self-test routine 
 completed
   without error or no self-test has ever 
   been run.
 Total time to complete Offline 
 data collection:   ( 430) seconds.
 Offline data collection
 capabilities:  (0x5b) SMART execute Offline immediate.
   Auto Offline data collection on/off 
 support.
   Suspend Offline collection upon new
   command.
   Offline surface scan supported.
   Self-test supported.
   No Conveyance Self-test supported.
   Selective Self-test supported.
 SMART capabilities:(0x0003)   Saves SMART data before entering
   power-saving mode.
   Supports SMART auto save timer.
 Error logging capability:(0x01)   Error logging supported.
   No General Purpose Logging support.
 Short self-test routine 
 recommended polling time:  (   1) minutes.
 Extended self-test routine
 recommended polling time:  (  47) minutes.
 
 SMART Attributes Data Structure revision number: 10
 Vendor Specific SMART Attributes with Thresholds:
 ID# ATTRIBUTE_NAME  FLAG VALUE WORST THRESH TYPE  UPDATED  
 WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate 0x000f   059   049   006Pre-fail  Always  
  -   215927244
   3 Spin_Up_Time0x0003   098   098   000Pre-fail  Always  
  -   0
   4 Start_Stop_Count0x0032   098   098   020Old_age   Always  
  -   2182
   5 Reallocated_Sector_Ct   0x0033   100   100   036Pre-fail  Always  
  -   0
   7 Seek_Error_Rate 0x000f   083   060   030Pre-fail  Always  
  -   204305750
   9 Power_On_Hours  0x0032   097   097   000Old_age   Always  
  -   3494
  10 Spin_Retry_Count0x0013   100   100   097Pre-fail  Always  
  -   0
  12 Power_Cycle_Count   0x0032   098   098   020Old_age   Always  
  -   2541
 194 Temperature_Celsius 0x0022   024   040   000Old_age   Always  
  -   24 (Lifetime Min/Max 0/15)
 195 Hardware_ECC_Recovered  0x001a   059   049   000Old_age   Always  
  -   215927244
 197 Current_Pending_Sector  0x0012   100   100   000Old_age   Always  
  -   1
 198 Offline_Uncorrectable   0x0010   100   100   000Old_age   Offline 
  -   1
 199 UDMA_CRC_Error_Count0x003e   200   200   000Old_age   Always  
  -   0
 200 Multi_Zone_Error_Rate   0x   100   253   000Old_age   Offline 
  -   0
 202 TA_Increase_Count   0x0032   100   253   000Old_age   Always  
  -   0
 
 SMART Error Log Version: 1
 ATA Error Count: 12 (device log contains only the most recent five errors)
   CR = Command Register [HEX]
   FR = Features Register [HEX]
   SC = Sector Count Register [HEX]
   SN = Sector Number Register [HEX]
   CL = Cylinder Low Register [HEX]

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink

On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
 Björn Steinbrink wrote:
 All kernels were bad using that approach. So back to square 1. :/
 
 Björn
 
 
 OK guys, here's a new patch to try against 2.6.20-rc5:
 
 Right now when switching between ADMA mode and legacy mode (i.e. when 
 going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
 set the ADMA GO register bit appropriately and continue with no delay. 
 It looks like in some cases the controller doesn't respond to this 
 immediately, it takes some nanoseconds for the controller's status 
 registers to reflect the change that was made. It's possible that if we 
 were trying to issue commands during this time, the controller might not 
 react properly. This patch adds some code to wait for the status 
 register to change to the state we asked for before continuing.

I went for the I feel lucky route and did just add mmio reads after the
mmio writes, posting them. Rationale being that if it is a write posting
issue, the debug patch would/could actually hide it AFAICT.
It's the I feel lucky route, because my whole knowledge about mmio
and write posting originates from the few things I read up on when you
discovered the comment about write posting in the generic ata code.

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: SATA exceptions with 2.6.20-rc5

2007-01-21 Thread Björn Steinbrink

On 2007.01.21 23:08:11 +0100, Björn Steinbrink wrote:
 On 2007.01.21 13:58:01 -0600, Robert Hancock wrote:
  Björn Steinbrink wrote:
  All kernels were bad using that approach. So back to square 1. :/
  
  Björn
  
  
  OK guys, here's a new patch to try against 2.6.20-rc5:
  
  Right now when switching between ADMA mode and legacy mode (i.e. when 
  going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just 
  set the ADMA GO register bit appropriately and continue with no delay. 
  It looks like in some cases the controller doesn't respond to this 
  immediately, it takes some nanoseconds for the controller's status 
  registers to reflect the change that was made. It's possible that if we 
  were trying to issue commands during this time, the controller might not 
  react properly. This patch adds some code to wait for the status 
  register to change to the state we asked for before continuing.
 
 I went for the I feel lucky route and did just add mmio reads after the
 mmio writes, posting them. Rationale being that if it is a write posting
 issue, the debug patch would/could actually hide it AFAICT.
 It's the I feel lucky route, because my whole knowledge about mmio
 and write posting originates from the few things I read up on when you
 discovered the comment about write posting in the generic ata code.

Uhm, yeah, exception occured about the time that I hit send.

Björn
-
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 >

1 - 100 of 182 matches

Mail list logo