Re: SATA exceptions
13 Tem 2007 Cum tarihinde, Tejun Heo şunları yazmıştı: > >> OS and driver can't really do much about the reallocation event. Some > >> number of reallocations is okay but if you it going up constantly, you > >> probably have a dying disk. > > > > Hmm... cut the power while writing is doable from OS and might force > > reallocations? > > Hmmm... We don't have any pending write when power goes out and I don't > emergency unload can directly increase reallocation count. It can > shorten lifespan of the head tho. > > > You might want to check if number of reallocated sectors increases > > with shutdowns/reboots. > > I'm curious too. It seems reboot/shutdown has no effect on reallocated sectors. After 5 rebot/5 shutdown it didn't change at all. zangetsu ~ # smartctl -a /dev/sda | grep Reall 5 Reallocated_Sector_Ct 0x0033 067 067 010Pre-fail Always - 314 196 Reallocated_Event_Count 0x0032 067 067 000Old_age Always - 314 Cheers -- S.Çağlar Onur <[EMAIL PROTECTED]> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! signature.asc Description: This is a digitally signed message part.
Re: SATA exceptions
13 Tem 2007 Cum tarihinde, Tejun Heo şunları yazmıştı: OS and driver can't really do much about the reallocation event. Some number of reallocations is okay but if you it going up constantly, you probably have a dying disk. Hmm... cut the power while writing is doable from OS and might force reallocations? Hmmm... We don't have any pending write when power goes out and I don't emergency unload can directly increase reallocation count. It can shorten lifespan of the head tho. You might want to check if number of reallocated sectors increases with shutdowns/reboots. I'm curious too. It seems reboot/shutdown has no effect on reallocated sectors. After 5 rebot/5 shutdown it didn't change at all. zangetsu ~ # smartctl -a /dev/sda | grep Reall 5 Reallocated_Sector_Ct 0x0033 067 067 010Pre-fail Always - 314 196 Reallocated_Event_Count 0x0032 067 067 000Old_age Always - 314 Cheers -- S.Çağlar Onur [EMAIL PROTECTED] http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! signature.asc Description: This is a digitally signed message part.
Re: SATA exceptions
Pavel Machek wrote: Your SMART log shows 309 reallocated sectors. That seems somewhat high.. >>> Ah sorry to misinterpret the content:), its a quiet new piece of hardware >>> (at >>> most ~1.5 month old) and "Reallocated_Event_Count" constantly increases >>> (currently its increased to 313) and although i'm not 100 percent sure >>> these >>> errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these >>> cause according to kern.log these only visible with 2.6.22+) >> OS and driver can't really do much about the reallocation event. Some >> number of reallocations is okay but if you it going up constantly, you >> probably have a dying disk. > > Hmm... cut the power while writing is doable from OS and might force > reallocations? Hmmm... We don't have any pending write when power goes out and I don't emergency unload can directly increase reallocation count. It can shorten lifespan of the head tho. > You might want to check if number of reallocated sectors increases > with shutdowns/reboots. I'm curious too. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions
Hi! > >> Your SMART log shows 309 reallocated sectors. That seems somewhat high.. > > > > Ah sorry to misinterpret the content:), its a quiet new piece of hardware > > (at > > most ~1.5 month old) and "Reallocated_Event_Count" constantly increases > > (currently its increased to 313) and although i'm not 100 percent sure > > these > > errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these > > cause according to kern.log these only visible with 2.6.22+) > > OS and driver can't really do much about the reallocation event. Some > number of reallocations is okay but if you it going up constantly, you > probably have a dying disk. Hmm... cut the power while writing is doable from OS and might force reallocations? You might want to check if number of reallocated sectors increases with shutdowns/reboots. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions
Hi! Your SMART log shows 309 reallocated sectors. That seems somewhat high.. Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at most ~1.5 month old) and Reallocated_Event_Count constantly increases (currently its increased to 313) and although i'm not 100 percent sure these errors only occured with kernels 2.6.18 (or 2.6.18 didn't report these cause according to kern.log these only visible with 2.6.22+) OS and driver can't really do much about the reallocation event. Some number of reallocations is okay but if you it going up constantly, you probably have a dying disk. Hmm... cut the power while writing is doable from OS and might force reallocations? You might want to check if number of reallocated sectors increases with shutdowns/reboots. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions
Pavel Machek wrote: Your SMART log shows 309 reallocated sectors. That seems somewhat high.. Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at most ~1.5 month old) and Reallocated_Event_Count constantly increases (currently its increased to 313) and although i'm not 100 percent sure these errors only occured with kernels 2.6.18 (or 2.6.18 didn't report these cause according to kern.log these only visible with 2.6.22+) OS and driver can't really do much about the reallocation event. Some number of reallocations is okay but if you it going up constantly, you probably have a dying disk. Hmm... cut the power while writing is doable from OS and might force reallocations? Hmmm... We don't have any pending write when power goes out and I don't emergency unload can directly increase reallocation count. It can shorten lifespan of the head tho. You might want to check if number of reallocated sectors increases with shutdowns/reboots. I'm curious too. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions
Mark Lord wrote: > I'm not even sure how to interpret those numbers. > It seems rather odd that nearly all fields are either "100" or "253", > so those are probably pre-programmed numbers rather than actual counts. > The raw value at the end of the line (for the various "Reallocated*" > fields) > is probably the real value here. I dunno exactly either. Different vendors seem to use different metrics anyway but increasing raw number on reallocate counter is pretty easy to interpret. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions
S.Çag(lar Onur wrote: Hi; 07 Tem 2007 Cts tarihinde, Robert Hancock Åunları yazmıÅtı: It's not the free space on the drive that matters, it's the number of free sectors in the spare sector pool on the drive, which is invisible to software. Your SMART log shows 309 reallocated sectors. That seems somewhat high.. Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at most ~1.5 month old) and "Reallocated_Event_Count" constantly increases (currently its increased to 313) and although i'm not 100 percent sure these errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these cause according to kern.log these only visible with 2.6.22+) We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its smartctl output follows as a reference; smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG HM160JI Serial Number:S0W6J10P331479 Firmware Version: AD100-16 User Capacity:160.041.885.696 bytes Device is:In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0 Local Time is:Sun Jul 8 00:22:21 2007 EEST ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details. SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (5391) seconds. Offline data collection capabilities:(0x51) SMART execute Offline immediate. No Auto Offline data collection support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time:( 89) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051Pre-fail Always - 0 3 Spin_Up_Time0x0007 253 253 025Pre-fail Always - 2880 4 Start_Stop_Count0x0032 098 098 000Old_age Always - 2648 5 Reallocated_Sector_Ct 0x0033 253 253 010Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 253 253 051Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 253 253 015Pre-fail Offline - 0 9 Power_On_Hours 0x0032 253 253 000Old_age Always - 236 10 Spin_Retry_Count0x0033 100 100 051Pre-fail Always - 1 11 Calibration_Retry_Count 0x0012 100 100 000Old_age Always - 2 12 Power_Cycle_Count 0x0032 100 100 000Old_age Always - 57 187 Unknown_Attribute 0x0032 253 253 000Old_age Always - 0 188 Unknown_Attribute 0x0032 253 253 000Old_age Always - 0 190 Temperature_Celsius 0x0022 047 040 040Old_age Always In_the_past 1008009269 191 G-Sense_Error_Rate 0x0012 100 100 000Old_age Always - 5396 192 Power-Off_Retract_Count 0x0012 100 100 000Old_age Always - 40 193 Load_Cycle_Count0x0012 100 100 000Old_age Always - 2575 194 Temperature_Celsius 0x0022 047 040 000Old_age Always - 53 (Lifetime Min/Max 0/15381) 195 Hardware_ECC_Recovered 0x001a 100 100 000Old_age
Re: SATA exceptions
Tejun Heo wrote: Hello, S.Çağlar Onur wrote: 07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı: It's not the free space on the drive that matters, it's the number of free sectors in the spare sector pool on the drive, which is invisible to software. Your SMART log shows 309 reallocated sectors. That seems somewhat high.. Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at most ~1.5 month old) and "Reallocated_Event_Count" constantly increases (currently its increased to 313) and although i'm not 100 percent sure these errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these cause according to kern.log these only visible with 2.6.22+) OS and driver can't really do much about the reallocation event. Some number of reallocations is okay but if you it going up constantly, you probably have a dying disk. Or, as I learned the hard way, if you have the problem on all drives sharing a power supply, a power issue. We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its smartctl output follows as a reference; 5 Reallocated_Sector_Ct 0x0033 253 253 010Pre-fail 196 Reallocated_Event_Count 0x0032 253 253 000Old_age Hmm... This is pretty high too. Do the counts increase on this machine too? -- Bill Davidsen <[EMAIL PROTECTED]> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions
Tejun Heo wrote: Hello, S.Çağlar Onur wrote: 07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı: It's not the free space on the drive that matters, it's the number of free sectors in the spare sector pool on the drive, which is invisible to software. Your SMART log shows 309 reallocated sectors. That seems somewhat high.. Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at most ~1.5 month old) and Reallocated_Event_Count constantly increases (currently its increased to 313) and although i'm not 100 percent sure these errors only occured with kernels 2.6.18 (or 2.6.18 didn't report these cause according to kern.log these only visible with 2.6.22+) OS and driver can't really do much about the reallocation event. Some number of reallocations is okay but if you it going up constantly, you probably have a dying disk. Or, as I learned the hard way, if you have the problem on all drives sharing a power supply, a power issue. We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its smartctl output follows as a reference; 5 Reallocated_Sector_Ct 0x0033 253 253 010Pre-fail 196 Reallocated_Event_Count 0x0032 253 253 000Old_age Hmm... This is pretty high too. Do the counts increase on this machine too? -- Bill Davidsen [EMAIL PROTECTED] We have more to fear from the bungling of the incompetent than from the machinations of the wicked. - from Slashdot - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions
S.Çag(lar Onur wrote: Hi; 07 Tem 2007 Cts tarihinde, Robert Hancock Åunları yazmıÅtı: It's not the free space on the drive that matters, it's the number of free sectors in the spare sector pool on the drive, which is invisible to software. Your SMART log shows 309 reallocated sectors. That seems somewhat high.. Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at most ~1.5 month old) and Reallocated_Event_Count constantly increases (currently its increased to 313) and although i'm not 100 percent sure these errors only occured with kernels 2.6.18 (or 2.6.18 didn't report these cause according to kern.log these only visible with 2.6.22+) We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its smartctl output follows as a reference; smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG HM160JI Serial Number:S0W6J10P331479 Firmware Version: AD100-16 User Capacity:160.041.885.696 bytes Device is:In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0 Local Time is:Sun Jul 8 00:22:21 2007 EEST == WARNING: May need -F samsung or -F samsung2 enabled; see manual for details. SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (5391) seconds. Offline data collection capabilities:(0x51) SMART execute Offline immediate. No Auto Offline data collection support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time: ( 2) minutes. Extended self-test routine recommended polling time:( 89) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051Pre-fail Always - 0 3 Spin_Up_Time0x0007 253 253 025Pre-fail Always - 2880 4 Start_Stop_Count0x0032 098 098 000Old_age Always - 2648 5 Reallocated_Sector_Ct 0x0033 253 253 010Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 253 253 051Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 253 253 015Pre-fail Offline - 0 9 Power_On_Hours 0x0032 253 253 000Old_age Always - 236 10 Spin_Retry_Count0x0033 100 100 051Pre-fail Always - 1 11 Calibration_Retry_Count 0x0012 100 100 000Old_age Always - 2 12 Power_Cycle_Count 0x0032 100 100 000Old_age Always - 57 187 Unknown_Attribute 0x0032 253 253 000Old_age Always - 0 188 Unknown_Attribute 0x0032 253 253 000Old_age Always - 0 190 Temperature_Celsius 0x0022 047 040 040Old_age Always In_the_past 1008009269 191 G-Sense_Error_Rate 0x0012 100 100 000Old_age Always - 5396 192 Power-Off_Retract_Count 0x0012 100 100 000Old_age Always - 40 193 Load_Cycle_Count0x0012 100 100 000Old_age Always - 2575 194 Temperature_Celsius 0x0022 047 040 000Old_age Always - 53 (Lifetime Min/Max 0/15381) 195 Hardware_ECC_Recovered 0x001a 100 100 000Old_age
Re: SATA exceptions
Mark Lord wrote: I'm not even sure how to interpret those numbers. It seems rather odd that nearly all fields are either 100 or 253, so those are probably pre-programmed numbers rather than actual counts. The raw value at the end of the line (for the various Reallocated* fields) is probably the real value here. I dunno exactly either. Different vendors seem to use different metrics anyway but increasing raw number on reallocate counter is pretty easy to interpret. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions
Hi; 09 Tem 2007 Pts tarihinde, Tejun Heo şunları yazmıştı: > > 07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı: > >> It's not the free space on the drive that matters, it's the number of > >> free sectors in the spare sector pool on the drive, which is invisible > >> to software. > >> > >> Your SMART log shows 309 reallocated sectors. That seems somewhat high.. > > > > Ah sorry to misinterpret the content:), its a quiet new piece of hardware > > (at most ~1.5 month old) and "Reallocated_Event_Count" constantly > > increases (currently its increased to 313) and although i'm not 100 > > percent sure these errors only occured with kernels > 2.6.18 (or 2.6.18 > > didn't report these cause according to kern.log these only visible with > > 2.6.22+) > > OS and driver can't really do much about the reallocation event. Some > number of reallocations is okay but if you it going up constantly, you > probably have a dying disk. Hmm its really interesting, then it means 3 piece of ~1.5 month old laptops dieing for same decease :) or they already somehow defectived (or we are damaging them but it sits on my table happily all that time :P) > > We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 > > and its smartctl output follows as a reference; > > > > 5 Reallocated_Sector_Ct 0x0033 253 253 010Pre-fail > > 196 Reallocated_Event_Count 0x0032 253 253 000Old_age > > Hmm... This is pretty high too. Do the counts increase on this machine > too? Yes, seems so (i'm adding Onur and İsmail to CC as other machines owner) and here is the smart logs for this 3 seperate machine, its interesting me and İsmail runs 2.6.22 (over 300 reloacations occured for both of us) and Onur uses 2.6.18 (0 relocation occured for him) [1] http://cekirdek.pardus.org.tr/~caglar/SATA/smart.caglar [2] http://cekirdek.pardus.org.tr/~caglar/SATA/smart.ismail [3] http://cekirdek.pardus.org.tr/~caglar/SATA/smart.onur Cheers -- S.Çağlar Onur <[EMAIL PROTECTED]> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! signature.asc Description: This is a digitally signed message part.
Re: SATA exceptions
Hello, S.Çağlar Onur wrote: > 07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı: >> It's not the free space on the drive that matters, it's the number of >> free sectors in the spare sector pool on the drive, which is invisible >> to software. >> >> Your SMART log shows 309 reallocated sectors. That seems somewhat high.. > > Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at > most ~1.5 month old) and "Reallocated_Event_Count" constantly increases > (currently its increased to 313) and although i'm not 100 percent sure these > errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these > cause according to kern.log these only visible with 2.6.22+) OS and driver can't really do much about the reallocation event. Some number of reallocations is okay but if you it going up constantly, you probably have a dying disk. > We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and > its > smartctl output follows as a reference; > > 5 Reallocated_Sector_Ct 0x0033 253 253 010Pre-fail > 196 Reallocated_Event_Count 0x0032 253 253 000Old_age Hmm... This is pretty high too. Do the counts increase on this machine too? -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions
Hello, S.Çağlar Onur wrote: 07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı: It's not the free space on the drive that matters, it's the number of free sectors in the spare sector pool on the drive, which is invisible to software. Your SMART log shows 309 reallocated sectors. That seems somewhat high.. Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at most ~1.5 month old) and Reallocated_Event_Count constantly increases (currently its increased to 313) and although i'm not 100 percent sure these errors only occured with kernels 2.6.18 (or 2.6.18 didn't report these cause according to kern.log these only visible with 2.6.22+) OS and driver can't really do much about the reallocation event. Some number of reallocations is okay but if you it going up constantly, you probably have a dying disk. We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its smartctl output follows as a reference; 5 Reallocated_Sector_Ct 0x0033 253 253 010Pre-fail 196 Reallocated_Event_Count 0x0032 253 253 000Old_age Hmm... This is pretty high too. Do the counts increase on this machine too? -- tejun - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions
Hi; 09 Tem 2007 Pts tarihinde, Tejun Heo şunları yazmıştı: 07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı: It's not the free space on the drive that matters, it's the number of free sectors in the spare sector pool on the drive, which is invisible to software. Your SMART log shows 309 reallocated sectors. That seems somewhat high.. Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at most ~1.5 month old) and Reallocated_Event_Count constantly increases (currently its increased to 313) and although i'm not 100 percent sure these errors only occured with kernels 2.6.18 (or 2.6.18 didn't report these cause according to kern.log these only visible with 2.6.22+) OS and driver can't really do much about the reallocation event. Some number of reallocations is okay but if you it going up constantly, you probably have a dying disk. Hmm its really interesting, then it means 3 piece of ~1.5 month old laptops dieing for same decease :) or they already somehow defectived (or we are damaging them but it sits on my table happily all that time :P) We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its smartctl output follows as a reference; 5 Reallocated_Sector_Ct 0x0033 253 253 010Pre-fail 196 Reallocated_Event_Count 0x0032 253 253 000Old_age Hmm... This is pretty high too. Do the counts increase on this machine too? Yes, seems so (i'm adding Onur and İsmail to CC as other machines owner) and here is the smart logs for this 3 seperate machine, its interesting me and İsmail runs 2.6.22 (over 300 reloacations occured for both of us) and Onur uses 2.6.18 (0 relocation occured for him) [1] http://cekirdek.pardus.org.tr/~caglar/SATA/smart.caglar [2] http://cekirdek.pardus.org.tr/~caglar/SATA/smart.ismail [3] http://cekirdek.pardus.org.tr/~caglar/SATA/smart.onur Cheers -- S.Çağlar Onur [EMAIL PROTECTED] http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! signature.asc Description: This is a digitally signed message part.
Re: SATA exceptions
Hi; 07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı: > It's not the free space on the drive that matters, it's the number of > free sectors in the spare sector pool on the drive, which is invisible > to software. > > Your SMART log shows 309 reallocated sectors. That seems somewhat high.. Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at most ~1.5 month old) and "Reallocated_Event_Count" constantly increases (currently its increased to 313) and although i'm not 100 percent sure these errors only occured with kernels > 2.6.18 (or 2.6.18 didn't report these cause according to kern.log these only visible with 2.6.22+) We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its smartctl output follows as a reference; smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG HM160JI Serial Number:S0W6J10P331479 Firmware Version: AD100-16 User Capacity:160.041.885.696 bytes Device is:In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0 Local Time is:Sun Jul 8 00:22:21 2007 EEST ==> WARNING: May need -F samsung or -F samsung2 enabled; see manual for details. SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (5391) seconds. Offline data collection capabilities:(0x51) SMART execute Offline immediate. No Auto Offline data collection support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time:( 2) minutes. Extended self-test routine recommended polling time:( 89) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051Pre-fail Always - 0 3 Spin_Up_Time0x0007 253 253 025Pre-fail Always - 2880 4 Start_Stop_Count0x0032 098 098 000Old_age Always - 2648 5 Reallocated_Sector_Ct 0x0033 253 253 010Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 253 253 051Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 253 253 015Pre-fail Offline - 0 9 Power_On_Hours 0x0032 253 253 000Old_age Always - 236 10 Spin_Retry_Count0x0033 100 100 051Pre-fail Always - 1 11 Calibration_Retry_Count 0x0012 100 100 000Old_age Always - 2 12 Power_Cycle_Count 0x0032 100 100 000Old_age Always - 57 187 Unknown_Attribute 0x0032 253 253 000Old_age Always - 0 188 Unknown_Attribute 0x0032 253 253 000Old_age Always - 0 190 Temperature_Celsius 0x0022 047 040 040Old_age Always In_the_past 1008009269 191 G-Sense_Error_Rate 0x0012 100 100 000Old_age Always - 5396 192 Power-Off_Retract_Count 0x0012 100 100 000Old_age Always - 40 193 Load_Cycle_Count0x0012 100 100 000Old_age Always - 2575 194 Temperature_Celsius 0x0022 047 040 000Old_age Always - 53 (Lifetime Min/Max 0/15381) 195
Re: SATA exceptions
S.Çağlar Onur wrote: 06 Tem 2007 Cum tarihinde, Tejun Heo şunları yazmıştı: S.Çağlar Onur wrote: [ 4260.278427] ata1.00: cmd ca/00:08:d0:88:bc/00:00:00:00:00/ee tag 0 cdb 0x0 data 4096 out [ 4260.278430] res 51/40:01:d7:88:bc/00:00:0e:00:00/ee Emask 0x9 (media error) That's media error on sector 247236823 on WRITE. Media errors on write are bad signs - it usually means the drive even failed to remap the sector because extra space ran out. Hmm, more than 50GB is empty on disk :) It's not the free space on the drive that matters, it's the number of free sectors in the spare sector pool on the drive, which is invisible to software. Your SMART log shows 309 reallocated sectors. That seems somewhat high.. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions
S.Çağlar Onur wrote: 06 Tem 2007 Cum tarihinde, Tejun Heo şunları yazmıştı: S.Çağlar Onur wrote: [ 4260.278427] ata1.00: cmd ca/00:08:d0:88:bc/00:00:00:00:00/ee tag 0 cdb 0x0 data 4096 out [ 4260.278430] res 51/40:01:d7:88:bc/00:00:0e:00:00/ee Emask 0x9 (media error) That's media error on sector 247236823 on WRITE. Media errors on write are bad signs - it usually means the drive even failed to remap the sector because extra space ran out. Hmm, more than 50GB is empty on disk :) It's not the free space on the drive that matters, it's the number of free sectors in the spare sector pool on the drive, which is invisible to software. Your SMART log shows 309 reallocated sectors. That seems somewhat high.. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions
Hi; 07 Tem 2007 Cts tarihinde, Robert Hancock şunları yazmıştı: It's not the free space on the drive that matters, it's the number of free sectors in the spare sector pool on the drive, which is invisible to software. Your SMART log shows 309 reallocated sectors. That seems somewhat high.. Ah sorry to misinterpret the content:), its a quiet new piece of hardware (at most ~1.5 month old) and Reallocated_Event_Count constantly increases (currently its increased to 313) and although i'm not 100 percent sure these errors only occured with kernels 2.6.18 (or 2.6.18 didn't report these cause according to kern.log these only visible with 2.6.22+) We bought 3 HP Pavillon dv2385ea and one of them only runs with 2.6.18 and its smartctl output follows as a reference; smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG HM160JI Serial Number:S0W6J10P331479 Firmware Version: AD100-16 User Capacity:160.041.885.696 bytes Device is:In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0 Local Time is:Sun Jul 8 00:22:21 2007 EEST == WARNING: May need -F samsung or -F samsung2 enabled; see manual for details. SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED See vendor-specific Attribute list for marginal Attributes. General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (5391) seconds. Offline data collection capabilities:(0x51) SMART execute Offline immediate. No Auto Offline data collection support. Suspend Offline collection upon new command. No Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time:( 2) minutes. Extended self-test routine recommended polling time:( 89) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 100 100 051Pre-fail Always - 0 3 Spin_Up_Time0x0007 253 253 025Pre-fail Always - 2880 4 Start_Stop_Count0x0032 098 098 000Old_age Always - 2648 5 Reallocated_Sector_Ct 0x0033 253 253 010Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 253 253 051Pre-fail Always - 0 8 Seek_Time_Performance 0x0025 253 253 015Pre-fail Offline - 0 9 Power_On_Hours 0x0032 253 253 000Old_age Always - 236 10 Spin_Retry_Count0x0033 100 100 051Pre-fail Always - 1 11 Calibration_Retry_Count 0x0012 100 100 000Old_age Always - 2 12 Power_Cycle_Count 0x0032 100 100 000Old_age Always - 57 187 Unknown_Attribute 0x0032 253 253 000Old_age Always - 0 188 Unknown_Attribute 0x0032 253 253 000Old_age Always - 0 190 Temperature_Celsius 0x0022 047 040 040Old_age Always In_the_past 1008009269 191 G-Sense_Error_Rate 0x0012 100 100 000Old_age Always - 5396 192 Power-Off_Retract_Count 0x0012 100 100 000Old_age Always - 40 193 Load_Cycle_Count0x0012 100 100 000Old_age Always - 2575 194 Temperature_Celsius 0x0022 047 040 000Old_age Always - 53 (Lifetime Min/Max 0/15381) 195
Re: SATA exceptions
Hi; 06 Tem 2007 Cum tarihinde, Tejun Heo şunları yazmıştı: > S.Çağlar Onur wrote: > > [ 4260.278427] ata1.00: cmd ca/00:08:d0:88:bc/00:00:00:00:00/ee tag 0 cdb > > 0x0 data 4096 out > > [ 4260.278430] res 51/40:01:d7:88:bc/00:00:0e:00:00/ee Emask 0x9 > > (media error) > > That's media error on sector 247236823 on WRITE. Media errors on write > are bad signs - it usually means the drive even failed to remap the > sector because extra space ran out. Hmm, more than 50GB is empty on disk :) > I'm not sure this is the case here > tho - the smart log is clear. Please run smart short/long tests and see > what they say. Both completed without a problem; zangetsu ~ # smartctl -l selftest /dev/sda smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_DescriptionStatus Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offlineCompleted without error 00% 357 - # 2 Short offline Completed without error 00% 355 - If you want me to try something else please just say :) Cheers -- S.Çağlar Onur <[EMAIL PROTECTED]> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! signature.asc Description: This is a digitally signed message part.
Re: SATA exceptions
Hi; 06 Tem 2007 Cum tarihinde, Tejun Heo şunları yazmıştı: S.Çağlar Onur wrote: [ 4260.278427] ata1.00: cmd ca/00:08:d0:88:bc/00:00:00:00:00/ee tag 0 cdb 0x0 data 4096 out [ 4260.278430] res 51/40:01:d7:88:bc/00:00:0e:00:00/ee Emask 0x9 (media error) That's media error on sector 247236823 on WRITE. Media errors on write are bad signs - it usually means the drive even failed to remap the sector because extra space ran out. Hmm, more than 50GB is empty on disk :) I'm not sure this is the case here tho - the smart log is clear. Please run smart short/long tests and see what they say. Both completed without a problem; zangetsu ~ # smartctl -l selftest /dev/sda smartctl version 5.37 [i686-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF READ SMART DATA SECTION === SMART Self-test log structure revision number 1 Num Test_DescriptionStatus Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offlineCompleted without error 00% 357 - # 2 Short offline Completed without error 00% 355 - If you want me to try something else please just say :) Cheers -- S.Çağlar Onur [EMAIL PROTECTED] http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in house! signature.asc Description: This is a digitally signed message part.
Re: SATA exceptions
Hello, S.Çağlar Onur wrote: > [ 4260.278427] ata1.00: cmd ca/00:08:d0:88:bc/00:00:00:00:00/ee tag 0 cdb 0x0 > data 4096 out > [ 4260.278430] res 51/40:01:d7:88:bc/00:00:0e:00:00/ee Emask 0x9 > (media error) That's media error on sector 247236823 on WRITE. Media errors on write are bad signs - it usually means the drive even failed to remap the sector because extra space ran out. I'm not sure this is the case here tho - the smart log is clear. Please run smart short/long tests and see what they say. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions
Hello, S.Çağlar Onur wrote: [ 4260.278427] ata1.00: cmd ca/00:08:d0:88:bc/00:00:00:00:00/ee tag 0 cdb 0x0 data 4096 out [ 4260.278430] res 51/40:01:d7:88:bc/00:00:0e:00:00/ee Emask 0x9 (media error) That's media error on sector 247236823 on WRITE. Media errors on write are bad signs - it usually means the drive even failed to remap the sector because extra space ran out. I'm not sure this is the case here tho - the smart log is clear. Please run smart short/long tests and see what they say. -- tejun - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.02.04 02:13:51 +0100, Björn Steinbrink wrote: > On 2007.02.02 23:48:14 -0600, Robert Hancock wrote: > > There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) > > which should hopefully avoid this problem for the cache flush commands, > > at least - can you try that one out? You'll have to apply the other > > sata_nv patches in -mm first, i.e. this order: > > > > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch > > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch > > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch > > Got 2.6.20-rc7 with them applied now (the rejects seemed trivial enough > for me to fix them). Let's see how that works out... After about 1.5 days of uptime, an involuntary reboot and another 3 days of uptime, no sign of an exception. No stress testing was done, but a few disk intensive actions did happen, at least more than with that -rc6 that did throw an exception at me. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.02.04 02:13:51 +0100, Björn Steinbrink wrote: On 2007.02.02 23:48:14 -0600, Robert Hancock wrote: There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) which should hopefully avoid this problem for the cache flush commands, at least - can you try that one out? You'll have to apply the other sata_nv patches in -mm first, i.e. this order: http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch Got 2.6.20-rc7 with them applied now (the rejects seemed trivial enough for me to fix them). Let's see how that works out... After about 1.5 days of uptime, an involuntary reboot and another 3 days of uptime, no sign of an exception. No stress testing was done, but a few disk intensive actions did happen, at least more than with that -rc6 that did throw an exception at me. Björn - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.02.02 23:48:14 -0600, Robert Hancock wrote: > Björn Steinbrink wrote: > >On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote: > >>On 2007.01.23 17:18:43 -0600, Robert Hancock wrote: > >>>Larry Walton wrote: > The last patch (sata_nv-force-int-dev-in-interrupt.patch) > seems to have fix the problem. Much appreciated, > thank you. I'd consider it a must have in 2.6.20. > >>>Can any of the rest of you that have been seeing this problem also > >>>confirm that this fixes it? > >>Seems to work for me, uptime is about an hour now and no exception yet. > >>Had the stress test running for only about 10 minutes, but I usually got > >>an exception within an hour even during plain irssi usage, so I'm quite > >>confident that the patch fixes it. > > > >Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of > >uptime to trigger, so it's just a lot harder to trigger now. > > Same exception details as before? Yes, exactly the same. > There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) > which should hopefully avoid this problem for the cache flush commands, > at least - can you try that one out? You'll have to apply the other > sata_nv patches in -mm first, i.e. this order: > > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch > http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch Got 2.6.20-rc7 with them applied now (the rejects seemed trivial enough for me to fix them). Let's see how that works out... Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.02.02 23:48:14 -0600, Robert Hancock wrote: Björn Steinbrink wrote: On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote: On 2007.01.23 17:18:43 -0600, Robert Hancock wrote: Larry Walton wrote: The last patch (sata_nv-force-int-dev-in-interrupt.patch) seems to have fix the problem. Much appreciated, thank you. I'd consider it a must have in 2.6.20. Can any of the rest of you that have been seeing this problem also confirm that this fixes it? Seems to work for me, uptime is about an hour now and no exception yet. Had the stress test running for only about 10 minutes, but I usually got an exception within an hour even during plain irssi usage, so I'm quite confident that the patch fixes it. Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of uptime to trigger, so it's just a lot harder to trigger now. Same exception details as before? Yes, exactly the same. There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) which should hopefully avoid this problem for the cache flush commands, at least - can you try that one out? You'll have to apply the other sata_nv patches in -mm first, i.e. this order: http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch Got 2.6.20-rc7 with them applied now (the rejects seemed trivial enough for me to fix them). Let's see how that works out... Björn - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote: On 2007.01.23 17:18:43 -0600, Robert Hancock wrote: Larry Walton wrote: The last patch (sata_nv-force-int-dev-in-interrupt.patch) seems to have fix the problem. Much appreciated, thank you. I'd consider it a must have in 2.6.20. Can any of the rest of you that have been seeing this problem also confirm that this fixes it? Seems to work for me, uptime is about an hour now and no exception yet. Had the stress test running for only about 10 minutes, but I usually got an exception within an hour even during plain irssi usage, so I'm quite confident that the patch fixes it. Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of uptime to trigger, so it's just a lot harder to trigger now. Same exception details as before? There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) which should hopefully avoid this problem for the cache flush commands, at least - can you try that one out? You'll have to apply the other sata_nv patches in -mm first, i.e. this order: http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote: > On 2007.01.23 17:18:43 -0600, Robert Hancock wrote: > > Larry Walton wrote: > > >The last patch (sata_nv-force-int-dev-in-interrupt.patch) > > >seems to have fix the problem. Much appreciated, > > >thank you. I'd consider it a must have in 2.6.20. > > > > Can any of the rest of you that have been seeing this problem also > > confirm that this fixes it? > > Seems to work for me, uptime is about an hour now and no exception yet. > Had the stress test running for only about 10 minutes, but I usually got > an exception within an hour even during plain irssi usage, so I'm quite > confident that the patch fixes it. Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of uptime to trigger, so it's just a lot harder to trigger now. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote: On 2007.01.23 17:18:43 -0600, Robert Hancock wrote: Larry Walton wrote: The last patch (sata_nv-force-int-dev-in-interrupt.patch) seems to have fix the problem. Much appreciated, thank you. I'd consider it a must have in 2.6.20. Can any of the rest of you that have been seeing this problem also confirm that this fixes it? Seems to work for me, uptime is about an hour now and no exception yet. Had the stress test running for only about 10 minutes, but I usually got an exception within an hour even during plain irssi usage, so I'm quite confident that the patch fixes it. Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of uptime to trigger, so it's just a lot harder to trigger now. Björn - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: On 2007.01.24 01:39:23 +0100, Björn Steinbrink wrote: On 2007.01.23 17:18:43 -0600, Robert Hancock wrote: Larry Walton wrote: The last patch (sata_nv-force-int-dev-in-interrupt.patch) seems to have fix the problem. Much appreciated, thank you. I'd consider it a must have in 2.6.20. Can any of the rest of you that have been seeing this problem also confirm that this fixes it? Seems to work for me, uptime is about an hour now and no exception yet. Had the stress test running for only about 10 minutes, but I usually got an exception within an hour even during plain irssi usage, so I'm quite confident that the patch fixes it. Or maybe not :( Just got an exception on 2.6.20-rc6. Took 4 days of uptime to trigger, so it's just a lot harder to trigger now. Same exception details as before? There's a patch in -mm (sata_nv-use-adma-for-nodata-commands.patch) which should hopefully avoid this problem for the cache flush commands, at least - can you try that one out? You'll have to apply the other sata_nv patches in -mm first, i.e. this order: http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2.patch http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-cleanup-adma-error-handling-v2-cleanup.patch http://www2.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.20-rc6/2.6.20-rc6-mm3/broken-out/sata_nv-use-adma-for-nodata-commands.patch -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.24 09:24:00 +0100, Ian Kumlien wrote: > On tis, 2007-01-23 at 17:18 -0600, Robert Hancock wrote: > > Larry Walton wrote: > > > The last patch (sata_nv-force-int-dev-in-interrupt.patch) > > > seems to have fix the problem. Much appreciated, > > > thank you. I'd consider it a must have in 2.6.20. > > > > Can any of the rest of you that have been seeing this problem also > > confirm that this fixes it? > > I applied it yesterday and today my dmesg contains three: > BUG: at mm/truncate.c:60 cancel_dirty_page() David Chinner sent two patches regarding that bug yesterday. http://lkml.org/lkml/2007/1/23/190 http://lkml.org/lkml/2007/1/23/192 Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On tis, 2007-01-23 at 17:18 -0600, Robert Hancock wrote: > Larry Walton wrote: > > The last patch (sata_nv-force-int-dev-in-interrupt.patch) > > seems to have fix the problem. Much appreciated, > > thank you. I'd consider it a must have in 2.6.20. > > Can any of the rest of you that have been seeing this problem also > confirm that this fixes it? I applied it yesterday and today my dmesg contains three: BUG: at mm/truncate.c:60 cancel_dirty_page() Call Trace: [] cancel_dirty_page+0x43/0x71 [] reiserfs_cut_from_item+0x5f8/0x61d [] find_get_page+0x21/0x47 [] reiserfs_do_truncate+0x34d/0x495 [] reiserfs_truncate_file+0x199/0x2aa [] reiserfs_file_release+0x261/0x281 [] __fput+0xb1/0x17d [] filp_close+0x5d/0x65 [] sys_close+0x8c/0xcf [] system_call+0x7e/0x83 Which never happened before... I dunno if they are related though, but they weren't there before... (It does fix the timeout problem) -- Ian Kumlien -- http://pomac.netswarm.net signature.asc Description: This is a digitally signed message part
Re: SATA exceptions with 2.6.20-rc5
On tis, 2007-01-23 at 17:18 -0600, Robert Hancock wrote: Larry Walton wrote: The last patch (sata_nv-force-int-dev-in-interrupt.patch) seems to have fix the problem. Much appreciated, thank you. I'd consider it a must have in 2.6.20. Can any of the rest of you that have been seeing this problem also confirm that this fixes it? I applied it yesterday and today my dmesg contains three: BUG: at mm/truncate.c:60 cancel_dirty_page() Call Trace: [8029f3e5] cancel_dirty_page+0x43/0x71 [802ec1ab] reiserfs_cut_from_item+0x5f8/0x61d [802074fc] find_get_page+0x21/0x47 [802ec51d] reiserfs_do_truncate+0x34d/0x495 [802d9d47] reiserfs_truncate_file+0x199/0x2aa [802df9c5] reiserfs_file_release+0x261/0x281 [80211b02] __fput+0xb1/0x17d [802218e0] filp_close+0x5d/0x65 [8021bef5] sys_close+0x8c/0xcf [8025725e] system_call+0x7e/0x83 Which never happened before... I dunno if they are related though, but they weren't there before... (It does fix the timeout problem) -- Ian Kumlien pomac () vapor ! com -- http://pomac.netswarm.net signature.asc Description: This is a digitally signed message part
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.24 09:24:00 +0100, Ian Kumlien wrote: On tis, 2007-01-23 at 17:18 -0600, Robert Hancock wrote: Larry Walton wrote: The last patch (sata_nv-force-int-dev-in-interrupt.patch) seems to have fix the problem. Much appreciated, thank you. I'd consider it a must have in 2.6.20. Can any of the rest of you that have been seeing this problem also confirm that this fixes it? I applied it yesterday and today my dmesg contains three: BUG: at mm/truncate.c:60 cancel_dirty_page() David Chinner sent two patches regarding that bug yesterday. http://lkml.org/lkml/2007/1/23/190 http://lkml.org/lkml/2007/1/23/192 Björn - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.23 17:18:43 -0600, Robert Hancock wrote: > Larry Walton wrote: > >The last patch (sata_nv-force-int-dev-in-interrupt.patch) > >seems to have fix the problem. Much appreciated, > >thank you. I'd consider it a must have in 2.6.20. > > Can any of the rest of you that have been seeing this problem also > confirm that this fixes it? Seems to work for me, uptime is about an hour now and no exception yet. Had the stress test running for only about 10 minutes, but I usually got an exception within an hour even during plain irssi usage, so I'm quite confident that the patch fixes it. Thanks, Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Larry Walton wrote: The last patch (sata_nv-force-int-dev-in-interrupt.patch) seems to have fix the problem. Much appreciated, thank you. I'd consider it a must have in 2.6.20. Can any of the rest of you that have been seeing this problem also confirm that this fixes it? -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
The last patch (sata_nv-force-int-dev-in-interrupt.patch) seems to have fix the problem. Much appreciated, thank you. I'd consider it a must have in 2.6.20. -- *--* Mail: [EMAIL PROTECTED] *--* Voice: 206.892.6269 *--* Cell: 206.225.0154 *--* HTTP://real.com -- - - - - - - - R e a l - - - - - - - - signature.asc Description: Digital signature
Re: SATA exceptions with 2.6.20-rc5
The last patch (sata_nv-force-int-dev-in-interrupt.patch) seems to have fix the problem. Much appreciated, thank you. I'd consider it a must have in 2.6.20. -- *--* Mail: [EMAIL PROTECTED] *--* Voice: 206.892.6269 *--* Cell: 206.225.0154 *--* HTTP://real.com -- - - - - - - - R e a l - - - - - - - - signature.asc Description: Digital signature
Re: SATA exceptions with 2.6.20-rc5
Larry Walton wrote: The last patch (sata_nv-force-int-dev-in-interrupt.patch) seems to have fix the problem. Much appreciated, thank you. I'd consider it a must have in 2.6.20. Can any of the rest of you that have been seeing this problem also confirm that this fixes it? -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.23 17:18:43 -0600, Robert Hancock wrote: Larry Walton wrote: The last patch (sata_nv-force-int-dev-in-interrupt.patch) seems to have fix the problem. Much appreciated, thank you. I'd consider it a must have in 2.6.20. Can any of the rest of you that have been seeing this problem also confirm that this fixes it? Seems to work for me, uptime is about an hour now and no exception yet. Had the stress test running for only about 10 minutes, but I usually got an exception within an hour even during plain irssi usage, so I'm quite confident that the patch fixes it. Thanks, Björn - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804. I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now without a single problem and that should still look at NV_INT_STATUS_CK804, right? I just noticed that my last email might not have been clear enough. The exceptions happened when I re-enabled the return statement in addition to the debug message. Without the INT_DEV check, it is completely fine AFAICT. Indeed, it seems to be just the NV_INT_DEV check that is problematic. Here's a patch that's likely better to test, it forces the NV_INT_DEV flag on when a command is active, and also fixes that questionable code in nv_host_intr that I mentioned. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ --- linux-2.6.20-rc5/drivers/ata/sata_nv.c 2007-01-19 19:18:53.0 -0600 +++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 22:33:43.0 -0600 @@ -700,7 +700,6 @@ static void nv_adma_check_cpb(struct ata static int nv_host_intr(struct ata_port *ap, u8 irq_stat) { struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap->active_tag); - int handled; /* freeze if hotplugged */ if (unlikely(irq_stat & (NV_INT_ADDED | NV_INT_REMOVED))) { @@ -719,13 +718,7 @@ static int nv_host_intr(struct ata_port } /* handle interrupt */ - handled = ata_host_intr(ap, qc); - if (unlikely(!handled)) { - /* spurious, clear it */ - ata_check_status(ap); - } - - return 1; + return ata_host_intr(ap, qc); } static irqreturn_t nv_adma_interrupt(int irq, void *dev_instance) @@ -752,6 +745,11 @@ static irqreturn_t nv_adma_interrupt(int if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) { u8 irq_stat = readb(host->mmio_base + NV_INT_STATUS_CK804) >> (NV_INT_PORT_SHIFT * i); + if(ata_tag_valid(ap->active_tag)) + /** NV_INT_DEV indication seems unreliable at times + at least in ADMA mode. Force it on always when a + command is active, to prevent losing interrupts. */ + irq_stat |= NV_INT_DEV; handled += nv_host_intr(ap, irq_stat); continue; }
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.22 19:24:22 -0600, Robert Hancock wrote: > Björn Steinbrink wrote: > >>>Running a kernel with the return statement replace by a line that prints > >>>the irq_stat instead. > >>> > >>>Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2. > >>40 minutes stress test now and no exception yet. What's interesting is > >>that ata1 saw exactly one interrupt with irq_stat 0x0, all others that > >>might have get dropped are as above. > >>I'll keep it running for some time and will then re-enable the return > >>statement to see if there's a relation between the irq_stat 0x0 and the > >>exception. > > > >No, doesn't seem to be related, did get 2 exceptions, but no irq_stat > >0x0 for ata1. Syslog/dmesg has nothing new either, still the same > >pattern of dismissed irq_stats. > > I've finally managed to reproduce this problem on my box, by doing: > > watch --interval=0.1 /sbin/hdparm -I /dev/sda > > on one drive and then running bonnie++ on /dev/sdb connected to the > other port on the same controller device. Usually within a few minutes > one of the IDENTIFY commands would time out in the same way you guys > have been seeing. > > Through some various trials and tribulations, the only conclusion I can > come to is that this controller really doesn't like that > NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried > adding some debug code to the qc_issue function that would check to see > if the BUSY flag in altstatus went high or that register showed an > interrupt within a certain time afterwards, however that really seemed > to hose things, the system wouldn't even boot. Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804. I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now without a single problem and that should still look at NV_INT_STATUS_CK804, right? I just noticed that my last email might not have been clear enough. The exceptions happened when I re-enabled the return statement in addition to the debug message. Without the INT_DEV check, it is completely fine AFAICT. > Try out this patch, it just calls the ata_host_intr function where > appropriate without using nv_host_intr which looks at the > NV_INT_STATUS_CK804 register. This is what the original ADMA patch from > Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for > that. With this patch I can get through a whole bonnie++ run with the > repeated IDENTIFY requests running without seeing the error. I'll see if I can schedule a test run for tomorrow, I currently need this box. Thanks, Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Alistair John Strachan wrote: On Tuesday 23 January 2007 01:24, Robert Hancock wrote: As a final aside, this is another case where the hardware docs for this controller would really be useful, in order to know whether we are actually supposed to be reading that register in ADMA mode or not. I sent a query to Allen Martin at NVIDIA asking if there's a way I could get access to the documents, but I haven't heard anything yet. Obviously, NVIDIA's response is disappointing, but thank you for putting the time in to debug this problem. Definitely sounds like a hardware defect, I'm just glad there's a workaround. Will we see this fix in 2.6.20? Hopefully, assuming it actually does fix the problem for those that have been seeing it.. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On Tuesday 23 January 2007 01:24, Robert Hancock wrote: > As a final aside, this is another case where the hardware docs for this > controller would really be useful, in order to know whether we are > actually supposed to be reading that register in ADMA mode or not. I > sent a query to Allen Martin at NVIDIA asking if there's a way I could > get access to the documents, but I haven't heard anything yet. Obviously, NVIDIA's response is disappointing, but thank you for putting the time in to debug this problem. Definitely sounds like a hardware defect, I'm just glad there's a workaround. Will we see this fix in 2.6.20? -- Cheers, Alistair. Final year Computer Science undergraduate. 1F2 55 South Clerk Street, Edinburgh, UK. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: Running a kernel with the return statement replace by a line that prints the irq_stat instead. Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2. 40 minutes stress test now and no exception yet. What's interesting is that ata1 saw exactly one interrupt with irq_stat 0x0, all others that might have get dropped are as above. I'll keep it running for some time and will then re-enable the return statement to see if there's a relation between the irq_stat 0x0 and the exception. No, doesn't seem to be related, did get 2 exceptions, but no irq_stat 0x0 for ata1. Syslog/dmesg has nothing new either, still the same pattern of dismissed irq_stats. I've finally managed to reproduce this problem on my box, by doing: watch --interval=0.1 /sbin/hdparm -I /dev/sda on one drive and then running bonnie++ on /dev/sdb connected to the other port on the same controller device. Usually within a few minutes one of the IDENTIFY commands would time out in the same way you guys have been seeing. Through some various trials and tribulations, the only conclusion I can come to is that this controller really doesn't like that NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried adding some debug code to the qc_issue function that would check to see if the BUSY flag in altstatus went high or that register showed an interrupt within a certain time afterwards, however that really seemed to hose things, the system wouldn't even boot. Try out this patch, it just calls the ata_host_intr function where appropriate without using nv_host_intr which looks at the NV_INT_STATUS_CK804 register. This is what the original ADMA patch from Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for that. With this patch I can get through a whole bonnie++ run with the repeated IDENTIFY requests running without seeing the error. As an aside, there seems to be some dubious code in nv_host_intr, if ata_host_intr returns 0 for handled when a command is outstanding, it goes and calls ata_check_status anyway. This is rather dangerous since if an interrupt showed up right after ata_host_intr but before ata_check_status, the ata_check_status would clear it and we would forget about it. I tried fixing just that issue and still had this problem however. I suspect that code is truly broken and needs further thought, but this patch avoids calling it in the ADMA case, at any rate. As a final aside, this is another case where the hardware docs for this controller would really be useful, in order to know whether we are actually supposed to be reading that register in ADMA mode or not. I sent a query to Allen Martin at NVIDIA asking if there's a way I could get access to the documents, but I haven't heard anything yet. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ --- linux-2.6.20-rc5/drivers/ata/sata_nv.c 2007-01-19 19:18:53.0 -0600 +++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 18:35:09.0 -0600 @@ -750,9 +750,9 @@ static irqreturn_t nv_adma_interrupt(int /* if in ATA register mode, use standard ata interrupt handler */ if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) { - u8 irq_stat = readb(host->mmio_base + NV_INT_STATUS_CK804) - >> (NV_INT_PORT_SHIFT * i); - handled += nv_host_intr(ap, irq_stat); + struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap->active_tag); + if(qc && !(qc->tf.flags & ATA_TFLAG_POLLING)) + handled += ata_host_intr(ap, qc); continue; }
Re: SATA exceptions with 2.6.20-rc5
On 1/15/07, Jeff Garzik <[EMAIL PROTECTED]> wrote: Jens Axboe wrote: > On Mon, Jan 15 2007, Jeff Garzik wrote: >> Jens Axboe wrote: >>> I'd be surprised if the device would not obey the 7 second timeout rule >>> that seems to be set in stone and not allow more dirty in-drive cache >>> than it could flush out in approximately that time. >> AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other >> commands... > > Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as > it would pretty much guarentee lower latencies for random writes and > write back caching. The concern is the barrier code, of course. I guess > I should do some timings on potential worst case patterns some day. Alan > may have done that sometime in the past, iirc. FWIW: According to the drive guys (Eric M, among others), FLUSH CACHE will "probably" be under 30 seconds, but pathological cases might even extend beyond that. Definitely more than 7 seconds in less-than-pathological cases, unfortunately... The mentioned Maxtor model (6Yxxx) isn't susceptible to the large-buffer long completion times, due to architectural differences and availability of only small buffers. Any "real" long-completion flush on this device would, I believe, involve damage to the disk that hinders the ability to seek, settle, or write. (e.g. 30-second flushes are easy to hit if you mount the disk on a shaker-table with sufficient amplitude) Later in the thread I think people have pretty much isolated it as not the disk's problem, but just wanted to point this out. I assume that large enough customers can buy enterprise-type command completion ("all commands within X seconds") from most any disk vendor. However, these firmwares require much smarter or more active drivers or block layers, to handle the higher error rate when the data on the device is valid, but it will take longer than allowed by the arbitrary enterprise rules. Most customers who are buying this many devices have software engineers customizing the drivers or disk management applications to handle this differing behavior. --eric - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.22 17:57:08 +0100, Björn Steinbrink wrote: > On 2007.01.22 17:12:40 +0100, Björn Steinbrink wrote: > > On 2007.01.21 18:17:01 -0600, Robert Hancock wrote: > > > Hmm, another miss, apparently.. Has anyone tried removing these lines > > > >from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does? > > > > > > /* bail out if not our interrupt */ > > > if (!(irq_stat & NV_INT_DEV)) > > > return 0; > > > > Running a kernel with the return statement replace by a line that prints > > the irq_stat instead. > > > > Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2. > > 40 minutes stress test now and no exception yet. What's interesting is > that ata1 saw exactly one interrupt with irq_stat 0x0, all others that > might have get dropped are as above. > I'll keep it running for some time and will then re-enable the return > statement to see if there's a relation between the irq_stat 0x0 and the > exception. No, doesn't seem to be related, did get 2 exceptions, but no irq_stat 0x0 for ata1. Syslog/dmesg has nothing new either, still the same pattern of dismissed irq_stats. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.22 17:12:40 +0100, Björn Steinbrink wrote: > On 2007.01.21 18:17:01 -0600, Robert Hancock wrote: > > Björn Steinbrink wrote: > > >On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: > > >>Björn Steinbrink wrote: > > >>>All kernels were bad using that approach. So back to square 1. :/ > > >>> > > >>>Björn > > >>> > > >>OK guys, here's a new patch to try against 2.6.20-rc5: > > >> > > >>Right now when switching between ADMA mode and legacy mode (i.e. when > > >>going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just > > >>set the ADMA GO register bit appropriately and continue with no delay. > > >>It looks like in some cases the controller doesn't respond to this > > >>immediately, it takes some nanoseconds for the controller's status > > >>registers to reflect the change that was made. It's possible that if we > > >>were trying to issue commands during this time, the controller might not > > >>react properly. This patch adds some code to wait for the status > > >>register to change to the state we asked for before continuing. > > > > > >Just got two exceptions with your patch, none of the debug messages were > > >issued. > > > > > >Björn > > > > Hmm, another miss, apparently.. Has anyone tried removing these lines > > >from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does? > > > > /* bail out if not our interrupt */ > > if (!(irq_stat & NV_INT_DEV)) > > return 0; > > Running a kernel with the return statement replace by a line that prints > the irq_stat instead. > > Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2. 40 minutes stress test now and no exception yet. What's interesting is that ata1 saw exactly one interrupt with irq_stat 0x0, all others that might have get dropped are as above. I'll keep it running for some time and will then re-enable the return statement to see if there's a relation between the irq_stat 0x0 and the exception. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 18:17:01 -0600, Robert Hancock wrote: > Björn Steinbrink wrote: > >On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: > >>Björn Steinbrink wrote: > >>>All kernels were bad using that approach. So back to square 1. :/ > >>> > >>>Björn > >>> > >>OK guys, here's a new patch to try against 2.6.20-rc5: > >> > >>Right now when switching between ADMA mode and legacy mode (i.e. when > >>going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just > >>set the ADMA GO register bit appropriately and continue with no delay. > >>It looks like in some cases the controller doesn't respond to this > >>immediately, it takes some nanoseconds for the controller's status > >>registers to reflect the change that was made. It's possible that if we > >>were trying to issue commands during this time, the controller might not > >>react properly. This patch adds some code to wait for the status > >>register to change to the state we asked for before continuing. > > > >Just got two exceptions with your patch, none of the debug messages were > >issued. > > > >Björn > > Hmm, another miss, apparently.. Has anyone tried removing these lines > >from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does? > > /* bail out if not our interrupt */ > if (!(irq_stat & NV_INT_DEV)) > return 0; Running a kernel with the return statement replace by a line that prints the irq_stat instead. Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On Monday, 22. January 2007 03:39, Tejun Heo wrote: > Hello, > > Chr wrote: > > Ok, you won't believe this... I opened my case and rewired my drives... > > And guess what, my second (aka the "good") HDD is now failing! > > I guess, my mainboard has a (but maybe two, or three :( ) "bad" > > sata-port(s)! > > Or, you have power related problem. Try to rewire the power lines or > connect harddrives to a separate powersupply. It's often useful to > change one component at a time and watch which change the problem > follows. Anyways, you seem to be suffering transmission failures, not a > driver problem. > > Thanks. > Yes and no, it's probably not a power problem, I've tried another PSU with the same result :( . Futhermore, the RAID0 setup makes it impossible to try only one drive alone :(. Anyway,the WD2500KS is known to have some strange bugs in the FW. e.g.: It reports 255°C right after a cold start. ( http://www.bugtrack.almico.com/view.php?id=468 ). Thanks, Chr. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions triggered by XFS (since 2.6.18)
On Mon, 22 Jan 2007 18:35:05 +0900 Tejun Heo <[EMAIL PROTECTED]> wrote: > Yeap, certainly. I'll ask people first before actually proceeding with > the blacklisting. I'm just getting a bit tired of tides of NCQ firmware > problems. Another interesting thing: it seems that I'm unable to reproduce the problem mounting XFS with "nobarrier" (using sda queue_depth = 31). So it looks like a problem with NCQ combined with cache flush command... -- Paolo Ornati Linux 2.6.20-rc5 on x86_64 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions triggered by XFS (since 2.6.18)
On Mon, 22 Jan 2007 18:35:05 +0900 Tejun Heo <[EMAIL PROTECTED]> wrote: > Yeap, certainly. I'll ask people first before actually proceeding with > the blacklisting. I'm just getting a bit tired of tides of NCQ firmware > problems. > > Anyways, for the time being, you can easily turn off NCQ using sysfs. > Please take a look at http://linux-ata.org/faq.html ok -- Paolo Ornati Linux 2.6.20-rc5 on x86_64 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions triggered by XFS (since 2.6.18)
Paolo Ornati wrote: === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family Device Model: ST380817AS I'll blacklist it. Thanks. Ok. It will be better if someone else with the same HD could confirm. It looks so strange that an HD that works fine, and should support NCQ, have so big troubles that I can "freeze" it in less than a second by using XFS (while with ext3 I cannot, or at least it's very hard). Yeap, certainly. I'll ask people first before actually proceeding with the blacklisting. I'm just getting a bit tired of tides of NCQ firmware problems. Anyways, for the time being, you can easily turn off NCQ using sysfs. Please take a look at http://linux-ata.org/faq.html -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions triggered by XFS (since 2.6.18)
On Mon, 22 Jan 2007 11:46:01 +0900 Tejun Heo <[EMAIL PROTECTED]> wrote: > > I don't know. It's a two years old ST380817AS. > > > > # smartctl -a -d ata /dev/sda > > > > smartctl version 5.36 [x86_64-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen > > Home page is http://smartmontools.sourceforge.net/ > > > > === START OF INFORMATION SECTION === > > Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family > > Device Model: ST380817AS > > I'll blacklist it. Thanks. Ok. It will be better if someone else with the same HD could confirm. It looks so strange that an HD that works fine, and should support NCQ, have so big troubles that I can "freeze" it in less than a second by using XFS (while with ext3 I cannot, or at least it's very hard). -- Paolo Ornati Linux 2.6.20-rc5 on x86_64 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions triggered by XFS (since 2.6.18)
On Mon, 22 Jan 2007 01:53:21 +0059 Jiri Slaby <[EMAIL PROTECTED]> wrote: > >> 7 Seek_Error_Rate 0x000f 083 060 030Pre-fail Always > >> - 204305750 > >> 1 Raw_Read_Error_Rate 0x000f 059 049 006Pre-fail Always > >> - 215927244 > >> 195 Hardware_ECC_Recovered 0x001a 059 049 000Old_age Always > >> - 215927244 > > > > Wow! that HDD is really in a bad condition. > > I don't think so, this seems to be normal for Seagate drives... I agree. For Chr: I don't think these big raw-numbers are counters, look at the normalized values instead, and see that they are greater than TRESH values (so they are good). The meaning of raw-numbers is vendor specific. -- Paolo Ornati Linux 2.6.20-rc5 on x86_64 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions triggered by XFS (since 2.6.18)
On Mon, 22 Jan 2007 01:53:21 +0059 Jiri Slaby [EMAIL PROTECTED] wrote: 7 Seek_Error_Rate 0x000f 083 060 030Pre-fail Always - 204305750 1 Raw_Read_Error_Rate 0x000f 059 049 006Pre-fail Always - 215927244 195 Hardware_ECC_Recovered 0x001a 059 049 000Old_age Always - 215927244 Wow! that HDD is really in a bad condition. I don't think so, this seems to be normal for Seagate drives... I agree. For Chr: I don't think these big raw-numbers are counters, look at the normalized values instead, and see that they are greater than TRESH values (so they are good). The meaning of raw-numbers is vendor specific. -- Paolo Ornati Linux 2.6.20-rc5 on x86_64 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions triggered by XFS (since 2.6.18)
On Mon, 22 Jan 2007 11:46:01 +0900 Tejun Heo [EMAIL PROTECTED] wrote: I don't know. It's a two years old ST380817AS. # smartctl -a -d ata /dev/sda smartctl version 5.36 [x86_64-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family Device Model: ST380817AS I'll blacklist it. Thanks. Ok. It will be better if someone else with the same HD could confirm. It looks so strange that an HD that works fine, and should support NCQ, have so big troubles that I can freeze it in less than a second by using XFS (while with ext3 I cannot, or at least it's very hard). -- Paolo Ornati Linux 2.6.20-rc5 on x86_64 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions triggered by XFS (since 2.6.18)
Paolo Ornati wrote: === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family Device Model: ST380817AS I'll blacklist it. Thanks. Ok. It will be better if someone else with the same HD could confirm. It looks so strange that an HD that works fine, and should support NCQ, have so big troubles that I can freeze it in less than a second by using XFS (while with ext3 I cannot, or at least it's very hard). Yeap, certainly. I'll ask people first before actually proceeding with the blacklisting. I'm just getting a bit tired of tides of NCQ firmware problems. Anyways, for the time being, you can easily turn off NCQ using sysfs. Please take a look at http://linux-ata.org/faq.html -- tejun - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions triggered by XFS (since 2.6.18)
On Mon, 22 Jan 2007 18:35:05 +0900 Tejun Heo [EMAIL PROTECTED] wrote: Yeap, certainly. I'll ask people first before actually proceeding with the blacklisting. I'm just getting a bit tired of tides of NCQ firmware problems. Anyways, for the time being, you can easily turn off NCQ using sysfs. Please take a look at http://linux-ata.org/faq.html ok -- Paolo Ornati Linux 2.6.20-rc5 on x86_64 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions triggered by XFS (since 2.6.18)
On Mon, 22 Jan 2007 18:35:05 +0900 Tejun Heo [EMAIL PROTECTED] wrote: Yeap, certainly. I'll ask people first before actually proceeding with the blacklisting. I'm just getting a bit tired of tides of NCQ firmware problems. Another interesting thing: it seems that I'm unable to reproduce the problem mounting XFS with nobarrier (using sda queue_depth = 31). So it looks like a problem with NCQ combined with cache flush command... -- Paolo Ornati Linux 2.6.20-rc5 on x86_64 - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On Monday, 22. January 2007 03:39, Tejun Heo wrote: Hello, Chr wrote: Ok, you won't believe this... I opened my case and rewired my drives... And guess what, my second (aka the good) HDD is now failing! I guess, my mainboard has a (but maybe two, or three :( ) bad sata-port(s)! Or, you have power related problem. Try to rewire the power lines or connect harddrives to a separate powersupply. It's often useful to change one component at a time and watch which change the problem follows. Anyways, you seem to be suffering transmission failures, not a driver problem. Thanks. Yes and no, it's probably not a power problem, I've tried another PSU with the same result :( . Futhermore, the RAID0 setup makes it impossible to try only one drive alone :(. Anyway,the WD2500KS is known to have some strange bugs in the FW. e.g.: It reports 255°C right after a cold start. ( http://www.bugtrack.almico.com/view.php?id=468 ). Thanks, Chr. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 18:17:01 -0600, Robert Hancock wrote: Björn Steinbrink wrote: On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: Björn Steinbrink wrote: All kernels were bad using that approach. So back to square 1. :/ Björn OK guys, here's a new patch to try against 2.6.20-rc5: Right now when switching between ADMA mode and legacy mode (i.e. when going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just set the ADMA GO register bit appropriately and continue with no delay. It looks like in some cases the controller doesn't respond to this immediately, it takes some nanoseconds for the controller's status registers to reflect the change that was made. It's possible that if we were trying to issue commands during this time, the controller might not react properly. This patch adds some code to wait for the status register to change to the state we asked for before continuing. Just got two exceptions with your patch, none of the debug messages were issued. Björn Hmm, another miss, apparently.. Has anyone tried removing these lines from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does? /* bail out if not our interrupt */ if (!(irq_stat NV_INT_DEV)) return 0; Running a kernel with the return statement replace by a line that prints the irq_stat instead. Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2. Björn - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.22 17:12:40 +0100, Björn Steinbrink wrote: On 2007.01.21 18:17:01 -0600, Robert Hancock wrote: Björn Steinbrink wrote: On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: Björn Steinbrink wrote: All kernels were bad using that approach. So back to square 1. :/ Björn OK guys, here's a new patch to try against 2.6.20-rc5: Right now when switching between ADMA mode and legacy mode (i.e. when going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just set the ADMA GO register bit appropriately and continue with no delay. It looks like in some cases the controller doesn't respond to this immediately, it takes some nanoseconds for the controller's status registers to reflect the change that was made. It's possible that if we were trying to issue commands during this time, the controller might not react properly. This patch adds some code to wait for the status register to change to the state we asked for before continuing. Just got two exceptions with your patch, none of the debug messages were issued. Björn Hmm, another miss, apparently.. Has anyone tried removing these lines from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does? /* bail out if not our interrupt */ if (!(irq_stat NV_INT_DEV)) return 0; Running a kernel with the return statement replace by a line that prints the irq_stat instead. Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2. 40 minutes stress test now and no exception yet. What's interesting is that ata1 saw exactly one interrupt with irq_stat 0x0, all others that might have get dropped are as above. I'll keep it running for some time and will then re-enable the return statement to see if there's a relation between the irq_stat 0x0 and the exception. Björn - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.22 17:57:08 +0100, Björn Steinbrink wrote: On 2007.01.22 17:12:40 +0100, Björn Steinbrink wrote: On 2007.01.21 18:17:01 -0600, Robert Hancock wrote: Hmm, another miss, apparently.. Has anyone tried removing these lines from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does? /* bail out if not our interrupt */ if (!(irq_stat NV_INT_DEV)) return 0; Running a kernel with the return statement replace by a line that prints the irq_stat instead. Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2. 40 minutes stress test now and no exception yet. What's interesting is that ata1 saw exactly one interrupt with irq_stat 0x0, all others that might have get dropped are as above. I'll keep it running for some time and will then re-enable the return statement to see if there's a relation between the irq_stat 0x0 and the exception. No, doesn't seem to be related, did get 2 exceptions, but no irq_stat 0x0 for ata1. Syslog/dmesg has nothing new either, still the same pattern of dismissed irq_stats. Björn - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 1/15/07, Jeff Garzik [EMAIL PROTECTED] wrote: Jens Axboe wrote: On Mon, Jan 15 2007, Jeff Garzik wrote: Jens Axboe wrote: I'd be surprised if the device would not obey the 7 second timeout rule that seems to be set in stone and not allow more dirty in-drive cache than it could flush out in approximately that time. AFAIK Windows flush-cache timeout is 30 seconds, not 7 as with other commands... Ok, 7 seconds for FLUSH_CACHE would have been nice for us too though, as it would pretty much guarentee lower latencies for random writes and write back caching. The concern is the barrier code, of course. I guess I should do some timings on potential worst case patterns some day. Alan may have done that sometime in the past, iirc. FWIW: According to the drive guys (Eric M, among others), FLUSH CACHE will probably be under 30 seconds, but pathological cases might even extend beyond that. Definitely more than 7 seconds in less-than-pathological cases, unfortunately... The mentioned Maxtor model (6Yxxx) isn't susceptible to the large-buffer long completion times, due to architectural differences and availability of only small buffers. Any real long-completion flush on this device would, I believe, involve damage to the disk that hinders the ability to seek, settle, or write. (e.g. 30-second flushes are easy to hit if you mount the disk on a shaker-table with sufficient amplitude) Later in the thread I think people have pretty much isolated it as not the disk's problem, but just wanted to point this out. I assume that large enough customers can buy enterprise-type command completion (all commands within X seconds) from most any disk vendor. However, these firmwares require much smarter or more active drivers or block layers, to handle the higher error rate when the data on the device is valid, but it will take longer than allowed by the arbitrary enterprise rules. Most customers who are buying this many devices have software engineers customizing the drivers or disk management applications to handle this differing behavior. --eric - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: Running a kernel with the return statement replace by a line that prints the irq_stat instead. Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2. 40 minutes stress test now and no exception yet. What's interesting is that ata1 saw exactly one interrupt with irq_stat 0x0, all others that might have get dropped are as above. I'll keep it running for some time and will then re-enable the return statement to see if there's a relation between the irq_stat 0x0 and the exception. No, doesn't seem to be related, did get 2 exceptions, but no irq_stat 0x0 for ata1. Syslog/dmesg has nothing new either, still the same pattern of dismissed irq_stats. I've finally managed to reproduce this problem on my box, by doing: watch --interval=0.1 /sbin/hdparm -I /dev/sda on one drive and then running bonnie++ on /dev/sdb connected to the other port on the same controller device. Usually within a few minutes one of the IDENTIFY commands would time out in the same way you guys have been seeing. Through some various trials and tribulations, the only conclusion I can come to is that this controller really doesn't like that NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried adding some debug code to the qc_issue function that would check to see if the BUSY flag in altstatus went high or that register showed an interrupt within a certain time afterwards, however that really seemed to hose things, the system wouldn't even boot. Try out this patch, it just calls the ata_host_intr function where appropriate without using nv_host_intr which looks at the NV_INT_STATUS_CK804 register. This is what the original ADMA patch from Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for that. With this patch I can get through a whole bonnie++ run with the repeated IDENTIFY requests running without seeing the error. As an aside, there seems to be some dubious code in nv_host_intr, if ata_host_intr returns 0 for handled when a command is outstanding, it goes and calls ata_check_status anyway. This is rather dangerous since if an interrupt showed up right after ata_host_intr but before ata_check_status, the ata_check_status would clear it and we would forget about it. I tried fixing just that issue and still had this problem however. I suspect that code is truly broken and needs further thought, but this patch avoids calling it in the ADMA case, at any rate. As a final aside, this is another case where the hardware docs for this controller would really be useful, in order to know whether we are actually supposed to be reading that register in ADMA mode or not. I sent a query to Allen Martin at NVIDIA asking if there's a way I could get access to the documents, but I haven't heard anything yet. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ --- linux-2.6.20-rc5/drivers/ata/sata_nv.c 2007-01-19 19:18:53.0 -0600 +++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 18:35:09.0 -0600 @@ -750,9 +750,9 @@ static irqreturn_t nv_adma_interrupt(int /* if in ATA register mode, use standard ata interrupt handler */ if (pp-flags NV_ADMA_PORT_REGISTER_MODE) { - u8 irq_stat = readb(host-mmio_base + NV_INT_STATUS_CK804) -(NV_INT_PORT_SHIFT * i); - handled += nv_host_intr(ap, irq_stat); + struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap-active_tag); + if(qc !(qc-tf.flags ATA_TFLAG_POLLING)) + handled += ata_host_intr(ap, qc); continue; }
Re: SATA exceptions with 2.6.20-rc5
On Tuesday 23 January 2007 01:24, Robert Hancock wrote: As a final aside, this is another case where the hardware docs for this controller would really be useful, in order to know whether we are actually supposed to be reading that register in ADMA mode or not. I sent a query to Allen Martin at NVIDIA asking if there's a way I could get access to the documents, but I haven't heard anything yet. Obviously, NVIDIA's response is disappointing, but thank you for putting the time in to debug this problem. Definitely sounds like a hardware defect, I'm just glad there's a workaround. Will we see this fix in 2.6.20? -- Cheers, Alistair. Final year Computer Science undergraduate. 1F2 55 South Clerk Street, Edinburgh, UK. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Alistair John Strachan wrote: On Tuesday 23 January 2007 01:24, Robert Hancock wrote: As a final aside, this is another case where the hardware docs for this controller would really be useful, in order to know whether we are actually supposed to be reading that register in ADMA mode or not. I sent a query to Allen Martin at NVIDIA asking if there's a way I could get access to the documents, but I haven't heard anything yet. Obviously, NVIDIA's response is disappointing, but thank you for putting the time in to debug this problem. Definitely sounds like a hardware defect, I'm just glad there's a workaround. Will we see this fix in 2.6.20? Hopefully, assuming it actually does fix the problem for those that have been seeing it.. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.22 19:24:22 -0600, Robert Hancock wrote: Björn Steinbrink wrote: Running a kernel with the return statement replace by a line that prints the irq_stat instead. Currently I'm seeing lots of 0x10 on ata1 and 0x0 on ata2. 40 minutes stress test now and no exception yet. What's interesting is that ata1 saw exactly one interrupt with irq_stat 0x0, all others that might have get dropped are as above. I'll keep it running for some time and will then re-enable the return statement to see if there's a relation between the irq_stat 0x0 and the exception. No, doesn't seem to be related, did get 2 exceptions, but no irq_stat 0x0 for ata1. Syslog/dmesg has nothing new either, still the same pattern of dismissed irq_stats. I've finally managed to reproduce this problem on my box, by doing: watch --interval=0.1 /sbin/hdparm -I /dev/sda on one drive and then running bonnie++ on /dev/sdb connected to the other port on the same controller device. Usually within a few minutes one of the IDENTIFY commands would time out in the same way you guys have been seeing. Through some various trials and tribulations, the only conclusion I can come to is that this controller really doesn't like that NV_INT_STATUS_CK804 register being looked at in ADMA mode. I tried adding some debug code to the qc_issue function that would check to see if the BUSY flag in altstatus went high or that register showed an interrupt within a certain time afterwards, however that really seemed to hose things, the system wouldn't even boot. Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804. I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now without a single problem and that should still look at NV_INT_STATUS_CK804, right? I just noticed that my last email might not have been clear enough. The exceptions happened when I re-enabled the return statement in addition to the debug message. Without the INT_DEV check, it is completely fine AFAICT. Try out this patch, it just calls the ata_host_intr function where appropriate without using nv_host_intr which looks at the NV_INT_STATUS_CK804 register. This is what the original ADMA patch from Mr. Mysterious NVIDIA Person did, I'm guessing there may be a reason for that. With this patch I can get through a whole bonnie++ run with the repeated IDENTIFY requests running without seeing the error. I'll see if I can schedule a test run for tomorrow, I currently need this box. Thanks, Björn - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: Hm, I don't think it is unhappy about looking at NV_INT_STATUS_CK804. I'm running 2.6.20-rc5 with the INT_DEV check removed for 8 hours now without a single problem and that should still look at NV_INT_STATUS_CK804, right? I just noticed that my last email might not have been clear enough. The exceptions happened when I re-enabled the return statement in addition to the debug message. Without the INT_DEV check, it is completely fine AFAICT. Indeed, it seems to be just the NV_INT_DEV check that is problematic. Here's a patch that's likely better to test, it forces the NV_INT_DEV flag on when a command is active, and also fixes that questionable code in nv_host_intr that I mentioned. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ --- linux-2.6.20-rc5/drivers/ata/sata_nv.c 2007-01-19 19:18:53.0 -0600 +++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-22 22:33:43.0 -0600 @@ -700,7 +700,6 @@ static void nv_adma_check_cpb(struct ata static int nv_host_intr(struct ata_port *ap, u8 irq_stat) { struct ata_queued_cmd *qc = ata_qc_from_tag(ap, ap-active_tag); - int handled; /* freeze if hotplugged */ if (unlikely(irq_stat (NV_INT_ADDED | NV_INT_REMOVED))) { @@ -719,13 +718,7 @@ static int nv_host_intr(struct ata_port } /* handle interrupt */ - handled = ata_host_intr(ap, qc); - if (unlikely(!handled)) { - /* spurious, clear it */ - ata_check_status(ap); - } - - return 1; + return ata_host_intr(ap, qc); } static irqreturn_t nv_adma_interrupt(int irq, void *dev_instance) @@ -752,6 +745,11 @@ static irqreturn_t nv_adma_interrupt(int if (pp-flags NV_ADMA_PORT_REGISTER_MODE) { u8 irq_stat = readb(host-mmio_base + NV_INT_STATUS_CK804) (NV_INT_PORT_SHIFT * i); + if(ata_tag_valid(ap-active_tag)) + /** NV_INT_DEV indication seems unreliable at times + at least in ADMA mode. Force it on always when a + command is active, to prevent losing interrupts. */ + irq_stat |= NV_INT_DEV; handled += nv_host_intr(ap, irq_stat); continue; }
Re: SATA exceptions triggered by XFS (since 2.6.18)
Paolo Ornati wrote: I don't know. It's a two years old ST380817AS. # smartctl -a -d ata /dev/sda smartctl version 5.36 [x86_64-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family Device Model: ST380817AS I'll blacklist it. Thanks. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Hello, Chr wrote: Ok, you won't believe this... I opened my case and rewired my drives... And guess what, my second (aka the "good") HDD is now failing! I guess, my mainboard has a (but maybe two, or three :( ) "bad" sata-port(s)! Or, you have power related problem. Try to rewire the power lines or connect harddrives to a separate powersupply. It's often useful to change one component at a time and watch which change the problem follows. Anyways, you seem to be suffering transmission failures, not a driver problem. Thanks. -- tejun - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions triggered by XFS (since 2.6.18)
Chr wrote: >> 7 Seek_Error_Rate 0x000f 083 060 030Pre-fail Always >> - 204305750 >> 1 Raw_Read_Error_Rate 0x000f 059 049 006Pre-fail Always >> - 215927244 >> 195 Hardware_ECC_Recovered 0x001a 059 049 000Old_age Always >> - 215927244 > > Wow! that HDD is really in a bad condition. I don't think so, this seems to be normal for Seagate drives... regards, -- http://www.fi.muni.cz/~xslaby/Jiri Slaby faculty of informatics, masaryk university, brno, cz e-mail: jirislaby gmail com, gpg pubkey fingerprint: B674 9967 0407 CE62 ACC8 22A0 32CC 55C3 39D4 7A7E - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: Björn Steinbrink wrote: All kernels were bad using that approach. So back to square 1. :/ Björn OK guys, here's a new patch to try against 2.6.20-rc5: Right now when switching between ADMA mode and legacy mode (i.e. when going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just set the ADMA GO register bit appropriately and continue with no delay. It looks like in some cases the controller doesn't respond to this immediately, it takes some nanoseconds for the controller's status registers to reflect the change that was made. It's possible that if we were trying to issue commands during this time, the controller might not react properly. This patch adds some code to wait for the status register to change to the state we asked for before continuing. Just got two exceptions with your patch, none of the debug messages were issued. Björn Hmm, another miss, apparently.. Has anyone tried removing these lines from nv_host_intr in 2.6.20-rc5 sata_nv.c and see what that does? /* bail out if not our interrupt */ if (!(irq_stat & NV_INT_DEV)) return 0; -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: On 2007.01.21 23:08:11 +0100, Björn Steinbrink wrote: On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: Björn Steinbrink wrote: All kernels were bad using that approach. So back to square 1. :/ Björn OK guys, here's a new patch to try against 2.6.20-rc5: Right now when switching between ADMA mode and legacy mode (i.e. when going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just set the ADMA GO register bit appropriately and continue with no delay. It looks like in some cases the controller doesn't respond to this immediately, it takes some nanoseconds for the controller's status registers to reflect the change that was made. It's possible that if we were trying to issue commands during this time, the controller might not react properly. This patch adds some code to wait for the status register to change to the state we asked for before continuing. I went for the "I feel lucky" route and did just add mmio reads after the mmio writes, posting them. Rationale being that if it is a write posting issue, the debug patch would/could actually hide it AFAICT. It's the "I feel lucky" route, because my whole "knowledge" about mmio and write posting originates from the few things I read up on when you discovered the comment about write posting in the generic ata code. Uhm, yeah, exception occured about the time that I hit "send". Björn Yeah, I don't think just adding reads to flush posted writes is enough here - it seems to need more delay than that, and it also wasn't always in the idle state even before we would write the register.. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: > Björn Steinbrink wrote: > >All kernels were bad using that approach. So back to square 1. :/ > > > >Björn > > > > OK guys, here's a new patch to try against 2.6.20-rc5: > > Right now when switching between ADMA mode and legacy mode (i.e. when > going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just > set the ADMA GO register bit appropriately and continue with no delay. > It looks like in some cases the controller doesn't respond to this > immediately, it takes some nanoseconds for the controller's status > registers to reflect the change that was made. It's possible that if we > were trying to issue commands during this time, the controller might not > react properly. This patch adds some code to wait for the status > register to change to the state we asked for before continuing. Just got two exceptions with your patch, none of the debug messages were issued. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 23:08:11 +0100, Björn Steinbrink wrote: > On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: > > Björn Steinbrink wrote: > > >All kernels were bad using that approach. So back to square 1. :/ > > > > > >Björn > > > > > > > OK guys, here's a new patch to try against 2.6.20-rc5: > > > > Right now when switching between ADMA mode and legacy mode (i.e. when > > going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just > > set the ADMA GO register bit appropriately and continue with no delay. > > It looks like in some cases the controller doesn't respond to this > > immediately, it takes some nanoseconds for the controller's status > > registers to reflect the change that was made. It's possible that if we > > were trying to issue commands during this time, the controller might not > > react properly. This patch adds some code to wait for the status > > register to change to the state we asked for before continuing. > > I went for the "I feel lucky" route and did just add mmio reads after the > mmio writes, posting them. Rationale being that if it is a write posting > issue, the debug patch would/could actually hide it AFAICT. > It's the "I feel lucky" route, because my whole "knowledge" about mmio > and write posting originates from the few things I read up on when you > discovered the comment about write posting in the generic ata code. Uhm, yeah, exception occured about the time that I hit "send". Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: > Björn Steinbrink wrote: > >All kernels were bad using that approach. So back to square 1. :/ > > > >Björn > > > > OK guys, here's a new patch to try against 2.6.20-rc5: > > Right now when switching between ADMA mode and legacy mode (i.e. when > going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just > set the ADMA GO register bit appropriately and continue with no delay. > It looks like in some cases the controller doesn't respond to this > immediately, it takes some nanoseconds for the controller's status > registers to reflect the change that was made. It's possible that if we > were trying to issue commands during this time, the controller might not > react properly. This patch adds some code to wait for the status > register to change to the state we asked for before continuing. I went for the "I feel lucky" route and did just add mmio reads after the mmio writes, posting them. Rationale being that if it is a write posting issue, the debug patch would/could actually hide it AFAICT. It's the "I feel lucky" route, because my whole "knowledge" about mmio and write posting originates from the few things I read up on when you discovered the comment about write posting in the generic ata code. Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions triggered by XFS (since 2.6.18)
On Sunday, 21. January 2007 20:25, Paolo Ornati wrote: > On Sun, 21 Jan 2007 11:32:02 -0600 > Robert Hancock <[EMAIL PROTECTED]> wrote: > > > It looks like what you're getting is an actual NCQ write timing out. > > That makes the bisect result not very interesting since obviously it > > wouldn't have issued any NCQ writes before NCQ support was > > implemented. Seeing as how it's also an entirely different driver I > > imagine it's a different problem than what I've been looking at. > > > > Maybe that drive just has some issues with NCQ? I would be surprised > > at that with a Seagate though.. > > I don't know. It's a two years old ST380817AS. > > > # smartctl -a -d ata /dev/sda > > smartctl version 5.36 [x86_64-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen > Home page is http://smartmontools.sourceforge.net/ > > === START OF INFORMATION SECTION === > Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family > Device Model: ST380817AS > Serial Number:4MR08EK8 > Firmware Version: 3.42 > User Capacity:80,026,361,856 bytes > Device is:In smartctl database [for details use: -P show] > ATA Version is: 6 > ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 > Local Time is:Sun Jan 21 20:15:40 2007 CET > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > > General SMART Values: > Offline data collection status: (0x82) Offline data collection activity > was completed without error. > Auto Offline Data Collection: Enabled. > Self-test execution status: ( 0) The previous self-test routine > completed > without error or no self-test has ever > been run. > Total time to complete Offline > data collection: ( 430) seconds. > Offline data collection > capabilities: (0x5b) SMART execute Offline immediate. > Auto Offline data collection on/off > support. > Suspend Offline collection upon new > command. > Offline surface scan supported. > Self-test supported. > No Conveyance Self-test supported. > Selective Self-test supported. > SMART capabilities:(0x0003) Saves SMART data before entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability:(0x01) Error logging supported. > No General Purpose Logging support. > Short self-test routine > recommended polling time: ( 1) minutes. > Extended self-test routine > recommended polling time: ( 47) minutes. > > SMART Attributes Data Structure revision number: 10 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED > WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000f 059 049 006Pre-fail Always > - 215927244 > 3 Spin_Up_Time0x0003 098 098 000Pre-fail Always > - 0 > 4 Start_Stop_Count0x0032 098 098 020Old_age Always > - 2182 > 5 Reallocated_Sector_Ct 0x0033 100 100 036Pre-fail Always > - 0 > 7 Seek_Error_Rate 0x000f 083 060 030Pre-fail Always > - 204305750 > 9 Power_On_Hours 0x0032 097 097 000Old_age Always > - 3494 > 10 Spin_Retry_Count0x0013 100 100 097Pre-fail Always > - 0 > 12 Power_Cycle_Count 0x0032 098 098 020Old_age Always > - 2541 > 194 Temperature_Celsius 0x0022 024 040 000Old_age Always > - 24 (Lifetime Min/Max 0/15) > 195 Hardware_ECC_Recovered 0x001a 059 049 000Old_age Always > - 215927244 > 197 Current_Pending_Sector 0x0012 100 100 000Old_age Always > - 1 > 198 Offline_Uncorrectable 0x0010 100 100 000Old_age Offline > - 1 > 199 UDMA_CRC_Error_Count0x003e 200 200 000Old_age Always > - 0 > 200 Multi_Zone_Error_Rate 0x 100 253 000Old_age Offline > - 0 > 202 TA_Increase_Count 0x0032 100 253 000Old_age Always > - 0 > > SMART Error Log Version: 1 > ATA Error Count: 12 (device log contains only the most recent five errors) > CR = Command Register [HEX] > FR = Features Register [HEX] > SC
Re: SATA exceptions with 2.6.20-rc5
On Sunday, 21. January 2007 19:01, Björn Steinbrink wrote: > On 2007.01.21 18:34:40 +0100, Chr wrote: > > I run those two in parallel: > while /bin/true; do ls -lR / > /dev/null 2>&1; done > while /bin/true; do echo 255 > /proc/sys/vm/drop_caches; sleep 1; done > > Not sure if running them in parallel is necessary, but I don't want to > change the test setup ;) Takes between 1 and 40 minutes to trigger it. > Most of the time it's around 15 minutes now, doing more random stuff in > addition to that seems to trigger it even easier (like reading mail, > rebuilding the kernel etc.). > > I'm down to 2 commits after 2.6.19 now, only bad kernels, so I tend to > say that 2.6.19 with 2.6.20-rc5's sata_nv.c will also fail for me, but I > thought I might finish bisection just to be sure. > > > But, this time it looks slightly different: > > ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen > > ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout) > > > > [Rest of the error message + SMART error snipped] > > I get the same exception every time, doesn't change for me. And neither > do I get any SMART errors or something. > > Thanks, > Björn Ok, you won't believe this... I opened my case and rewired my drives... And guess what, my second (aka the "good") HDD is now failing! I guess, my mainboard has a (but maybe two, or three :( ) "bad" sata-port(s)! But, one small question remains: when I opened my case, I saw that my drivers are pluged in SATA jack 1 and 2... The BIOS also says they're on 1 and 2. Now, Linux says they're on port 3 & 4! it's always ata3.00! "ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata3.00: tag 0 cmd 0xea Emask 0x4 stat 0x40 err 0x0 (timeout) ata3: soft resetting port ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata3.00: configured for UDMA/133 ata3: EH complete SCSI device sda: 490234752 512-byte hdwr sectors (251000 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: drive cache: write back" Thanks, Chr. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: All kernels were bad using that approach. So back to square 1. :/ Björn OK guys, here's a new patch to try against 2.6.20-rc5: Right now when switching between ADMA mode and legacy mode (i.e. when going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just set the ADMA GO register bit appropriately and continue with no delay. It looks like in some cases the controller doesn't respond to this immediately, it takes some nanoseconds for the controller's status registers to reflect the change that was made. It's possible that if we were trying to issue commands during this time, the controller might not react properly. This patch adds some code to wait for the status register to change to the state we asked for before continuing. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ --- linux-2.6.20-rc5/drivers/ata/sata_nv.c 2007-01-19 19:18:53.0 -0600 +++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-21 13:35:17.0 -0600 @@ -509,14 +509,38 @@ static void nv_adma_register_mode(struct { void __iomem *mmio = nv_adma_ctl_block(ap); struct nv_adma_port_priv *pp = ap->private_data; - u16 tmp; + u16 tmp, status; + int count = 0; if (pp->flags & NV_ADMA_PORT_REGISTER_MODE) return; + status = readw(mmio + NV_ADMA_STAT); + while(!(status & NV_ADMA_STAT_IDLE) && count < 20) { + ndelay(50); + status = readw(mmio + NV_ADMA_STAT); + count++; + } + if(count == 20) + ata_port_printk(ap, KERN_WARNING, + "timeout waiting for ADMA IDLE, stat=0x%hx\n", + status); + tmp = readw(mmio + NV_ADMA_CTL); writew(tmp & ~NV_ADMA_CTL_GO, mmio + NV_ADMA_CTL); + count = 0; + status = readw(mmio + NV_ADMA_STAT); + while(!(status & NV_ADMA_STAT_LEGACY) && count < 20) { + ndelay(50); + status = readw(mmio + NV_ADMA_STAT); + count++; + } + if(count == 20) + ata_port_printk(ap, KERN_WARNING, +"timeout waiting for ADMA LEGACY, stat=0x%hx\n", +status); + pp->flags |= NV_ADMA_PORT_REGISTER_MODE; } @@ -524,7 +548,8 @@ static void nv_adma_mode(struct ata_port { void __iomem *mmio = nv_adma_ctl_block(ap); struct nv_adma_port_priv *pp = ap->private_data; - u16 tmp; + u16 tmp, status; + int count = 0; if (!(pp->flags & NV_ADMA_PORT_REGISTER_MODE)) return; @@ -534,6 +559,18 @@ static void nv_adma_mode(struct ata_port tmp = readw(mmio + NV_ADMA_CTL); writew(tmp | NV_ADMA_CTL_GO, mmio + NV_ADMA_CTL); + status = readw(mmio + NV_ADMA_STAT); + while(((status & NV_ADMA_STAT_LEGACY) || + !(status & NV_ADMA_STAT_IDLE)) && count < 20) { + ndelay(50); + status = readw(mmio + NV_ADMA_STAT); + count++; + } + if(count == 20) + ata_port_printk(ap, KERN_WARNING, + "timeout waiting for ADMA LEGACY clear and IDLE, stat=0x%hx\n", + status); + pp->flags &= ~NV_ADMA_PORT_REGISTER_MODE; }
Re: SATA exceptions triggered by XFS (since 2.6.18)
On Sun, 21 Jan 2007 11:32:02 -0600 Robert Hancock <[EMAIL PROTECTED]> wrote: > It looks like what you're getting is an actual NCQ write timing out. > That makes the bisect result not very interesting since obviously it > wouldn't have issued any NCQ writes before NCQ support was > implemented. Seeing as how it's also an entirely different driver I > imagine it's a different problem than what I've been looking at. > > Maybe that drive just has some issues with NCQ? I would be surprised > at that with a Seagate though.. I don't know. It's a two years old ST380817AS. # smartctl -a -d ata /dev/sda smartctl version 5.36 [x86_64-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family Device Model: ST380817AS Serial Number:4MR08EK8 Firmware Version: 3.42 User Capacity:80,026,361,856 bytes Device is:In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 Local Time is:Sun Jan 21 20:15:40 2007 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities:(0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time:( 1) minutes. Extended self-test routine recommended polling time:( 47) minutes. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 059 049 006Pre-fail Always - 215927244 3 Spin_Up_Time0x0003 098 098 000Pre-fail Always - 0 4 Start_Stop_Count0x0032 098 098 020Old_age Always - 2182 5 Reallocated_Sector_Ct 0x0033 100 100 036Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 083 060 030Pre-fail Always - 204305750 9 Power_On_Hours 0x0032 097 097 000Old_age Always - 3494 10 Spin_Retry_Count0x0013 100 100 097Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 098 098 020Old_age Always - 2541 194 Temperature_Celsius 0x0022 024 040 000Old_age Always - 24 (Lifetime Min/Max 0/15) 195 Hardware_ECC_Recovered 0x001a 059 049 000Old_age Always - 215927244 197 Current_Pending_Sector 0x0012 100 100 000Old_age Always - 1 198 Offline_Uncorrectable 0x0010 100 100 000Old_age Offline - 1 199 UDMA_CRC_Error_Count0x003e 200 200 000Old_age Always - 0 200 Multi_Zone_Error_Rate 0x 100 253 000Old_age Offline - 0 202 TA_Increase_Count 0x0032 100 253 000Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 12 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 09:36:18 +0100, Björn Steinbrink wrote: > On 2007.01.21 00:39:20 -0600, Robert Hancock wrote: > > Björn Steinbrink wrote: > > >On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote: > > >>Robert Hancock wrote: > > >>>change in 2.6.20-rc is either causing or triggering this problem. It > > >>>would be useful if you could try git bisect between 2.6.19 and > > >>>2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that > > >> > > >>Yes, 'git bisect' would be the next step in figuring out this puzzle. > > >> > > >>Anybody up for it? > > > > > >I'll go for it, but could I get an explanation how that could lead to a > > >different result than my last bisection? I see the difference of keeping > > >sata_nv.c but my brain can't wrap around it right now (woke up in the > > >middle of the night and still not up to speed...). > > > > Whatever the problem is, only seems to show up when ADMA is enabled, and > > so the patch that added ADMA support shows up as the culprit from your > > git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA > > support added in doesn't seem to have the problem, so presumably > > something else that changed in the 2.6.20-rc series is triggering it. > > Doing a bisect while keeping the driver code itself the same will > > hopefully identify what that change is.. > > Ah, right... sata_nv.c of course interacts with the outside world, d'oh! > > Up to now, I only got bad kernels, latest tested being: > 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576 > > Which, unless I missed a commit in the diff, only USB changes, > continuing anyway. All kernels were bad using that approach. So back to square 1. :/ Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 18:34:40 +0100, Chr wrote: > On Sunday, 21. January 2007 09:36, Björn Steinbrink wrote: > > On 2007.01.21 00:39:20 -0600, Robert Hancock wrote: > > > > Ah, right... sata_nv.c of course interacts with the outside world, d'oh! > > > > Up to now, I only got bad kernels, latest tested being: > > 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576 > > > > Which, unless I missed a commit in the diff, only USB changes, > > continuing anyway. > > > > Just to make sure, here's my little helper for this bisect run, I hope > > it does what you expected: > > > > #!/bin/bash > > cp ../sata_nv.c.orig drivers/ata/sata_nv.c > > git bisect good > > cp drivers/ata/sata_nv.c ../sata_nv.c.orig > > cp ../sata_nv.c drivers/ata/ > > make oldconfig > > make -j4 > > > > Where "../sata_nv.c" is the version from 2.6.20-rc5. The copying is done > > to avoid conflicts and keep git happy. Of course there's also a version > > for bad kernels ;) No idea, why I didn't make that an argument to the > > script... > > > > Thanks, > > Björn > > Ar, 2.6.19 (with 2.6.20-rc5 adma stuff) is affected too (BTW, what do you > do to trigger the exceptions? Because, it takes hours to "reproduces" this > silly *). I run those two in parallel: while /bin/true; do ls -lR / > /dev/null 2>&1; done while /bin/true; do echo 255 > /proc/sys/vm/drop_caches; sleep 1; done Not sure if running them in parallel is necessary, but I don't want to change the test setup ;) Takes between 1 and 40 minutes to trigger it. Most of the time it's around 15 minutes now, doing more random stuff in addition to that seems to trigger it even easier (like reading mail, rebuilding the kernel etc.). I'm down to 2 commits after 2.6.19 now, only bad kernels, so I tend to say that 2.6.19 with 2.6.20-rc5's sata_nv.c will also fail for me, but I thought I might finish bisection just to be sure. > But, this time it looks slightly different: > ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen > ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout) > [Rest of the error message + SMART error snipped] I get the same exception every time, doesn't change for me. And neither do I get any SMART errors or something. Thanks, Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On Sunday, 21. January 2007 09:36, Björn Steinbrink wrote: > On 2007.01.21 00:39:20 -0600, Robert Hancock wrote: > > Ah, right... sata_nv.c of course interacts with the outside world, d'oh! > > Up to now, I only got bad kernels, latest tested being: > 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576 > > Which, unless I missed a commit in the diff, only USB changes, > continuing anyway. > > Just to make sure, here's my little helper for this bisect run, I hope > it does what you expected: > > #!/bin/bash > cp ../sata_nv.c.orig drivers/ata/sata_nv.c > git bisect good > cp drivers/ata/sata_nv.c ../sata_nv.c.orig > cp ../sata_nv.c drivers/ata/ > make oldconfig > make -j4 > > Where "../sata_nv.c" is the version from 2.6.20-rc5. The copying is done > to avoid conflicts and keep git happy. Of course there's also a version > for bad kernels ;) No idea, why I didn't make that an argument to the > script... > > Thanks, > Björn Ar, 2.6.19 (with 2.6.20-rc5 adma stuff) is affected too (BTW, what do you do to trigger the exceptions? Because, it takes hours to "reproduces" this silly *). But, this time it looks slightly different: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout) ata3: soft resetting port ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) !!! ata3.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x1) ata3.00: revalidation failed (errno=-5) ata3: failed to recover some devices, retrying in 5 secs !!! ata3: hard resetting port ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata3.00: configured for UDMA/133 ata3: EH complete SCSI device sda: 488395055 512-byte hdwr sectors (250058 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: drive cache: write back Oh, and I got this nice SMART Error: ID# ATTRIBUTE_NAME FLAGRAW VALUE 199 UDMA_CRC_Error_Count0x003e ... - 12 SMART Error Log Version: 1 ATA Error Count: 1 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It "wraps" after 49.710 days. Error 1 occurred at disk power-on lifetime: 5603 hours (233 days + 11 hours) When the command that caused the error occurred, the device was in an unknown state. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 51 3f 00 00 00 af Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- 91 00 3f 00 00 00 0f 00 05:30:59.655 INITIALIZE DEVICE PARAMETERS [OBS-6] ec 00 01 01 00 00 00 00 05:30:59.654 IDENTIFY DEVICE ec 00 00 00 00 00 00 00 05:30:56.191 IDENTIFY DEVICE ca 00 28 02 ee 9a 0c 00 05:30:56.190 WRITE DMA ca 00 10 e8 4c 10 0a 00 05:30:56.190 WRITE DMA Maybe, it's really the HDD! OT: "http://www.nvidia.com/object/680i_hotfix.html; Chr. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions triggered by XFS (since 2.6.18)
Paolo Ornati wrote: On Sun, 21 Jan 2007 15:29:32 +0100 Paolo Ornati <[EMAIL PROTECTED]> wrote: Sorry for starting a new thread, but I've deleted the messages from my mail-box, and I'm sot sure it's the same problem as here: http://lkml.org/lkml/2007/1/14/108 Today I've decided to try XFS... and just doing anything on it (extracting a tarball, for example) make my SATA HD go crazy ;) I don't remember to have seen this using Ext3. [ 877.839920] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x2 frozen [ 877.839929] ata1.00: cmd 61/02:00:64:98:98/00:00:00:00:00/40 tag 0 cdb 0x0 data 1024 out [ 877.839931] res 40/00:00:00:4f:c2/00:00:00:4f:c2/00 Emask 0x4 (timeout) [ 878.142367] ata1: soft resetting port [ 878.351791] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [ 878.354384] ata1.00: configured for UDMA/133 [ 878.354392] ata1: EH complete [ 878.355696] SCSI device sda: 156301488 512-byte hdwr sectors (80026 MB) [ 878.355716] sda: Write Protect is off [ 878.355718] sda: Mode Sense: 00 3a 00 00 [ 878.355745] SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA It takes nothing to reproduce it. .. git-bisect points to this commit: -- 12fad3f965830d71f6454f02b2af002a64cec4d3 is first bad commit commit 12fad3f965830d71f6454f02b2af002a64cec4d3 Author: Tejun Heo <[EMAIL PROTECTED]> Date: Mon May 15 21:03:55 2006 +0900 [PATCH] ahci: implement NCQ suppport Implement NCQ support. It looks like what you're getting is an actual NCQ write timing out. That makes the bisect result not very interesting since obviously it wouldn't have issued any NCQ writes before NCQ support was implemented. Seeing as how it's also an entirely different driver I imagine it's a different problem than what I've been looking at. Maybe that drive just has some issues with NCQ? I would be surprised at that with a Seagate though.. -- Robert Hancock Saskatoon, SK, Canada To email, remove "nospam" from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 00:39:20 -0600, Robert Hancock wrote: > Björn Steinbrink wrote: > >On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote: > >>Robert Hancock wrote: > >>>change in 2.6.20-rc is either causing or triggering this problem. It > >>>would be useful if you could try git bisect between 2.6.19 and > >>>2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that > >> > >>Yes, 'git bisect' would be the next step in figuring out this puzzle. > >> > >>Anybody up for it? > > > >I'll go for it, but could I get an explanation how that could lead to a > >different result than my last bisection? I see the difference of keeping > >sata_nv.c but my brain can't wrap around it right now (woke up in the > >middle of the night and still not up to speed...). > > Whatever the problem is, only seems to show up when ADMA is enabled, and > so the patch that added ADMA support shows up as the culprit from your > git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA > support added in doesn't seem to have the problem, so presumably > something else that changed in the 2.6.20-rc series is triggering it. > Doing a bisect while keeping the driver code itself the same will > hopefully identify what that change is.. Ah, right... sata_nv.c of course interacts with the outside world, d'oh! Up to now, I only got bad kernels, latest tested being: 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576 Which, unless I missed a commit in the diff, only USB changes, continuing anyway. Just to make sure, here's my little helper for this bisect run, I hope it does what you expected: #!/bin/bash cp ../sata_nv.c.orig drivers/ata/sata_nv.c git bisect good cp drivers/ata/sata_nv.c ../sata_nv.c.orig cp ../sata_nv.c drivers/ata/ make oldconfig make -j4 Where "../sata_nv.c" is the version from 2.6.20-rc5. The copying is done to avoid conflicts and keep git happy. Of course there's also a version for bad kernels ;) No idea, why I didn't make that an argument to the script... Thanks, Björn - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 00:39:20 -0600, Robert Hancock wrote: Björn Steinbrink wrote: On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote: Robert Hancock wrote: change in 2.6.20-rc is either causing or triggering this problem. It would be useful if you could try git bisect between 2.6.19 and 2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that Yes, 'git bisect' would be the next step in figuring out this puzzle. Anybody up for it? I'll go for it, but could I get an explanation how that could lead to a different result than my last bisection? I see the difference of keeping sata_nv.c but my brain can't wrap around it right now (woke up in the middle of the night and still not up to speed...). Whatever the problem is, only seems to show up when ADMA is enabled, and so the patch that added ADMA support shows up as the culprit from your git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA support added in doesn't seem to have the problem, so presumably something else that changed in the 2.6.20-rc series is triggering it. Doing a bisect while keeping the driver code itself the same will hopefully identify what that change is.. Ah, right... sata_nv.c of course interacts with the outside world, d'oh! Up to now, I only got bad kernels, latest tested being: 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576 Which, unless I missed a commit in the diff, only USB changes, continuing anyway. Just to make sure, here's my little helper for this bisect run, I hope it does what you expected: #!/bin/bash cp ../sata_nv.c.orig drivers/ata/sata_nv.c git bisect good cp drivers/ata/sata_nv.c ../sata_nv.c.orig cp ../sata_nv.c drivers/ata/ make oldconfig make -j4 Where ../sata_nv.c is the version from 2.6.20-rc5. The copying is done to avoid conflicts and keep git happy. Of course there's also a version for bad kernels ;) No idea, why I didn't make that an argument to the script... Thanks, Björn - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions triggered by XFS (since 2.6.18)
Paolo Ornati wrote: On Sun, 21 Jan 2007 15:29:32 +0100 Paolo Ornati [EMAIL PROTECTED] wrote: Sorry for starting a new thread, but I've deleted the messages from my mail-box, and I'm sot sure it's the same problem as here: http://lkml.org/lkml/2007/1/14/108 Today I've decided to try XFS... and just doing anything on it (extracting a tarball, for example) make my SATA HD go crazy ;) I don't remember to have seen this using Ext3. [ 877.839920] ata1.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x2 frozen [ 877.839929] ata1.00: cmd 61/02:00:64:98:98/00:00:00:00:00/40 tag 0 cdb 0x0 data 1024 out [ 877.839931] res 40/00:00:00:4f:c2/00:00:00:4f:c2/00 Emask 0x4 (timeout) [ 878.142367] ata1: soft resetting port [ 878.351791] ata1: SATA link up 1.5 Gbps (SStatus 113 SControl 300) [ 878.354384] ata1.00: configured for UDMA/133 [ 878.354392] ata1: EH complete [ 878.355696] SCSI device sda: 156301488 512-byte hdwr sectors (80026 MB) [ 878.355716] sda: Write Protect is off [ 878.355718] sda: Mode Sense: 00 3a 00 00 [ 878.355745] SCSI device sda: write cache: enabled, read cache: enabled, doesn't support DPO or FUA It takes nothing to reproduce it. .. git-bisect points to this commit: -- 12fad3f965830d71f6454f02b2af002a64cec4d3 is first bad commit commit 12fad3f965830d71f6454f02b2af002a64cec4d3 Author: Tejun Heo [EMAIL PROTECTED] Date: Mon May 15 21:03:55 2006 +0900 [PATCH] ahci: implement NCQ suppport Implement NCQ support. It looks like what you're getting is an actual NCQ write timing out. That makes the bisect result not very interesting since obviously it wouldn't have issued any NCQ writes before NCQ support was implemented. Seeing as how it's also an entirely different driver I imagine it's a different problem than what I've been looking at. Maybe that drive just has some issues with NCQ? I would be surprised at that with a Seagate though.. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On Sunday, 21. January 2007 09:36, Björn Steinbrink wrote: On 2007.01.21 00:39:20 -0600, Robert Hancock wrote: Ah, right... sata_nv.c of course interacts with the outside world, d'oh! Up to now, I only got bad kernels, latest tested being: 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576 Which, unless I missed a commit in the diff, only USB changes, continuing anyway. Just to make sure, here's my little helper for this bisect run, I hope it does what you expected: #!/bin/bash cp ../sata_nv.c.orig drivers/ata/sata_nv.c git bisect good cp drivers/ata/sata_nv.c ../sata_nv.c.orig cp ../sata_nv.c drivers/ata/ make oldconfig make -j4 Where ../sata_nv.c is the version from 2.6.20-rc5. The copying is done to avoid conflicts and keep git happy. Of course there's also a version for bad kernels ;) No idea, why I didn't make that an argument to the script... Thanks, Björn Ar, 2.6.19 (with 2.6.20-rc5 adma stuff) is affected too (BTW, what do you do to trigger the exceptions? Because, it takes hours to reproduces this silly *). But, this time it looks slightly different: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout) ata3: soft resetting port ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) !!! ata3.00: failed to IDENTIFY (INIT_DEV_PARAMS failed, err_mask=0x1) ata3.00: revalidation failed (errno=-5) ata3: failed to recover some devices, retrying in 5 secs !!! ata3: hard resetting port ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata3.00: configured for UDMA/133 ata3: EH complete SCSI device sda: 488395055 512-byte hdwr sectors (250058 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: drive cache: write back Oh, and I got this nice SMART Error: ID# ATTRIBUTE_NAME FLAGRAW VALUE 199 UDMA_CRC_Error_Count0x003e ... - 12 SMART Error Log Version: 1 ATA Error Count: 1 CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register [HEX] ST = Status register [HEX] Powered_Up_Time is measured from power on, and printed as DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes, SS=sec, and sss=millisec. It wraps after 49.710 days. Error 1 occurred at disk power-on lifetime: 5603 hours (233 days + 11 hours) When the command that caused the error occurred, the device was in an unknown state. After command completion occurred, registers were: ER ST SC SN CL CH DH -- -- -- -- -- -- -- 04 51 3f 00 00 00 af Commands leading to the command that caused the error were: CR FR SC SN CL CH DH DC Powered_Up_Time Command/Feature_Name -- -- -- -- -- -- -- -- 91 00 3f 00 00 00 0f 00 05:30:59.655 INITIALIZE DEVICE PARAMETERS [OBS-6] ec 00 01 01 00 00 00 00 05:30:59.654 IDENTIFY DEVICE ec 00 00 00 00 00 00 00 05:30:56.191 IDENTIFY DEVICE ca 00 28 02 ee 9a 0c 00 05:30:56.190 WRITE DMA ca 00 10 e8 4c 10 0a 00 05:30:56.190 WRITE DMA Maybe, it's really the HDD! OT: http://www.nvidia.com/object/680i_hotfix.html; Chr. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 18:34:40 +0100, Chr wrote: On Sunday, 21. January 2007 09:36, Björn Steinbrink wrote: On 2007.01.21 00:39:20 -0600, Robert Hancock wrote: Ah, right... sata_nv.c of course interacts with the outside world, d'oh! Up to now, I only got bad kernels, latest tested being: 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576 Which, unless I missed a commit in the diff, only USB changes, continuing anyway. Just to make sure, here's my little helper for this bisect run, I hope it does what you expected: #!/bin/bash cp ../sata_nv.c.orig drivers/ata/sata_nv.c git bisect good cp drivers/ata/sata_nv.c ../sata_nv.c.orig cp ../sata_nv.c drivers/ata/ make oldconfig make -j4 Where ../sata_nv.c is the version from 2.6.20-rc5. The copying is done to avoid conflicts and keep git happy. Of course there's also a version for bad kernels ;) No idea, why I didn't make that an argument to the script... Thanks, Björn Ar, 2.6.19 (with 2.6.20-rc5 adma stuff) is affected too (BTW, what do you do to trigger the exceptions? Because, it takes hours to reproduces this silly *). I run those two in parallel: while /bin/true; do ls -lR / /dev/null 21; done while /bin/true; do echo 255 /proc/sys/vm/drop_caches; sleep 1; done Not sure if running them in parallel is necessary, but I don't want to change the test setup ;) Takes between 1 and 40 minutes to trigger it. Most of the time it's around 15 minutes now, doing more random stuff in addition to that seems to trigger it even easier (like reading mail, rebuilding the kernel etc.). I'm down to 2 commits after 2.6.19 now, only bad kernels, so I tend to say that 2.6.19 with 2.6.20-rc5's sata_nv.c will also fail for me, but I thought I might finish bisection just to be sure. But, this time it looks slightly different: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout) [Rest of the error message + SMART error snipped] I get the same exception every time, doesn't change for me. And neither do I get any SMART errors or something. Thanks, Björn - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 09:36:18 +0100, Björn Steinbrink wrote: On 2007.01.21 00:39:20 -0600, Robert Hancock wrote: Björn Steinbrink wrote: On 2007.01.20 22:34:27 -0500, Jeff Garzik wrote: Robert Hancock wrote: change in 2.6.20-rc is either causing or triggering this problem. It would be useful if you could try git bisect between 2.6.19 and 2.6.20-rc5, keeping the latest sata_nv.c each time, and see if that Yes, 'git bisect' would be the next step in figuring out this puzzle. Anybody up for it? I'll go for it, but could I get an explanation how that could lead to a different result than my last bisection? I see the difference of keeping sata_nv.c but my brain can't wrap around it right now (woke up in the middle of the night and still not up to speed...). Whatever the problem is, only seems to show up when ADMA is enabled, and so the patch that added ADMA support shows up as the culprit from your git bisect. However, from what Chr is reporting, 2.6.19 with the ADMA support added in doesn't seem to have the problem, so presumably something else that changed in the 2.6.20-rc series is triggering it. Doing a bisect while keeping the driver code itself the same will hopefully identify what that change is.. Ah, right... sata_nv.c of course interacts with the outside world, d'oh! Up to now, I only got bad kernels, latest tested being: 94fcda1f8ab5e0cacc381c5ca1cc9aa6ad523576 Which, unless I missed a commit in the diff, only USB changes, continuing anyway. All kernels were bad using that approach. So back to square 1. :/ Björn - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions triggered by XFS (since 2.6.18)
On Sun, 21 Jan 2007 11:32:02 -0600 Robert Hancock [EMAIL PROTECTED] wrote: It looks like what you're getting is an actual NCQ write timing out. That makes the bisect result not very interesting since obviously it wouldn't have issued any NCQ writes before NCQ support was implemented. Seeing as how it's also an entirely different driver I imagine it's a different problem than what I've been looking at. Maybe that drive just has some issues with NCQ? I would be surprised at that with a Seagate though.. I don't know. It's a two years old ST380817AS. # smartctl -a -d ata /dev/sda smartctl version 5.36 [x86_64-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family Device Model: ST380817AS Serial Number:4MR08EK8 Firmware Version: 3.42 User Capacity:80,026,361,856 bytes Device is:In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 Local Time is:Sun Jan 21 20:15:40 2007 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities:(0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time:( 1) minutes. Extended self-test routine recommended polling time:( 47) minutes. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 059 049 006Pre-fail Always - 215927244 3 Spin_Up_Time0x0003 098 098 000Pre-fail Always - 0 4 Start_Stop_Count0x0032 098 098 020Old_age Always - 2182 5 Reallocated_Sector_Ct 0x0033 100 100 036Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 083 060 030Pre-fail Always - 204305750 9 Power_On_Hours 0x0032 097 097 000Old_age Always - 3494 10 Spin_Retry_Count0x0013 100 100 097Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 098 098 020Old_age Always - 2541 194 Temperature_Celsius 0x0022 024 040 000Old_age Always - 24 (Lifetime Min/Max 0/15) 195 Hardware_ECC_Recovered 0x001a 059 049 000Old_age Always - 215927244 197 Current_Pending_Sector 0x0012 100 100 000Old_age Always - 1 198 Offline_Uncorrectable 0x0010 100 100 000Old_age Offline - 1 199 UDMA_CRC_Error_Count0x003e 200 200 000Old_age Always - 0 200 Multi_Zone_Error_Rate 0x 100 253 000Old_age Offline - 0 202 TA_Increase_Count 0x0032 100 253 000Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 12 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX] CH = Cylinder High Register [HEX] DH = Device/Head Register [HEX] DC = Device Command Register [HEX] ER = Error register
Re: SATA exceptions with 2.6.20-rc5
Björn Steinbrink wrote: All kernels were bad using that approach. So back to square 1. :/ Björn OK guys, here's a new patch to try against 2.6.20-rc5: Right now when switching between ADMA mode and legacy mode (i.e. when going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just set the ADMA GO register bit appropriately and continue with no delay. It looks like in some cases the controller doesn't respond to this immediately, it takes some nanoseconds for the controller's status registers to reflect the change that was made. It's possible that if we were trying to issue commands during this time, the controller might not react properly. This patch adds some code to wait for the status register to change to the state we asked for before continuing. -- Robert Hancock Saskatoon, SK, Canada To email, remove nospam from [EMAIL PROTECTED] Home Page: http://www.roberthancock.com/ --- linux-2.6.20-rc5/drivers/ata/sata_nv.c 2007-01-19 19:18:53.0 -0600 +++ linux-2.6.20-rc5debug/drivers/ata/sata_nv.c 2007-01-21 13:35:17.0 -0600 @@ -509,14 +509,38 @@ static void nv_adma_register_mode(struct { void __iomem *mmio = nv_adma_ctl_block(ap); struct nv_adma_port_priv *pp = ap-private_data; - u16 tmp; + u16 tmp, status; + int count = 0; if (pp-flags NV_ADMA_PORT_REGISTER_MODE) return; + status = readw(mmio + NV_ADMA_STAT); + while(!(status NV_ADMA_STAT_IDLE) count 20) { + ndelay(50); + status = readw(mmio + NV_ADMA_STAT); + count++; + } + if(count == 20) + ata_port_printk(ap, KERN_WARNING, + timeout waiting for ADMA IDLE, stat=0x%hx\n, + status); + tmp = readw(mmio + NV_ADMA_CTL); writew(tmp ~NV_ADMA_CTL_GO, mmio + NV_ADMA_CTL); + count = 0; + status = readw(mmio + NV_ADMA_STAT); + while(!(status NV_ADMA_STAT_LEGACY) count 20) { + ndelay(50); + status = readw(mmio + NV_ADMA_STAT); + count++; + } + if(count == 20) + ata_port_printk(ap, KERN_WARNING, +timeout waiting for ADMA LEGACY, stat=0x%hx\n, +status); + pp-flags |= NV_ADMA_PORT_REGISTER_MODE; } @@ -524,7 +548,8 @@ static void nv_adma_mode(struct ata_port { void __iomem *mmio = nv_adma_ctl_block(ap); struct nv_adma_port_priv *pp = ap-private_data; - u16 tmp; + u16 tmp, status; + int count = 0; if (!(pp-flags NV_ADMA_PORT_REGISTER_MODE)) return; @@ -534,6 +559,18 @@ static void nv_adma_mode(struct ata_port tmp = readw(mmio + NV_ADMA_CTL); writew(tmp | NV_ADMA_CTL_GO, mmio + NV_ADMA_CTL); + status = readw(mmio + NV_ADMA_STAT); + while(((status NV_ADMA_STAT_LEGACY) || + !(status NV_ADMA_STAT_IDLE)) count 20) { + ndelay(50); + status = readw(mmio + NV_ADMA_STAT); + count++; + } + if(count == 20) + ata_port_printk(ap, KERN_WARNING, + timeout waiting for ADMA LEGACY clear and IDLE, stat=0x%hx\n, + status); + pp-flags = ~NV_ADMA_PORT_REGISTER_MODE; }
Re: SATA exceptions with 2.6.20-rc5
On Sunday, 21. January 2007 19:01, Björn Steinbrink wrote: On 2007.01.21 18:34:40 +0100, Chr wrote: I run those two in parallel: while /bin/true; do ls -lR / /dev/null 21; done while /bin/true; do echo 255 /proc/sys/vm/drop_caches; sleep 1; done Not sure if running them in parallel is necessary, but I don't want to change the test setup ;) Takes between 1 and 40 minutes to trigger it. Most of the time it's around 15 minutes now, doing more random stuff in addition to that seems to trigger it even easier (like reading mail, rebuilding the kernel etc.). I'm down to 2 commits after 2.6.19 now, only bad kernels, so I tend to say that 2.6.19 with 2.6.20-rc5's sata_nv.c will also fail for me, but I thought I might finish bisection just to be sure. But, this time it looks slightly different: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata3.00: tag 0 cmd 0xec Emask 0x4 stat 0x40 err 0x0 (timeout) [Rest of the error message + SMART error snipped] I get the same exception every time, doesn't change for me. And neither do I get any SMART errors or something. Thanks, Björn Ok, you won't believe this... I opened my case and rewired my drives... And guess what, my second (aka the good) HDD is now failing! I guess, my mainboard has a (but maybe two, or three :( ) bad sata-port(s)! But, one small question remains: when I opened my case, I saw that my drivers are pluged in SATA jack 1 and 2... The BIOS also says they're on 1 and 2. Now, Linux says they're on port 3 4! it's always ata3.00! ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x2 frozen ata3.00: tag 0 cmd 0xea Emask 0x4 stat 0x40 err 0x0 (timeout) ata3: soft resetting port ata3: SATA link up 1.5 Gbps (SStatus 113 SControl 300) ata3.00: configured for UDMA/133 ata3: EH complete SCSI device sda: 490234752 512-byte hdwr sectors (251000 MB) sda: Write Protect is off sda: Mode Sense: 00 3a 00 00 SCSI device sda: drive cache: write back Thanks, Chr. - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions triggered by XFS (since 2.6.18)
On Sunday, 21. January 2007 20:25, Paolo Ornati wrote: On Sun, 21 Jan 2007 11:32:02 -0600 Robert Hancock [EMAIL PROTECTED] wrote: It looks like what you're getting is an actual NCQ write timing out. That makes the bisect result not very interesting since obviously it wouldn't have issued any NCQ writes before NCQ support was implemented. Seeing as how it's also an entirely different driver I imagine it's a different problem than what I've been looking at. Maybe that drive just has some issues with NCQ? I would be surprised at that with a Seagate though.. I don't know. It's a two years old ST380817AS. # smartctl -a -d ata /dev/sda smartctl version 5.36 [x86_64-pc-linux-gnu] Copyright (C) 2002-6 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Model Family: Seagate Barracuda 7200.7 and 7200.7 Plus family Device Model: ST380817AS Serial Number:4MR08EK8 Firmware Version: 3.42 User Capacity:80,026,361,856 bytes Device is:In smartctl database [for details use: -P show] ATA Version is: 6 ATA Standard is: ATA/ATAPI-6 T13 1410D revision 2 Local Time is:Sun Jan 21 20:15:40 2007 CET SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: ( 430) seconds. Offline data collection capabilities: (0x5b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time: ( 47) minutes. SMART Attributes Data Structure revision number: 10 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 059 049 006Pre-fail Always - 215927244 3 Spin_Up_Time0x0003 098 098 000Pre-fail Always - 0 4 Start_Stop_Count0x0032 098 098 020Old_age Always - 2182 5 Reallocated_Sector_Ct 0x0033 100 100 036Pre-fail Always - 0 7 Seek_Error_Rate 0x000f 083 060 030Pre-fail Always - 204305750 9 Power_On_Hours 0x0032 097 097 000Old_age Always - 3494 10 Spin_Retry_Count0x0013 100 100 097Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 098 098 020Old_age Always - 2541 194 Temperature_Celsius 0x0022 024 040 000Old_age Always - 24 (Lifetime Min/Max 0/15) 195 Hardware_ECC_Recovered 0x001a 059 049 000Old_age Always - 215927244 197 Current_Pending_Sector 0x0012 100 100 000Old_age Always - 1 198 Offline_Uncorrectable 0x0010 100 100 000Old_age Offline - 1 199 UDMA_CRC_Error_Count0x003e 200 200 000Old_age Always - 0 200 Multi_Zone_Error_Rate 0x 100 253 000Old_age Offline - 0 202 TA_Increase_Count 0x0032 100 253 000Old_age Always - 0 SMART Error Log Version: 1 ATA Error Count: 12 (device log contains only the most recent five errors) CR = Command Register [HEX] FR = Features Register [HEX] SC = Sector Count Register [HEX] SN = Sector Number Register [HEX] CL = Cylinder Low Register [HEX]
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: Björn Steinbrink wrote: All kernels were bad using that approach. So back to square 1. :/ Björn OK guys, here's a new patch to try against 2.6.20-rc5: Right now when switching between ADMA mode and legacy mode (i.e. when going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just set the ADMA GO register bit appropriately and continue with no delay. It looks like in some cases the controller doesn't respond to this immediately, it takes some nanoseconds for the controller's status registers to reflect the change that was made. It's possible that if we were trying to issue commands during this time, the controller might not react properly. This patch adds some code to wait for the status register to change to the state we asked for before continuing. I went for the I feel lucky route and did just add mmio reads after the mmio writes, posting them. Rationale being that if it is a write posting issue, the debug patch would/could actually hide it AFAICT. It's the I feel lucky route, because my whole knowledge about mmio and write posting originates from the few things I read up on when you discovered the comment about write posting in the generic ata code. Björn - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: SATA exceptions with 2.6.20-rc5
On 2007.01.21 23:08:11 +0100, Björn Steinbrink wrote: On 2007.01.21 13:58:01 -0600, Robert Hancock wrote: Björn Steinbrink wrote: All kernels were bad using that approach. So back to square 1. :/ Björn OK guys, here's a new patch to try against 2.6.20-rc5: Right now when switching between ADMA mode and legacy mode (i.e. when going from doing normal DMA reads/writes to doing a FLUSH CACHE) we just set the ADMA GO register bit appropriately and continue with no delay. It looks like in some cases the controller doesn't respond to this immediately, it takes some nanoseconds for the controller's status registers to reflect the change that was made. It's possible that if we were trying to issue commands during this time, the controller might not react properly. This patch adds some code to wait for the status register to change to the state we asked for before continuing. I went for the I feel lucky route and did just add mmio reads after the mmio writes, posting them. Rationale being that if it is a write posting issue, the debug patch would/could actually hide it AFAICT. It's the I feel lucky route, because my whole knowledge about mmio and write posting originates from the few things I read up on when you discovered the comment about write posting in the generic ata code. Uhm, yeah, exception occured about the time that I hit send. Björn - To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/