Chris, All agreed. Further comment inlined:
(Should have mentioned more prominently that the hardware problem has been worked-around by limiting the sata to 3Gbit/s on bootup.) On 28/09/13 21:51, Chris Murphy wrote: > > On Sep 28, 2013, at 1:26 PM, Martin <m_bt...@ml1.co.uk> wrote: > >> Writing data via rsync at the 6Gbit/s sata rate caused IO errors >> for just THREE sectors... >> >> Yet btrfsck bombs out with LOTs of errors… > > Any fs will bomb out on write errors. Indeed. However, are not the sata errors reported back to btrfs so that it knows whatever parts haven't been updated? Is there not a mechanism to then go "read-only"? Also, should not the journal limit the damage? >> How best to recover from this? > > Why you're getting I/O errors at SATA 6Gbps link speed needs to be > understood. Is it a bad cable? Bad SATA port? Drive or controller > firmware bug? Or libata driver bug? I systematically eliminated such as leads, PSU, and NCQ. Limiting libata to only use 3Gbit/s is the one change that gives a consistent fix. The HDD and motherboard both support 6Gbit/s, but hey-ho, that's an experiment I can try again some other time when I have another HDD/SSD to test in there. In any case, for the existing HDD - motherboard combination, using sata2 rather than sata3 speeds shouldn't noticeably impact performance. (Other than sata2 works reliably and so is infinitely better for this case!) >> Lots of sata error noise omitted. > > And entire dmesg might still be useful. I don't know if the list will > handle the whole dmesg in one email, but it's worth a shot (reply to > an email in the thread, don't change the subject). I can email directly if of use/interest. Let me know offlist. > do a smartctl -x on the drive, chances are it's recording PHY Event (smartctl -x errors shown further down...) Nothing untoward noticed: # smartctl -a /dev/sdc === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Green (AF, SATA 6Gb/s) Device Model: WDC WD20EARX-00PASB0 Serial Number: WD-... LU WWN Device Id: ... Firmware Version: 51.0AB51 User Capacity: 2,000,398,934,016 bytes [2.00 TB] Sector Sizes: 512 bytes logical, 4096 bytes physical Device is: In smartctl database [for details use: -P show] ATA Version is: ATA8-ACS (minor revision not indicated) SATA Version is: SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s) Local Time is: Sat Sep 28 23:35:57 2013 BST SMART support is: Available - device has SMART capability. SMART support is: Enabled [...] SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 9 3 Spin_Up_Time 0x0027 253 159 021 Pre-fail Always - 1983 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 55 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 099 099 000 Old_age Always - 800 10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 53 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 31 193 Load_Cycle_Count 0x0032 199 199 000 Old_age Always - 3115 194 Temperature_Celsius 0x0022 118 110 000 Old_age Always - 32 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0 # smartctl -x /dev/sdc ... also shows the errors it saw: (Just the last 4 copied which look timed for when the HDD was last exposed to 6Gbit/s sata) Error 46 [21] occurred at disk power-on lifetime: 755 hours (31 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 01 -- 51 00 08 00 00 6c 1a 4b b0 e0 00 Error: AMNF 8 sectors at LBA = 0x6c1a4bb0 = 1813662640 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 25 00 00 00 08 00 00 6c 1a 4b b0 e0 08 10:51:07.192 READ DMA EXT 35 00 00 00 08 00 00 6c 1a 4b a8 e0 08 10:51:07.192 WRITE DMA EXT 25 00 00 00 08 00 00 6c 1a 4b a8 e0 08 10:51:07.192 READ DMA EXT 35 00 00 00 08 00 00 6c 1a 4b a8 e0 08 10:51:07.192 WRITE DMA EXT 25 00 00 00 08 00 00 6c 1a 4b a8 e0 08 10:51:07.157 READ DMA EXT Error 45 [20] occurred at disk power-on lifetime: 755 hours (31 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 01 -- 51 04 00 00 00 6c 1a 4b b0 e0 00 Error: AMNF 1024 sectors at LBA = 0x6c1a4bb0 = 1813662640 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 25 00 00 04 00 00 00 6c 1a 48 00 e0 08 10:51:03.450 READ DMA EXT ef 00 10 00 02 00 00 00 00 00 00 a0 08 10:51:03.449 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 00 00 00 00 00 e0 08 10:51:03.449 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 00 00 00 00 00 a0 08 10:51:03.446 IDENTIFY DEVICE ef 00 03 00 46 00 00 00 00 00 00 a0 08 10:51:03.446 SET FEATURES [Set transfer mode] Error 44 [19] occurred at disk power-on lifetime: 755 hours (31 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 01 -- 51 04 00 00 00 6c 1a 4b b0 e0 00 Error: AMNF 1024 sectors at LBA = 0x6c1a4bb0 = 1813662640 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 25 00 00 04 00 00 00 6c 1a 48 00 e0 08 10:51:00.453 READ DMA EXT ef 00 10 00 02 00 00 00 00 00 00 a0 08 10:51:00.452 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 00 00 00 00 00 e0 08 10:51:00.452 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 00 00 00 00 00 a0 08 10:51:00.449 IDENTIFY DEVICE ef 00 03 00 46 00 00 00 00 00 00 a0 08 10:51:00.449 SET FEATURES [Set transfer mode] Error 43 [18] occurred at disk power-on lifetime: 755 hours (31 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 01 -- 51 04 00 00 00 6c 1a 4b b0 e0 00 Error: AMNF 1024 sectors at LBA = 0x6c1a4bb0 = 1813662640 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 25 00 00 04 00 00 00 6c 1a 48 00 e0 08 10:50:57.455 READ DMA EXT ef 00 10 00 02 00 00 00 00 00 00 a0 08 10:50:57.455 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 00 00 00 00 00 e0 08 10:50:57.455 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 00 00 00 00 00 a0 08 10:50:57.452 IDENTIFY DEVICE ef 00 03 00 46 00 00 00 00 00 00 a0 08 10:50:57.452 SET FEATURES [Set transfer mode] Error 42 [17] occurred at disk power-on lifetime: 755 hours (31 days + 11 hours) When the command that caused the error occurred, the device was active or idle. After command completion occurred, registers were: ER -- ST COUNT LBA_48 LH LM LL DV DC -- -- -- == -- == == == -- -- -- -- -- 01 -- 51 04 00 00 00 6c 1a 4b b0 e0 00 Error: AMNF 1024 sectors at LBA = 0x6c1a4bb0 = 1813662640 Commands leading to the command that caused the error were: CR FEATR COUNT LBA_48 LH LM LL DV DC Powered_Up_Time Command/Feature_Name -- == -- == -- == == == -- -- -- -- -- --------------- -------------------- 25 00 00 04 00 00 00 6c 1a 48 00 e0 08 10:50:54.459 READ DMA EXT ef 00 10 00 02 00 00 00 00 00 00 a0 08 10:50:54.458 SET FEATURES [Enable SATA feature] 27 00 00 00 00 00 00 00 00 00 00 e0 08 10:50:54.458 READ NATIVE MAX ADDRESS EXT [OBS-ACS-3] ec 00 00 00 00 00 00 00 00 00 00 a0 08 10:50:54.455 IDENTIFY DEVICE ef 00 03 00 46 00 00 00 00 00 00 a0 08 10:50:54.455 SET FEATURES [Set transfer mode] >> Running btrfsck twice gives the same result, giving a failure >> with: > > Well honestly at this point I expect file system corruption as it's > entirely possible that before the hardware dropped the link speed > down to SATA 3Gbps, there was corrupt data already sent to the drive > and that's not something Btrfs can know about until trying to read > the data back in. So *shrug* - I don't see Btrfs as a way to totally > mitigate hardware problems. It's the same problem with bad RAM, and > Btrfs doesn't like that either. Indeed. Hence trapping 'unexpectedness' where reasonable to then go read-only... (I guess a hard compromise though whilst still debugging bugs 'unexpectedness'! But still good to have in mind. ;-) ) Regards, Martin -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html