On 12/21/2014 11:34 AM, constantine wrote:
Some months ago I had 6 uncorrectable errors. I deleted the files that
contained them and then after scrubbing I had 0 uncorrectable errors.
After some weeks I encountered new uncorrectable errors.
Question 1:
Why do I have uncorrectable errors on a RAID-1 filesystem in the first place?
These are disk/platter/hardware errors. They happen for one of two
reasons. (most likely) There is a flaw, new or existing, on the platter
itself and data just cannot live in that spot. (least likely) You
suffered an environmental hazard (hard jolt) while a sector was being
written and the drive is just choking on the digital wreckage.
Question 2:
How do I properly correct them? (Again by deleting their files? :( )
You have to _force_ the system to write the sector. If the disk can
correct the sector (not a hardware flaw) the problem goes away forever.
If it can't the drive will re-map the sector with a spare sector and it
will seem to go away forever.
Here is a decent tutorial ::
http://smartmontools.sourceforge.net/badblockhowto.html and which
version of things you need to do will vary by hardware, so read the
whole thing.
_BUT_ on my system I had to use hdparam to write the sectors instead of
just using dd. Math is involved to find the LBA and you have to use the
"yes I really know what I am doing" option to force the write at the low
level.
[Quick version :: smartctl --test=long (or range if you know the range).
Test will stop on the read error. Force writ the the "lba of first
error" block with hdparam or use the sg-spare thing. Repeat until the
long test will read the entire drive.
My current smartctl --all /dev/sda shows that recent remapping exercise.
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining
LifeTime(hours) LBA_of_first_error
# 1 Extended offline Completed without error 00% 36605
-
# 2 Selective offline Completed without error 00% 36603
-
# 3 Selective offline Aborted by host 90% 36603
-
# 4 Selective offline Completed without error 00% 36603
-
# 5 Selective offline Completed: read failure 90% 36603
19530186
# 6 Selective offline Completed: read failure 90% 36603
19530182
# 7 Extended offline Completed: read failure 90% 36602
19530182
# 8 Extended offline Completed: read failure 90% 36602
19530182
# 9 Extended offline Completed: read failure 90% 36592
19530182
#10 Extended offline Completed: read failure 90% 36094
19530182
#11 Extended offline Completed without error 00% 4222
-
6 of 6 failed self-tests are outdated by newer successful extended
offline self-test # 1
The good news is that since you are using RAID1 and checksums you
shouldn't need to delete any files. Just coerce the write and then btrfs
scrub your filesystem and the checksum/rewrite thing should recover the
degraded copy from the good copy in the mirror.
Question 3:
How do I prevent this from happening?
If the disk only shows an error or two it's probably still in normal
range. If you have to spare out a lot of sectors then your disk may be
reaching end-of-life and so likely needs replacing.
ALL DISKS FAIL EVENTUALLY so you don't "prevent it from happening". You
use RAID1 (etc) and backups to prevent data loss and you periodically
run the tests and check the output to prevent data loss.
That is, you can't prevent eventual disk loss, your job is to prevent
data loss. So good on you for the RAID1
Thanks a lot!
constantine
PS.
The disks can be considered old (some with > 15000 hrs online), but
SMART long tests complete without errors. I have this filesystem:
I don't see the smart test results in any of these blocks. Are you sure
you are looking at the correct part of the results? You should have been
showing us the table after the heading "SMART Self-test log structure
revision number 1" if you are trying to show us tests completing without
errors.
See smartctl --all and/or --xall output, e.g. _lower_ _case_ "a" or "x",
not upper case "A" "attributes", test results will be near the bottom.
The "attribute" section is interesting but not dispositive of recent
test results. It only shows non-test event counters.
# btrfs fi show /mnt/thefilesystem
Label: 'thefilesystem' uuid: 1d1d0850-d1bc-4c76-96a1-17d168ff2431
Total devices 5 FS bytes used 6.11TiB
devid 1 size 2.73TiB used 2.63TiB path /dev/sda1
devid 2 size 3.64TiB used 3.54TiB path /dev/sdg1
devid 3 size 1.82TiB used 1.72TiB path /dev/sdd1
devid 4 size 1.82TiB used 1.72TiB path /dev/sdc1
devid 5 size 2.73TiB used 2.63TiB path /dev/sdh1
Btrfs v3.17.3
# btrfs fi df /mnt/thefilesystem
Data, RAID1: total=6.10TiB, used=6.10TiB
System, RAID1: total=32.00MiB, used=896.00KiB
Metadata, RAID1: total=10.00GiB, used=8.98GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
===================
SMART information from each of the disks:
# for i in a g d c h ; do smartctl -A /dev/sd$i; done
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
Always - 0
3 Spin_Up_Time 0x0027 177 175 021 Pre-fail
Always - 6108
4 Start_Stop_Count 0x0032 100 100 000 Old_age
Always - 201
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age
Always - 0
9 Power_On_Hours 0x0032 093 093 000 Old_age
Always - 5836
10 Spin_Retry_Count 0x0032 100 100 000 Old_age
Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age
Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 185
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
Always - 118
193 Load_Cycle_Count 0x0032 189 189 000 Old_age
Always - 33154
194 Temperature_Celsius 0x0022 114 098 000 Old_age
Always - 36
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
Offline - 0
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
Always - 0
3 Spin_Up_Time 0x0027 179 175 021 Pre-fail
Always - 8050
4 Start_Stop_Count 0x0032 100 100 000 Old_age
Always - 141
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age
Always - 0
9 Power_On_Hours 0x0032 094 094 000 Old_age
Always - 4842
10 Spin_Retry_Count 0x0032 100 100 000 Old_age
Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age
Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 140
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
Always - 91
193 Load_Cycle_Count 0x0032 194 194 000 Old_age
Always - 18614
194 Temperature_Celsius 0x0022 114 100 000 Old_age
Always - 38
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
Offline - 0
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 102 099 006 Pre-fail
Always - 4738696
3 Spin_Up_Time 0x0003 092 092 000 Pre-fail
Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age
Always - 836
5 Reallocated_Sector_Ct 0x0033 100 100 036 Pre-fail
Always - 144
7 Seek_Error_Rate 0x000f 078 060 030 Pre-fail
Always - 69594766
9 Power_On_Hours 0x0032 077 077 000 Old_age
Always - 20554
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age
Always - 721
183 Runtime_Bad_Block 0x0032 092 092 000 Old_age
Always - 8
184 End-to-End_Error 0x0032 100 100 099 Old_age
Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age
Always - 0
188 Command_Timeout 0x0032 100 099 000 Old_age
Always - 14
189 High_Fly_Writes 0x003a 097 097 000 Old_age
Always - 3
190 Airflow_Temperature_Cel 0x0022 068 042 045 Old_age
Always In_the_past 32 (0 15 39 23 0)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age
Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age
Always - 320
193 Load_Cycle_Count 0x0032 100 100 000 Old_age
Always - 947
194 Temperature_Celsius 0x0022 032 058 000 Old_age
Always - 32 (0 13 0 0 0)
195 Hardware_ECC_Recovered 0x001a 014 003 000 Old_age
Always - 4738696
197 Current_Pending_Sector 0x0012 100 100 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age
Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age
Offline - 19390 (116 2 0)
241 Total_LBAs_Written 0x0000 100 253 000 Old_age
Offline - 2165686930
242 Total_LBAs_Read 0x0000 100 253 000 Old_age
Offline - 1913785108
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail
Always - 1
3 Spin_Up_Time 0x0027 182 178 021 Pre-fail
Always - 5900
4 Start_Stop_Count 0x0032 100 100 000 Old_age
Always - 310
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail
Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age
Always - 0
9 Power_On_Hours 0x0032 086 086 000 Old_age
Always - 10839
10 Spin_Retry_Count 0x0032 100 100 000 Old_age
Always - 0
11 Calibration_Retry_Count 0x0032 100 100 000 Old_age
Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age
Always - 275
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age
Always - 175
193 Load_Cycle_Count 0x0032 123 123 000 Old_age
Always - 233706
194 Temperature_Celsius 0x0022 120 102 000 Old_age
Always - 30
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age
Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0030 200 200 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age
Always - 0
200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age
Offline - 0
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-3.16.7-1-bfs] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 117 099 006 Pre-fail
Always - 154070800
3 Spin_Up_Time 0x0003 094 093 000 Pre-fail
Always - 0
4 Start_Stop_Count 0x0032 100 100 020 Old_age
Always - 198
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail
Always - 0
7 Seek_Error_Rate 0x000f 077 060 030 Pre-fail
Always - 4346841135
9 Power_On_Hours 0x0032 090 090 000 Old_age
Always - 9283
10 Spin_Retry_Count 0x0013 100 100 097 Pre-fail
Always - 0
12 Power_Cycle_Count 0x0032 100 100 020 Old_age
Always - 185
183 Runtime_Bad_Block 0x0032 100 100 000 Old_age
Always - 0
184 End-to-End_Error 0x0032 100 100 099 Old_age
Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age
Always - 0
188 Command_Timeout 0x0032 100 100 000 Old_age
Always - 0 0 0
189 High_Fly_Writes 0x003a 098 098 000 Old_age
Always - 2
190 Airflow_Temperature_Cel 0x0022 065 046 045 Old_age
Always - 35 (Min/Max 23/45)
191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age
Always - 0
192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age
Always - 129
193 Load_Cycle_Count 0x0032 098 098 000 Old_age
Always - 5879
194 Temperature_Celsius 0x0022 035 054 000 Old_age
Always - 35 (0 19 0 0 0)
197 Current_Pending_Sector 0x0012 100 100 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0010 100 100 000 Old_age
Offline - 0
199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age
Always - 0
240 Head_Flying_Hours 0x0000 100 253 000 Old_age
Offline - 8753h+05m+40.278s
241 Total_LBAs_Written 0x0000 100 253 000 Old_age
Offline - 36640474598
242 Total_LBAs_Read 0x0000 100 253 000 Old_age
Offline - 94882096088
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html