Disk "failed" while doing scrub

Dāvis Mosāns Sun, 12 Jul 2015 23:27:12 -0700

Hello,

Short version: while doing scrub on 5 disk btrfs filesystem, /dev/sdd
"failed" and also had some error on other disk (/dev/sdh)


Because filesystem still mounts, I assume I should do "btrfs device
delete /dev/sdd /mntpoint" and then restore damaged files from backup.
Are all affected files listed in journal? there's messages about "x
callbacks suppressed" so I'm not sure and if there aren't how to get
full list of damaged files?
Also I wonder if there are any tools to recover partial file fragments
and reconstruct file? (where missing fragments filled with nulls)
I assume that there's no point in running "btrfs check
--check-data-csum" because scrub already does check that?

from journal:

kernel: drivers/scsi/mvsas/mv_sas.c 1863:Release slot [1] tag[1], task
[ffff88007efb8800]:
kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 00000002,  slot [1].
kernel: sas: sas_ata_task_done: SAS error 8a
kernel: sas: Enter sas_scsi_recover_host busy: 1 failed: 1
kernel: sas: ata9: end_device-7:2: cmd error handler
kernel: sas: ata7: end_device-7:0: dev error handler
kernel: sas: ata14: end_device-7:7: dev error handler
kernel: ata9.00: exception Emask 0x0 SAct 0x800 SErr 0x0 action 0x0
kernel: ata9.00: failed command: READ FPDMA QUEUED
kernel: ata9.00: cmd 60/00:00:00:3d:a1/04:00:ab:00:00/40 tag 11 ncq 524288 in
                                            res
41/40:00:48:40:a1/00:04:ab:00:00/00 Emask 0x409 (media error) <F>
kernel: ata9.00: status: { DRDY ERR }
kernel: ata9.00: error: { UNC }
kernel: ata9.00: configured for UDMA/133
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x00
driverbyte=0x08
kernel: sd 7:0:2:0: [sdd] tag#0 Sense Key : 0x3 [current] [descriptor]
kernel: sd 7:0:2:0: [sdd] tag#0 ASC=0x11 ASCQ=0x4
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 ab a1 3d 00 00 04 00 00
kernel: blk_update_request: I/O error, dev sdd, sector 2879471688
kernel: ata9: EH complete
kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
kernel: drivers/scsi/mvsas/mv_sas.c 1863:Release slot [1] tag[1], task
[ffff88007efb9a00]:
kernel: drivers/scsi/mvsas/mv_94xx.c 625:command active 00000003,  slot [1].
kernel: sas: sas_ata_task_done: SAS error 8a
kernel: sas: Enter sas_scsi_recover_host busy: 2 failed: 2
kernel: sas: trying to find task 0xffff8801e0cadb00
kernel: sas: sas_scsi_find_task: aborting task 0xffff8801e0cadb00
kernel: sas: sas_scsi_find_task: task 0xffff8801e0cadb00 is aborted
kernel: sas: sas_eh_handle_sas_errors: task 0xffff8801e0cadb00 is aborted
kernel: sas: ata9: end_device-7:2: cmd error handler
kernel: sas: ata8: end_device-7:1: cmd error handler
kernel: sas: ata7: end_device-7:0: dev error handler
kernel: sas: ata8: end_device-7:1: dev error handler
kernel: ata8.00: exception Emask 0x0 SAct 0x40000 SErr 0x0 action 0x6 frozen
kernel: ata8.00: failed command: READ FPDMA QUEUED
kernel: ata8.00: cmd 60/00:00:00:1b:36/04:00:bf:00:00/40 tag 18 ncq 524288 in
                                            res
40/00:08:00:58:11/00:00:a6:00:00/40 Emask 0x4 (timeout)
kernel: ata8.00: status: { DRDY }
kernel: ata8: hard resetting link
kernel: sas: ata9: end_device-7:2: dev error handler
kernel: sas: ata14: end_device-7:7: dev error handler
kernel: ata9: log page 10h reported inactive tag 26
kernel: ata9.00: exception Emask 0x1 SAct 0x400000 SErr 0x0 action 0x6
kernel: ata9.00: failed command: READ FPDMA QUEUED
kernel: ata9.00: cmd 60/08:00:48:40:a1/00:00:ab:00:00/40 tag 22 ncq 4096 in
                                            res
01/04:a8:40:40:a1/00:00:ab:00:00/40 Emask 0x3 (HSM violation)
kernel: ata9.00: status: { ERR }
kernel: ata9.00: error: { ABRT }
kernel: ata9: hard resetting link
kernel: sas: sas_form_port: phy1 belongs to port1 already(1)!
kernel: ata9.00: both IDENTIFYs aborted, assuming NODEV
kernel: ata9.00: revalidation failed (errno=-2)
kernel: drivers/scsi/mvsas/mv_sas.c 1428:mvs_I_T_nexus_reset for device[1]:rc= 0
kernel: ata8.00: configured for UDMA/133
kernel: ata8.00: device reported invalid CHS sector 0
kernel: ata8: EH complete
kernel: ata9: hard resetting link
kernel: ata9.00: both IDENTIFYs aborted, assuming NODEV
kernel: ata9.00: revalidation failed (errno=-2)
kernel: ata9: hard resetting link
kernel: ata9.00: both IDENTIFYs aborted, assuming NODEV
kernel: ata9.00: revalidation failed (errno=-2)
kernel: ata9.00: disabled
kernel: ata9: EH complete
kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04
driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 ab a1 40 48 00 00 08 00
kernel: blk_update_request: I/O error, dev sdd, sector 2879471688
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04
driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 ab a1 45 00 00 06 00 00
kernel: BTRFS: unable to fixup (regular) error at logical
7390602616832 on dev /dev/sdd
kernel: BTRFS: unable to fixup (regular) error at logical
7390602891264 on dev /dev/sdd
kernel: scsi_io_completion: 186117 callbacks suppressed
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04
driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x2a 2a 00 00 14 78 c0 00 00 20 00
kernel: blk_update_request: 186156 callbacks suppressed
kernel: blk_update_request: I/O error, dev sdd, sector 1341632
kernel: sd 7:0:2:0: [sdd] tag#1 UNKNOWN(0x2003) Result: hostbyte=0x04
driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] tag#1 CDB: opcode=0x2a 2a 00 00 14 7a 80 00 00 20 00
kernel: blk_update_request: I/O error, dev sdd, sector 2879472896
kernel: BTRFS: i/o error at logical 7386235424768 on dev /dev/sdd,
sector 2891849768, root 3034, inode 5633529, offset 11878400, length
4096, links 1 (path: [...])
kernel: BTRFS: i/o error at logical 7386235039744 on dev /dev/sdd,
sector 2891849016, root 3034, inode 5633529, offset 11493376, length
4096, links 1 (path: [...])
kernel: btrfs_dev_stat_print_on_error: 78908 callbacks suppressed
kernel: BTRFS: bdev /dev/sdd errs: wr 347, rd 1644871, flush 0, corrupt 0, gen 0
kernel: BTRFS: bdev /dev/sdd errs: wr 356, rd 1644871, flush 0, corrupt 0, gen 0
kernel: BTRFS: error (device sdh) in write_all_supers:3454: errno=-5
IO failure (errors while submitting device barriers.)
kernel: BTRFS info (device sdh): forced readonly
kernel: BTRFS warning (device sdh): Skipping commit of aborted transaction.
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 5 PID: 3756 at fs/btrfs/super.c:260
__btrfs_abort_transaction+0x54/0x130 [btrfs]()
kernel: BTRFS: Transaction aborted (error -5)
kernel: Modules linked in: nf_conntrack_netbios_ns
nf_conntrack_broadcast xt_tcpudp ip6t_rpfilter ip6t_REJECT [...]
kernel:  nvidia(PO) tda8290 tuner aes_x86_64 lrw saa7134
snd_hda_codec_realtek gf128mul edac_core glue_helper [...]
kernel:
kernel: CPU: 5 PID: 3756 Comm: btrfs-transacti Tainted: P           O
  4.0.7-2-ARCH #1
kernel: Hardware name: Gigabyte Technology Co., Ltd.
GA-990FXA-UD3/GA-990FXA-UD3, BIOS FFe 11/08/2013
kernel:  0000000000000000 000000005f5d9ca7 ffff88006090fc18 ffffffff81574ec3
kernel:  0000000000000000 ffff88006090fc70 ffff88006090fc58 ffffffff81074e7a
kernel:  0000000000000000 ffff8800ce8e6c60 00000000fffffffb ffff8800bbaa4800
kernel: Call Trace:
kernel:  [<ffffffff81574ec3>] dump_stack+0x4c/0x6e
kernel:  [<ffffffff81074e7a>] warn_slowpath_common+0x8a/0xc0
kernel:  [<ffffffff81074f05>] warn_slowpath_fmt+0x55/0x70
kernel:  [<ffffffffa0253bb4>] __btrfs_abort_transaction+0x54/0x130 [btrfs]
kernel:  [<ffffffffa0282ceb>] cleanup_transaction+0x7b/0x300 [btrfs]
kernel:  [<ffffffff810b6ce0>] ? wake_atomic_t_function+0x60/0x60
kernel:  [<ffffffffa0284162>] btrfs_commit_transaction+0x932/0xc10 [btrfs]
kernel:  [<ffffffffa027f3a5>] transaction_kthread+0x1d5/0x240 [btrfs]
kernel:  [<ffffffffa027f1d0>] ? btrfs_cleanup_transaction+0x5a0/0x5a0 [btrfs]
kernel:  [<ffffffff810934b8>] kthread+0xd8/0xf0
kernel:  [<ffffffff810933e0>] ? kthread_worker_fn+0x170/0x170
kernel:  [<ffffffff8157a718>] ret_from_fork+0x58/0x90
kernel:  [<ffffffff810933e0>] ? kthread_worker_fn+0x170/0x170
kernel: ---[ end trace 8ecc49ef203bd88c ]---
kernel: BTRFS: error (device sdh) in cleanup_transaction:1686:
errno=-5 IO failure
kernel: BTRFS info (device sdh): delayed_refs has NO entry
kernel: scrub_handle_errored_block: 92600 callbacks suppressed
kernel: BTRFS: i/o error at logical 7390928568320 on dev /dev/sdd,
sector 2892627456, root 3034, inode 5637106, offset 614400, length
4096, links 1 (path: [...])
kernel: BTRFS: i/o error at logical 7390928175104 on dev /dev/sdd,
sector 2892626688, root 3034, inode 5637106, offset 483328, length
4096, links 1 (path: [...])
kernel: scrub_handle_errored_block: 77404 callbacks suppressed
kernel: BTRFS: unable to fixup (regular) error at logical
7390928568320 on dev /dev/sdd
kernel: BTRFS: unable to fixup (regular) error at logical
7390928175104 on dev /dev/sdd
smartd[723]: Device: /dev/sdd [SAT], not capable of SMART self-check
smartd[723]: Device: /dev/sdd [SAT], failed to read SMART Attribute Data
smartd[723]: Device: /dev/sdd [SAT], Read SMART Self Test Log Failed
smartd[723]: Device: /dev/sdd [SAT], Read Summary SMART Error Log failed
kernel: scsi_io_completion: 8110 callbacks suppressed
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04
driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 e8 e0 88 00 00 00 08 00
kernel: blk_update_request: 8115 callbacks suppressed
kernel: blk_update_request: I/O error, dev sdd, sector 3907028992
kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04
driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 e8 e0 88 00 00 00 08 00
kernel: blk_update_request: I/O error, dev sdd, sector 3907028992
kernel: Buffer I/O error on dev sdd, logical block 488378624, async page read


Long story:

I had Seagate disk which died, but still was covered by warranty so I
got replacement, only disk they returned wasn't new, but repaired
and I haven't used it much, but seems it won't hold for long as it got
uncorrectable sectors.
When I received it, I did full SMART test and checked all sectors,
everything passed and seemed to be good, but now I copied my data
and used it for a while, only to find

smartd[592]: Device: /dev/sdd [SAT], 16 Currently unreadable (pending) sectors
smartd[592]: Device: /dev/sdd [SAT], 16 Offline uncorrectable sectors

then I ran scrub

scrub status for 1ec5b839-acc6-4f70-be9d-6f9e6118c71c
       scrub started at Sun Jul 12 13:36:11 2015 and was aborted after 02:43:21
       total bytes scrubbed: 6.24TiB with 1648151 errors
       error details: read=1648151
       corrected errors: 704, uncorrectable errors: 1647447,
unverified errors: 0

it caused drive to become unrecognizable by Linux and seems it also
made some error for different disk (/dev/sdh)
which caused filesystem to become read-only and didn't mount

kernel: sd 7:0:2:0: [sdd] tag#0 UNKNOWN(0x2003) Result: hostbyte=0x04
driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] tag#0 CDB: opcode=0x28 28 00 00 00 00 80 00 00 08 00
kernel: blk_update_request: I/O error, dev sdd, sector 128
kernel: BTRFS info (device sdh): enabling auto defrag
kernel: BTRFS info (device sdh): disk space caching is enabled
kernel: BTRFS: has skinny extents
kernel: BTRFS: failed to read chunk tree on sdh
mount[17625]: mount: wrong fs type, bad option, bad superblock on /dev/sdh,
mount[17625]: missing codepage or helper program, or other error
mount[17625]: In some cases useful info is found in syslog - try
mount[17625]: dmesg | tail or so.
kernel: BTRFS: open_ctree failed
kernel: sd 7:0:2:0: [sdd] Synchronizing SCSI cache
kernel: sd 7:0:2:0: [sdd] Synchronize Cache(10) failed: Result:
hostbyte=0x04 driverbyte=0x00
kernel: sd 7:0:2:0: [sdd] Stopping disk
kernel: sd 7:0:2:0: [sdd] Start/Stop Unit failed: Result:
hostbyte=0x04 driverbyte=0x00

pulled out that /dev/sdd drive and plugged back in

kernel: mvsas 0000:07:00.0: Phy2 : No sig fis
kernel: sas: phy-7:2 added to port-7:2, phy_mask:0x4 ( 200000000000000)
kernel: sas: DOING DISCOVERY on port 2, pid:16744
kernel: sas: DONE DISCOVERY on port 2, pid:16744, result:0
kernel: sas: Enter sas_scsi_recover_host busy: 0 failed: 0
kernel: ata20.00: ATA-8: ST2000DM001-9YN164, CC9F, max UDMA/133
kernel: ata20.00: 3907029168 sectors, multi 0: LBA48 NCQ (depth 31/32)
kernel: ata20.00: configured for UDMA/133
kernel: sas: --- Exit sas_scsi_recover_host: busy: 0 failed: 0 tries: 1
kernel: scsi 7:0:8:0: Direct-Access     ATA      ST2000DM001-9YN1 CC9F
PQ: 0 ANSI: 5
kernel: sd 7:0:8:0: [sdd] 3907029168 512-byte logical blocks: (2.00 TB/1.81 TiB)
kernel: sd 7:0:8:0: [sdd] 4096-byte physical blocks
kernel: sd 7:0:8:0: [sdd] Write Protect is off
kernel: sd 7:0:8:0: [sdd] Mode Sense: 00 3a 00 00
kernel: sd 7:0:8:0: [sdd] Write cache: enabled, read cache: enabled,
doesn't support DPO or FUA
kernel: sd 7:0:8:0: [sdd] Attached SCSI disk
smartd[723]: Device: /dev/sdd [SAT], SMART Usage Attribute: 187
Reported_Uncorrect changed from 100 to 98
smartd[723]: Device: /dev/sdd [SAT], previous self-test completed with
error (read test element)
smartd[723]: Device: /dev/sdd [SAT], Self-Test Log error count
increased from 0 to 2
smartd[723]: Device: /dev/sdd [SAT], ATA error count increased from 0 to 2

everything seems "ok" again, run short SMART self-test which now
failed for first time (but disk SMART status still says PASSED)
then resumed scrub and it completed

scrub status for 1ec5b839-acc6-4f70-be9d-6f9e6118c71c
scrub device /dev/sdc (id 1) history
       scrub resumed at Sun Jul 12 18:07:06 2015 and finished after 04:34:02
       total bytes scrubbed: 2.35TiB with 0 errors
scrub device /dev/sdd (id 2) history
       scrub resumed at Sun Jul 12 18:07:06 2015 and finished after 02:56:23
       total bytes scrubbed: 1.44TiB with 1648151 errors
       error details: read=1648151
       corrected errors: 704, uncorrectable errors: 1647447,
unverified errors: 0
scrub device /dev/sde (id 3) history
       scrub started at Sun Jul 12 13:36:11 2015 and finished after 02:35:46
       total bytes scrubbed: 1.43TiB with 0 errors
scrub device /dev/sdg (id 4) history
       scrub started at Sun Jul 12 13:36:11 2015 and finished after 02:40:01
       total bytes scrubbed: 1.44TiB with 0 errors
scrub device /dev/sdh (id 5) history
       scrub started at Sun Jul 12 13:36:11 2015 and finished after 01:14:34
       total bytes scrubbed: 537.82GiB with 0 errors

btrfs device stats doesn't show any errors

[/dev/sdc].write_io_errs   0
[/dev/sdc].read_io_errs    0
[/dev/sdc].flush_io_errs   0
[/dev/sdc].corruption_errs 0
[/dev/sdc].generation_errs 0
[/dev/sdd].write_io_errs   0
[/dev/sdd].read_io_errs    0
[/dev/sdd].flush_io_errs   0
[/dev/sdd].corruption_errs 0
[/dev/sdd].generation_errs 0
[/dev/sde].write_io_errs   0
[/dev/sde].read_io_errs    0
[/dev/sde].flush_io_errs   0
[/dev/sde].corruption_errs 0
[/dev/sde].generation_errs 0
[/dev/sdg].write_io_errs   0
[/dev/sdg].read_io_errs    0
[/dev/sdg].flush_io_errs   0
[/dev/sdg].corruption_errs 0
[/dev/sdg].generation_errs 0
[/dev/sdh].write_io_errs   0
[/dev/sdh].read_io_errs    0
[/dev/sdh].flush_io_errs   0
[/dev/sdh].corruption_errs 0
[/dev/sdh].generation_errs 0


other disk /dev/sdh doesn't show any signs if it would have become bad
so most likely it was controller's fault when sdd threw errors.
when scrub says about error counts, what exactly count's as error, a
file fragment?
also are there some easy way to locate those unreadable sectors and
rewrite them so hdd relocates them?

Thanks :)

Here's ful SMART info for /dev/sdd

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Barracuda 7200.14 (AF)
Device Model:     ST2000DM001-9YN164
Serial Number:    W2404VST
LU WWN Device Id: 5 000c50 044a7a68a
Firmware Version: CC9F
User Capacity:    2 000 398 934 016 bytes [2,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    7200 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 4
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Jul 13 07:40:14 2015 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Unavailable
APM level is:     128 (minimum power consumption without standby)
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Unavailable

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                       was never started.
                                       Auto Offline Data Collection: Disabled.
Self-test execution status:      ( 121) The previous self-test completed having
                                       the read element of the test failed.
Total time to complete Offline
data collection:                (  592) seconds.
Offline data collection
capabilities:                    (0x73) SMART execute Offline immediate.
                                       Auto Offline data collection
on/off support.
                                       Suspend Offline collection upon new
                                       command.
                                       No Offline surface scan supported.
                                       Self-test supported.
                                       Conveyance Self-test supported.
                                       Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                       power-saving mode.
                                       Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                       General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   1) minutes.
Extended self-test routine
recommended polling time:        ( 254) minutes.
Conveyance self-test routine
recommended polling time:        (   2) minutes.
SCT capabilities:              (0x3081) SCT Status supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
 1 Raw_Read_Error_Rate     POSR--   117   100   006    -    166724616
 3 Spin_Up_Time            PO----   092   092   000    -    0
 4 Start_Stop_Count        -O--CK   100   100   020    -    626
 5 Reallocated_Sector_Ct   PO--CK   100   100   036    -    0
 7 Seek_Error_Rate         POSR--   060   060   030    -    1306645
 9 Power_On_Hours          -O--CK   097   097   000    -    3154
10 Spin_Retry_Count        PO--C-   100   100   097    -    0
12 Power_Cycle_Count       -O--CK   100   100   020    -    433
183 Runtime_Bad_Block       -O--CK   100   100   000    -    0
184 End-to-End_Error        -O--CK   100   100   099    -    0
187 Reported_Uncorrect      -O--CK   098   098   000    -    2
188 Command_Timeout         -O--CK   100   099   000    -    4 4 4
189 High_Fly_Writes         -O-RCK   100   100   000    -    0
190 Airflow_Temperature_Cel -O---K   070   058   045    -    30 (0 1 34 29 0)
191 G-Sense_Error_Rate      -O--CK   100   100   000    -    0
192 Power-Off_Retract_Count -O--CK   100   100   000    -    335
193 Load_Cycle_Count        -O--CK   096   096   000    -    9566
194 Temperature_Celsius     -O---K   030   042   000    -    30 (128 0 0 0 0)
197 Current_Pending_Sector  -O--C-   100   100   000    -    16
198 Offline_Uncorrectable   ----C-   100   100   000    -    16
199 UDMA_CRC_Error_Count    -OSRCK   200   200   000    -    0
240 Head_Flying_Hours       ------   100   253   000    -    367h+26m+14.504s
241 Total_LBAs_Written      ------   100   253   000    -    38608136381115
242 Total_LBAs_Read         ------   100   253   000    -    7979572945843
                           ||||||_ K auto-keep
                           |||||__ C event count
                           ||||___ R error rate
                           |||____ S speed/performance
                           ||_____ O updated online
                           |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      5  Comprehensive SMART error log
0x03       GPL     R/O      5  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      1  Extended self-test log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  SATA NCQ Queued Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x21       GPL     R/O      1  Write stream error log
0x22       GPL     R/O      1  Read stream error log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xa1       GPL,SL  VS      20  Device vendor specific log
0xa2       GPL     VS    4496  Device vendor specific log
0xa8       GPL,SL  VS      20  Device vendor specific log
0xa9       GPL,SL  VS       1  Device vendor specific log
0xab       GPL     VS       1  Device vendor specific log
0xb0       GPL     VS    5067  Device vendor specific log
0xbd       GPL     VS     512  Device vendor specific log
0xbe-0xbf  GPL     VS   65535  Device vendor specific log
0xc0       GPL,SL  VS       1  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (5 sectors)
Device Error Count: 2
       CR     = Command Register
       FEATR  = Features Register
       COUNT  = Count (was: Sector Count) Register
       LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
       LH     = LBA High (was: Cylinder High) Register    ]   LBA
       LM     = LBA Mid (was: Cylinder Low) Register      ] Register
       LL     = LBA Low (was: Sector Number) Register     ]
       DV     = Device (was: Device/Head) Register
       DC     = Device Control Register
       ER     = Error register
       ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 [1] occurred at disk power-on lifetime: 3139 hours (130 days + 19 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER -- ST COUNT  LBA_48  LH LM LL DV DC
 -- -- -- == -- == == == -- -- -- -- --
 40 -- 51 00 00 00 00 ab a1 40 48 00 00  Error: UNC at LBA =
0xaba14048 = 2879471688

 Commands leading to the command that caused the error were:
 CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
 -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
 60 00 00 00 08 00 00 ab a1 40 48 40 00     02:54:39.784  READ FPDMA QUEUED
 60 00 00 00 08 00 00 ab a1 40 40 40 00     02:54:39.783  READ FPDMA QUEUED
 60 00 00 00 08 00 00 ab a1 40 38 40 00     02:54:39.783  READ FPDMA QUEUED
 60 00 00 00 08 00 00 ab a1 40 30 40 00     02:54:39.782  READ FPDMA QUEUED
 60 00 00 00 08 00 00 ab a1 40 28 40 00     02:54:39.782  READ FPDMA QUEUED

Error 1 [0] occurred at disk power-on lifetime: 3139 hours (130 days + 19 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER -- ST COUNT  LBA_48  LH LM LL DV DC
 -- -- -- == -- == == == -- -- -- -- --
 40 -- 51 00 00 00 00 ab a1 40 48 00 00  Error: UNC at LBA =
0xaba14048 = 2879471688

 Commands leading to the command that caused the error were:
 CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
 -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
 60 00 00 04 00 00 00 ab a0 14 00 40 00     02:54:36.512  READ FPDMA QUEUED
 60 00 00 04 00 00 00 ab a0 10 00 40 00     02:54:36.500  READ FPDMA QUEUED
 60 00 00 04 00 00 00 ab a0 0c 00 40 00     02:54:36.498  READ FPDMA QUEUED
 60 00 00 04 00 00 00 ab a0 08 00 40 00     02:54:36.497  READ FPDMA QUEUED
 60 00 00 04 00 00 00 ab 9f f9 00 40 00     02:54:36.402  READ FPDMA QUEUED

SMART Error Log Version: 1
ATA Error Count: 2
       CR = Command Register [HEX]
       FR = Features Register [HEX]
       SC = Sector Count Register [HEX]
       SN = Sector Number Register [HEX]
       CL = Cylinder Low Register [HEX]
       CH = Cylinder High Register [HEX]
       DH = Device/Head Register [HEX]
       DC = Device Command Register [HEX]
       ER = Error register [HEX]
       ST = Status register [HEX]
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 occurred at disk power-on lifetime: 3139 hours (130 days + 19 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 60 00 08 ff ff ff 4f 00      02:54:39.784  READ FPDMA QUEUED
 60 00 08 ff ff ff 4f 00      02:54:39.783  READ FPDMA QUEUED
 60 00 08 ff ff ff 4f 00      02:54:39.783  READ FPDMA QUEUED
 60 00 08 ff ff ff 4f 00      02:54:39.782  READ FPDMA QUEUED
 60 00 08 ff ff ff 4f 00      02:54:39.782  READ FPDMA QUEUED

Error 1 occurred at disk power-on lifetime: 3139 hours (130 days + 19 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER ST SC SN CL CH DH
 -- -- -- -- -- -- --
 40 51 00 ff ff ff 0f  Error: UNC at LBA = 0x0fffffff = 268435455

 Commands leading to the command that caused the error were:
 CR FR SC SN CL CH DH DC   Powered_Up_Time  Command/Feature_Name
 -- -- -- -- -- -- -- --  ----------------  --------------------
 60 00 00 ff ff ff 4f 00      02:54:36.512  READ FPDMA QUEUED
 60 00 00 ff ff ff 4f 00      02:54:36.500  READ FPDMA QUEUED
 60 00 00 ff ff ff 4f 00      02:54:36.498  READ FPDMA QUEUED
 60 00 00 ff ff ff 4f 00      02:54:36.497  READ FPDMA QUEUED
 60 00 00 ff ff ff 4f 00      02:54:36.402  READ FPDMA QUEUED

SMART Extended Self-test Log Version: 1 (1 sectors)
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%      3139
      2879471688
# 2  Short offline       Completed: read failure       90%      3139
      2879471688
# 3  Short offline       Completed without error       00%      3049         -
# 4  Conveyance offline  Completed without error       00%      2996         -
# 5  Short offline       Completed without error       00%      2239         -
# 6  Extended offline    Completed without error       00%      2238         -
# 7  Short offline       Completed without error       00%      1550         -
# 8  Short offline       Completed without error       00%      1550         -
# 9  Short offline       Completed without error       00%        69         -
#10  Short offline       Completed without error       00%         9         -

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%      3139
      2879471688
# 2  Short offline       Completed: read failure       90%      3139
      2879471688
# 3  Short offline       Completed without error       00%      3049         -
# 4  Conveyance offline  Completed without error       00%      2996         -
# 5  Short offline       Completed without error       00%      2239         -
# 6  Extended offline    Completed without error       00%      2238         -
# 7  Short offline       Completed without error       00%      1550         -
# 8  Short offline       Completed without error       00%      1550         -
# 9  Short offline       Completed without error       00%        69         -
#10  Short offline       Completed without error       00%         9         -

SMART Selective self-test log data structure revision number 1
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Not_testing
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  3
SCT Version (vendor specific):       522 (0x020a)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    30 Celsius
Power Cycle Min/Max Temperature:     29/34 Celsius
Lifetime    Min/Max Temperature:      9/42 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Data Table command not supported

SCT Error Recovery Control command not supported

Device Statistics (GP/SMART Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x000a  2            1  Device-to-host register FISes sent due to a COMRESET
0x0001  2            0  Command failed due to ICRC error
0x0003  2            0  R_ERR response for device-to-host data FIS
0x0004  2            0  R_ERR response for host-to-device data FIS
0x0006  2            0  R_ERR response for device-to-host non-data FIS
0x0007  2            0  R_ERR response for host-to-device non-data FIS





SMART info for /dev/sdh

=== START OF INFORMATION SECTION ===
Model Family:     SAMSUNG SpinPoint F3
Device Model:     SAMSUNG HD103SJ
Serial Number:    S246JDWZ113593
LU WWN Device Id: 5 0024e9 002bf43c5
Firmware Version: 1AJ100E4
User Capacity:    1 000 204 886 016 bytes [1,00 TB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    7200 rpm
Form Factor:      3.5 inches
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ATA8-ACS T13/1699-D revision 6
SATA Version is:  SATA 2.6, 3.0 Gb/s
Local Time is:    Mon Jul 13 07:53:49 2015 EEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
AAM feature is:   Disabled
APM feature is:   Disabled
Rd look-ahead is: Enabled
Write cache is:   Enabled
ATA Security is:  Disabled, NOT FROZEN [SEC1]
Wt Cache Reorder: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                       was completed without error.
                                       Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                       without error or no self-test has ever
                                       been run.
Total time to complete Offline
data collection:                ( 9420) seconds.
Offline data collection
capabilities:                    (0x5b) SMART execute Offline immediate.
                                       Auto Offline data collection
on/off support.
                                       Suspend Offline collection upon new
                                       command.
                                       Offline surface scan supported.
                                       Self-test supported.
                                       No Conveyance Self-test supported.
                                       Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                       power-saving mode.
                                       Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                       General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 157) minutes.
SCT capabilities:              (0x003f) SCT Status supported.
                                       SCT Error Recovery Control supported.
                                       SCT Feature Control supported.
                                       SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
 1 Raw_Read_Error_Rate     POSR-K   100   100   051    -    1
 2 Throughput_Performance  -OS--K   055   055   000    -    8621
 3 Spin_Up_Time            PO---K   073   071   025    -    8314
 4 Start_Stop_Count        -O--CK   091   091   000    -    9745
 5 Reallocated_Sector_Ct   PO--CK   252   252   010    -    0
 7 Seek_Error_Rate         -OSR-K   252   252   051    -    0
 8 Seek_Time_Performance   --S--K   252   252   015    -    0
 9 Power_On_Hours          -O--CK   100   100   000    -    20675
10 Spin_Retry_Count        -O--CK   252   252   051    -    0
11 Calibration_Retry_Count -O--CK   252   252   000    -    0
12 Power_Cycle_Count       -O--CK   097   097   000    -    3297
191 G-Sense_Error_Rate      -O---K   100   100   000    -    42
192 Power-Off_Retract_Count -O---K   252   252   000    -    0
194 Temperature_Celsius     -O----   064   043   000    -    32 (Min/Max 4/57)
195 Hardware_ECC_Recovered  -O-RCK   100   100   000    -    0
196 Reallocated_Event_Count -O--CK   252   252   000    -    0
197 Current_Pending_Sector  -O--CK   252   252   000    -    0
198 Offline_Uncorrectable   ----CK   252   252   000    -    0
199 UDMA_CRC_Error_Count    -OS-CK   100   100   000    -    2
200 Multi_Zone_Error_Rate   -O-R-K   100   100   000    -    101
223 Load_Retry_Count        -O--CK   252   252   000    -    0
225 Load_Cycle_Count        -O--CK   100   100   000    -    9897
                           ||||||_ K auto-keep
                           |||||__ C event count
                           ||||___ R error rate
                           |||____ S speed/performance
                           ||_____ O updated online
                           |______ P prefailure warning

General Purpose Log Directory Version 1
SMART           Log Directory Version 1 [multi-sector log support]
Address    Access  R/W   Size  Description
0x00       GPL,SL  R/O      1  Log Directory
0x01           SL  R/O      1  Summary SMART error log
0x02           SL  R/O      2  Comprehensive SMART error log
0x03       GPL     R/O      2  Ext. Comprehensive SMART error log
0x06           SL  R/O      1  SMART self-test log
0x07       GPL     R/O      2  Extended self-test log
0x08       GPL     R/O      2  Power Conditions log
0x09           SL  R/W      1  Selective self-test log
0x10       GPL     R/O      1  SATA NCQ Queued Error log
0x11       GPL     R/O      1  SATA Phy Event Counters log
0x80-0x9f  GPL,SL  R/W     16  Host vendor specific log
0xbb       GPL     VS       4  Device vendor specific log
0xbc       GPL     VS       2  Device vendor specific log
0xe0       GPL,SL  R/W      1  SCT Command/Status
0xe1       GPL,SL  R/W      1  SCT Data Transfer

SMART Extended Comprehensive Error Log Version: 1 (2 sectors)
Device Error Count: 2
       CR     = Command Register
       FEATR  = Features Register
       COUNT  = Count (was: Sector Count) Register
       LBA_48 = Upper bytes of LBA High/Mid/Low Registers ]  ATA-8
       LH     = LBA High (was: Cylinder High) Register    ]   LBA
       LM     = LBA Mid (was: Cylinder Low) Register      ] Register
       LL     = LBA Low (was: Sector Number) Register     ]
       DV     = Device (was: Device/Head) Register
       DC     = Device Control Register
       ER     = Error register
       ST     = Status register
Powered_Up_Time is measured from power on, and printed as
DDd+hh:mm:SS.sss where DD=days, hh=hours, mm=minutes,
SS=sec, and sss=millisec. It "wraps" after 49.710 days.

Error 2 [1] occurred at disk power-on lifetime: 4244 hours (176 days + 20 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER -- ST COUNT  LBA_48  LH LM LL DV DC
 -- -- -- == -- == == == -- -- -- -- --
 84 -- 51 93 e8 00 00 00 00 00 00 e0 00  Error: ICRC, ABRT 37864
sectors at LBA = 0x00000000 = 0

 Commands leading to the command that caused the error were:
 CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
 -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
 35 00 00 01 00 00 00 61 18 92 e8 e0 08     00:00:01.927  WRITE DMA EXT
 25 00 00 01 00 00 00 1b ce e8 60 e0 08     00:00:01.927  READ DMA EXT
 25 00 00 01 00 00 00 1b ce e7 60 e0 08     00:00:01.927  READ DMA EXT
 25 00 00 01 00 00 00 1b ce e6 60 e0 08     00:00:01.927  READ DMA EXT
 25 00 00 01 00 00 00 1b ce e5 60 e0 08     00:00:01.927  READ DMA EXT

Error 1 [0] occurred at disk power-on lifetime: 2234 hours (93 days + 2 hours)
 When the command that caused the error occurred, the device was active or idle.

 After command completion occurred, registers were:
 ER -- ST COUNT  LBA_48  LH LM LL DV DC
 -- -- -- == -- == == == -- -- -- -- --
 84 -- 51 e5 ee 00 00 00 00 00 00 e0 00  Error: ICRC, ABRT 58862
sectors at LBA = 0x00000000 = 0

 Commands leading to the command that caused the error were:
 CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
 -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
 35 00 00 00 06 00 00 00 35 e5 e8 e0 08     00:00:17.173  WRITE DMA EXT
 35 00 00 00 08 00 00 06 d5 77 10 e0 08     00:00:17.173  WRITE DMA EXT
 35 00 00 00 03 00 00 00 82 12 48 e0 08     00:00:17.173  WRITE DMA EXT
 35 00 00 00 07 00 00 06 d5 77 10 e0 08     00:00:17.171  WRITE DMA EXT
 35 00 00 00 03 00 00 00 82 12 48 e0 08     00:00:17.171  WRITE DMA EXT

SMART Error Log Version: 1
No Errors Logged

SMART Extended Self-test Log Version: 1 (2 sectors)
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     20661         -
# 2  Extended offline    Completed without error       00%     19724         -
# 3  Short offline       Completed without error       00%     19721         -
# 4  Short offline       Aborted by host               90%     19404         -
# 5  Short offline       Completed without error       00%     18910         -
# 6  Short offline       Completed without error       00%     15792         -
# 7  Short offline       Completed without error       00%     15792         -

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining
LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     20661         -
# 2  Extended offline    Completed without error       00%     19724         -
# 3  Short offline       Completed without error       00%     19721         -
# 4  Short offline       Aborted by host               90%     19404         -
# 5  Short offline       Completed without error       00%     18910         -
# 6  Short offline       Completed without error       00%     15792         -
# 7  Short offline       Completed without error       00%     15792         -

SMART Selective self-test log data structure revision number 0
Note: revision number not 1 implies that no selective self-test has
ever been run
SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
   1        0        0  Completed [00% left] (0-65535)
   2        0        0  Not_testing
   3        0        0  Not_testing
   4        0        0  Not_testing
   5        0        0  Not_testing
Selective self-test flags (0x0):
 After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

SCT Status Version:                  2
SCT Version (vendor specific):       256 (0x0100)
SCT Support Level:                   1
Device State:                        Active (0)
Current Temperature:                    32 Celsius
Power Cycle Min/Max Temperature:     24/38 Celsius
Lifetime    Min/Max Temperature:      7/57 Celsius
Under/Over Temperature Limit Count:   0/0

SCT Temperature History Version:     2
Temperature Sampling Period:         5 minutes
Temperature Logging Interval:        5 minutes
Min/Max recommended Temperature:     -5/80 Celsius
Min/Max Temperature Limit:           -10/85 Celsius
Temperature History Size (Index):    128 (106)

Index    Estimated Time   Temperature Celsius
107    2015-07-12 21:15    35  ****************
108    2015-07-12 21:20    34  ***************
105    2015-07-13 07:45    33  **************
106    2015-07-13 07:50    32  *************

SCT Error Recovery Control:
          Read: Disabled
         Write: Disabled

Device Statistics (GP/SMART Log 0x04) not supported

SATA Phy Event Counters (GP Log 0x11)
ID      Size     Value  Description
0x0001  4            0  Command failed due to ICRC error
0x0002  4            0  R_ERR response for data FIS
0x0003  4            0  R_ERR response for device-to-host data FIS
0x0004  4            0  R_ERR response for host-to-device data FIS
0x0005  4            0  R_ERR response for non-data FIS
0x0006  4            0  R_ERR response for device-to-host non-data FIS
0x0007  4            0  R_ERR response for host-to-device non-data FIS
0x0008  4            0  Device-to-host non-data FIS retries
0x0009  4            1  Transition from drive PhyRdy to drive PhyNRdy
0x000a  4            2  Device-to-host register FISes sent due to a COMRESET
0x000b  4            0  CRC errors within host-to-device FIS
0x000d  4            0  Non-CRC errors within host-to-device FIS
0x000f  4            0  R_ERR response for host-to-device data FIS, CRC
0x0010  4            0  R_ERR response for host-to-device data FIS, non-CRC
0x0012  4            0  R_ERR response for host-to-device non-data FIS, CRC
0x0013  4            0  R_ERR response for host-to-device non-data FIS, non-CRC
0x8e00  4            0  Vendor specific
0x8e01  4            0  Vendor specific
0x8e02  4            0  Vendor specific
0x8e03  4            0  Vendor specific
0x8e04  4            0  Vendor specific
0x8e05  4            0  Vendor specific
0x8e06  4            0  Vendor specific
0x8e07  4            0  Vendor specific
0x8e08  4            0  Vendor specific
0x8e09  4            0  Vendor specific
0x8e0a  4            0  Vendor specific
0x8e0b  4            0  Vendor specific
0x8e0c  4            0  Vendor specific
0x8e0d  4            0  Vendor specific
0x8e0e  4            0  Vendor specific
0x8e0f  4            0  Vendor specific
0x8e10  4            0  Vendor specific
0x8e11  4            0  Vendor specific
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Disk "failed" while doing scrub

Reply via email to