Re: [gentoo-user] dying hard drive
On Thu, Jul 22, 2010 at 1:11 AM, Mick wrote: > On Thursday 22 July 2010 05:14:08 David Relson wrote: >> /var/log/messages has indicated a slew of XFS problems on an external >> USB hard drive (see attachment). These look pretty fatal. Anybody >> think the file system is recoverable? > > You'll have to try to recover it, to see if it is possible: xfs is vulnerable > to power interruptions, so a faulty USB cable can cause corruption. I had exactly this problem with a USB HDD formatted with xfs. The USB cable that it came with was rubbish... the drive would disconnect & reconnect on its own for no apparent reason, and corruption happened of course. I replaced it with another cable and it worked fine after that. A few months later the power supply started to fail, it would occasionally not provide enough power and the drive would go offline or start beeping/clicking. At first I thought the disk was bad (clicking is never good) but it was actually the sound of the drive trying to spin up and failing. Eventually the power brick couldn't even spin up the drive at all. I replaced the power supply and now the drive works fine again, for now...
Re: [gentoo-user] dying hard drive
On Thursday 22 July 2010 05:14:08 David Relson wrote: > /var/log/messages has indicated a slew of XFS problems on an external > USB hard drive (see attachment). These look pretty fatal. Anybody > think the file system is recoverable? You'll have to try to recover it, to see if it is possible: xfs is vulnerable to power interruptions, so a faulty USB cable can cause corruption. I haven't had a corrupted xfs system for years now, so I put initial experiences down to early (buggy) versions of the drivers. In my case, I was not able to recover and I had to reformat and start again. After a couple of early mortality cases the fs in question carried on for 4 years without a single problem. Try xfs_check and xfs_repair with the drive unmounted, but first use xfs_dump/restore or dd to make a back up just in case. > Also, palimpsest is reporting (graphically) that my external hard drive is > about to die. Can I save it's report to a text file??? Sorry, can't help with that because I'm not familiar with the application. You could use sys-apps/smartmontools if you want a console application that you can copy and paste from. -- Regards, Mick signature.asc Description: This is a digitally signed message part.
[gentoo-user] dying hard drive
/var/log/messages has indicated a slew of XFS problems on an external USB hard drive (see attachment). These look pretty fatal. Anybody think the file system is recoverable? Also, palimpsest is reporting (graphically) that my external hard drive is about to die. Can I save it's report to a text file??? Jul 21 23:53:23 osage kernel: usb 2-1: new high speed USB device using ehci_hcd and address 2 Jul 21 23:53:23 osage kernel: usb 2-1: New USB device found, idVendor=0bc2, idProduct=3001 Jul 21 23:53:23 osage kernel: usb 2-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3 Jul 21 23:53:23 osage kernel: usb 2-1: Product: FreeAgent Jul 21 23:53:23 osage kernel: usb 2-1: Manufacturer: Seagate Jul 21 23:53:23 osage kernel: usb 2-1: SerialNumber: 2GEX0DP4 Jul 21 23:53:23 osage kernel: scsi4 : usb-storage 2-1:1.0 Jul 21 23:53:24 osage kernel: scsi 4:0:0:0: Direct-Access Seagate FreeAgent102D PQ: 0 ANSI: 4 Jul 21 23:53:24 osage kernel: sd 4:0:0:0: Attached scsi generic sg1 type 0 Jul 21 23:53:28 osage kernel: sd 4:0:0:0: [sdb] 976773168 512-byte logical blocks: (500 GB/465 GiB) Jul 21 23:53:28 osage kernel: sd 4:0:0:0: [sdb] Write Protect is off Jul 21 23:53:28 osage kernel: sd 4:0:0:0: [sdb] Mode Sense: 1c 00 00 00 Jul 21 23:53:28 osage kernel: sd 4:0:0:0: [sdb] Assuming drive cache: write through Jul 21 23:53:28 osage kernel: sd 4:0:0:0: [sdb] Assuming drive cache: write through Jul 21 23:53:28 osage kernel: sdb: sdb1 Jul 21 23:53:28 osage kernel: sd 4:0:0:0: [sdb] Assuming drive cache: write through Jul 21 23:53:28 osage kernel: sd 4:0:0:0: [sdb] Attached SCSI disk Jul 21 23:54:18 osage kernel: XFS: bad magic number Jul 21 23:54:18 osage kernel: XFS: SB validate failed Jul 21 23:54:36 osage kernel: XFS mounting filesystem sdb1 Jul 21 23:54:36 osage kernel: Starting XFS recovery on filesystem: sdb1 (logdev: internal) Jul 21 23:55:12 osage kernel: XFS internal error XFS_WANT_CORRUPTED_GOTO at line 1544 of file fs/xfs/xfs_alloc.c. Caller 0x81122bf8 Jul 21 23:55:12 osage kernel: Pid: 4415, comm: mount Not tainted 2.6.34-gentoo-r1 #1 Jul 21 23:55:12 osage kernel: Call Trace: Jul 21 23:55:12 osage kernel: [] ? xfs_free_extent+0x7d/0x94 Jul 21 23:55:12 osage kernel: [] ? xfs_free_ag_extent+0x42e/0x662 Jul 21 23:55:12 osage kernel: [] ? xfs_free_extent+0x7d/0x94 Jul 21 23:55:12 osage kernel: [] ? xfs_trans_get_efd+0x21/0x29 Jul 21 23:55:12 osage kernel: [] ? xlog_recover_process_efi+0x113/0x171 Jul 21 23:55:12 osage kernel: [] ? xlog_recover_process_efis+0x4d/0x8a Jul 21 23:55:12 osage kernel: [] ? xlog_recover_finish+0x14/0xac Jul 21 23:55:12 osage kernel: [] ? xfs_mountfs+0x48f/0x556 Jul 21 23:55:12 osage kernel: [] ? kmem_zalloc+0xd/0x28 Jul 21 23:55:12 osage kernel: [] ? xfs_mru_cache_create+0x111/0x14c Jul 21 23:55:12 osage kernel: [] ? xfs_fs_fill_super+0x199/0x300 Jul 21 23:55:12 osage kernel: [] ? get_sb_bdev+0x125/0x16d Jul 21 23:55:12 osage kernel: [] ? xfs_fs_fill_super+0x0/0x300 Jul 21 23:55:12 osage kernel: [] ? vfs_kern_mount+0xaa/0x179 Jul 21 23:55:12 osage kernel: [] ? do_kern_mount+0x43/0xe1 Jul 21 23:55:12 osage kernel: [] ? do_mount+0x766/0x7e2 Jul 21 23:55:12 osage kernel: [] ? copy_from_user+0x13/0x25 Jul 21 23:55:12 osage kernel: [] ? sys_mount+0x84/0xc5 Jul 21 23:55:12 osage kernel: [] ? system_call_fastpath+0x16/0x1b Jul 21 23:55:12 osage kernel: Filesystem "sdb1": XFS internal error xfs_trans_cancel at line 1161 of file fs/xfs/xfs_trans.c. Caller 0x8114f0b9 Jul 21 23:55:12 osage kernel: Jul 21 23:55:12 osage kernel: Pid: 4415, comm: mount Not tainted 2.6.34-gentoo-r1 #1 Jul 21 23:55:12 osage kernel: Call Trace: Jul 21 23:55:12 osage kernel: [] ? xlog_recover_process_efi+0x163/0x171 Jul 21 23:55:12 osage kernel: [] ? xfs_trans_cancel+0x56/0xd3 Jul 21 23:55:12 osage kernel: [] ? xlog_recover_process_efi+0x163/0x171 Jul 21 23:55:12 osage kernel: [] ? xlog_recover_process_efis+0x4d/0x8a Jul 21 23:55:12 osage udevd-work[3570]: '/bin/mount -a' unexpected exit with status 0x000b Jul 21 23:55:12 osage kernel: [] ? xlog_recover_finish+0x14/0xac Jul 21 23:55:12 osage kernel: [] ? xfs_mountfs+0x48f/0x556 Jul 21 23:55:12 osage kernel: [] ? kmem_zalloc+0xd/0x28 Jul 21 23:55:12 osage kernel: [] ? xfs_mru_cache_create+0x111/0x14c Jul 21 23:55:12 osage kernel: [] ? xfs_fs_fill_super+0x199/0x300 Jul 21 23:55:12 osage kernel: [] ? get_sb_bdev+0x125/0x16d Jul 21 23:55:12 osage kernel: [] ? xfs_fs_fill_super+0x0/0x300 Jul 21 23:55:12 osage kernel: [] ? vfs_kern_mount+0xaa/0x179 Jul 21 23:55:12 osage kernel: [] ? do_kern_mount+0x43/0xe1 Jul 21 23:55:12 osage kernel: [] ? do_mount+0x766/0x7e2 Jul 21 23:55:12 osage kernel: [] ? copy_from_user+0x13/0x25 Jul 21 23:55:12 osage kernel: [] ? sys_mount+0x84/0xc5 Jul 21 23:55:12 osage kernel: [] ? system_call_fastpath+0x16/0x1b Jul 21 23:55:12 osage kernel: xfs_force_shutdown(sdb1,0x8) called from line 1162 of file fs/xfs/xfs_trans.c. Return address = 0x81156cd5 Jul 21 23:55:12 osage kernel:
Re: [gentoo-user] dying hard drive?
On Fri, Jan 13, 2006 at 06:15:20PM -0700, Richard Fish wrote: > I was able to resurrect a drive with a similar problem with: > dd if=/dev/zero of=/dev/hda bs=32k > You can then check that the drive is working with: > dd if=/dev/hda of=/dev/null bs=32k > > If either command fails, then it is time to replace the drive. In > my case, that drive was still working perfectly 18 months later > when I sold it to someone else. I don't think that's going to work for me: # dd if=/dev/zero of=/dev/hda bs=32k dd: writing `/dev/hda': No space left on device 4884091+0 records in 4884090+0 records out # dd if=/dev/hda of=/dev/null bs=32k dd: reading `/dev/hda': Input/output error 3229627+1 records in 3229627+1 records out D'oh! Time to find that RMA form! Thanks for the help, Matt -- Matt Garman email at: http://raw-sewage.net/index.php?file=email -- gentoo-user@gentoo.org mailing list
Re: [gentoo-user] dying hard drive?
On 1/13/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote: > > I keep getting hard drive errors in my kernel log/dmesg that have me > worried. From /var/log/kernel/current: > > Jan 13 11:42:31 [kernel] hda: dma_intr: status=0x59 { DriveReady SeekComplete > DataRequest Error } > - Last output repeated 7 times - > Jan 13 11:42:39 [kernel] hda: dma_intr: error=0x40 { UncorrectableError }, > LBAsect=206696214, high=12, low=5369622, sector=206695927 > Jan 13 11:42:39 [kernel] ide: failed opcode was: unknown > Jan 13 11:42:40 [kernel] hda: dma_intr: status=0x59 { DriveReady SeekComplete > DataRequest Error } These mean the blocks are corrupt, and cannot be read. Whatever was on those blocks is now lost. > On the drive. Apparently, an error was found (details below). I'm > not sure if this drive is actually dying, though, as the following > article (by the smartmontools author) suggests that one or two > errors on a drive is nothing to worry about. Also, the SMART > overall-health self-assessment test comes back as PASSED. I was able to resurrect a drive with a similar problem with: dd if=/dev/zero of=/dev/hda bs=32k !DANGER! the above command will destroy all data on the drive...but by writing to those sectors you can cause the drive to remap them to sectors reserved for that purpose. You can then check that the drive is working with: dd if=/dev/hda of=/dev/null bs=32k If either command fails, then it is time to replace the drive. In my case, that drive was still working perfectly 18 months later when I sold it to someone else. In any case, time to make sure you have a good backup. -Richard -- gentoo-user@gentoo.org mailing list
Re: [gentoo-user] dying hard drive?
On Fri, Jan 13, 2006 at 03:39:46PM -0600, Penguin Lover [EMAIL PROTECTED] squawked: > > I keep getting hard drive errors in my kernel log/dmesg that have me > worried. From /var/log/kernel/current: > > Jan 13 11:42:31 [kernel] hda: dma_intr: status=0x59 { DriveReady SeekComplete > DataRequest Error } > - Last output repeated 7 times - > Jan 13 11:42:39 [kernel] hda: dma_intr: error=0x40 { UncorrectableError }, > LBAsect=206696214, high=12, low=5369622, sector=206695927 > Jan 13 11:42:39 [kernel] ide: failed opcode was: unknown > Jan 13 11:42:40 [kernel] hda: dma_intr: status=0x59 { DriveReady SeekComplete > DataRequest Error } > Do you run SMARTD? If you do, did it complain? (grep SMART /var/log/everything/*) Usually UncorrectablError means that some spots on your harddrive is not readable. And if it keeps complaining, it might be a sign that something is wrong with your drive. (Of course, it could also be flaky connectors.) Maybe you can take a look at http://www.samsung.com/Products/HardDiskDrive/troubleshooting/index.htm A lot of times you get one or two bad sectors due to environmental issues: power blip for one, and my roommate slamming the door too hard on his way out for another. If that is the case, most harddrive vendors provide a diagnostic tool that allows you to map that couple sectors to one of the backup ones on the disk. (Yes, they have a few extra on the harddrive just for that purpose). > > The drive is a 160 GB PATA Samsung. It's about two or three years > old, running 24x7 (although lightly). The drive has three > partitions, all are ext3. > > SMART Self-test log structure revision number 1 > Num Test_DescriptionStatus Remaining LifeTime(hours) > LBA_of_first_error > # 1 Extended offlineCompleted: read failure 00% 11486 > 262886799 > # 2 Short offline Completed without error 00% 11483 - W -- Statistics are like a Bikini: showing interesting details but hiding the important stuff. Sortir en Pantoufles: up 62 days, 14:29 -- gentoo-user@gentoo.org mailing list
Re: [gentoo-user] dying hard drive?
[EMAIL PROTECTED] wrote: I keep getting hard drive errors in my kernel log/dmesg that have me worried. From /var/log/kernel/current: Jan 13 11:42:31 [kernel] hda: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } - Last output repeated 7 times - Jan 13 11:42:39 [kernel] hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=206696214, high=12, low=5369622, sector=206695927 Jan 13 11:42:39 [kernel] ide: failed opcode was: unknown Jan 13 11:42:40 [kernel] hda: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } Exactly the same message I noticed less than 1hr before my Maxtor DiamondMax 9 packed in just before xmas. Annoyingly my drive wouldn't mount the main data partition but everything else seemed in tact. I managed to recover all my data from the drive using dd once i had a new drive. I'd recommend backing up anything thats essencial on the drive and preparing for it to give up the ghost. The drive is a 160 GB PATA Samsung. It's about two or three years old, running 24x7 (although lightly). The drive has three partitions, all are ext3. When I started seeing the above messages, I ran fsck.ext3 -f -v -c -c /dev/hda? on all three partitions. Note that the "-c" flag includes the bad blocks check. I also ran smartctl -t long /dev/hda On the drive. Apparently, an error was found (details below). I'm not sure if this drive is actually dying, though, as the following article (by the smartmontools author) suggests that one or two errors on a drive is nothing to worry about. Also, the SMART overall-health self-assessment test comes back as PASSED. http://www.linuxjournal.com/article/6983 But the constant kernel messages, along with the error in the "long" SMART test, concern me. At this point, I'm not really sure what my next steps should be, so I'm looking for any suggestions or advice. Thanks! Matt # smartctl -a /dev/hda smartctl version 5.33 [i686-pc-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG SP1614N Serial Number:0642J1FW903226 Firmware Version: TM100-24 User Capacity:160,041,885,696 bytes Device is:In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0 Local Time is:Fri Jan 13 15:24:27 2006 CST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 245) Self-test routine in progress... 50% of test remaining. Total time to complete Offline data collection: (5760) seconds. Offline data collection capabilities:(0x1b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time: ( 1) minutes. Extended self-test routine recommended polling time:( 96) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 051Pre-fail Always - 1 3 Spin_Up_Time0x0007 061 061 000Pre-fail Always - 6528 4 Start_Stop_Count0x0032 100 100 000Old_age Always - 73 5 Reallocated_Sector_Ct 0x0033 253 253 010Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 253 253 051Pre-fail Always - 0 8 Seek_Time_Performance 0x0024 253 253 000Old_age Offline - 0 9 Power_On_Half_Minutes 0x0032 098 098 000Old_age Always - 11505h+32m 10 Spin_Retry_Count
[gentoo-user] dying hard drive?
I keep getting hard drive errors in my kernel log/dmesg that have me worried. From /var/log/kernel/current: Jan 13 11:42:31 [kernel] hda: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } - Last output repeated 7 times - Jan 13 11:42:39 [kernel] hda: dma_intr: error=0x40 { UncorrectableError }, LBAsect=206696214, high=12, low=5369622, sector=206695927 Jan 13 11:42:39 [kernel] ide: failed opcode was: unknown Jan 13 11:42:40 [kernel] hda: dma_intr: status=0x59 { DriveReady SeekComplete DataRequest Error } The drive is a 160 GB PATA Samsung. It's about two or three years old, running 24x7 (although lightly). The drive has three partitions, all are ext3. When I started seeing the above messages, I ran fsck.ext3 -f -v -c -c /dev/hda? on all three partitions. Note that the "-c" flag includes the bad blocks check. I also ran smartctl -t long /dev/hda On the drive. Apparently, an error was found (details below). I'm not sure if this drive is actually dying, though, as the following article (by the smartmontools author) suggests that one or two errors on a drive is nothing to worry about. Also, the SMART overall-health self-assessment test comes back as PASSED. http://www.linuxjournal.com/article/6983 But the constant kernel messages, along with the error in the "long" SMART test, concern me. At this point, I'm not really sure what my next steps should be, so I'm looking for any suggestions or advice. Thanks! Matt # smartctl -a /dev/hda smartctl version 5.33 [i686-pc-linux-gnu] Copyright (C) 2002-4 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF INFORMATION SECTION === Device Model: SAMSUNG SP1614N Serial Number:0642J1FW903226 Firmware Version: TM100-24 User Capacity:160,041,885,696 bytes Device is:In smartctl database [for details use: -P show] ATA Version is: 7 ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0 Local Time is:Fri Jan 13 15:24:27 2006 CST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x00) Offline data collection activity was never started. Auto Offline Data Collection: Disabled. Self-test execution status: ( 245) Self-test routine in progress... 50% of test remaining. Total time to complete Offline data collection: (5760) seconds. Offline data collection capabilities:(0x1b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. No Conveyance Self-test supported. No Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. No General Purpose Logging support. Short self-test routine recommended polling time:( 1) minutes. Extended self-test routine recommended polling time:( 96) minutes. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000b 100 100 051Pre-fail Always - 1 3 Spin_Up_Time0x0007 061 061 000Pre-fail Always - 6528 4 Start_Stop_Count0x0032 100 100 000Old_age Always - 73 5 Reallocated_Sector_Ct 0x0033 253 253 010Pre-fail Always - 0 7 Seek_Error_Rate 0x000b 253 253 051Pre-fail Always - 0 8 Seek_Time_Performance 0x0024 253 253 000Old_age Offline - 0 9 Power_On_Half_Minutes 0x0032 098 098 000Old_age Always - 11505h+32m 10 Spin_Retry_Count0x0013 253 253 049Pre-fail Always - 0 12 Power_Cycle_Count 0x0032 100 100 000Old_age Always - 50 194 Temperature_Celsius 0x0022 163 127 000Old_age Always - 25 195 Hardware_ECC_Recovered 0x000a 100 100 000Old_age Always - 265460048 196 Reallocated_Event_Count 0x0012 100 100 000