I have a disk that has failed; there seem to be damaged areas that 
cause errors when specific files are accessed.  This disk was one of a two-disk 
mirror running raidframe.  The disk has been replaced and the original machine 
is back up and running again.
        However as I use a second computer to investigate the failed disk, I 
have been puzzled that this second computer locks up and stops responding when 
I try copying files that include various damaged areas of the disk.  

        This second computer has an installation of OpenBSD 4.6, with the 
kernel recompiled to support raidframe (so I can access the data on the 
partition); I have also adjusted the drive numbering so that the failed drive 
believes it is the only disk present in its mirror.  On this second computer, 
the operating system is on a completely different physical disk; the failed 
disk is not necessary for a completely functional system.
        However, even though this computer doesn't use the failed disk for its 
root filesystem - the computer still freezes up and stops responding when the 
bad sectors are accessed.
        I even tried using the "dump" and "dd" utilities to access the disk 
with a raw, unmounted partition - but the host computer still freezes up and 
stops responding after adding a few lines to /var/log/messages.

        I was expecting the error messages, but not expecting the host system 
to freeze up - even the mouse stops responding.  It's irritating to have to 
reboot the computer each time I access one of the damaged sectors.
        I thought this problem might be caused if the drive controller hardware 
never returns control back to the operating system once the disk error occurs 
too many times.  But the error messages do end up in /var/log/messages, so 
control does return to the operating system for at least a little while.

        And yes, repeatedly accessing the same file generates the error 
messages referring to the same sectors.

1.  How can I attempt to access the damaged sectors without causing the entire 
computer to freeze up and stop responding?

2.  I have used stat, ncheck, and fsdb to find and examine the inodes for 
various files.  Is there a utility to show which sectors of the filesystem 
and/or the drive are actually used by various files?

3.  How can I identify all the files that contain bad sectors without freezing 
up the computer on each file that contains one?

# mount
/dev/wd1a on / type ffs (local)
/dev/wd1e on /usr type ffs (local, read-only)
/dev/wd1g on /mnt3 type ffs (local, read-only)
/dev/wd1f on /mnt type ffs (local, read-only)
# fsck -f /dev/rraid2d
** /dev/rraid2d
** File system is already clean
** Last Mounted on /home-big
** Phase 1 - Check Blocks and Sizes
** Phase 2 - Check Pathnames
** Phase 3 - Check Connectivity
** Phase 4 - Check Reference Counts
** Phase 5 - Check Cyl groups
452600 files, 69774853 used, 43730370 free (26658 frags, 5462964 blocks, 0.0% fr
agmentation)

# mount -r /dev/raid2d /mnt2
# mount
/dev/wd1a on / type ffs (local)
/dev/wd1e on /usr type ffs (local, read-only)
/dev/wd1g on /mnt3 type ffs (local, read-only)
/dev/wd1f on /mnt type ffs (local, read-only)
/dev/raid2d on /mnt2 type ffs (local, read-only)

# dd conv=noerror,notrunc,sync \
> if=/mnt2/.../20198332.txt of=/dev/null count=1

        The computer stopped responding but these messages were on the console 
and in /var/log/messages on rebooting:
/var/log/messages
Jan 26 08:23:15 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 o
f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
Jan 26 08:23:18 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode 4
Jan 26 08:23:18 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode 4
Jan 26 08:23:18 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 o
f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
Jan 26 08:23:20 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode 3
Jan 26 08:23:20 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode 3
Jan 26 08:23:20 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 o
f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
Jan 26 08:23:22 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode 2
Jan 26 08:23:22 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode 2
Jan 26 08:23:22 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 o
f 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
Jan 26 08:23:25 one /bsd: wd0f: uncorrectable data error reading fsbn 40104976 o
f 40104952-40104983 (wd0 bn 67174501; cn 4181 tn 106 sn 58), retrying

        And the error messages are repeatable (especially the failed block 
numbers) if I repeat the command:
Jan 26 10:40:19 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 
of 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
Jan 26 10:40:21 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 
of 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
Jan 26 10:40:24 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 
of 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
Jan 26 10:40:26 one /bsd: wd0: transfer error, downgrading to Ultra-DMA mode 4
Jan 26 10:40:26 one /bsd: wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode 4
Jan 26 10:40:26 one /bsd: wd0f: uncorrectable data error reading fsbn 40104952 
of 40104952-40104983 (wd0 bn 67174477; cn 4181 tn 106 sn 34), retrying
Jan 26 10:40:29 one /bsd: wd0f: uncorrectable data error reading fsbn 40104976 
of 40104952-40104983 (wd0 bn 67174501; cn 4181 tn 106 sn 58), retrying

        However, none of these commands seem to cause any problem - no error 
messages and no freezing up:
# dd conv=noerror,notrunc,sync \
> if=/dev/wd0f skip=40104951 of=/dev/null count=1
1+0 records in
1+0 records out
512 bytes transferred in 0.014 secs (34222 bytes/sec)
# dd conv=noerror,notrunc,sync \
> if=/dev/wd0f skip=40104952 of=/dev/null count=1
1+0 records in
1+0 records out
512 bytes transferred in 0.000 secs (2612245 bytes/sec)
# dd conv=noerror,notrunc,sync \
> if=/dev/wd0f skip=67174501 of=/dev/null count=1
1+0 records in
1+0 records out
512 bytes transferred in 0.011 secs (43813 bytes/sec)
# dd conv=noerror,notrunc,sync \
> if=/dev/raid2d skip=40104952 of=/dev/null count=1
dd: /dev/raid2d: Device busy
# umount /mnt2
# dd conv=noerror,notrunc,sync \
> if=/dev/raid2d skip=40104952 of=/dev/null count=1
1+0 records in
1+0 records out
512 bytes transferred in 0.013 secs (37083 bytes/sec)
#

        Here is the dmesg:
OpenBSD 4.6 (RAID110125) #0: Tue Jan 25 03:11:29 MST 2011
    r...@one.my.domain:/usr/src/sys/arch/i386/compile/RAID110125
cpu0: Intel(R) Pentium(R) 4 CPU 3.40GHz ("GenuineIntel" 686-class) 3.40 GHz
cpu0: 
FPU,V86,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CFLUSH,DS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,SBF,SSE3,MWAIT,DS-CPL,EST,CNXT-ID,CX16,xTPR
real mem  = 1073246208 (1023MB)
avail mem = 1028554752 (980MB)
mainbus0 at root
bios0 at mainbus0: AT/286+ BIOS, date 03/10/05, BIOS32 rev. 0 @ 0xfaad0, SMBIOS 
rev. 2.3 @ 0xf0100 (25 entries)
bios0: vendor Phoenix Technologies Ltd. version "F2" date 03/10/2005
bios0: Gigabyte Technology Co., Ltd. 0000000000
acpi0 at bios0: rev 0
acpi0: tables DSDT FACP MCFG APIC SSDT SSDT
acpi0: wakeup devices PEX0(S5) PEX1(S5) PEX2(S5) PEX3(S5) HUB0(S5) UAR1(S1) 
PS2M(S1) PS2K(S1) USB0(S4) USB1(S4) USB2(S4) USB3(S4) USBE(S4) AC97(S5) 
MC97(S5) AZAL(S5) PCI0(S5)
acpitimer0 at acpi0: 3579545 Hz, 24 bits
acpimadt0 at acpi0 addr 0xfee00000: PC-AT compat
cpu0 at mainbus0: apid 0 (boot processor)
cpu0: apic clock running at 199MHz
ioapic0 at mainbus0: apid 2 pa 0xfec00000, version 20, 24 pins
acpiprt0 at acpi0: bus 0 (PCI0)
acpiprt1 at acpi0: bus 2 (PEX0)
acpiprt2 at acpi0: bus 3 (PEX1)
acpiprt3 at acpi0: bus -1 (PEX2)
acpiprt4 at acpi0: bus -1 (PEX3)
acpiprt5 at acpi0: bus 4 (HUB0)
acpicpu0 at acpi0: FVS, 3400, 2800 MHz
acpibtn0 at acpi0: PWRB
bios0: ROM list: 0xc0000/0xec00 0xd0000/0x1800 0xef000/0x1000!
pci0 at mainbus0 bus 0: configuration mode 1 (bios)
pchb0 at pci0 dev 0 function 0 "Intel 82925X Host" rev 0x05
ppb0 at pci0 dev 1 function 0 "Intel 82925X PCIE" rev 0x05: apic 2 int 16 (irq 
5)
pci1 at ppb0 bus 1
vga1 at pci1 dev 0 function 0 "NVIDIA GeForce 7600 GS" rev 0xa1
wsdisplay0 at vga1 mux 1: console (80x25, vt100 emulation)
wsdisplay0: screen 1-5 added (80x25, vt100 emulation)
azalia0 at pci0 dev 27 function 0 "Intel 82801FB HD Audio" rev 0x03: apic 2 int 
16 (irq 5)
azalia0: codecs: Realtek ALC260
audio0 at azalia0
ppb1 at pci0 dev 28 function 0 "Intel 82801FB PCIE" rev 0x03: apic 2 int 16 
(irq 5)
pci2 at ppb1 bus 2
ppb2 at pci0 dev 28 function 1 "Intel 82801FB PCIE" rev 0x03: apic 2 int 17 
(irq 10)
pci3 at ppb2 bus 3
bge0 at pci3 dev 0 function 0 "Broadcom BCM5751" rev 0x01, BCM5750 A1 (0x4001): 
apic 2 int 17 (irq 10), address 00:14:85:1d:03:a8
brgphy0 at bge0 phy 1: BCM5750 10/100/1000baseT PHY, rev. 0
uhci0 at pci0 dev 29 function 0 "Intel 82801FB USB" rev 0x03: apic 2 int 23 
(irq 3)
uhci1 at pci0 dev 29 function 1 "Intel 82801FB USB" rev 0x03: apic 2 int 19 
(irq 11)
uhci2 at pci0 dev 29 function 2 "Intel 82801FB USB" rev 0x03: apic 2 int 18 
(irq 11)
uhci3 at pci0 dev 29 function 3 "Intel 82801FB USB" rev 0x03: apic 2 int 16 
(irq 5)
ehci0 at pci0 dev 29 function 7 "Intel 82801FB USB" rev 0x03: apic 2 int 23 
(irq 3)
usb0 at ehci0: USB revision 2.0
uhub0 at usb0 "Intel EHCI root hub" rev 2.00/1.00 addr 1
ppb3 at pci0 dev 30 function 0 "Intel 82801BA Hub-to-PCI" rev 0xd3
pci4 at ppb3 bus 4
"TI TSB43AB23 FireWire" rev 0x00 at pci4 dev 5 function 0 not configured
ichpcib0 at pci0 dev 31 function 0 "Intel 82801FB LPC" rev 0x03: PM disabled
pciide0 at pci0 dev 31 function 1 "Intel 82801FB IDE" rev 0x03: DMA, channel 0 
configured to compatibility, channel 1 configured to compatibility
atapiscsi0 at pciide0 channel 0 drive 0
scsibus0 at atapiscsi0: 2 targets
cd0 at scsibus0 targ 0 lun 0: <LITE-ON, DVDRW SHW-160P6S, PS08> ATAPI 5/cdrom 
removable
cd0(pciide0:0:0): using PIO mode 4, Ultra-DMA mode 4
pciide0: channel 1 disabled (no drives)
pciide2 at pci0 dev 31 function 2 "Intel 82801FR SATA" rev 0x03: DMA, channel 0 
configured to native-PCI, channel 1 configured to native-PCI
pciide2: using apic 2 int 19 (irq 11) for native-PCI interrupt
wd1 at pciide2 channel 0 drive 0: <ST31000520AS>
wd1: 16-sector PIO, LBA48, 953869MB, 1953525168 sectors
wd1(pciide2:0:0): using PIO mode 4, Ultra-DMA mode 5
wd0 at pciide2 channel 1 drive 1: <SAMSUNG SP2504C>
wd0: 16-sector PIO, LBA48, 238475MB, 488397168 sectors
wd0(pciide2:1:1): using PIO mode 4, Ultra-DMA mode 5
ichiic0 at pci0 dev 31 function 3 "Intel 82801FB SMBus" rev 0x03: apic 2 int 19 
(irq 11)
iic0 at ichiic0
spdmem0 at iic0 addr 0x50: 1GB DDR2 SDRAM non-parity PC2-4200CL3
usb1 at uhci0: USB revision 1.0
uhub1 at usb1 "Intel UHCI root hub" rev 1.00/1.00 addr 1
usb2 at uhci1: USB revision 1.0
uhub2 at usb2 "Intel UHCI root hub" rev 1.00/1.00 addr 1
usb3 at uhci2: USB revision 1.0
uhub3 at usb3 "Intel UHCI root hub" rev 1.00/1.00 addr 1
usb4 at uhci3: USB revision 1.0
uhub4 at usb4 "Intel UHCI root hub" rev 1.00/1.00 addr 1
isa0 at ichpcib0
isadma0 at isa0
com0 at isa0 port 0x3f8/8 irq 4: ns16550a, 16 byte fifo
pckbc0 at isa0 port 0x60/5
pckbd0 at pckbc0 (kbd slot)
pckbc0: using irq 1 for kbd slot
wskbd0 at pckbd0: console keyboard, using wsdisplay0
pmsi0 at pckbc0 (aux slot)
pckbc0: using irq 12 for aux slot
wsmouse0 at pmsi0 mux 0
pcppi0 at isa0 port 0x61
midi0 at pcppi0: <PC speaker>
spkr0 at pcppi0
lpt0 at isa0 port 0x378/4 irq 7
it0 at isa0 port 0x2e/2: IT8712F rev 7, EC port 0x290
npx0 at isa0 port 0xf0/16: reported by CPUID; using exception 16
fdc0 at isa0 port 0x3f0/6 irq 6 drq 2
mtrr: Pentium Pro MTRR support
Kernelized RAIDframe activated
umass0 at uhub0 port 8 configuration 1 interface 0 "Generic USB2.0 Card Reader" 
rev 2.00/1.9c addr 2
umass0: using SCSI over Bulk-Only
scsibus1 at umass0: 2 targets, initiator 0
sd0 at scsibus1 targ 1 lun 0: <Generic, IC1210 CF, 1.9C> SCSI0 0/direct 
removable
sd0: drive offline
sd1 at scsibus1 targ 1 lun 1: <Generic, IC1210 MS, 1.9C> SCSI0 0/direct 
removable
sd1: drive offline
sd2 at scsibus1 targ 1 lun 2: <Generic, IC1210 MMC/SD, 1.9C> SCSI0 0/direct 
removable
sd2: drive offline
sd3 at scsibus1 targ 1 lun 3: <Generic, IC1210 SM, 1.9C> SCSI0 0/direct 
removable
sd3: drive offline
cd0(atapiscsi0:0:0): Check Condition (error 0x70) on opcode 0x0
    SENSE KEY: Not Ready
     ASC/ASCQ: Medium Not Present
softraid0 at root
root on wd1a swap on wd1b dump on wd1b
WARNING: / was not properly unmounted
raidlookup on device: /dev/wd2f failed !
Hosed component: /dev/wd2f.
Hosed component: /dev/wd2f.
raid2: Component /dev/wd0f being configured at row: 0 col: 0
         Row: 0 Column: 0 Num Rows: 1 Num Columns: 2
         Version: 2 Serial Number: 2007112802 Mod Counter: 364
         Clean: No Status: 0
/dev/wd0f is not clean !
raid2: Ignoring /dev/wd2f.
raid2 at root

        Here is the raid configuration file:
# cat /etc/raid-stuff/raid2-big.conf
START array
# numRow numCol numSpare
1 2 0

START disks
/dev/wd0f
/dev/wd2f

START layout
# sectPerSU SUsPerParityUnit SUsPerReconUnit RAID_level_0
128 1 1 1

START queue
fifo 100

Reply via email to