Hello.

In a server of mine (7.3p4/i386) I replaced a 1TB Hitachi SATA drive (which worked perfectly), with two brand new Western Digital 2TB disks. Now I'm having critical problems, ranging from the disks getting stuck, to the box rebooting. Those are not the main disks in the box, so they are currently unmounted; I wasn't even able to run newfs on them, since every process that tries to use these disk will hang after a while (and can't be killed either).

The box is based on an Intel S5000 motherboard and the drives are attached on the MB in an hot-swap enclosure.



First, what I think might be the relevant part of dmesg:

FreeBSD 7.3-RELEASE-p4 #1: Wed Dec 15 11:53:13 CET 2010
r...@xxxxx.xxxxxxxx.xx:/usr/obj/usr/src/sys/XXXXX i386
Timecounter "i8254" frequency 1193182 Hz quality 0
CPU: Intel(R) Xeon(R) CPU           E5405  @ 2.00GHz (2004.99-MHz 686-class CPU)
Origin = "GenuineIntel"  Id = 0x10676  Stepping = 6
Features=0xbfebfbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,CLFLUSH,DTS,ACPI,MMX,FXSR,SSE,SSE2,SS,HTT,TM,PBE>
Features2=0xce33d<SSE3,DTES64,MON,DS_CPL,VMX,TM2,SSSE3,CX16,xTPR,PDCM,DCA,SSE4.1>
AMD Features=0x20100000<NX,LM>
AMD Features2=0x1<LAHF>
Cores per package: 4
real memory  = 2143289344 (2044 MB)
avail memory = 2090176512 (1993 MB)
ACPI APIC Table: <INTEL  S5000PSL>
FreeBSD/SMP: Multiprocessor System Detected: 4 CPUs
...
acpi0: <INTEL S5000PSL> on motherboard
acpi0: [ITHREAD]
acpi0: Power Button (fixed)
acpi0: reservation of 0, a0000 (3) failed
Timecounter "ACPI-fast" frequency 3579545 Hz quality 1000
acpi_timer0: <24-bit timer at 3.579545MHz> port 0x408-0x40b on acpi0
acpi_hpet0: <High Precision Event Timer> iomem 0xfed00000-0xfed003ff on acpi0
Timecounter "HPET" frequency 14318180 Hz quality 900
acpi_button0: <Sleep Button> on acpi0
acpi_button1: <Power Button> on acpi0
pcib0: <ACPI Host-PCI bridge> port 0xca2,0xca3,0xcf8-0xcff on acpi0
pci0: <ACPI PCI bus> on pcib0
pcib1: <ACPI PCI-PCI bridge> at device 2.0 on pci0
pci1: <ACPI PCI bus> on pcib1
pcib2: <ACPI PCI-PCI bridge> irq 16 at device 0.0 on pci1
pci2: <ACPI PCI bus> on pcib2
pcib3: <ACPI PCI-PCI bridge> irq 16 at device 0.0 on pci2
pci3: <ACPI PCI bus> on pcib3
pcib4: <PCI-PCI bridge> at device 0.0 on pci3
pci4: <PCI bus> on pcib4
...
pcib5: <PCI-PCI bridge> at device 0.2 on pci3
pci5: <PCI bus> on pcib5
pcib6: <ACPI PCI-PCI bridge> irq 18 at device 2.0 on pci2
pci6: <ACPI PCI bus> on pcib6
...
pcib7: <ACPI PCI-PCI bridge> at device 0.3 on pci1
pci7: <ACPI PCI bus> on pcib7
...
pcib8: <PCI-PCI bridge> at device 3.0 on pci0
pci8: <PCI bus> on pcib8
pcib9: <ACPI PCI-PCI bridge> at device 4.0 on pci0
pci9: <ACPI PCI bus> on pcib9
pcib10: <ACPI PCI-PCI bridge> at device 5.0 on pci0
pci10: <ACPI PCI bus> on pcib10
pcib11: <ACPI PCI-PCI bridge> at device 6.0 on pci0
pci11: <ACPI PCI bus> on pcib11
pcib12: <PCI-PCI bridge> at device 7.0 on pci0
pci12: <PCI bus> on pcib12
pci0: <base peripheral> at device 8.0 (no driver attached)
pcib13: <ACPI PCI-PCI bridge> irq 16 at device 28.0 on pci0
pci13: <ACPI PCI bus> on pcib13
...
pcib14: <ACPI PCI-PCI bridge> at device 30.0 on pci0
pci14: <ACPI PCI bus> on pcib14
...
atapci1: <Intel 63XXESB2 SATA300 controller> port 
0x40d8-0x40df,0x40f4-0x40f7,0x40d0-0x40d7,0x40f0-0x40f3,0x4020-0x403f mem 
0xb9000000-0xb90003ff irq 20 at device 31.2 on pci0
atapci1: [ITHREAD]
atapci1: AHCI called from vendor specific driver
atapci1: AHCI Version 01.10 controller with 6 ports detected
ata2: <ATA channel 0> on atapci1
ata2: [ITHREAD]
ata3: <ATA channel 1> on atapci1
ata3: [ITHREAD]
ata4: <ATA channel 2> on atapci1
ata4: [ITHREAD]
ata5: <ATA channel 3> on atapci1
ata5: [ITHREAD]
ata6: <ATA channel 4> on atapci1
ata6: [ITHREAD]
ata7: <ATA channel 5> on atapci1
ata7: [ITHREAD]
...
ad4: 1907729MB <WDC WD20EARS-00MVWB0 51.0AB51> at ata2-master SATA300
ad8: 1907729MB <WDC WD20EARS-00MVWB0 51.0AB51> at ata4-master SATA300
...
GEOM_STRIPE: Device backup created (id=912470894).
GEOM_STRIPE: Disk ad4 attached to backup.
GEOM_STRIPE: Disk ad8 attached to backup.
GEOM_STRIPE: Device backup activated.
...



Following are some samples of the messages I get in the logs:
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SMART taskqueue timeout - completing request directly
ad8: WARNING - SMART taskqueue timeout - completing request directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request 
directly
ad8: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad4: FAILURE - SMART timed out
ad8: WARNING - SMART freeing taskqueue zombie request
ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request 
directly
ad8: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad8: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad8: FAILURE - SMART timed out
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request 
directly
ad8: WARNING - ATA_IDENTIFY taskqueue timeout - completing request directly
ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad4: FAILURE - SET_MULTI timed out
ad8: WARNING - ATA_IDENTIFY freeing taskqueue zombie request
ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE requeued due to channel reset
ad4: WARNING - SETFEATURES SET TRANSFER MODE requeued due to channel reset
ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad8: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad8: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SMART taskqueue timeout - completing request directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad8: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad8: FAILURE - ATA_IDENTIFY timed out LBA=0
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request 
directly
ad8: WARNING - ATAPI_IDENTIFY taskqueue timeout - completing request directly
ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad4: FAILURE - SMART timed out
ad8: WARNING - ATAPI_IDENTIFY freeing taskqueue zombie request
ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request 
directly
ad8: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad8: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad8: WARNING - SETFEATURES ENABLE WCACHE taskqueue timeout - completing request 
directly
ad4: WARNING - SETFEATURES SET TRANSFER MODE taskqueue timeout - completing 
request directly
ad4: WARNING - SETFEATURES ENABLE RCACHE taskqueue timeout - completing request 
directly


smartctl -a gives:
smartctl 5.40 2010-10-16 r3189 [FreeBSD 7.3-RELEASE-p4 i386] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Caviar Green (Adv. Format) family
Device Model:     WDC WD20EARS-00MVWB0
Serial Number:    WD-WMAZA4718261
Firmware Version: 51.0AB51
User Capacity:    2,000,398,934,016 bytes
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Thu Jun  2 14:08:08 2011 CEST
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                                        was completed without error.
                                        Auto Offline Data Collection: Enabled.
Self-test execution status:      (  41) The self-test routine was interrupted
                                        by the host with a hard or soft reset.
Total time to complete Offline
data collection:                 (38580) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off 
support.
                                        Suspend Offline collection upon new
                                        command.
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 255) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x3035) SCT Status supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  
WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   100   253   051    Pre-fail  Always       
-       0
  3 Spin_Up_Time            0x0027   253   253   021    Pre-fail  Always       
-       1058
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       
-       9
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       
-       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       
-       0
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       
-       22
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       
-       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       
-       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       
-       8
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       
-       7
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       
-       67
194 Temperature_Celsius     0x0022   118   114   000    Old_age   Always       
-       32
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       
-       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       
-       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      
-       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       
-       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      
-       11

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  
LBA_of_first_error
# 1  Extended captive    Interrupted (host reset)      90%        21         -
# 2  Extended captive    Interrupted (host reset)      90%        21         -
# 3  Conveyance captive  Completed without error       00%        20         -
# 4  Short captive       Completed without error       00%        20         -
# 5  Short captive       Interrupted (host reset)      90%         1         -

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

(This is for one drive, but the other is almost identical).
Notice I can't complete a long test, since the box will crash, dump and reboot.



Following is a backtrace from one of the crash dumps:
# kgdb kernel.debug /var/crash/vmcore.14
GNU gdb 6.1.1 [FreeBSD]
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB.  Type "show warranty" for details.
This GDB was configured as "i386-marcel-freebsd"...

Unread portion of the kernel message buffer:
ad8: WARNING - SET_MULTI taskqueue timeout - completing request directly
ad8: WARNING - SET_MULTI requeued due to channel reset
ad8: FAILURE - SET_MULTI timed out


Fatal trap 12: page fault while in kernel mode
cpuid = 0; apic id = 00
fault virtual address   = 0x188
fault code              = supervisor read, page not present
instruction pointer     = 0x20:0xc05553d4
stack pointer           = 0x28:0xe8efca8c
frame pointer           = 0x28:0xe8efcaa4
code segment            = base 0x0, limit 0xfffff, type 0x1b
                        = DPL 0, pres 1, def32 1, gran 1
processor eflags        = interrupt enabled, resume, IOPL = 0
current process         = 8125 (smartctl)
trap number             = 12
panic: page fault
cpuid = 0
Uptime: 37m18s
Physical memory: 2033 MB
Dumping 151 MB: 136 120 104 88 72 56 40 24 8

Reading symbols from /boot/kernel/splash_bmp.ko...Reading symbols from 
/boot/kernel/splash_bmp.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/splash_bmp.ko
Reading symbols from /boot/kernel/geom_stripe.ko...Reading symbols from 
/boot/kernel/geom_stripe.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/geom_stripe.ko
Reading symbols from /boot/kernel/acpi.ko...Reading symbols from 
/boot/kernel/acpi.ko.symbols...done.
done.
Loaded symbols for /boot/kernel/acpi.ko
#0  doadump () at pcpu.h:196
196             __asm __volatile("movl %%fs:0,%0" : "=r" (td));
(kgdb) bt
#0  doadump () at pcpu.h:196
#1  0xc0563d48 in boot (howto=260) at /usr/src/sys/kern/kern_shutdown.c:418
#2  0xc0564025 in panic (fmt=Variable "fmt" is not available.
) at /usr/src/sys/kern/kern_shutdown.c:574
#3  0xc0732764 in trap_fatal (frame=0xe8efca4c, eva=392) at 
/usr/src/sys/i386/i386/trap.c:950
#4  0xc07329b4 in trap_pfault (frame=0xe8efca4c, usermode=0, eva=392) at 
/usr/src/sys/i386/i386/trap.c:863
#5  0xc0733351 in trap (frame=0xe8efca4c) at /usr/src/sys/i386/i386/trap.c:541
#6  0xc0718abb in calltrap () at /usr/src/sys/i386/i386/exception.s:166
#7  0xc05553d4 in _mtx_lock_sleep (m=0xc5d84acc, tid=3328172032, opts=0, 
file=0x0, line=0) at /usr/src/sys/kern/kern_mutex.c:339
#8  0xc056300b in _sema_post (sema=0xc5d84acc, file=0x0, line=0) at 
/usr/src/sys/kern/kern_sema.c:79
#9  0xc047cf0c in ata_completed (context=0xc5d84a80, dummy=0) at 
/usr/src/sys/dev/ata/ata-queue.c:490
#10 0xc047c7d5 in ata_queue_request (request=0xc5d84a80) at 
/usr/src/sys/dev/ata/ata-queue.c:112
#11 0xc046439f in ata_device_ioctl (dev=0xc507d200, cmd=3224920420, data=0xc5cc12c0 
"¡") at /usr/src/sys/dev/ata/ata-all.c:493
#12 0xc04769e9 in ad_ioctl (disk=0xc53cac00, cmd=3224920420, data=0xc5cc12c0, 
flag=1, td=0xc65fe000) at /usr/src/sys/dev/ata/ata-disk.c:373
#13 0xc050d83b in g_disk_ioctl (pp=0xc5572d00, cmd=3224920420, data=0xc5cc12c0, 
fflag=1, td=0xc65fe000) at /usr/src/sys/geom/geom_disk.c:231
#14 0xc050cc3e in g_dev_ioctl (dev=0xc5556600, cmd=3224920420, data=0xc5cc12c0 
"¡", fflag=1, td=0xc65fe000) at /usr/src/sys/geom/geom_dev.c:332
#15 0xc0502dbf in devfs_ioctl_f (fp=0xc64aba18, com=3224920420, 
data=0xc5cc12c0, cred=0xc63ee100, td=0xc65fe000) at 
/usr/src/sys/fs/devfs/devfs_vnops.c:602
#16 0xc059d075 in kern_ioctl (td=0xc65fe000, fd=3, com=3224920420, data=0xc5cc12c0 
"¡") at file.h:269
#17 0xc059d1ad in ioctl (td=0xc65fe000, uap=0xe8efccfc) at 
/usr/src/sys/kern/sys_generic.c:571
#18 0xc0732cf5 in syscall (frame=0xe8efcd38) at 
/usr/src/sys/i386/i386/trap.c:1101
#19 0xc0718b20 in Xint0x80_syscall () at /usr/src/sys/i386/i386/exception.s:262
#20 0x00000033 in ?? ()
Previous frame inner to this frame (corrupt stack?)
(kgdb)



Please, I'm really desperate; any help is appreciated.
Is this a known problem? Should I upgrade? Is there any settings I can try? Patches?


 Bye & Thanks
        av.
_______________________________________________
freebsd-questions@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-questions
To unsubscribe, send any mail to "freebsd-questions-unsubscr...@freebsd.org"

Reply via email to