[releng_7 tinderbox] failure on i386/pc98
TB --- 2010-05-16 01:22:51 - tinderbox 2.6 running on freebsd-stable.sentex.ca TB --- 2010-05-16 01:22:51 - starting RELENG_7 tinderbox run for i386/pc98 TB --- 2010-05-16 01:22:51 - cleaning the object tree TB --- 2010-05-16 01:23:07 - cvsupping the source tree TB --- 2010-05-16 01:23:07 - /usr/bin/csup -z -r 3 -g -L 1 -h localhost -s /tinderbox/RELENG_7/i386/pc98/supfile TB --- 2010-05-16 01:23:15 - building world TB --- 2010-05-16 01:23:15 - MAKEOBJDIRPREFIX=/obj TB --- 2010-05-16 01:23:15 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2010-05-16 01:23:15 - TARGET=pc98 TB --- 2010-05-16 01:23:15 - TARGET_ARCH=i386 TB --- 2010-05-16 01:23:15 - TZ=UTC TB --- 2010-05-16 01:23:15 - __MAKE_CONF=/dev/null TB --- 2010-05-16 01:23:15 - cd /src TB --- 2010-05-16 01:23:15 - /usr/bin/make -B buildworld >>> World build started on Sun May 16 01:23:16 UTC 2010 >>> Rebuilding the temporary build tree >>> stage 1.1: legacy release compatibility shims >>> stage 1.2: bootstrap tools >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3: cross tools >>> stage 4.1: building includes >>> stage 4.2: building libraries >>> stage 4.3: make dependencies >>> stage 4.4: building everything >>> World build completed on Sun May 16 02:27:10 UTC 2010 TB --- 2010-05-16 02:27:10 - generating LINT kernel config TB --- 2010-05-16 02:27:10 - cd /src/sys/pc98/conf TB --- 2010-05-16 02:27:10 - /usr/bin/make -B LINT TB --- 2010-05-16 02:27:10 - building LINT kernel TB --- 2010-05-16 02:27:10 - MAKEOBJDIRPREFIX=/obj TB --- 2010-05-16 02:27:10 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2010-05-16 02:27:10 - TARGET=pc98 TB --- 2010-05-16 02:27:10 - TARGET_ARCH=i386 TB --- 2010-05-16 02:27:10 - TZ=UTC TB --- 2010-05-16 02:27:10 - __MAKE_CONF=/dev/null TB --- 2010-05-16 02:27:10 - cd /src TB --- 2010-05-16 02:27:10 - /usr/bin/make -B buildkernel KERNCONF=LINT >>> Kernel build for LINT started on Sun May 16 02:27:10 UTC 2010 >>> stage 1: configuring the kernel >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3.1: making dependencies >>> stage 3.2: building everything [...] echo elink_reset elink_idseq > export_syms awk -f /src/sys/conf/kmod_syms.awk elink.kld export_syms | xargs -J% objcopy % elink.kld ld -Bshareable -d -warn-common -o elink.ko elink.kld objcopy --strip-debug elink.ko ===> em (all) cc -O2 -fno-strict-aliasing -pipe -DPC98 -D_KERNEL -DKLD_MODULE -std=c99 -nostdinc -I/src/sys/modules/em/../../dev/e1000 -DHAVE_KERNEL_OPTION_HEADERS -include /obj/pc98/src/sys/LINT/opt_global.h -I. -I@ -I@/contrib/altq -finline-limit=8000 --param inline-unit-growth=100 --param large-function-growth=1000 -fno-common -I/obj/pc98/src/sys/LINT -mno-align-long-strings -mpreferred-stack-boundary=2 -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 -ffreestanding -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual -Wundef -Wno-pointer-sign -fformat-extensions -c /src/sys/modules/em/../../dev/e1000/if_em.c /src/sys/modules/em/../../dev/e1000/if_em.c:1350: error: conflicting types for 'em_poll' /src/sys/modules/em/../../dev/e1000/if_em.c:287: error: previous declaration of 'em_poll' was here *** Error code 1 Stop in /src/sys/modules/em. *** Error code 1 Stop in /src/sys/modules. *** Error code 1 Stop in /obj/pc98/src/sys/LINT. *** Error code 1 Stop in /src. *** Error code 1 Stop in /src. TB --- 2010-05-16 02:43:35 - WARNING: /usr/bin/make returned exit code 1 TB --- 2010-05-16 02:43:35 - ERROR: failed to build lint kernel TB --- 2010-05-16 02:43:35 - 4100.66 user 415.24 system 4844.64 real http://tinderbox.freebsd.org/tinderbox-releng_7-RELENG_7-i386-pc98.full ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
[releng_7 tinderbox] failure on i386/i386
TB --- 2010-05-16 00:48:02 - tinderbox 2.6 running on freebsd-stable.sentex.ca TB --- 2010-05-16 00:48:02 - starting RELENG_7 tinderbox run for i386/i386 TB --- 2010-05-16 00:48:02 - cleaning the object tree TB --- 2010-05-16 00:48:29 - cvsupping the source tree TB --- 2010-05-16 00:48:29 - /usr/bin/csup -z -r 3 -g -L 1 -h localhost -s /tinderbox/RELENG_7/i386/i386/supfile TB --- 2010-05-16 00:48:38 - building world TB --- 2010-05-16 00:48:38 - MAKEOBJDIRPREFIX=/obj TB --- 2010-05-16 00:48:38 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2010-05-16 00:48:38 - TARGET=i386 TB --- 2010-05-16 00:48:38 - TARGET_ARCH=i386 TB --- 2010-05-16 00:48:38 - TZ=UTC TB --- 2010-05-16 00:48:38 - __MAKE_CONF=/dev/null TB --- 2010-05-16 00:48:38 - cd /src TB --- 2010-05-16 00:48:38 - /usr/bin/make -B buildworld >>> World build started on Sun May 16 00:48:39 UTC 2010 >>> Rebuilding the temporary build tree >>> stage 1.1: legacy release compatibility shims >>> stage 1.2: bootstrap tools >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3: cross tools >>> stage 4.1: building includes >>> stage 4.2: building libraries >>> stage 4.3: make dependencies >>> stage 4.4: building everything >>> World build completed on Sun May 16 01:52:32 UTC 2010 TB --- 2010-05-16 01:52:32 - generating LINT kernel config TB --- 2010-05-16 01:52:32 - cd /src/sys/i386/conf TB --- 2010-05-16 01:52:32 - /usr/bin/make -B LINT TB --- 2010-05-16 01:52:32 - building LINT kernel TB --- 2010-05-16 01:52:32 - MAKEOBJDIRPREFIX=/obj TB --- 2010-05-16 01:52:32 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2010-05-16 01:52:32 - TARGET=i386 TB --- 2010-05-16 01:52:32 - TARGET_ARCH=i386 TB --- 2010-05-16 01:52:32 - TZ=UTC TB --- 2010-05-16 01:52:32 - __MAKE_CONF=/dev/null TB --- 2010-05-16 01:52:32 - cd /src TB --- 2010-05-16 01:52:32 - /usr/bin/make -B buildkernel KERNCONF=LINT >>> Kernel build for LINT started on Sun May 16 01:52:32 UTC 2010 >>> stage 1: configuring the kernel >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3.1: making dependencies >>> stage 3.2: building everything [...] echo elink_reset elink_idseq > export_syms awk -f /src/sys/conf/kmod_syms.awk elink.kld export_syms | xargs -J% objcopy % elink.kld ld -Bshareable -d -warn-common -o elink.ko elink.kld objcopy --strip-debug elink.ko ===> em (all) cc -O2 -fno-strict-aliasing -pipe -D_KERNEL -DKLD_MODULE -std=c99 -nostdinc -I/src/sys/modules/em/../../dev/e1000 -DHAVE_KERNEL_OPTION_HEADERS -include /obj/src/sys/LINT/opt_global.h -I. -I@ -I@/contrib/altq -finline-limit=8000 --param inline-unit-growth=100 --param large-function-growth=1000 -fno-common -I/obj/src/sys/LINT -mno-align-long-strings -mpreferred-stack-boundary=2 -mno-mmx -mno-3dnow -mno-sse -mno-sse2 -mno-sse3 -ffreestanding -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual -Wundef -Wno-pointer-sign -fformat-extensions -c /src/sys/modules/em/../../dev/e1000/if_em.c /src/sys/modules/em/../../dev/e1000/if_em.c:1350: error: conflicting types for 'em_poll' /src/sys/modules/em/../../dev/e1000/if_em.c:287: error: previous declaration of 'em_poll' was here *** Error code 1 Stop in /src/sys/modules/em. *** Error code 1 Stop in /src/sys/modules. *** Error code 1 Stop in /obj/src/sys/LINT. *** Error code 1 Stop in /src. *** Error code 1 Stop in /src. TB --- 2010-05-16 02:12:38 - WARNING: /usr/bin/make returned exit code 1 TB --- 2010-05-16 02:12:38 - ERROR: failed to build lint kernel TB --- 2010-05-16 02:12:38 - 4320.77 user 414.50 system 5076.45 real http://tinderbox.freebsd.org/tinderbox-releng_7-RELENG_7-i386-i386.full ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
[releng_7 tinderbox] failure on amd64/amd64
TB --- 2010-05-15 23:34:21 - tinderbox 2.6 running on freebsd-stable.sentex.ca TB --- 2010-05-15 23:34:21 - starting RELENG_7 tinderbox run for amd64/amd64 TB --- 2010-05-15 23:34:21 - cleaning the object tree TB --- 2010-05-15 23:34:45 - cvsupping the source tree TB --- 2010-05-15 23:34:45 - /usr/bin/csup -z -r 3 -g -L 1 -h localhost -s /tinderbox/RELENG_7/amd64/amd64/supfile TB --- 2010-05-15 23:34:54 - building world TB --- 2010-05-15 23:34:54 - MAKEOBJDIRPREFIX=/obj TB --- 2010-05-15 23:34:54 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2010-05-15 23:34:54 - TARGET=amd64 TB --- 2010-05-15 23:34:54 - TARGET_ARCH=amd64 TB --- 2010-05-15 23:34:54 - TZ=UTC TB --- 2010-05-15 23:34:54 - __MAKE_CONF=/dev/null TB --- 2010-05-15 23:34:54 - cd /src TB --- 2010-05-15 23:34:54 - /usr/bin/make -B buildworld >>> World build started on Sat May 15 23:34:55 UTC 2010 >>> Rebuilding the temporary build tree >>> stage 1.1: legacy release compatibility shims >>> stage 1.2: bootstrap tools >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3: cross tools >>> stage 4.1: building includes >>> stage 4.2: building libraries >>> stage 4.3: make dependencies >>> stage 4.4: building everything >>> stage 5.1: building 32 bit shim libraries >>> World build completed on Sun May 16 01:04:55 UTC 2010 TB --- 2010-05-16 01:04:55 - generating LINT kernel config TB --- 2010-05-16 01:04:55 - cd /src/sys/amd64/conf TB --- 2010-05-16 01:04:55 - /usr/bin/make -B LINT TB --- 2010-05-16 01:04:55 - building LINT kernel TB --- 2010-05-16 01:04:55 - MAKEOBJDIRPREFIX=/obj TB --- 2010-05-16 01:04:55 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2010-05-16 01:04:55 - TARGET=amd64 TB --- 2010-05-16 01:04:55 - TARGET_ARCH=amd64 TB --- 2010-05-16 01:04:55 - TZ=UTC TB --- 2010-05-16 01:04:55 - __MAKE_CONF=/dev/null TB --- 2010-05-16 01:04:55 - cd /src TB --- 2010-05-16 01:04:55 - /usr/bin/make -B buildkernel KERNCONF=LINT >>> Kernel build for LINT started on Sun May 16 01:04:55 UTC 2010 >>> stage 1: configuring the kernel >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3.1: making dependencies >>> stage 3.2: building everything [...] ld -d -warn-common -r -d -o if_ed.ko if_ed.o if_ed_novell.o if_ed_wd80x3.o if_ed_rtl80x9.o if_ed_isa.o if_ed_3c503.o if_ed_hpp.o if_ed_sic.o if_ed_pccard.o if_ed_pci.o :> export_syms awk -f /src/sys/conf/kmod_syms.awk if_ed.ko export_syms | xargs -J% objcopy % if_ed.ko objcopy --strip-debug if_ed.ko ===> em (all) cc -O2 -fno-strict-aliasing -pipe -D_KERNEL -DKLD_MODULE -std=c99 -nostdinc -I/src/sys/modules/em/../../dev/e1000 -DHAVE_KERNEL_OPTION_HEADERS -include /obj/amd64/src/sys/LINT/opt_global.h -I. -I@ -I@/contrib/altq -finline-limit=8000 --param inline-unit-growth=100 --param large-function-growth=1000 -fno-common -fno-omit-frame-pointer -I/obj/amd64/src/sys/LINT -mcmodel=kernel -mno-red-zone -mfpmath=387 -mno-sse -mno-sse2 -mno-mmx -mno-3dnow -msoft-float -fno-asynchronous-unwind-tables -ffreestanding -Wall -Wredundant-decls -Wnested-externs -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Winline -Wcast-qual -Wundef -Wno-pointer-sign -fformat-extensions -c /src/sys/modules/em/../../dev/e1000/if_em.c /src/sys/modules/em/../../dev/e1000/if_em.c:1350: error: conflicting types for 'em_poll' /src/sys/modules/em/../../dev/e1000/if_em.c:287: error: previous declaration of 'em_poll' was here *** Error code 1 Stop in /src/sys/modules/em. *** Error code 1 Stop in /src/sys/modules. *** Error code 1 Stop in /obj/amd64/src/sys/LINT. *** Error code 1 Stop in /src. *** Error code 1 Stop in /src. TB --- 2010-05-16 01:22:51 - WARNING: /usr/bin/make returned exit code 1 TB --- 2010-05-16 01:22:51 - ERROR: failed to build lint kernel TB --- 2010-05-16 01:22:51 - 5484.84 user 580.52 system 6509.85 real http://tinderbox.freebsd.org/tinderbox-releng_7-RELENG_7-amd64-amd64.full ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Read / write timeouts on SATA disks connected to ICH9
On Sat, May 15, 2010 at 11:16:33PM +0200, Pieter de Boer wrote: > >Attached the SMART output of both disks I replaced about a month ago. It > >appears I replaced perfectly fine drives with the current disks with > >errors ;( One of the old disks is in a USB-enclosure now, so 'da0'. Regarding the Western Digital RE3 disk (serial WD-WMASY5474089): The disk looks fine. The only thing of interest here is the temperature, which is extremely high (47C). If this is the drive which is located in an (non-fan-cooled) enclosure, that would explain it. There are no UDMA/CRC errors, so I'm not of the belief that there were bad cables in use either. Finally, there's no sign of the disk powering on/off excessively either. In summary, I can't explain how this disk would fall off the bus given its condition. Regarding the Western Digital RE3 disk (serial WD-WMASY5474727): Similar to the first RE3 disk; everything here looks great, including disk temperature. I do wish the FreeBSD ATA layer would give full diagnostic messages when encountering these conditions. The request buffer could be printed, and the response (error) could also be printed. SCSI CAM's error output is what I'd be hoping for (sans SK/ASC/ASCQ, which AFAIK ATA doesn't have). Yes, I know this is available if you use ahci.ko, but this isn't available to the OP. Anyway, if heavy disk/controller load appears to be causing these problems, you could have power-related issues. Possibly the combination of two disks + heavy I/O causes enough power draw that the ICH9 starts to behave oddly. Voltages which deviate too much can cause odd things to happen to hardware. If you have the time/money, you might try replacing the PSU in your system to see if there's any improvement; your BIOS should be able to provide you Hardware Monitoring statistics (voltages). Write these down before and after the PSU swap. You don't need to go crazy and buy a 1000W PSU or anything, but 450-750W is pretty normal these days. -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Read / write timeouts on SATA disks connected to ICH9
Attached the SMART output of both disks I replaced about a month ago. It appears I replaced perfectly fine drives with the current disks with errors ;( One of the old disks is in a USB-enclosure now, so 'da0'. Let's send those attachments, then. -- Pieter smartctl 5.39 2009-12-09 r2995 [FreeBSD 8.0-STABLE i386] (local build) Copyright (C) 2002-9 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital RE3 Serial ATA family Device Model: WDC WD5002ABYS-18B1B0 Serial Number:WD-WMASY5474089 Firmware Version: 02.03B03 User Capacity:500,107,862,016 bytes Device is:In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is:Sat May 15 21:53:04 2010 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x82) Offline data collection activity was completed without error. Auto Offline Data Collection: Enabled. Self-test execution status: ( 0) The previous self-test routine completed without error or no self-test has ever been run. Total time to complete Offline data collection: (9480) seconds. Offline data collection capabilities:(0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time:( 2) minutes. Extended self-test routine recommended polling time:( 112) minutes. Conveyance self-test routine recommended polling time:( 5) minutes. SCT capabilities: (0x303f) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051Pre-fail Always - 0 3 Spin_Up_Time0x0027 179 179 021Pre-fail Always - 4033 4 Start_Stop_Count0x0032 100 100 000Old_age Always - 89 5 Reallocated_Sector_Ct 0x0033 200 200 140Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000Old_age Always - 0 9 Power_On_Hours 0x0032 093 093 000Old_age Always - 5536 10 Spin_Retry_Count0x0032 100 253 000Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 253 000Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000Old_age Always - 74 192 Power-Off_Retract_Count 0x0032 200 200 000Old_age Always - 71 193 Load_Cycle_Count0x0032 200 200 000Old_age Always - 89 194 Temperature_Celsius 0x0022 100 094 000Old_age Always - 47 196 Reallocated_Event_Count 0x0032 200 200 000Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 200 000Old_age Always - 0 198 Offline_Uncorrectable 0x0030 200 200 000Old_age Offline - 0 199 UDMA_CRC_Error_Count0x0032 200 200 000Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000Old_age Offline - 0 SMART Error Log Version: 1 No Errors Logged SMART Self-test log structure revision number 1 Num Test_DescriptionStatus Remaining LifeTime(hours) LBA_of_first_error # 1 Extended offlineCompleted without error 00% 5487 - # 2 Extended offlineCompleted without error 00% 492
Re: Read / write timeouts on SATA disks connected to ICH9
Hi, That could be caused by a multitude of other known things. For example, some Western Digital "Green" drives (including the Enterprise class ones) are known to perform head parking/offloading excessively, which could result in the drive spending more time doing that than actually serving overall I/O requests. There are some other reports of Samsung Spinpoint drives experiencing other issues (I've since forgotten and would have to dig up the threads). If you could provide full SMART stats for that drive, it might help. Attached the SMART output of both disks I replaced about a month ago. It appears I replaced perfectly fine drives with the current disks with errors ;( One of the old disks is in a USB-enclosure now, so 'da0'. Yes, it's a DOS-based utility (like most firmware upgraders these days). I can provide it if you'd like. I've been meaning to spend some time trying to reverse-engineer the binary to figure out what ATA commands it sends to the disk to toggle/adjust the feature (so that one could do it in real-time rather than have to boot into DOS). I'd like to try that tool. Since the old WD disks are now lying around at home, I have some time to get a DOS boot working to try it out. A FreeBSD-implementation of the WD tool and possibly other brands would be really useful indeed. At a certain point in time I had read errors from specific LBA's on ad4. Using dd I was able to pinpoint those to single sectors. This isn't very effective (dd will read large chunks/amounts of data (read: multiple LBAs) from the underlying disk at once, rather than the disk itself performing a per-LBA test). My opinion is that the "dd method" should only be used on drives which don't support selective LBA scanning via SMART. Will dd read multiple LBAs even when using 'bs=512'? The process I used was reading using bs=8192, then zooming in on the LBA's mentioned in the errors in dmesg with bs=512 to find the actual LBA. A selective scan on ad4 did not reveal any errors today: it 'completed without error'. On ad6 it's a whole lot slower; at the time of writing it's at 2/3. All HD vendors have their own quirks/ordeals right now. You basically just have to go with one who works wells for you, then if things start going downhill, switch to another. None of them are perfect. I figured as much. What irritates though is that I've had consistent problems with 4 disks in this specific system, but not (such) issues with any other disk in other systems I've had. I generally replace disks when I grow out of them, not because they break down. What this indicates to me is that if a disk falls off the bus on an ICH9 controller in Enhanced (non-AHCI) mode, FreeBSD starts seeing an absurd number of interrupts generated from the ICH9. My guess is FreeBSD isn't doing something correctly with the controller when this happens; maybe certain commands aren't being sent back to the controller or handling of certain events are being done improperly when it comes to ICH9 (or possibly earlier ICH revisions too). This should be *very* easy to reproduce. Unfortunately I'm not really in a position to help reproducing this or testing possible fixes; downtime is currently very unwelcome. Although one of the previous disks indeed fell of the bus entirely (couldn't get it back with atacontrol either), that hasn't happened again so far. I only see timeouts (and a few days ago read errors on ad4) which gmirror doesn't like. I guess those aren't that simple to reproduce (apart from on my system ;). If you see any of your disks on the ICH9 controller fall off the bus or report ATA errors (doesn't matter what kind), please make note of the timestamp (should be in the kernel log), and ASAP run "smartctl -a" on the disk. You should compare attributes before and after the event. You might also want to consider using smartd, which can log SMART attribute changes on its own. Note that you might have to tune the arguments in smartd.conf to ignore some attributes which fluctuate naturally (such as drive temperature and seek error rate). I've configured smartd to poll both disks every 5 minutes. I -think- the issues happen specifically under load: the periodic scripts of the host and its 4 jails appear to trigger it sometimes. At that time I'm normally trying to get some sleep, so smartd will have to do for now. Although I'll run a "smartctl -a" asap anyway. -- Pieter ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Kernel panic when unpluggin AC adaptor
On Thu, May 13, 2010 at 7:25 PM, Giovanni Trematerra wrote: > On Thu, May 13, 2010 at 1:09 AM, Brandon Gooch > wrote: >> On Wed, May 12, 2010 at 9:41 AM, Attilio Rao wrote: >>> 2010/5/12 David DEMELIER : I remove the patch, and built the kernel (I updated the src this morning) and it does not panic now. It's really odd. If it reappears soon I will tell you. >>> >>> I looked at the code with Giovanni and I have the feeling that the >>> race with the idle thread may still be fatal. >>> We need to fix that. >>> >>> Attilio >>> >> >> That seems to be the case, as my laptop shows about an 80-85 % chance >> of experiencing a panic if left idle for long-ish periods of time (2 >> to 4 hours). I usually rebuild world or big ports overnight, and more >> often than not I wake up to a panicked machine, same situation every >> time: >> >> ... >> rman_get_bushandle() at rman_get_bushandle+0x1 >> sched_idletd() at sched_idletd+0x123 >> fork_exit() at fork_exit+0x12a >> fork_trampoline() at fork_trampoline+0xe >> ... >> >> The kernel/userland is rebuilt, the ports are finished compiling -- >> it's in the time AFTER the completion of all tasks that the machine >> gets bored and tries to kill itself :) >> >> I have seen the AC adapter plug/unplug "hang" in the past on this >> laptop, but I never made the connection between the events, as >> nowadays my laptop usually stays plugged in :( >> >> Attilio, I hope you can track this one down, let me know if I can do >> anything to help or test... >> > > Attilio and I came up with this patch. It seems ready for stress > testing and review > Please test and report back. > > Thank you > > P.S: all the faults are only mine. I tried the patch, and my kernel panics I panic on boot. I have 8.5MB(!) of JPG images (6 of them) if anyone needs to see them. I'm looking for a place to post them, but if anyone wants, I can send via e-mail... -Brandon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Read / write timeouts on SATA disks connected to ICH9
On Sat, May 15, 2010 at 09:04:11AM +0200, Pieter de Boer wrote: > Thanks for your elaborate reply, it was very useful to see smartctl > output explained a bit :) I still think there's something else in > play beside disk failure. I've checked one of the drives I replaced > earlier, but that one doesn't have any of the errors in its SMART > output you described, although it did drop out of the mirror > multiple times during its lifetime. That could be caused by a multitude of other known things. For example, some Western Digital "Green" drives (including the Enterprise class ones) are known to perform head parking/offloading excessively, which could result in the drive spending more time doing that than actually serving overall I/O requests. There are some other reports of Samsung Spinpoint drives experiencing other issues (I've since forgotten and would have to dig up the threads). If you could provide full SMART stats for that drive, it might help. > >The WD Caviar Black drives have a useful feature called TLER -- it's > >disabled by default, for reasons which I don't want to get into here -- > >which can force the drive to internally give up after X seconds (it's > >user-selectable) when dealing with such remapping/errors. The idea is > >to keep the drive from being deemed dead from the OS/controller's point > >of view. I believe Seagate, Hitachi, or Samsung (I forget which) have > >this feature as well, but it's not called TLER. > > I've read about this feature, but didn't have the time to try to get > it turned on (iirc you'd need a specific Western Digital DOS-based > util or something). Yes, it's a DOS-based utility (like most firmware upgraders these days). I can provide it if you'd like. I've been meaning to spend some time trying to reverse-engineer the binary to figure out what ATA commands it sends to the disk to toggle/adjust the feature (so that one could do it in real-time rather than have to boot into DOS). > >If you want to find out the exact LBA that has the problem (there may be > >more than one), I can step you through performing a selective LBA scan > >using SMART, since this model of disk does support such. It's easy to > >do, easy to understand the results, and can be done while the drive is > >in operation (though I would recommend trying to keep disk I/O to a > >minimum during this test). Let me know. > > At a certain point in time I had read errors from specific LBA's on > ad4. Using dd I was able to pinpoint those to single sectors. This isn't very effective (dd will read large chunks/amounts of data (read: multiple LBAs) from the underlying disk at once, rather than the disk itself performing a per-LBA test). My opinion is that the "dd method" should only be used on drives which don't support selective LBA scanning via SMART. > Overwriting those sectors with what was on ad6 made them readable > again. What is odd is that the 'remapped sector' count of ad4 is 0. What may have happened is that the drive took a while to read certain LBAs (long enough for the OS/controller to time out), but that internal drive ECC was used to correct the reads and the sectors therefore *did not* need to be remapped. I do see that Attribute 1 on ad4 is non-zero, which could indicate said situation, but WD doesn't provide Attribute 195 (ECC recovery rate), which could help here. SMART implementations are usually quite good (particularly in recent WD drives), but I have seen situations where certain counters are, erroneously, not being incremented or changed. I've seen a couple brand new disks come out of the factory with non-zero values (indicating someone at the fab forgot to clear them before shipping). I'd love to get my hands on a WD utility that zeros out the counters and re-flashes the drive firmware to rule out any oddities. It's been proven already that WD will re-uses the same F/W version number despite some code being changed. There was a FreeBSD user who got a F/W fix from WD for the head offloading/parking ordeal (see above, re: WD GP), and the firmware version between the old and the new were the same. Tracking stuff like this down is basically impossible unless MD5/SHAs of the firmware files can be provided (good luck). All HD vendors have their own quirks/ordeals right now. You basically just have to go with one who works wells for you, then if things start going downhill, switch to another. None of them are perfect. > Still I'd like to know how do perform such a scan. smartctl -t select,0-max This will start a selective LBA scan from LBA 0 to the end of the disk. If any error is encountered, the scan stops and the error -- including the LBA where an error was seen -- is output in the SMART self-test and SMART selective self-test logs. You can then write down the LBA, and then re-run the above command replacing "0" with the LBA+1 where the error was seen. Here's an example of what a failed selective scan looks like (taken from a Hitachi disk I just dealt w
Re: Read / write timeouts on SATA disks connected to ICH9
Pieter de Boer wrote: Hi there, what kind of disk I/O is going on. If actual I/O is very little, then something weird is going on with regards to the number of interrupts being seen on IRQ 23. mav@ might have some ideas, otherwise I'd recommend rebooting the machine and seeing if the number drops. If so, it may be that the OS has some sort of bug where a disk timing out or falling off the bus causes interrupt problems. (It's too bad you don't have AHCI on this system. It handles stuff like this much more elegantly...) Well, due to a UFS snapshot panic the box was rebooted, and now I only see around 1500 interrupts per second, while syncing the mirror. I seen high interrupts on 7.x systems after pulling out/in one drive in gmirror [1] even if it was successfully disconnected by gmirror remove + atacontrol detach and reconnected by atacontrol attach + gmirror insert. It was not 100% reproducible, but it seems the bug is still there in 8.x. [1] http://lists.freebsd.org/pipermail/freebsd-stable/2008-October/046003.html Miroslav Lachman ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Read / write timeouts on SATA disks connected to ICH9
Interesting. Which version of FreeBSD is this system running? I guess you didn't experience any of the timeouts I'm seeing? 8-STABLE as of the 11th of this month, or thereabouts. No, I've never seen a disk timeout on that box. Yeah, this R300 was bought second-hand and unfortunately the owner pulled the RAID card out. It's something to consider, getting one of those cards. Do you use the RAID-features of the drive and if so, does that work well? I'm a bit hesitant to use hardware raid; it would be a big plus if the RAID disks could also be used stand-alone if need be (which is easy with gmirror because of its metadata being stored in the drive's last sector). Does your system have hot-swap drive bays and the SAS backplane? If it at least has hot-swap bays, then you could always add the backplane, cable, and controller. I'm using the hardware mirroring on the SAS 6/iR card (with a pair of WD3000HLFS drives, since the previous owner took the factory drives out before selling the system). I haven't tried taking one of those drives and seeing if it will boot on a standalone SATA port. I have removed both drives, installed a scratch drive, and installed Windows on it to run one of the Dell update install- ers (not all of them come in DOS or Linux flavors). The controller didn't mind the swap a bit (or the swap back to the 2 RAID drives). That's a lot better than the old amr-based RAID cards. Terry Kennedy http://www.tmk.com te...@tmk.com New York, NY USA ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Read / write timeouts on SATA disks connected to ICH9
Hi there, what kind of disk I/O is going on. If actual I/O is very little, then something weird is going on with regards to the number of interrupts being seen on IRQ 23. mav@ might have some ideas, otherwise I'd recommend rebooting the machine and seeing if the number drops. If so, it may be that the OS has some sort of bug where a disk timing out or falling off the bus causes interrupt problems. (It's too bad you don't have AHCI on this system. It handles stuff like this much more elegantly...) Well, due to a UFS snapshot panic the box was rebooted, and now I only see around 1500 interrupts per second, while syncing the mirror. -- Pieter ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
www/apache22: purpose of WITHOUT_APACHE_OPTIONS?
Hi, I was quite surprised that I need to set WITHOUT_APACHE_OPTIONS to have any command line options honored by the makefile. All other ports seem to override the config options (that may or not may be set) with the WITH and WITHOUT variables specifed on the make commandline or through pkgtools.conf. What's the reason for this difference? Thanks, Stefan -- Stefan BethkeFon +49 151 14070811 ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Read / write timeouts on SATA disks connected to ICH9
Hi Terry, I have a bunch of R300's here. From one that is using the on-board SATA and 2 drives in a gmirror setup (very similar to the OP) after 18 hours of uptime: [0:2] speedtest:~> vmstat -i interrupt total rate irq23: atapci0254116 3 Interesting. Which version of FreeBSD is this system running? I guess you didn't experience any of the timeouts I'm seeing? I also have another R300 with Dell's "SAS 6/iR" card (a re-branded LSI 1068-something, seen as "mpt" by FreeBSD). While Dell only sells that as part of a package deal with the hot-swap backplane and redundant power supplies, there's no reason you couldn't pick one up on eBay and add it yourself. You'll need some sort of breakout cable to get from the big connector on the SAS 6 to individual SATA ports. Yeah, this R300 was bought second-hand and unfortunately the owner pulled the RAID card out. It's something to consider, getting one of those cards. Do you use the RAID-features of the drive and if so, does that work well? I'm a bit hesitant to use hardware raid; it would be a big plus if the RAID disks could also be used stand-alone if need be (which is easy with gmirror because of its metadata being stored in the drive's last sector). Thanks, Pieter ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Read / write timeouts on SATA disks connected to ICH9
Hi Jeremy, Lots to say about all of this. Thanks for your elaborate reply, it was very useful to see smartctl output explained a bit :) I still think there's something else in play beside disk failure. I've checked one of the drives I replaced earlier, but that one doesn't have any of the errors in its SMART output you described, although it did drop out of the mirror multiple times during its lifetime. The WD Caviar Black drives have a useful feature called TLER -- it's disabled by default, for reasons which I don't want to get into here -- which can force the drive to internally give up after X seconds (it's user-selectable) when dealing with such remapping/errors. The idea is to keep the drive from being deemed dead from the OS/controller's point of view. I believe Seagate, Hitachi, or Samsung (I forget which) have this feature as well, but it's not called TLER. I've read about this feature, but didn't have the time to try to get it turned on (iirc you'd need a specific Western Digital DOS-based util or something). If you want to find out the exact LBA that has the problem (there may be more than one), I can step you through performing a selective LBA scan using SMART, since this model of disk does support such. It's easy to do, easy to understand the results, and can be done while the drive is in operation (though I would recommend trying to keep disk I/O to a minimum during this test). Let me know. At a certain point in time I had read errors from specific LBA's on ad4. Using dd I was able to pinpoint those to single sectors. Overwriting those sectors with what was on ad6 made them readable again. What is odd is that the 'remapped sector' count of ad4 is 0. Still I'd like to know how do perform such a scan. > Finally, your vmstat -i output: # vmstat -i interrupt total rate irq23: atapci0 371021299 10423 Good to know there's no IRQ sharing going on, but what does worry me is the interrupt rate (10K interrupts/second). That seems *extremely* high, but it also depends on what kind of disk I/O is happening on this system -- especially since you have 2 disks attached to the same controller. The rate is higher than 1 also at idle. During a gmirror sync from ad6 to ad4, it's about 10670. "iostat 1", "iostat -x 1", or "gstat" might come in handy to tell you what kind of disk I/O is going on. If actual I/O is very little, then something weird is going on with regards to the number of interrupts being seen on IRQ 23. mav@ might have some ideas, otherwise I'd recommend rebooting the machine and seeing if the number drops. If so, it may be that the OS has some sort of bug where a disk timing out or falling off the bus causes interrupt problems. (It's too bad you don't have AHCI on this system. It handles stuff like this much more elegantly...) If mav@ or anyone else doesn't have another insight in the interrupt rate, I guess a reboot will at least show if it's persistent or related to the errors. I'll try to do a reboot when convenient (probably sunday morning or something). Thanks, Pieter ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
[releng_7 tinderbox] failure on sparc64/sparc64
TB --- 2010-05-15 05:57:23 - tinderbox 2.6 running on freebsd-stable.sentex.ca TB --- 2010-05-15 05:57:23 - starting RELENG_7 tinderbox run for sparc64/sparc64 TB --- 2010-05-15 05:57:23 - cleaning the object tree TB --- 2010-05-15 05:57:51 - cvsupping the source tree TB --- 2010-05-15 05:57:51 - /usr/bin/csup -z -r 3 -g -L 1 -h localhost -s /tinderbox/RELENG_7/sparc64/sparc64/supfile TB --- 2010-05-15 05:58:04 - building world TB --- 2010-05-15 05:58:04 - MAKEOBJDIRPREFIX=/obj TB --- 2010-05-15 05:58:04 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2010-05-15 05:58:04 - TARGET=sparc64 TB --- 2010-05-15 05:58:04 - TARGET_ARCH=sparc64 TB --- 2010-05-15 05:58:04 - TZ=UTC TB --- 2010-05-15 05:58:04 - __MAKE_CONF=/dev/null TB --- 2010-05-15 05:58:04 - cd /src TB --- 2010-05-15 05:58:04 - /usr/bin/make -B buildworld >>> World build started on Sat May 15 05:58:05 UTC 2010 >>> Rebuilding the temporary build tree >>> stage 1.1: legacy release compatibility shims >>> stage 1.2: bootstrap tools >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3: cross tools >>> stage 4.1: building includes >>> stage 4.2: building libraries >>> stage 4.3: make dependencies >>> stage 4.4: building everything >>> World build completed on Sat May 15 06:58:39 UTC 2010 TB --- 2010-05-15 06:58:39 - generating LINT kernel config TB --- 2010-05-15 06:58:39 - cd /src/sys/sparc64/conf TB --- 2010-05-15 06:58:39 - /usr/bin/make -B LINT TB --- 2010-05-15 06:58:39 - building LINT kernel TB --- 2010-05-15 06:58:39 - MAKEOBJDIRPREFIX=/obj TB --- 2010-05-15 06:58:39 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2010-05-15 06:58:39 - TARGET=sparc64 TB --- 2010-05-15 06:58:39 - TARGET_ARCH=sparc64 TB --- 2010-05-15 06:58:39 - TZ=UTC TB --- 2010-05-15 06:58:39 - __MAKE_CONF=/dev/null TB --- 2010-05-15 06:58:39 - cd /src TB --- 2010-05-15 06:58:39 - /usr/bin/make -B buildkernel KERNCONF=LINT >>> Kernel build for LINT started on Sat May 15 06:58:39 UTC 2010 >>> stage 1: configuring the kernel >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3.1: making dependencies [...] ===> em (depend) @ -> /src/sys machine -> /src/sys/sparc64/include awk -f @/tools/makeobjops.awk @/kern/device_if.m -h awk -f @/tools/makeobjops.awk @/kern/bus_if.m -h awk -f @/tools/makeobjops.awk @/dev/pci/pci_if.m -h ln -sf /obj/sparc64/src/sys/LINT/opt_inet.h opt_inet.h make: don't know how to make if_lem.c. Stop *** Error code 2 Stop in /src/sys/modules. *** Error code 1 Stop in /obj/sparc64/src/sys/LINT. *** Error code 1 Stop in /src. *** Error code 1 Stop in /src. TB --- 2010-05-15 07:00:06 - WARNING: /usr/bin/make returned exit code 1 TB --- 2010-05-15 07:00:06 - ERROR: failed to build lint kernel TB --- 2010-05-15 07:00:06 - 3200.38 user 331.62 system 3763.34 real http://tinderbox.freebsd.org/tinderbox-releng_7-RELENG_7-sparc64-sparc64.full ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"