Re: [releng_7 tinderbox] failure on amd64/amd64
DUH, forgot to add the file, lol. Fix coming shortly Jack On Fri, May 14, 2010 at 9:54 PM, Jeremy Chadwick wrote: > On Fri, May 14, 2010 at 11:40:23PM -0400, FreeBSD Tinderbox wrote: > > ===> em (depend) > > @ -> /src/sys > > machine -> /src/sys/amd64/include > > awk -f @/tools/makeobjops.awk @/kern/device_if.m -h > > awk -f @/tools/makeobjops.awk @/kern/bus_if.m -h > > awk -f @/tools/makeobjops.awk @/dev/pci/pci_if.m -h > > ln -sf /obj/amd64/src/sys/LINT/opt_inet.h opt_inet.h > > make: don't know how to make if_lem.c. Stop > > *** Error code 2 > > > > Stop in /src/sys/modules. > > *** Error code 1 > > > > Stop in /obj/amd64/src/sys/LINT. > > *** Error code 1 > > Jack, did you break em(4) (or lem in this case) again? :-) > > -- > | Jeremy Chadwick j...@parodius.com | > | Parodius Networking http://www.parodius.com/ | > | UNIX Systems Administrator Mountain View, CA, USA | > | Making life hard for others since 1977. PGP: 4BD6C0CB | > > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
[releng_7 tinderbox] failure on ia64/ia64
TB --- 2010-05-15 04:29:04 - tinderbox 2.6 running on freebsd-stable.sentex.ca TB --- 2010-05-15 04:29:04 - starting RELENG_7 tinderbox run for ia64/ia64 TB --- 2010-05-15 04:29:04 - cleaning the object tree TB --- 2010-05-15 04:29:28 - cvsupping the source tree TB --- 2010-05-15 04:29:28 - /usr/bin/csup -z -r 3 -g -L 1 -h localhost -s /tinderbox/RELENG_7/ia64/ia64/supfile TB --- 2010-05-15 04:29:40 - building world TB --- 2010-05-15 04:29:40 - MAKEOBJDIRPREFIX=/obj TB --- 2010-05-15 04:29:40 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2010-05-15 04:29:40 - TARGET=ia64 TB --- 2010-05-15 04:29:40 - TARGET_ARCH=ia64 TB --- 2010-05-15 04:29:40 - TZ=UTC TB --- 2010-05-15 04:29:40 - __MAKE_CONF=/dev/null TB --- 2010-05-15 04:29:40 - cd /src TB --- 2010-05-15 04:29:40 - /usr/bin/make -B buildworld >>> World build started on Sat May 15 04:29:41 UTC 2010 >>> Rebuilding the temporary build tree >>> stage 1.1: legacy release compatibility shims >>> stage 1.2: bootstrap tools >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3: cross tools >>> stage 4.1: building includes >>> stage 4.2: building libraries >>> stage 4.3: make dependencies >>> stage 4.4: building everything >>> World build completed on Sat May 15 05:55:48 UTC 2010 TB --- 2010-05-15 05:55:48 - generating LINT kernel config TB --- 2010-05-15 05:55:48 - cd /src/sys/ia64/conf TB --- 2010-05-15 05:55:48 - /usr/bin/make -B LINT TB --- 2010-05-15 05:55:48 - building LINT kernel TB --- 2010-05-15 05:55:48 - MAKEOBJDIRPREFIX=/obj TB --- 2010-05-15 05:55:48 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2010-05-15 05:55:48 - TARGET=ia64 TB --- 2010-05-15 05:55:48 - TARGET_ARCH=ia64 TB --- 2010-05-15 05:55:48 - TZ=UTC TB --- 2010-05-15 05:55:48 - __MAKE_CONF=/dev/null TB --- 2010-05-15 05:55:48 - cd /src TB --- 2010-05-15 05:55:48 - /usr/bin/make -B buildkernel KERNCONF=LINT >>> Kernel build for LINT started on Sat May 15 05:55:49 UTC 2010 >>> stage 1: configuring the kernel >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3.1: making dependencies [...] ===> em (depend) @ -> /src/sys machine -> /src/sys/ia64/include awk -f @/tools/makeobjops.awk @/kern/device_if.m -h awk -f @/tools/makeobjops.awk @/kern/bus_if.m -h awk -f @/tools/makeobjops.awk @/dev/pci/pci_if.m -h ln -sf /obj/ia64/src/sys/LINT/opt_inet.h opt_inet.h make: don't know how to make if_lem.c. Stop *** Error code 2 Stop in /src/sys/modules. *** Error code 1 Stop in /obj/ia64/src/sys/LINT. *** Error code 1 Stop in /src. *** Error code 1 Stop in /src. TB --- 2010-05-15 05:57:23 - WARNING: /usr/bin/make returned exit code 1 TB --- 2010-05-15 05:57:23 - ERROR: failed to build lint kernel TB --- 2010-05-15 05:57:23 - 4602.17 user 363.85 system 5299.09 real http://tinderbox.freebsd.org/tinderbox-releng_7-RELENG_7-ia64-ia64.full ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: [releng_7 tinderbox] failure on amd64/amd64
On Fri, May 14, 2010 at 11:40:23PM -0400, FreeBSD Tinderbox wrote: > ===> em (depend) > @ -> /src/sys > machine -> /src/sys/amd64/include > awk -f @/tools/makeobjops.awk @/kern/device_if.m -h > awk -f @/tools/makeobjops.awk @/kern/bus_if.m -h > awk -f @/tools/makeobjops.awk @/dev/pci/pci_if.m -h > ln -sf /obj/amd64/src/sys/LINT/opt_inet.h opt_inet.h > make: don't know how to make if_lem.c. Stop > *** Error code 2 > > Stop in /src/sys/modules. > *** Error code 1 > > Stop in /obj/amd64/src/sys/LINT. > *** Error code 1 Jack, did you break em(4) (or lem in this case) again? :-) -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
[releng_7 tinderbox] failure on i386/pc98
TB --- 2010-05-15 03:40:23 - tinderbox 2.6 running on freebsd-stable.sentex.ca TB --- 2010-05-15 03:40:23 - starting RELENG_7 tinderbox run for i386/pc98 TB --- 2010-05-15 03:40:23 - cleaning the object tree TB --- 2010-05-15 03:40:43 - cvsupping the source tree TB --- 2010-05-15 03:40:43 - /usr/bin/csup -z -r 3 -g -L 1 -h localhost -s /tinderbox/RELENG_7/i386/pc98/supfile TB --- 2010-05-15 03:40:52 - building world TB --- 2010-05-15 03:40:52 - MAKEOBJDIRPREFIX=/obj TB --- 2010-05-15 03:40:52 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2010-05-15 03:40:52 - TARGET=pc98 TB --- 2010-05-15 03:40:52 - TARGET_ARCH=i386 TB --- 2010-05-15 03:40:52 - TZ=UTC TB --- 2010-05-15 03:40:52 - __MAKE_CONF=/dev/null TB --- 2010-05-15 03:40:52 - cd /src TB --- 2010-05-15 03:40:52 - /usr/bin/make -B buildworld >>> World build started on Sat May 15 03:40:54 UTC 2010 >>> Rebuilding the temporary build tree >>> stage 1.1: legacy release compatibility shims >>> stage 1.2: bootstrap tools >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3: cross tools >>> stage 4.1: building includes >>> stage 4.2: building libraries >>> stage 4.3: make dependencies >>> stage 4.4: building everything >>> World build completed on Sat May 15 04:44:43 UTC 2010 TB --- 2010-05-15 04:44:43 - generating LINT kernel config TB --- 2010-05-15 04:44:43 - cd /src/sys/pc98/conf TB --- 2010-05-15 04:44:43 - /usr/bin/make -B LINT TB --- 2010-05-15 04:44:43 - building LINT kernel TB --- 2010-05-15 04:44:43 - MAKEOBJDIRPREFIX=/obj TB --- 2010-05-15 04:44:43 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2010-05-15 04:44:43 - TARGET=pc98 TB --- 2010-05-15 04:44:43 - TARGET_ARCH=i386 TB --- 2010-05-15 04:44:43 - TZ=UTC TB --- 2010-05-15 04:44:43 - __MAKE_CONF=/dev/null TB --- 2010-05-15 04:44:43 - cd /src TB --- 2010-05-15 04:44:43 - /usr/bin/make -B buildkernel KERNCONF=LINT >>> Kernel build for LINT started on Sat May 15 04:44:43 UTC 2010 >>> stage 1: configuring the kernel >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3.1: making dependencies [...] @ -> /src/sys machine -> /src/sys/pc98/include i386 -> /src/sys/i386/include awk -f @/tools/makeobjops.awk @/kern/device_if.m -h awk -f @/tools/makeobjops.awk @/kern/bus_if.m -h awk -f @/tools/makeobjops.awk @/dev/pci/pci_if.m -h ln -sf /obj/pc98/src/sys/LINT/opt_inet.h opt_inet.h make: don't know how to make if_lem.c. Stop *** Error code 2 Stop in /src/sys/modules. *** Error code 1 Stop in /obj/pc98/src/sys/LINT. *** Error code 1 Stop in /src. *** Error code 1 Stop in /src. TB --- 2010-05-15 04:46:43 - WARNING: /usr/bin/make returned exit code 1 TB --- 2010-05-15 04:46:43 - ERROR: failed to build lint kernel TB --- 2010-05-15 04:46:43 - 3309.74 user 364.98 system 3980.45 real http://tinderbox.freebsd.org/tinderbox-releng_7-RELENG_7-i386-pc98.full ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
[releng_7 tinderbox] failure on i386/i386
TB --- 2010-05-15 03:21:53 - tinderbox 2.6 running on freebsd-stable.sentex.ca TB --- 2010-05-15 03:21:53 - starting RELENG_7 tinderbox run for i386/i386 TB --- 2010-05-15 03:21:53 - cleaning the object tree TB --- 2010-05-15 03:22:18 - cvsupping the source tree TB --- 2010-05-15 03:22:18 - /usr/bin/csup -z -r 3 -g -L 1 -h localhost -s /tinderbox/RELENG_7/i386/i386/supfile TB --- 2010-05-15 03:22:30 - building world TB --- 2010-05-15 03:22:30 - MAKEOBJDIRPREFIX=/obj TB --- 2010-05-15 03:22:30 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2010-05-15 03:22:30 - TARGET=i386 TB --- 2010-05-15 03:22:30 - TARGET_ARCH=i386 TB --- 2010-05-15 03:22:30 - TZ=UTC TB --- 2010-05-15 03:22:30 - __MAKE_CONF=/dev/null TB --- 2010-05-15 03:22:30 - cd /src TB --- 2010-05-15 03:22:30 - /usr/bin/make -B buildworld >>> World build started on Sat May 15 03:22:31 UTC 2010 >>> Rebuilding the temporary build tree >>> stage 1.1: legacy release compatibility shims >>> stage 1.2: bootstrap tools >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3: cross tools >>> stage 4.1: building includes >>> stage 4.2: building libraries >>> stage 4.3: make dependencies >>> stage 4.4: building everything >>> World build completed on Sat May 15 04:26:42 UTC 2010 TB --- 2010-05-15 04:26:42 - generating LINT kernel config TB --- 2010-05-15 04:26:42 - cd /src/sys/i386/conf TB --- 2010-05-15 04:26:42 - /usr/bin/make -B LINT TB --- 2010-05-15 04:26:42 - building LINT kernel TB --- 2010-05-15 04:26:42 - MAKEOBJDIRPREFIX=/obj TB --- 2010-05-15 04:26:42 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2010-05-15 04:26:42 - TARGET=i386 TB --- 2010-05-15 04:26:42 - TARGET_ARCH=i386 TB --- 2010-05-15 04:26:42 - TZ=UTC TB --- 2010-05-15 04:26:42 - __MAKE_CONF=/dev/null TB --- 2010-05-15 04:26:42 - cd /src TB --- 2010-05-15 04:26:42 - /usr/bin/make -B buildkernel KERNCONF=LINT >>> Kernel build for LINT started on Sat May 15 04:26:42 UTC 2010 >>> stage 1: configuring the kernel >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3.1: making dependencies [...] ===> em (depend) @ -> /src/sys machine -> /src/sys/i386/include awk -f @/tools/makeobjops.awk @/kern/device_if.m -h awk -f @/tools/makeobjops.awk @/kern/bus_if.m -h awk -f @/tools/makeobjops.awk @/dev/pci/pci_if.m -h ln -sf /obj/src/sys/LINT/opt_inet.h opt_inet.h make: don't know how to make if_lem.c. Stop *** Error code 2 Stop in /src/sys/modules. *** Error code 1 Stop in /obj/src/sys/LINT. *** Error code 1 Stop in /src. *** Error code 1 Stop in /src. TB --- 2010-05-15 04:29:04 - WARNING: /usr/bin/make returned exit code 1 TB --- 2010-05-15 04:29:04 - ERROR: failed to build lint kernel TB --- 2010-05-15 04:29:04 - 3362.72 user 358.23 system 4030.56 real http://tinderbox.freebsd.org/tinderbox-releng_7-RELENG_7-i386-i386.full ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
[releng_7 tinderbox] failure on amd64/amd64
TB --- 2010-05-15 02:06:57 - tinderbox 2.6 running on freebsd-stable.sentex.ca TB --- 2010-05-15 02:06:57 - starting RELENG_7 tinderbox run for amd64/amd64 TB --- 2010-05-15 02:06:57 - cleaning the object tree TB --- 2010-05-15 02:07:27 - cvsupping the source tree TB --- 2010-05-15 02:07:27 - /usr/bin/csup -z -r 3 -g -L 1 -h localhost -s /tinderbox/RELENG_7/amd64/amd64/supfile TB --- 2010-05-15 02:07:38 - building world TB --- 2010-05-15 02:07:38 - MAKEOBJDIRPREFIX=/obj TB --- 2010-05-15 02:07:38 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2010-05-15 02:07:38 - TARGET=amd64 TB --- 2010-05-15 02:07:38 - TARGET_ARCH=amd64 TB --- 2010-05-15 02:07:38 - TZ=UTC TB --- 2010-05-15 02:07:38 - __MAKE_CONF=/dev/null TB --- 2010-05-15 02:07:38 - cd /src TB --- 2010-05-15 02:07:38 - /usr/bin/make -B buildworld >>> World build started on Sat May 15 02:07:39 UTC 2010 >>> Rebuilding the temporary build tree >>> stage 1.1: legacy release compatibility shims >>> stage 1.2: bootstrap tools >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3: cross tools >>> stage 4.1: building includes >>> stage 4.2: building libraries >>> stage 4.3: make dependencies >>> stage 4.4: building everything >>> stage 5.1: building 32 bit shim libraries >>> World build completed on Sat May 15 03:38:13 UTC 2010 TB --- 2010-05-15 03:38:13 - generating LINT kernel config TB --- 2010-05-15 03:38:13 - cd /src/sys/amd64/conf TB --- 2010-05-15 03:38:13 - /usr/bin/make -B LINT TB --- 2010-05-15 03:38:13 - building LINT kernel TB --- 2010-05-15 03:38:13 - MAKEOBJDIRPREFIX=/obj TB --- 2010-05-15 03:38:13 - PATH=/usr/bin:/usr/sbin:/bin:/sbin TB --- 2010-05-15 03:38:13 - TARGET=amd64 TB --- 2010-05-15 03:38:13 - TARGET_ARCH=amd64 TB --- 2010-05-15 03:38:13 - TZ=UTC TB --- 2010-05-15 03:38:13 - __MAKE_CONF=/dev/null TB --- 2010-05-15 03:38:13 - cd /src TB --- 2010-05-15 03:38:13 - /usr/bin/make -B buildkernel KERNCONF=LINT >>> Kernel build for LINT started on Sat May 15 03:38:13 UTC 2010 >>> stage 1: configuring the kernel >>> stage 2.1: cleaning up the object tree >>> stage 2.2: rebuilding the object tree >>> stage 2.3: build tools >>> stage 3.1: making dependencies [...] ===> em (depend) @ -> /src/sys machine -> /src/sys/amd64/include awk -f @/tools/makeobjops.awk @/kern/device_if.m -h awk -f @/tools/makeobjops.awk @/kern/bus_if.m -h awk -f @/tools/makeobjops.awk @/dev/pci/pci_if.m -h ln -sf /obj/amd64/src/sys/LINT/opt_inet.h opt_inet.h make: don't know how to make if_lem.c. Stop *** Error code 2 Stop in /src/sys/modules. *** Error code 1 Stop in /obj/amd64/src/sys/LINT. *** Error code 1 Stop in /src. *** Error code 1 Stop in /src. TB --- 2010-05-15 03:40:22 - WARNING: /usr/bin/make returned exit code 1 TB --- 2010-05-15 03:40:22 - ERROR: failed to build lint kernel TB --- 2010-05-15 03:40:22 - 4650.35 user 524.41 system 5605.26 real http://tinderbox.freebsd.org/tinderbox-releng_7-RELENG_7-amd64-amd64.full ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
RE: Read / write timeouts on SATA disks connected to ICH9
On Fri May 14 22:42:38 UTC 2010, Jeremy Chadwick wrote: > Finally, your vmstat -i output: > > > # vmstat -i > > interrupt total rate > > irq23: atapci0 371021299 10423 > > Good to know there's no IRQ sharing going on, but what does worry me is > the interrupt rate (10K interrupts/second). That seems *extremely* > high, but it also depends on what kind of disk I/O is happening on this > system -- especially since you have 2 disks attached to the same > controller. I have a bunch of R300's here. From one that is using the on-board SATA and 2 drives in a gmirror setup (very similar to the OP) after 18 hours of uptime: [0:2] speedtest:~> vmstat -i interrupt total rate irq23: atapci0254116 3 I haven't specifically done any stress testing on this box, though I did do a "make -j8 buildworld" during the initial gmirror synchronization. 8-} The drives are a pair of Dell-labeled 160GB "SAMSUNG HE161HJ 1AC01121" that shipped with the box. I also have another R300 with Dell's "SAS 6/iR" card (a re-branded LSI 1068-something, seen as "mpt" by FreeBSD). While Dell only sells that as part of a package deal with the hot-swap backplane and redundant power supplies, there's no reason you couldn't pick one up on eBay and add it yourself. You'll need some sort of breakout cable to get from the big connector on the SAS 6 to individual SATA ports. Terry Kennedy http://www.tmk.com te...@tmk.com New York, NY USA ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Read / write timeouts on SATA disks connected to ICH9
On Fri, May 14, 2010 at 11:09:28PM +0200, Pieter de Boer wrote: > The ad4 SMART output is showing errors, as this disk is indeed > broken now. It wasn't before and it is a replacement of another disk > that wasn't broken either. Grmbl, I now see reallocated sectors on > ad6 as well, in the smartctl output. So both disks look wonky; > although afaik that's not the main issue here. Lots to say about all of this. Focusing on drive ad4 (Western Digital): The disk has 1 uncorrected sector (Attribute 198). This means the drive tried to remap it and was not successful. This could have happened any time during the lifetime of the drive. There are no pending sector reallocations (Attribute 197) (meaning there aren't others which are bad which the drive is waiting to attempt remapping for), and there are no remapped sectors (Attribute 5). There have been no successful reallocation attempts during the drive's lifetime (Attribute 196). In general, I would say this is acceptable. If Attribute 198 was higher, or you had other pending sectors which needed to be remapped, I'd say replace the disk. UDMA/CRC error count (Attribute 199) is zero. That's good -- it means that most likely cabling issues can be ruled out, since the attribute tracks the number of communication errors between the controller and the disk PCB. Drive temperature looks good, so nothing to worry about there. The drive itself has detected numerous error conditions in the SMART error log during its lifetime -- a total of 48, but SMART only lists the most recent 5. The drive has been online for a total of 827 hours (Attribute 9), which we can use to determine how recent the drive experienced said errors. Let's examine the first 3: > Error 48 occurred at disk power-on lifetime: 817 hours (34 days + 1 hours) > 40 51 00 9d 84 0e e0 Error: UNC at LBA = 0x000e849d = 951453 > c8 00 20 00 84 0e 00 00 00:45:18.204 READ DMA > > Error 47 occurred at disk power-on lifetime: 817 hours (34 days + 1 hours) > 40 51 00 0c 9d 0e e0 Error: UNC at LBA = 0x000e9d0c = 957708 > c8 00 80 00 9b 0e 00 00 00:03:08.605 READ DMA > > Error 46 occurred at disk power-on lifetime: 817 hours (34 days + 1 hours) > 40 51 00 9d 84 0e e0 Error: UNC at LBA = 0x000e849d = 951453 > c8 00 80 80 82 0e 00 00 00:03:05.176 READ DMA Okay, it's probably safe to assume these are all signs of the uncorrected sector. When a drive attempts a LBA remap -- which in this case it did, but failed -- it can spend quite a bit of time doing that; in some cases minutes, not seconds. The drive essentially "locks up" during this time (from the perspective of the SATA controller) -- it's literally spending all of its time trying to read and re-read the LBA/sector in different ways, hoping to get the data out of it (and/or correct it with ECC) so that it can be written to a spare block and then internally the bad LBA won't ever be used again. What the OS ends up seeing in this situation is disk timeouts. This is completely normal. The WD Caviar Black drives have a useful feature called TLER -- it's disabled by default, for reasons which I don't want to get into here -- which can force the drive to internally give up after X seconds (it's user-selectable) when dealing with such remapping/errors. The idea is to keep the drive from being deemed dead from the OS/controller's point of view. I believe Seagate, Hitachi, or Samsung (I forget which) have this feature as well, but it's not called TLER. Anyway, so this is probably the cause of one detachment/timeout you've seen FreeBSD report. Let's move on to the 2 remaining errors: > Error 45 occurred at disk power-on lifetime: 817 hours (34 days + 1 hours) > 40 51 08 20 47 6c e0 Error: UNC at LBA = 0x006c4720 = 7096096 > c4 ff 08 ff 46 6c 00 00 00:01:09.459 READ MULTIPLE > > Error 44 occurred at disk power-on lifetime: 817 hours (34 days + 1 hours) > 40 51 08 21 8e 67 e0 Error: UNC at LBA = 0x00678e21 = 6786593 > c4 ff 04 3f 2f 00 00 00 00:01:00.724 READ MULTIPLE These two happened around the same time (10 seconds within one another). I'm under the impression that these are *probably* the result of the above uncorrected sector issue, but I'm not 100% certain. Here's why I think that: - The errors occurred within the same hour mark (817) as the previous 3 errors, - The errors happened only 2 minutes prior to the preceding 3, - The drive was in the process of executing READ MULTIPLE (cmd 0xc4), which tells the disk to read multiple logical sectors within 1 pass. The ATA-8 specification states that READ MULTIPLE is a PIO command. I'm not sure how/why FreeBSD would be submitting this to a disk unless the communication protocol had been downgraded from DMA to PIO. mav@ might have some insights on this, as well as how to decode some of the SMART error data shown. It looks like the 48-bit read input block is written in reverse order (word 5 to word 0). If you want to find out the exact L
Re: Read / write timeouts on SATA disks connected to ICH9
My question: does anyone have experience with FreeBSD on a Dell R300 or can anyone give me some help in trying to fix the timeouts? Could you please do the following: - Provide output from "vmstat -i" - Provide output from "dmesg | grep -i ata" - Install ports/sysutils/smartmontools (5.40 or later) and provide full output from commands "smartctl -a /dev/ad4" and "smartctl -a /dev/ad6" The ad4 SMART output is showing errors, as this disk is indeed broken now. It wasn't before and it is a replacement of another disk that wasn't broken either. Grmbl, I now see reallocated sectors on ad6 as well, in the smartctl output. So both disks look wonky; although afaik that's not the main issue here. I've attached the smartctl output as separate files. smartmontools 5.40 does not appear to exist; I used 5.39.1, the latest port version. Attached also the vmstat -i and dmesg output. -- Pieter smartctl 5.39.1 2010-01-28 r3054 [FreeBSD 8.0-RELEASE-p1 i386] (local build) Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net === START OF INFORMATION SECTION === Model Family: Western Digital Caviar Black family Device Model: WDC WD5001AALS-00L3B2 Serial Number:WD-WCASYA964063 Firmware Version: 01.03B01 User Capacity:500,107,862,016 bytes Device is:In smartctl database [for details use: -P show] ATA Version is: 8 ATA Standard is: Exact ATA specification draft version not indicated Local Time is:Fri May 14 23:01:49 2010 CEST SMART support is: Available - device has SMART capability. SMART support is: Enabled === START OF READ SMART DATA SECTION === SMART overall-health self-assessment test result: PASSED General SMART Values: Offline data collection status: (0x85) Offline data collection activity was aborted by an interrupting command from host. Auto Offline Data Collection: Enabled. Self-test execution status: ( 241) Self-test routine in progress... 10% of test remaining. Total time to complete Offline data collection: (11160) seconds. Offline data collection capabilities:(0x7b) SMART execute Offline immediate. Auto Offline data collection on/off support. Suspend Offline collection upon new command. Offline surface scan supported. Self-test supported. Conveyance Self-test supported. Selective Self-test supported. SMART capabilities:(0x0003) Saves SMART data before entering power-saving mode. Supports SMART auto save timer. Error logging capability:(0x01) Error logging supported. General Purpose Logging supported. Short self-test routine recommended polling time:( 2) minutes. Extended self-test routine recommended polling time:( 131) minutes. Conveyance self-test routine recommended polling time:( 5) minutes. SCT capabilities: (0x3037) SCT Status supported. SCT Feature Control supported. SCT Data Table supported. SMART Attributes Data Structure revision number: 16 Vendor Specific SMART Attributes with Thresholds: ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x002f 200 200 051Pre-fail Always - 78 3 Spin_Up_Time0x0027 184 168 021Pre-fail Always - 3791 4 Start_Stop_Count0x0032 100 100 000Old_age Always - 992 5 Reallocated_Sector_Ct 0x0033 200 200 140Pre-fail Always - 0 7 Seek_Error_Rate 0x002e 200 200 000Old_age Always - 0 9 Power_On_Hours 0x0032 099 099 000Old_age Always - 827 10 Spin_Retry_Count0x0032 100 100 000Old_age Always - 0 11 Calibration_Retry_Count 0x0032 100 100 000Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000Old_age Always - 990 192 Power-Off_Retract_Count 0x0032 199 199 000Old_age Always - 989 193 Load_Cycle_Count0x0032 200 200 000Old_age Always - 992 194 Temperature_Celsius 0x0022 125 109 000Old_age Always - 22 196 Reallocated_Event_Count 0x0032 200 200 000Old_age Always - 0 197 Current_Pending_Sector 0x0032 200 198 000Old_age Always -
Re: Read / write timeouts on SATA disks connected to ICH9
On Fri, May 14, 2010 at 07:42:33PM +0200, Pieter de Boer wrote: > Hi list, > > I'm running FreeBSD 8.0-RELEASE-p1 on a Dell R300 which has a ICH9 > SATA controller on-board (do not have the RAID controller). > > The system has 2 disks in a gmirror setup. Every now and then, > probably under some load, one of the disks gets read or write > timeouts like: > May 5 03:01:37 aberdeen kernel: ad4: timeout waiting to issue command > May 5 03:01:37 aberdeen kernel: ad4: error issuing WRITE_DMA48 command > May 5 03:01:37 aberdeen kernel: GEOM_MIRROR: Request failed > (error=5). ad4[WRITE(offset=200404975104, length=16384)] > May 5 03:01:37 aberdeen kernel: GEOM_MIRROR: Device gm0: provider > ad4 disconnected. > > or: > > May 13 14:41:26 aberdeen kernel: ad6: TIMEOUT - READ_DMA48 retrying > (1 retry left) LBA=975513887 > > Sometimes the read/write succeeds after a few retries, but sometimes > it does not, so geom_mirror throws the disk out of the mirror. > > Tonight ad6 was thrown out of the mirror and ad4 then gave actual > read errors, resulting in a big mess :( > > My question: does anyone have experience with FreeBSD on a Dell R300 > or can anyone give me some help in trying to fix the timeouts? Could you please do the following: - Provide output from "vmstat -i" - Provide output from "dmesg | grep -i ata" - Install ports/sysutils/smartmontools (5.40 or later) and provide full output from commands "smartctl -a /dev/ad4" and "smartctl -a /dev/ad6" -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
RE: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write
> Oops, youre right that other CPUs are running. > > The stop_cpus() call is only made if kdb is entered. doadump() is called > out of boot() which comes later. At Isilon weve been running with a patch > that does stop_cpus() pretty close to the front of panic(9). This is interesting, and changing the behavior will probably allow the crash dump for the original problem (repeatable crash in the bce driver) to be analyzed. At the moment, I'm more interested in dealing with the original problem of the crash in bce. Right now, I'm running this vendor's product under Linux compatibility mode. The vendor is hard at work building a native FreeBSD version of their product. One of two things is going to happen here: 1) the crash doesn't happen in native mode due to different code paths being taken, and I lose the ability to reproduce the crash when the box goes into production, or 2) the crash continues to happen and the ven- dor gets the impression FreeBSD is unstable and not worth supporting. I'd like to avoid that. So, any ideas on how to troubleshoot the panic in bce? Thanks, Terry Kennedy http://www.tmk.com te...@tmk.com New York, NY USA ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Read / write timeouts on SATA disks connected to ICH9
Adam Vande More wrote: May 5 03:01:37 aberdeen kernel: ad4: timeout waiting to issue command May 5 03:01:37 aberdeen kernel: ad4: error issuing WRITE_DMA48 command May 5 03:01:37 aberdeen kernel: GEOM_MIRROR: Request failed (error=5). ad4[WRITE(offset=200404975104, length=16384)] May 5 03:01:37 aberdeen kernel: GEOM_MIRROR: Device gm0: provider ad4 disconnected. Have you tried replacing/checking the cables? Does it always happen to ad4? Your drive could be dying, try swapping it out and see if the errors persist. It happens to both drives and to both drives I replaced a month ago with these. Didn't replace the cables back then, but they were correctly attached and are now. Also it would be odd that both cables are broken at the same time. -- Pieter ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Read / write timeouts on SATA disks connected to ICH9
On Fri, May 14, 2010 at 12:42 PM, Pieter de Boer wrote: > I'm running FreeBSD 8.0-RELEASE-p1 on a Dell R300 which has a ICH9 SATA > controller on-board (do not have the RAID controller). > > The system has 2 disks in a gmirror setup. Every now and then, probably > under some load, one of the disks gets read or write timeouts like: > May 5 03:01:37 aberdeen kernel: ad4: timeout waiting to issue command > May 5 03:01:37 aberdeen kernel: ad4: error issuing WRITE_DMA48 command > May 5 03:01:37 aberdeen kernel: GEOM_MIRROR: Request failed (error=5). > ad4[WRITE(offset=200404975104, length=16384)] > May 5 03:01:37 aberdeen kernel: GEOM_MIRROR: Device gm0: provider ad4 > disconnected. > Have you tried replacing/checking the cables? Does it always happen to ad4? Your drive could be dying, try swapping it out and see if the errors persist. -- Adam Vande More ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Read / write timeouts on SATA disks connected to ICH9
Hi list, I'm running FreeBSD 8.0-RELEASE-p1 on a Dell R300 which has a ICH9 SATA controller on-board (do not have the RAID controller). The system has 2 disks in a gmirror setup. Every now and then, probably under some load, one of the disks gets read or write timeouts like: May 5 03:01:37 aberdeen kernel: ad4: timeout waiting to issue command May 5 03:01:37 aberdeen kernel: ad4: error issuing WRITE_DMA48 command May 5 03:01:37 aberdeen kernel: GEOM_MIRROR: Request failed (error=5). ad4[WRITE(offset=200404975104, length=16384)] May 5 03:01:37 aberdeen kernel: GEOM_MIRROR: Device gm0: provider ad4 disconnected. or: May 13 14:41:26 aberdeen kernel: ad6: TIMEOUT - READ_DMA48 retrying (1 retry left) LBA=975513887 Sometimes the read/write succeeds after a few retries, but sometimes it does not, so geom_mirror throws the disk out of the mirror. Tonight ad6 was thrown out of the mirror and ad4 then gave actual read errors, resulting in a big mess :( My question: does anyone have experience with FreeBSD on a Dell R300 or can anyone give me some help in trying to fix the timeouts? I was told using AHCI could be better for SATA disks, but apparently (http://permalink.gmane.org/gmane.linux.kernel.pci/8267) the BIOS does not support turning that on, so that does not appear to be an option. Thanks, Pieter ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write
Matthew Fleming wrote: As an aside, this is a quad-core in one package CPU (an X3363). On both this box and a similar one with an X5470, console messages continue to print out after "the system has been halted - press any key to reboot" - in particular, the shutdown makes a bunch of the "behind the scenes" man- agement stuff like the virtual keyboard and monitor appear. Plugging or unplugging USB devices will go through the whole deal of detecting and making their service available. Oops, youre right that other CPUs are running. The stop_cpus() call is only made if kdb is entered. doadump() is called out o f boot() which comes later. At Isilon weve been running with a patch that does stop_cpus() pretty close to the front of panic(9). As an design decision it seems reasonable to call stop_cpus() early in panic(9) simply because most causes for panic means something unexpected, and the soone r the other CPUs arent running the more likely it is that they dont do more dam age, leaving the system in a more useful state for dump or {g,d}db analysis. T his should be done before dump or entering kdb. Im ccing -current@ since I would like a small discussion of moving the stop_cpu s() to earlier in panic. If this change is agreeable I can roll up a patch and test it on CURRENT. Im not sure yet how much of the other panic-related chang es we have made at Isilon would be required. Work along this lines has been done at Panasas. We were planning on put it back to the community. There turns out to be lots of edge cases by changing this that we're still sorting thru. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
RE: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write
> As an aside, this is a quad-core in one package CPU (an X3363). On both > this box and a similar one with an X5470, console messages continue to > print out after "the system has been halted - press any key to reboot" - > in particular, the shutdown makes a bunch of the "behind the scenes" man- > agement stuff like the virtual keyboard and monitor appear. Plugging or > unplugging USB devices will go through the whole deal of detecting and > making their service available. Oops, youre right that other CPUs are running. The stop_cpus() call is only made if kdb is entered. doadump() is called out of boot() which comes later. At Isilon weve been running with a patch that does stop_cpus() pretty close to the front of panic(9). As an design decision it seems reasonable to call stop_cpus() early in panic(9) simply because most causes for panic means something unexpected, and the sooner the other CPUs arent running the more likely it is that they dont do more damage, leaving the system in a more useful state for dump or {g,d}db analysis. This should be done before dump or entering kdb. Im ccing -current@ since I would like a small discussion of moving the stop_cpus() to earlier in panic. If this change is agreeable I can roll up a patch and test it on CURRENT. Im not sure yet how much of the other panic-related changes we have made at Isilon would be required. Thanks, matthew ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Any chance of someone commiting the patch in bin/131861 ?
> Do you feel strongly about merging the fix to 8 or 7 or both? Not really - it;s such a small change that it would seem a shame not to commit it to earlier releases, but then I used 7 thoughout it;s lifetime with the bug only being a minor annoyance. -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Enabling watchdog
On Fri, May 14, 2010 at 3:15 PM, Jeremy Chadwick wrote: > > I'm a bit confused at this point, Doug. At what point did the OP state > he has IPMI support or IPMI cards in his system? > He said he had a Dell PowerEdge 2950 - iirc these all have IPMI. Cheers Tom ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Any chance of someone commiting the patch in bin/131861 ?
On Fri, 14.05.2010 at 10:17:23 +0100, Pete French wrote: > > > Postfix will re-write this as part of sanitization, so I had to revert > > to creating mbox files by hand. Anyway, could you please test the > > following patch with a wider variety of mails? > > I've been testing your patch for a few weeks now as my main email > client, and I havent encountered any problems - it also does fix > the reply issues I was originally having. Do you want to attach it > to the PR ? After that maybe someone could commit it - I am pretty > certain it doesnt actualy break any exising behaviour. > > cheers, > > -pete. I'll try to get review by some fellow FreeBSD dev that is more familiar with our mail(1) history and then commit the changed eventually. Do you feel strongly about merging the fix to 8 or 7 or both? Regards, Uli ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Enabling watchdog
On Fri, May 14, 2010 at 03:21:42PM +0100, Tom Evans wrote: > On Fri, May 14, 2010 at 3:15 PM, Jeremy Chadwick > wrote: > > > > I'm a bit confused at this point, Doug. At what point did the OP state > > he has IPMI support or IPMI cards in his system? > > > > He said he had a Dell PowerEdge 2950 - iirc these all have IPMI. Ah, thanks Tom! I had no idea. It surprised me when the conversation turned from software watchdogs to IPMI hardware watchdogs. :-) -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Enabling watchdog
Tom Evans writes: | On Fri, May 14, 2010 at 3:15 PM, Jeremy Chadwick | wrote: | > | > I'm a bit confused at this point, Doug. ?At what point did the OP state | > he has IPMI support or IPMI cards in his system? | | He said he had a Dell PowerEdge 2950 - iirc these all have IPMI. ... and although HW WD doesn't have to be in IPMI, I know for a fact it is on the base config. of a Dell PE2950 and has been since the PE2650. However, on the 2650 I saw false trips. It was one of the reasons I wrote ipmi(4). Eventually, I need to get in sync with jhb to add kernel back-trace support to it. I have some code at work to do it but it needs some work to ensure it works in every case etc. BTW, there is code/patches floating around to control the LCD on these Dell machines via ipmitool and on the r710 control attributes of the LCD. Unfortunately the ipmitool folks haven't pick it up. Doug A. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write
On Fri, May 14, 2010 at 09:56:47AM -0400, Terry Kennedy wrote: > As an aside, this is a quad-core in one package CPU (an X3363). On both > this box and a similar one with an X5470, console messages continue to > print out after "the system has been halted - press any key to reboot" - > in particular, the shutdown makes a bunch of the "behind the scenes" man- > agement stuff like the virtual keyboard and monitor appear. Plugging or > unplugging USB devices will go through the whole deal of detecting and > making their service available. > > I know the other CPUs are considered to still be running (hence the > "halting other CPUs" when you press a key to reboot), but this is the > first time I've seen device detection, attachment, etc. show up on the > console after a shutdown. > > Is this behavior to be expected, or is it as unexpected as it was to > me? Systems are Dell Poweredge R300's, 8-STABLE amd64. I've seen this behaviour before (on non-Dell hardware). I'm under the impression there's an interrupt handler that isn't being unloaded, and that the driver framework within the kernel does not "unload" on FreeBSD. What exactly does FreeBSD do on a system halt? I'm under the impression the OS should be unloading its interrupt handlers and then execute the HLT opcode on each processor/core. I don't have a tendency to halt my Supermicro systems, but shutdown -r now or shutdown -p now is pretty common. I've noticed an overall improvement with regards to the shutdown procedure and how long things take during the final phases (after filesystems are unmounted, etc.) with the below sysctl set (in /etc/sysctl.conf, but you can set it in real-time via command-line). hw.acpi.handle_reboot=1 -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
RE: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write
> > Hmm. You could try changing the code to not do a nested panic in that > > case. You would update subr_turnstile.c to just return if panicstr is > > not NULL rather than calling panic. However, there is still a good > > chance you will end up deadlocking in that case. I have another patch I > > can send you next week that prevents blocking on mutexes duing a panic > > which may also help. > > It would be instructive to know exactly why we were in turnstile(9) but > its likely due to mtx contention. > > AIX has some code at the beginning of all the locking operations to avoid > taking locks if we were running code out of kdb, though getting that worked > out was slightly tricky with our variant of mtx_assert(9). I seem to recall > there was also some "lockbusting" code that forcibly reset all owned locks > to have no owner, at least in some paths. > Given that the system is single-cpu and should be single-threaded when > dumping, this seems to me to be something worth working through to get > more reliable dumps. Except for mtx_assert(9) I cant think of a reason > to take locks once we start dumping or when in the debugger. As an aside, this is a quad-core in one package CPU (an X3363). On both this box and a similar one with an X5470, console messages continue to print out after "the system has been halted - press any key to reboot" - in particular, the shutdown makes a bunch of the "behind the scenes" man- agement stuff like the virtual keyboard and monitor appear. Plugging or unplugging USB devices will go through the whole deal of detecting and making their service available. I know the other CPUs are considered to still be running (hence the "halting other CPUs" when you press a key to reboot), but this is the first time I've seen device detection, attachment, etc. show up on the console after a shutdown. Is this behavior to be expected, or is it as unexpected as it was to me? Systems are Dell Poweredge R300's, 8-STABLE amd64. > As an aside, with terribly corrupted locks Ive seen double panics when the > attempt to print the lock name faulted in strlen(9) called for printf(9), > due to a bad lockname pointer. We have been able to get enough info off > these crashes to debug them, but its useful to remember that the system > may be in a very unstable state depending on why it panics. True. In these crashes, the system is doing essentially nothing except the one application (which, unfortunately, I don't have the source code for). The second crash happened right after booting the system, logging in, and firing off the application. It left an identical footprint (other than the 0x10 byte offset due to a recompiled kernel) from the first one, where the system had been up for 13+ hours. So, in this case I don't think there was a bunch of corruption piling up which triggered the fault, but instead the one simple operation and right away - splat! As I mentioned in the original posting, I'd be glad to give a developer complete access to the system via the remote console (Dell DRAC 5 web interface) and to the underlying FreeBSD if it'll help pin down the prob- lem. Another thing I could try (would take a couple days until I could get someone to the site) would be to try this using a bge port instead of the bce one. That might help pin it down to either something in the bce- specific code path, or somewhere else in the stack. Thanks, Terry Kennedy http://www.tmk.com te...@tmk.com New York, NY USA ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Enabling watchdog
On Fri, May 14, 2010 at 07:16:28AM -0700, Doug Ambrisko wrote: > rihad writes: > | On 05/14/2010 04:13 AM, Doug Ambrisko wrote: > | > rihad writes: > | > | Hi, I'm thinking of enabling the watchdog on our Dell PowerEdge 2950 / > | > | FreeBSD 8.0 amd64, so that it reboots the machine in case of lockups. > | > | Right now it doesn't work: > | > | > | > | # watchdog > | > | watchdog: patting the dog: Operation not supported > | > | # > | > | Looking through the kernel configuration I found two relevant settings: > | > | In /sys/conf/NOTES: > | > | # > | > | # Add software watchdog routines. > | > | # > | > | options SW_WATCHDOG > | > | > | > | and in /sys/amd64/conf/NOTES: > | > | # > | > | # Watchdog routines. > | > | # > | > | options MP_WATCHDOG > | > | > | > | Which of them should I rebuild the kernel with? BTW, the existing kernel > | > | is built with the default "options SCHED_ULE" to make good use of > | > | multiple CPUs, does watchdog work with it? > | > > | > If no one has said yet, kldload ipmi then run watchdogd. ... or compile > | > it into the kernel. This will enable the IPMI HW watchdog. If it > triggers, > | > it will appear in the IPMI SEL (ipmitool sel list). > | > | Thanks. So did I understand it right that I should first install > | sysutils/ipmitool, then start polling "ipmitool sel list" in a shell > | script from a cron job run once a minute, and reboot in case IPMI > | triggers? But if it's a kernel lockup, none of the user level code might > | run at all. Any way to fall back to a hard and fast kernel level machine > | reset? > > Nope, when you load the ipmi driver it provides a HW watchdog via ipmi > and works with watchdogd. Now if you want to know if your machines > rebooted due to the watchdog then check the ipmi sel for the watchdog > event. I'm a bit confused at this point, Doug. At what point did the OP state he has IPMI support or IPMI cards in his system? -- | Jeremy Chadwick j...@parodius.com | | Parodius Networking http://www.parodius.com/ | | UNIX Systems Administrator Mountain View, CA, USA | | Making life hard for others since 1977. PGP: 4BD6C0CB | ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Enabling watchdog
rihad writes: | On 05/14/2010 04:13 AM, Doug Ambrisko wrote: | > rihad writes: | > | Hi, I'm thinking of enabling the watchdog on our Dell PowerEdge 2950 / | > | FreeBSD 8.0 amd64, so that it reboots the machine in case of lockups. | > | Right now it doesn't work: | > | | > | # watchdog | > | watchdog: patting the dog: Operation not supported | > | # | > | Looking through the kernel configuration I found two relevant settings: | > | In /sys/conf/NOTES: | > | # | > | # Add software watchdog routines. | > | # | > | options SW_WATCHDOG | > | | > | and in /sys/amd64/conf/NOTES: | > | # | > | # Watchdog routines. | > | # | > | options MP_WATCHDOG | > | | > | Which of them should I rebuild the kernel with? BTW, the existing kernel | > | is built with the default "options SCHED_ULE" to make good use of | > | multiple CPUs, does watchdog work with it? | > | > If no one has said yet, kldload ipmi then run watchdogd. ... or compile | > it into the kernel. This will enable the IPMI HW watchdog. If it triggers, | > it will appear in the IPMI SEL (ipmitool sel list). | | Thanks. So did I understand it right that I should first install | sysutils/ipmitool, then start polling "ipmitool sel list" in a shell | script from a cron job run once a minute, and reboot in case IPMI | triggers? But if it's a kernel lockup, none of the user level code might | run at all. Any way to fall back to a hard and fast kernel level machine | reset? Nope, when you load the ipmi driver it provides a HW watchdog via ipmi and works with watchdogd. Now if you want to know if your machines rebooted due to the watchdog then check the ipmi sel for the watchdog event. Doug A. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
RE: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write
> > The crash was a "page fault while in kernel mode" with the current process > > being the interrupt service routine for the bce0 GigE. Things progressed > > reasonably until partway through the dump, when the system locked up with a > > "Sleeping thread (tid 100028, pid 12) owns a non-sleepable lock". Thats the > > same PID as reported in the main crash. > > Hmm. You could try changing the code to not do a nested panic in that > case. You would update subr_turnstile.c to just return if panicstr is > not NULL rather than calling panic. However, there is still a good > chance you will end up deadlocking in that case. I have another patch I > can send you next week that prevents blocking on mutexes duing a panic > which may also help. It would be instructive to know exactly why we were in turnstile(9) but its likely due to mtx contention. AIX has some code at the beginning of all the locking operations to avoid taking locks if we were running code out of kdb, though getting that worked out was slightly tricky with our variant of mtx_assert(9). I seem to recall there was also some "lockbusting" code that forcibly reset all owned locks to have no owner, at least in some paths. Given that the system is single-cpu and should be single-threaded when dumping, this seems to me to be something worth working through to get more reliable dumps. Except for mtx_assert(9) I cant think of a reason to take locks once we start dumping or when in the debugger. As an aside, with terribly corrupted locks Ive seen double panics when the attempt to print the lock name faulted in strlen(9) called for printf(9), due to a bad lockname pointer. We have been able to get enough info off these crashes to debug them, but its useful to remember that the system may be in a very unstable state depending on why it panics. Thanks, matthew ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Mount root error / New device numbering?
On Fri, May 14, 2010 at 00:32, Fred Souza wrote: > Good to know, I never really paid much attention to those details (I > will from now on). Thank you a lot for the help, Jeremy. I will try > your suggestions in the morning and post back to tell what did I find > out. Like I said, here are my findings: Jeremy's pointers were very correct, the difference in numbering seems to be just an ata(4) change. Manually changing entries in /etc/fstab does fix it, and I found out that the kernel panic I was getting was merely a simple detail I overlooked: The 3rd-party nvidia driver had been compiled on -RELEASE and was causing the kernel panics on -STABLE. Simply disabling its loading at the boot loader prompt, then booting with /etc/fstab properly updated and then reinstalling the nvidia-driver port (`portmaster nvidia-driver`) fixed it. Just to be on the safe side, I also reinstalled the other 3rd-party kernel module I use (fusefs-ntfs3g), even though it wasn't giving me any errors. I did try the -STABLE snapshot image as Jeremy suggested, that's how I figured he was right about the numering difference being an ata(4) change. I preferred to just manually change the previous install's /etc/fstab, though (but maybe there was a better way of doing this with the -STABLE snapshot DVD). The interrupt storm on irq21 is still happening, and I'm going to work on that next. Mounting any non-audio CD/DVD stops it, so I'll keep doing that until I actually find a fix for the issue. So thank you very much, Jeremy. Your pointers were very helpful in fixing my problem. Best regards, Fred ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: regression in dc(4) from 7.2 to RELENG_8
on 14/05/2010 09:42 Chris Buechler said the following: > one of our users has reported a regression in dc(4) on RELENG_8, the > cards work fine on 7.2 and previous versions, but no longer function at > all with RELENG_8 as of about a week ago. > http://forum.pfsense.org/index.php/topic,24964.msg129488.html#msg129488 Perhaps this might be a cardbus issue (or even a more general issue) rather than a dc(4) issue. But first please try this patch reversed: --- a/sys/dev/dc/if_dc.c +++ b/sys/dev/dc/if_dc.c @@ -331,7 +331,6 @@ static driver_t dc_driver = { static devclass_t dc_devclass; -DRIVER_MODULE(dc, cardbus, dc_driver, dc_devclass, 0, 0); DRIVER_MODULE(dc, pci, dc_driver, dc_devclass, 0, 0); DRIVER_MODULE(miibus, dc, miibus_driver, miibus_devclass, 0, 0); > dmesg from it working, from 7.2: > cbb0: at device 11.0 on pci0 > cardbus0: on cbb0 > pccard0: <16-bit PCCard bus> on cbb0 > cbb0: [ITHREAD] > cbb1: at device 11.1 on pci0 > cardbus1: on cbb1 > pccard1: <16-bit PCCard bus> on cbb1 > cbb1: [ITHREAD] > dc0: port 0x1080-0x10ff mem > 0x8800-0x880007ff,0x88001000-0x880017ff irq 11 at device 0.0 on > cardbus0 > miibus1: on dc0 > tdkphy0: PHY 0 on miibus1 > tdkphy0: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto > dc0: Ethernet address: 00:xx:xx:xx:xx:56 > dc0: [ITHREAD] > dc1: port 0x1100-0x117f mem > 0x88002000-0x880027ff,0x88003000-0x880037ff irq 11 at device 0.0 on > cardbus1 > miibus2: on dc1 > tdkphy1: PHY 0 on miibus2 > tdkphy1: 10baseT, 10baseT-FDX, 100baseTX, 100baseTX-FDX, auto > dc1: Ethernet address: 00:xx:xx:xx:xx:66 > dc1: [ITHREAD] > > Not working, RELENG_8: > cbb0: at device 11.0 on pci0 > cardbus0: on cbb0 > pccard0: <16-bit PCCard bus> on cbb0 > cbb0: [FILTER] > cbb1: at device 11.1 on pci0 > cardbus1: on cbb1 > pccard1: <16-bit PCCard bus> on cbb1 > cbb1: [FILTER] > cardbus0: Unable to allocate resource to read CIS. > cardbus0: Unable to allocate resources for CIS > cardbus0: Unable to allocate resource to read CIS. > cardbus0: Unable to allocate resources for CIS > dc0: port 0x1080-0x10ff mem > 0x8800-0x880007ff,0x88001000-0x880017ff irq 11 at device 0.0 on > cardbus0 > dc0: No station address in CIS! > device_attach: dc0 attach returned 6 > cardbus1: Unable to allocate resource to read CIS. > cardbus1: Unable to allocate resources for CIS > cardbus1: Unable to allocate resource to read CIS. > cardbus1: Unable to allocate resources for CIS > dc1: port 0x1080-0x10ff mem > 0x88002000-0x880027ff,0x88003000-0x880037ff irq 11 at device 0.0 on > cardbus1 > dc1: No station address in CIS! > device_attach: dc1 attach returned 6 > > > We can apply patches to our builds for this person and others to test > and confirm the fix, before it's committed into FreeBSD. > > Chris > > ___ > freebsd-...@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-net > To unsubscribe, send any mail to "freebsd-net-unsubscr...@freebsd.org" > -- Andriy Gapon ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write
> The crash was a "page fault while in kernel mode" with the current process > being the interrupt service routine for the bce0 GigE. Things progressed > reasonably until partway through the dump, when the system locked up with a > "Sleeping thread (tid 100028, pid 12) owns a non-sleepable lock". That's the > same PID as reported in the main crash. Hmm. You could try changing the code to not do a nested panic in that case. You would update subr_turnstile.c to just return if panicstr is not NULL rather than calling panic. However, there is still a good chance you will end up deadlocking in that case. I have another patch I can send you next week that prevents blocking on mutexes duing a panic which may also help. Ok, I'll be glad to try that. > 3) Is there any way to rig the system to obtain more info if this happens > again? Right now I'm using an embedded remote console server, but I could > switch the system to a serial port if enabling the kernel debugger might help. > But I think that the sleeping thread bit would happen even at the debugger > prompt, wouldn't it? Include DDB and enable the 'trace_on_panic' sysctl knob perhaps. Hmmm. Do you think it will get very far before the sleeping thread business locks it up? > Is it possible to correlate the source line in the kernel with the instruction > pointer in the panic? If you are booted into the same kernel with the same modules loaded, you can probably run 'kgdb' as root do 'l *'. I did that and discovered that the 0x20: prefix is probably unwanted: (kgdb) l *0x20:0x801e3c06 A syntax error in expression, near `:0x801e3c06'. (kgdb) l *0x801e3c06 0x801e3c06 is in bce_start_locked (/usr/src/sys/dev/bce/if_bce.c:6996). 6991} 6992 6993count++; 6994 6995/* Send a copy of the frame to any BPF listeners. */ 6996ETHER_BPF_MTAP(ifp, m_head); 6997} 6998 6999/* Exit if no packets were dequeued. */ 7000if (count == 0) { (kgdb) This kernel does have BPF compiled in, but I don't think it was in use at the time. Any further suggestions to look at (remember, this system is in another state from me and all I have is remote access to the framebuffer - I'd have to go there and set up a serial console to be able to talk to the debugger if it crashes). Thanks, Terry Kennedy http://www.tmk.com te...@tmk.com New York, NY USA ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Crash dump problem - sleeping thread owns a non-sleepable lock during crash dump write
Terry Kennedy wrote: I'm reposting this over here at the suggestion of the Forums moderator. The original post is at http://forums.freebsd.org/showthread.php?t=14163 Got an interesting crash just now (well, as interesting as a crash on a soon-to-be production system can be). This is 8-STABLE/amd64, last cvsup'd early in the morning of May 9th. The system didn't complete the crash dump, so it needed a manual reset to get it going again. The crash was a "page fault while in kernel mode" with the current process being the interrupt service routine for the bce0 GigE. Things progressed reasonably until partway through the dump, when the system locked up with a "Sleeping thread (tid 100028, pid 12) owns a non-sleepable lock". That's the same PID as reported in the main crash. Hmm. You could try changing the code to not do a nested panic in that case. You would update subr_turnstile.c to just return if panicstr is not NULL rather than calling panic. However, there is still a good chance you will end up deadlocking in that case. I have another patch I can send you next week that prevents blocking on mutexes duing a panic which may also help. 3) Is there any way to rig the system to obtain more info if this happens again? Right now I'm using an embedded remote console server, but I could switch the system to a serial port if enabling the kernel debugger might help. But I think that the sleeping thread bit would happen even at the debugger prompt, wouldn't it? Include DDB and enable the 'trace_on_panic' sysctl knob perhaps. I just booted the new kernel and tried this again, and got another crash. The message is identical to the first, except that the instruction pointer changed by 0x10 (presumably due to code differences between the old and new kernels) and it got 6MB further writing the crash dump. Since it seems I can reproduce this at will, I'll be glad to either perform additional information-gathering or give a developer access to the box for testing purposes. Is it possible to correlate the source line in the kernel with the instruction pointer in the panic? If you are booted into the same kernel with the same modules loaded, you can probably run 'kgdb' as root do 'l *'. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Enabling watchdog
rihad wrote: On 05/14/2010 04:13 AM, Doug Ambrisko wrote: rihad writes: | Hi, I'm thinking of enabling the watchdog on our Dell PowerEdge 2950 / | FreeBSD 8.0 amd64, so that it reboots the machine in case of lockups. | Right now it doesn't work: | | # watchdog | watchdog: patting the dog: Operation not supported | # | Looking through the kernel configuration I found two relevant settings: | In /sys/conf/NOTES: | # | # Add software watchdog routines. | # | options SW_WATCHDOG | | and in /sys/amd64/conf/NOTES: | # | # Watchdog routines. | # | options MP_WATCHDOG | | Which of them should I rebuild the kernel with? BTW, the existing kernel | is built with the default "options SCHED_ULE" to make good use of | multiple CPUs, does watchdog work with it? If no one has said yet, kldload ipmi then run watchdogd. ... or compile it into the kernel. This will enable the IPMI HW watchdog. If it triggers, it will appear in the IPMI SEL (ipmitool sel list). Thanks. So did I understand it right that I should first install sysutils/ipmitool, then start polling "ipmitool sel list" in a shell script from a cron job run once a minute, and reboot in case IPMI triggers? But if it's a kernel lockup, none of the user level code might run at all. Any way to fall back to a hard and fast kernel level machine reset? No, watchdogd and the IPMI driver will manage the watchdog. You can use 'sel elist' after a reboot to see if the reboot was triggered via the watchdog. -- John Baldwin ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: ipv6_ifconfig__alias not working
Thanks for the hints Matthew! Cleaning up my config I found the culprit. Copied ipv6_network_interfaces="gif0" from some guide which off course defeated all my efforts to configure ipv6 on the other interfaces. The ipv6_addrs_ knob doesn't seem to work (this is 8.0-p2), can't find any references to it in the subr files either. Saw that there's quite a bit of changes in -head though Kind regards, Spil. On Fri, May 14, 2010 at 11:40 AM, Matthew Seaman wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > On 14/05/2010 10:07:23, Spil Oss wrote: > >> I'm trying to set ipv6 aliases for my jails in my rc.conf but it >> doesn't seem to work as advertised. I have a /48 range assigned to me >> (for this example 2001:dead:beef) and am trying to assign ipv6 >> addresses to a jail. The jails will all have ipv6 addresses in the >> 2001:dead:beef:1 range. >> >>>From man rc.conf "Aliases should be set as >>>ipv6_ifconfig__alias" >> >> My bge0 config in /etc/rc.conf: >> ifconfig_bge0= >> ipv4_addrs_bge0="10.10.2.1/24 10.10.2.2/24 10.10.2.3/24 10.10.2.5/24 >> 10.10.2.6/24" >> ipv6_ifconfig_bge0_alias0=" >> rtadvd_interfaces="wlan0 bge0" >> >> Additional ipv6 config in /etc/rc.conf >> ipv6_enable="YES" >> ipv6_gateway_enable="YES" >> >> The "2001:dead:beef:1::5/64" address is not assigned to bge0. >> There must be some stupid mistake I'm making in my config. Is it >> perhaps the ifconfig_bge0 line that screws up my config? > > Hmmm... for consistencies' sake you should probably be using: > > ipv6_ifconfig_bge0="2001:dead:beef:::1/64" > ipv6_ifconfig_bge0_alias0="2001:dead:beef:1::5/64" > > or, to make things absolutely parallel to your IPv4 settings: > > ipv6_addrs_bge0="2001:dead:beef:::1/64 2001:dead:beef:1::5/64" > > Cheers, > > Matthew > > - -- > Dr Matthew J Seaman MA, D.Phil. 7 Priory Courtyard > Flat 3 > PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate > Kent, CT11 9PW > -BEGIN PGP SIGNATURE- > Version: GnuPG/MacGPG2 v2.0.14 (Darwin) > Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ > > iEYEARECAAYFAkvtGpoACgkQ8Mjk52CukIyauACeIVpsDf2VfGT0IpJXf0DQ2wLc > ROQAoIomIPblYcDCtYDU1pjDakzHMbWN > =OwJ5 > -END PGP SIGNATURE- > ___ > freebsd-stable@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-stable > To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org" > ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: ipv6_ifconfig__alias not working
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 14/05/2010 10:07:23, Spil Oss wrote: > I'm trying to set ipv6 aliases for my jails in my rc.conf but it > doesn't seem to work as advertised. I have a /48 range assigned to me > (for this example 2001:dead:beef) and am trying to assign ipv6 > addresses to a jail. The jails will all have ipv6 addresses in the > 2001:dead:beef:1 range. > >>From man rc.conf "Aliases should be set as ipv6_ifconfig__alias" > > My bge0 config in /etc/rc.conf: > ifconfig_bge0= > ipv4_addrs_bge0="10.10.2.1/24 10.10.2.2/24 10.10.2.3/24 10.10.2.5/24 > 10.10.2.6/24" > ipv6_ifconfig_bge0_alias0=" > rtadvd_interfaces="wlan0 bge0" > > Additional ipv6 config in /etc/rc.conf > ipv6_enable="YES" > ipv6_gateway_enable="YES" > > The "2001:dead:beef:1::5/64" address is not assigned to bge0. > There must be some stupid mistake I'm making in my config. Is it > perhaps the ifconfig_bge0 line that screws up my config? Hmmm... for consistencies' sake you should probably be using: ipv6_ifconfig_bge0="2001:dead:beef:::1/64" ipv6_ifconfig_bge0_alias0="2001:dead:beef:1::5/64" or, to make things absolutely parallel to your IPv4 settings: ipv6_addrs_bge0="2001:dead:beef:::1/64 2001:dead:beef:1::5/64" Cheers, Matthew - -- Dr Matthew J Seaman MA, D.Phil. 7 Priory Courtyard Flat 3 PGP: http://www.infracaninophile.co.uk/pgpkey Ramsgate Kent, CT11 9PW -BEGIN PGP SIGNATURE- Version: GnuPG/MacGPG2 v2.0.14 (Darwin) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/ iEYEARECAAYFAkvtGpoACgkQ8Mjk52CukIyauACeIVpsDf2VfGT0IpJXf0DQ2wLc ROQAoIomIPblYcDCtYDU1pjDakzHMbWN =OwJ5 -END PGP SIGNATURE- ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
ipv6_ifconfig__alias not working
Hi, I'm trying to set ipv6 aliases for my jails in my rc.conf but it doesn't seem to work as advertised. I have a /48 range assigned to me (for this example 2001:dead:beef) and am trying to assign ipv6 addresses to a jail. The jails will all have ipv6 addresses in the 2001:dead:beef:1 range. >From man rc.conf "Aliases should be set as ipv6_ifconfig__alias" My bge0 config in /etc/rc.conf: ifconfig_bge0="inet6 2001:dead:beef:::1/64 up" ipv4_addrs_bge0="10.10.2.1/24 10.10.2.2/24 10.10.2.3/24 10.10.2.5/24 10.10.2.6/24" ipv6_ifconfig_bge0_alias0="2001:dead:beef:1::5/64" rtadvd_interfaces="wlan0 bge0" Additional ipv6 config in /etc/rc.conf ipv6_enable="YES" ipv6_gateway_enable="YES" The "2001:dead:beef:1::5/64" address is not assigned to bge0. There must be some stupid mistake I'm making in my config. Is it perhaps the ifconfig_bge0 line that screws up my config? Kind regards, Spil. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"
Re: Any chance of someone commiting the patch in bin/131861 ?
> Postfix will re-write this as part of sanitization, so I had to revert > to creating mbox files by hand. Anyway, could you please test the > following patch with a wider variety of mails? I've been testing your patch for a few weeks now as my main email client, and I havent encountered any problems - it also does fix the reply issues I was originally having. Do you want to attach it to the PR ? After that maybe someone could commit it - I am pretty certain it doesnt actualy break any exising behaviour. cheers, -pete. ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to "freebsd-stable-unsubscr...@freebsd.org"