IPMI hardware watchdogs Re: dell r420/r320 stable/9
On Jul 26, 2012, at 8:50 PM, Sean Bruno wrote: For the time being I had to revert the following from my stable/9 tree. Otherwise I would get a kernel panic on shutdown from ipmi(4). http://svnweb.freebsd.org/base?view=revisionrevision=237839 http://svnweb.freebsd.org/base?view=revisionrevision=221121 On a somewhat related note: We noticed recently that you can't pet or disable the IPMI hardware watchdog once SCHEDULER_STOPPED() is true. This means it can fire unexpectedly while you're dumping core or rebooting, depending on how long the timeout was on the pet before the panic. The ipmi driver will need to process the command differently if the scheduler is stopped. I haven't had time to look at a fix yet. -Andrew -- Andrew Boyerabo...@averesystems.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: IPMI hardware watchdogs Re: dell r420/r320 stable/9
On Jul 27, 2012, at 10:42 AM, Attilio Rao wrote: On Fri, Jul 27, 2012 at 3:33 PM, Andrew Boyer abo...@averesystems.com wrote: On Jul 26, 2012, at 8:50 PM, Sean Bruno wrote: For the time being I had to revert the following from my stable/9 tree. Otherwise I would get a kernel panic on shutdown from ipmi(4). http://svnweb.freebsd.org/base?view=revisionrevision=237839 http://svnweb.freebsd.org/base?view=revisionrevision=221121 On a somewhat related note: We noticed recently that you can't pet or disable the IPMI hardware watchdog once SCHEDULER_STOPPED() is true. This means it can fire unexpectedly while you're dumping core or rebooting, depending on how long the timeout was on the pet before the panic. The ipmi driver will need to process the command differently if the scheduler is stopped. I haven't had time to look at a fix yet. I recall I fixed that internally for SV, but the key here is that we need to find an unified (or a default policy). More specifically, do we want the watchdog also covers the kernel dump part (because of possible deadlocks when dumping). If the answer is yes, we likely need pat the watchdog from within the dumping cycle itself. If the answer is no, then we can just disable it when entering the panic path. But anyway, we need to identify a default policy that makes sense first. Attilio For our use case, we need the system to reset if the dump hangs. As the code stands now, you can't disable the HW watchdog from the panic path. Prior to stopping the scheduler early in panic(), you don't know the lock state, so you can't safely initiate the IPMI command. (It hung the first time I tried it.) After stopping the scheduler, you can't pet it to turn it off. -Andrew -- Andrew Boyerabo...@averesystems.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: FreeBSD 8-STABLE on R620 w/ X520-DA2/Intel 82599
Please post the output of pciconf -lvc for these devices. -Andrew On Jun 29, 2012, at 10:50 AM, Rick Miller wrote: Hi All, I have 2 hosts, HP DL360 G8 and Dell R620. Both have the X520-DA2/Intel 82599 10G Fiber NIC. Both also have the same FreeBSD 8-STABLE image. The Dell displays the following in dmesg and we are unable to configure the ix0 or ix1 interfaces where the HP works just fine. Wondering if anyone else has experienced this? pci4: network, ethernet at device 0.0 (no driver attached) pci4: network, ethernet at device 0.1 (no driver attached) -- Take care Rick Miller ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org -- Andrew Boyerabo...@averesystems.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Intel X520-DA2 Supported in stable/8?
You can probably turn hw.ixgbe.num_queues down to 2 or 4 and cut your mbuf consumption dramatically without noticing any loss of performance. -A On Jun 22, 2012, at 6:19 PM, Rick Miller wrote: On Fri, Jun 22, 2012 at 5:21 PM, Jack Vogel jfvo...@gmail.com wrote: Increase your system mbuf pool size, you do not want that failure to happen. Thanks, Jack. I saw a thread where you discussed this. You are referring to kern.ipc.nmbclusters, correct? Should I also adjust the following? hw.ixgbe.rxd hw.ixgbe.txd hw.ixgbe.num_queues hw.intr_storm_threshold -- Andrew Boyerabo...@averesystems.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Intel X520-DA2 Supported in stable/8?
The ixgbe driver creates devices named ix0, etc. I believe you need to run 'ifconfig ix0 up' before it will attempt to get link. -Andrew On Jun 22, 2012, at 3:45 PM, Rick Miller wrote: On Fri, Jun 22, 2012 at 3:13 PM, Rick Miller vmil...@hostileadmin.com wrote: Hi All, Wondering if the Intel X520-DA2 10G Fibre NIC is supported in stable/8. Hardware notes don't specify it, but I have a system up and the interfaces appear to be loaded by the ix driver. However, status indicates no carrier. Ok, brain fart. Please forgive my ineptitude. I once sent an email inquiring about the Intel 82599, which is this NIC. Responses to that mail say it's supported by the ixgbe driver. My stable/8 installation (5/21/2012) probes it with an ix driver that I cannot find any info on. The ixgbe manage indicates it only supports 82598 based controllers. Not sure what to think here... ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org -- Andrew Boyerabo...@averesystems.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Can someone do something about the extra svn mergeinfo?
For example: http://svnweb.freebsd.org/base/stable/8/sys/dev/e1000/?view=log This makes it very hard to figure out which changes are actually relevant to e1000. Thank you, Andrew -- Andrew Boyerabo...@averesystems.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: hang during dump (reproducible)
On Feb 10, 2012, at 9:50 PM, Jake Holland wrote: Many thanks to Attilio Rao, Kostik Belousov, and Andriy Gapon. And anybody else involved. However, when I looked at the commit I noticed this: $ svn log -r228424 svn://svn.freebsd.org/base ... MFC after: 3 months (or never) I'm not sure whether never is still considered an option, but it would be useful for me if 8.3 release, when it comes, does not hang this way during panic. But thanks for the patch, regardless. Agreed - if this commit could be MFC'd for 8.3 it would be much appreciated. -Andrew -- Andrew Boyerabo...@averesystems.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Kernel panics under 8.2 due to ATA timeouts
Hello Alexander, I have a system that appears to have a flaky SATA controller (one of the Intel ESB2 variants) and it seems to be exposing a weakness in the ATA driver (not using ATA_CAM). If a command with ATA_R_DIRECT set times out, the channel gets reinitialized, but from the soft interrupt context. It panics when it tries to sleep in ata_queue_request(). Timeouts work if ATA_R_DIRECT isn't set because in that case it uses a taskqueue to complete the request. Here is the backtrace: #0 kdb_enter (why=0x80962cfa panic, msg=0xa Address 0xa out of bounds) at ../../../kern/subr_kdb.c:349 #1 0x805d6d0b in panic (fmt=Variable fmt is not available. ) at ../../../kern/kern_shutdown.c:689 #2 0x8061bc53 in sleepq_add (wchan=0xff00052c3e58, lock=0xff00052c3e38, wmesg=0x808fa213 ATA request done, flags=1, queue=0) at ../../../kern/subr_sleepqueue.c:320 #3 0x80590c95 in _cv_timedwait (cvp=0xff00052c3e58, lock=0xff00052c3e38, timo=4) at ../../../kern/kern_condvar.c:313 #4 0x805d61af in _sema_timedwait (sema=0xff00052c3e38, timo=4, file=0x808fa1f6 ../../../dev/ata/ata-queue.c, line=118) at ../../../kern/kern_sema.c:123 #5 0x8028559f in ata_queue_request (request=0xff00052c3dc0) at ../../../dev/ata/ata-queue.c:117 #6 0x80286628 in ata_controlcmd (dev=0xff0002e83d00, command=239 '?', feature=Variable feature is not available. ) at ../../../dev/ata/ata-queue.c:153 #7 0x8027ffd3 in ata_setmode (dev=0xff0002e83d00) at ../../../dev/ata/ata-all.c:637 #8 0x802a0af9 in ad_init (dev=0xff0002e83d00) at ../../../dev/ata/ata-disk.c:405 #9 0x802a0c29 in ad_reinit (dev=0xff0002e83d00) at ../../../dev/ata/ata-disk.c:221 #10 0x80280cad in ata_reinit (dev=0xff0002902800) at ata_if.h:79 #11 0x802856c4 in ata_completed (context=Variable context is not available. ) at ../../../dev/ata/ata-queue.c:313 #12 0x80285ffb in ata_finish (request=0xff00054ec8c0) at ../../../dev/ata/ata-queue.c:265 #13 0x805ed419 in softclock (arg=Variable arg is not available. ) at ../../../kern/kern_timeout.c:430 This is very repeatable. I'm not sure what's the best fix - always use a taskqueue on timeouts? Don't reinit if direct commands fail? -Andrew -- Andrew Boyerabo...@averesystems.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
mfip and smartctl Re: smartctl / mpt on 9.0-RC1
On Nov 7, 2011, at 6:24 AM, Marat N.Afanasyev wrote: this is an output on mfi controller with mfip loaded: # smartctl -a /dev/pass1 smartctl 5.41 2011-06-09 r3365 [FreeBSD 8.2-RELEASE amd64] (local build) Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net Vendor: SEAGATE Product: ST3146356SS Revision: 0007 User Capacity:146,815,737,856 bytes [146 GB] Logical block size: 512 bytes Logical Unit id: 0x5000c50028f8a56f Serial number:3QN4PWHS9130JLKB Device type: 31 Transport protocol: SAS Local Time is:Mon Nov 7 15:20:27 2011 MSK Device supports SMART and is Enabled Temperature Warning Enabled SMART Health Status: OK Current Drive Temperature: 26 C Drive Trip Temperature:68 C Error counter log: Errors Corrected by Total Correction GigabytesTotal ECC rereads/errors algorithm processed uncorrected fast | delayed rewrites corrected invocations [10^9 bytes] errors read:93821240 0 93821249382124 3436.782 0 write: 00 0 0 0 8978.360 0 verify: 6634330 0663433 663433332.651 0 Non-medium error count:7 [GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on'] No self-tests have been logged Long (extended) Self Test duration: 1740 seconds [29.0 minutes] btw, 3dm can tell about reallocated sector count on sas somehow, while smartctl cannot, even on supported controller :( Notice how the device type is 31? The mfip driver masks off the SCSI INQUIRY peripheral device type bits to prevent CAM from attached da* devices to the disks. See sys/dev/mfi/mfi_cam.c, search for T_DIRECT. That confuses smartctl and prevents it from displaying information like the Grown Defect List. I added a local hack to smartctl to interpret a peripheral device type of 0x1f (unknown or missing) to 0x0 (disk), but I don't think the hack is appropriate for general consumption. What we need is better way for mfi and aac to block CAM from attaching without corrupting the inquiry results. -Andrew -- SY, Marat -- Andrew Boyerabo...@averesystems.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
USB/coredump hangs in 8 and 9
Re: panic: bufwrite: buffer is not busy??? (originally on freebsd-net) Re: debugging frequent kernel panics on 8.2-RELEASE (originally on freebsd-stable) Re: System hang in USB umass module while processing panic (originally on freebsd-usb) Hello Andriy and Hans, Sorry for tying in so many discussions on this topic, but I think I have an explanation for the problems we have been reporting* with hanging coredumps on multicore systems on 8.2-RELEASE, and it has implications for Andriy's proposed scheduler patch** and for USB. In today's 8.X and 9.X branches, nothing that I can find stops the other CPUs when the kernel panics, but many parts of the locking code get disabled (grep on 'panicstr'). The 'bufwrite: buffer is not busy???' panic is caused by the syncer encountering an error. If that happens when it's on the dumping CPU everything hangs. If it's running on a different CPU, it will be blocked and hidden by the panic_cpu spinlock in panic(), and the dump continues, polling every attached keyboard for a Ctl-C. But, the new 8.X USB stack relies on multithreading. (The new stack is the variable that broke coredumps for us in the 7.1-8.2 transition, I think.) SVN 224223 fixes a hang that would happen when dumpsys() polls the USB keyboard (IPMI KVM, in our case). That helps, but it only gets as far as usb_process(), where it hangs in a loop around a cv_wait() call. This is easy to reproduce by adding code to the watchdog to break into the debugger if panicstr is set. I am experimenting with Andriy's patch** to stop the scheduler and it seems to be most of the way there, stopping the CPUs and disabling the rest of locking. There are a few places that still reference panicstr, but that's minor. These are the changes I made to the patch: * Changed ukbd_do_poll() to return immediately if SCHEDULER_STOPPED() is true, so that we don't hang up in USB. ukbd_yield() locks up in DROP_GIANT(), and if you skip ukbd_yield(), usbd_transfer_poll() locks up trying to drop mutexes. * Changed the call to spinlock_enter() back to critical_enter(), so that interrupts stay enabled and the hardclock still functions. * Added code in the beginning of panic() to switch to CPU 0, so that we're able to service the hardclock interrupts and so that watchdog panics get through. This has worked 100% for me so far, although anyone using a USB keyboard or dump device would still be out of luck. Thoughts? It seems like stopping all of the other CPUs is the right thing to do on a panic (what are they doing otherwise?). Are the USB issues fixable? If Andriy's patch get committed it might just involve short-circuiting all of the locking in the polling path, but I haven't gotten that far yet. I bet dumping to NFS will have the same problem. Thanks, Andrew * - http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/155421 ** - http://people.freebsd.org/~avg/stop_scheduler_on_panic.8.x.diff -- Andrew Boyerabo...@averesystems.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: Heads up: you'll need to do a fresh config KERNEL etc
On May 19, 2011, at 6:53 PM, Rick Macklem wrote: Assuming that you are using the regular 8.n client (and not the new one), there have been some commits related to krpc bugs that could have fixed cases which would have caused poor perf., although all of those (except one where a client would hang on a TCP reconnect attempt) are in 8.2. Are you referring to r221934? If not, which change? (Trying to make sure I have them all...) Thanks, Andrew -- Andrew Boyerabo...@averesystems.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: About panic: bufwrite: buffer is not busy???
On Feb 20, 2011, at 10:46 AM, Jeremy Chadwick wrote: On Sun, Feb 20, 2011 at 10:30:52AM -0500, Mike Tancsa wrote: On 2/20/2011 9:33 AM, Andrey Smagin wrote: On week -current I have same problem, my box paniced every 2-15 min. I resolve problem by next steps - unplug network connectors from 2 intel em (82574L) cards. I think last time that mpd5 related panic, but mpd5 work with another re interface interated on MB. I think it may be em related panic, or em+mpd5. The latest panic I saw didnt have anything to do with em. Are you sure your crashes are because of the nic drive ? Not to mention, the error string the OP provided (see Subject) is only contained in one file: sys/ufs/ffs/ffs_vfsops.c, function ffs_bufwrite(). So, that would be some kind of weird filesystem-related issue, not NIC-specific. I have no idea how to debug said problem. The issue is the file system activity occurring in parallel with the coredump, which is strange. It seems like everything else should be halted before the dump begins but I couldn't find a place in the code that actually tries to stop the other CPUs. My question isn't about the initial panic (I was using the sysctl to provoke one), but about the secondary panic. This is on 8-core systems. -Andrew -- Andrew Boyerabo...@averesystems.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org
Re: About panic: bufwrite: buffer is not busy???
Moving this to -current and -stable and following up... Something is broken with coredumps on stable/8 amd64. I tried a vanilla 8.2-RC3 and yesterday's csup of stable/8; neither can dump a core with 'sysctl debug.kdb.panic=1'. For the 8.2-RC3 / amd64 / GENERIC install, I used the memstick image, installed on ad7 (a 250GB SATA drive), used the default partition map, and set dumpdev to AUTO. I added enough tracing to show that the second panic is due to the syncer process flushing buffers to the other filesystems in parallel with the dump. I've seen this panic and a similar one 'buffer not locked' coming from ffs_write(). One time out of about 30 the core ran to completion, but slowly (~1MB/sec). Other times the dump just locks up completely with no other output. Does anyone know what might have changed to expose this problem? I don't ever see it under 7.1. Thanks, Andrew On Feb 3, 2011, at 12:11 AM, Eugene Grosbein wrote: On 02.02.2011 00:50, Gleb Smirnoff wrote: E Uptime: 8h3m51s E Dumping 4087 MB (3 chunks) E chunk 0: 1MB (150 pages) ... ok E chunk 1: 3575MB (915088 pages) 3559 3543panic: bufwrite: buffer is not busy??? E cpuid = 3 E Uptime: 8h3m52s E Automatic reboot in 15 seconds - press a key on the console to abort Can you add KDB_TRACE option to kernel? Your boxes for some reason can't dump core, but with this option we will have at least trace. I see Mike Tancsa's box has bufwrite: buffer is not busy??? problem too. Has anyone a thought how to fix generation of crashdumps? Eugene Grosbein ___ freebsd-...@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-net To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org -- Andrew Boyerabo...@averesystems.com ___ freebsd-stable@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-stable To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org