IPMI hardware watchdogs Re: dell r420/r320 stable/9

2012-07-27 Thread Andrew Boyer

On Jul 26, 2012, at 8:50 PM, Sean Bruno wrote:

 For the time being I had to revert the following from my stable/9 tree.
 Otherwise I would get a kernel panic on shutdown from ipmi(4).
 
 http://svnweb.freebsd.org/base?view=revisionrevision=237839
 http://svnweb.freebsd.org/base?view=revisionrevision=221121
 


On a somewhat related note: We noticed recently that you can't pet or disable 
the IPMI hardware watchdog once SCHEDULER_STOPPED() is true.  This means it can 
fire unexpectedly while you're dumping core or rebooting, depending on how long 
the timeout was on the pet before the panic.  The ipmi driver will need to 
process the command differently if the scheduler is stopped.  I haven't had 
time to look at a fix yet.

-Andrew

--
Andrew Boyerabo...@averesystems.com




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: IPMI hardware watchdogs Re: dell r420/r320 stable/9

2012-07-27 Thread Andrew Boyer

On Jul 27, 2012, at 10:42 AM, Attilio Rao wrote:

 On Fri, Jul 27, 2012 at 3:33 PM, Andrew Boyer abo...@averesystems.com wrote:
 
 On Jul 26, 2012, at 8:50 PM, Sean Bruno wrote:
 
 For the time being I had to revert the following from my stable/9 tree.
 Otherwise I would get a kernel panic on shutdown from ipmi(4).
 
 http://svnweb.freebsd.org/base?view=revisionrevision=237839
 http://svnweb.freebsd.org/base?view=revisionrevision=221121
 
 
 On a somewhat related note: We noticed recently that you can't pet or 
 disable the IPMI hardware watchdog once SCHEDULER_STOPPED() is true.  This 
 means it can fire unexpectedly while you're dumping core or rebooting, 
 depending on how long the timeout was on the pet before the panic.  The ipmi 
 driver will need to process the command differently if the scheduler is 
 stopped.  I haven't had time to look at a fix yet.
 
 I recall I fixed that internally for SV, but the key here is that we
 need to find an unified (or a default policy).
 More specifically, do we want the watchdog also covers the kernel dump
 part (because of possible deadlocks when dumping). If the answer is
 yes, we likely need pat the watchdog from within the dumping cycle
 itself. If the answer is no, then we can just disable it when entering
 the panic path. But anyway, we need to identify a default policy that
 makes sense first.
 
 Attilio
 

For our use case, we need the system to reset if the dump hangs.

As the code stands now, you can't disable the HW watchdog from the panic path.  
Prior to stopping the scheduler early in panic(), you don't know the lock 
state, so you can't safely initiate the IPMI command.  (It hung the first time 
I tried it.)  After stopping the scheduler, you can't pet it to turn it off.

-Andrew

--
Andrew Boyerabo...@averesystems.com




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: FreeBSD 8-STABLE on R620 w/ X520-DA2/Intel 82599

2012-06-29 Thread Andrew Boyer
Please post the output of pciconf -lvc for these devices.

-Andrew

On Jun 29, 2012, at 10:50 AM, Rick Miller wrote:

 Hi All,
 
 I have 2 hosts, HP DL360 G8 and Dell R620.  Both have the
 X520-DA2/Intel 82599 10G Fiber NIC.  Both also have the same FreeBSD
 8-STABLE image.  The Dell displays the following in dmesg and we are
 unable to configure the ix0 or ix1 interfaces where the HP works just
 fine.  Wondering if anyone else has experienced this?
 
 pci4: network, ethernet at device 0.0 (no driver attached)
 pci4: network, ethernet at device 0.1 (no driver attached)
 
 
 -- 
 Take care
 Rick Miller
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

--
Andrew Boyerabo...@averesystems.com




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Intel X520-DA2 Supported in stable/8?

2012-06-25 Thread Andrew Boyer
You can probably turn hw.ixgbe.num_queues down to 2 or 4 and cut your mbuf 
consumption dramatically without noticing any loss of performance.

-A

On Jun 22, 2012, at 6:19 PM, Rick Miller wrote:

 On Fri, Jun 22, 2012 at 5:21 PM, Jack Vogel jfvo...@gmail.com wrote:
 Increase your system mbuf pool size, you do not want that failure to happen.
 
 Thanks, Jack.  I saw a thread where you discussed this.  You are
 referring to kern.ipc.nmbclusters, correct?
 
 Should I also adjust the following?
 
 hw.ixgbe.rxd
 hw.ixgbe.txd
 hw.ixgbe.num_queues
 hw.intr_storm_threshold

--
Andrew Boyerabo...@averesystems.com




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Intel X520-DA2 Supported in stable/8?

2012-06-22 Thread Andrew Boyer
The ixgbe driver creates devices named ix0, etc.

I believe you need to run 'ifconfig ix0 up' before it will attempt to get link.

-Andrew

On Jun 22, 2012, at 3:45 PM, Rick Miller wrote:

 On Fri, Jun 22, 2012 at 3:13 PM, Rick Miller vmil...@hostileadmin.com wrote:
 Hi All,
 
 Wondering if the Intel X520-DA2 10G Fibre NIC is supported in
 stable/8.  Hardware notes don't specify it, but I have a system up and
 the interfaces appear to be loaded by the ix driver.  However, status
 indicates no carrier.
 
 Ok, brain fart.  Please forgive my ineptitude.  I once sent an email
 inquiring about the Intel 82599, which is this NIC.  Responses to that
 mail say it's supported by the ixgbe driver.  My stable/8 installation
 (5/21/2012) probes it with an ix driver that I cannot find any info
 on.  The ixgbe manage indicates it only supports 82598 based
 controllers.  Not sure what to think here...
 ___
 freebsd-stable@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-stable
 To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org

--
Andrew Boyerabo...@averesystems.com




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Can someone do something about the extra svn mergeinfo?

2012-03-20 Thread Andrew Boyer
For example:
http://svnweb.freebsd.org/base/stable/8/sys/dev/e1000/?view=log

This makes it very hard to figure out which changes are actually relevant to 
e1000.

Thank you,
 Andrew

--
Andrew Boyerabo...@averesystems.com




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: hang during dump (reproducible)

2012-02-14 Thread Andrew Boyer

On Feb 10, 2012, at 9:50 PM, Jake Holland wrote:
 
 Many thanks to Attilio Rao, Kostik Belousov, and Andriy Gapon. And anybody 
 else involved.
 
 However, when I looked at the commit I noticed this:
 $ svn log -r228424 svn://svn.freebsd.org/base
  ...
 MFC after:  3 months (or never)
 
 I'm not sure whether never is still considered an option, but it would be 
 useful for me if 8.3 release, when it comes, does not hang this way during 
 panic. But thanks for the patch, regardless.
 

Agreed - if this commit could be MFC'd for 8.3 it would be much appreciated.

-Andrew

--
Andrew Boyerabo...@averesystems.com




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Kernel panics under 8.2 due to ATA timeouts

2012-01-30 Thread Andrew Boyer
Hello Alexander,
I have a system that appears to have a flaky SATA controller (one of the Intel 
ESB2 variants) and it seems to be exposing a weakness in the ATA driver (not 
using ATA_CAM).  If a command with ATA_R_DIRECT set times out, the channel gets 
reinitialized, but from the soft interrupt context.  It panics when it tries to 
sleep in ata_queue_request().

Timeouts work if ATA_R_DIRECT isn't set because in that case it uses a 
taskqueue to complete the request.

Here is the backtrace:
 #0  kdb_enter (why=0x80962cfa panic, msg=0xa Address 0xa out of 
 bounds) at ../../../kern/subr_kdb.c:349
 #1  0x805d6d0b in panic (fmt=Variable fmt is not available.
 ) at ../../../kern/kern_shutdown.c:689
 #2  0x8061bc53 in sleepq_add (wchan=0xff00052c3e58, 
 lock=0xff00052c3e38, wmesg=0x808fa213 ATA request done, 
 flags=1, queue=0) at ../../../kern/subr_sleepqueue.c:320
 #3  0x80590c95 in _cv_timedwait (cvp=0xff00052c3e58, 
 lock=0xff00052c3e38, timo=4) at ../../../kern/kern_condvar.c:313
 #4  0x805d61af in _sema_timedwait (sema=0xff00052c3e38, 
 timo=4, file=0x808fa1f6 ../../../dev/ata/ata-queue.c, 
 line=118) at ../../../kern/kern_sema.c:123
 #5  0x8028559f in ata_queue_request (request=0xff00052c3dc0) at 
 ../../../dev/ata/ata-queue.c:117
 #6  0x80286628 in ata_controlcmd (dev=0xff0002e83d00, command=239 
 '?', feature=Variable feature is not available.
 ) at ../../../dev/ata/ata-queue.c:153
 #7  0x8027ffd3 in ata_setmode (dev=0xff0002e83d00) at 
 ../../../dev/ata/ata-all.c:637
 #8  0x802a0af9 in ad_init (dev=0xff0002e83d00) at 
 ../../../dev/ata/ata-disk.c:405
 #9  0x802a0c29 in ad_reinit (dev=0xff0002e83d00) at 
 ../../../dev/ata/ata-disk.c:221
 #10 0x80280cad in ata_reinit (dev=0xff0002902800) at ata_if.h:79
 #11 0x802856c4 in ata_completed (context=Variable context is not 
 available.
 ) at ../../../dev/ata/ata-queue.c:313
 #12 0x80285ffb in ata_finish (request=0xff00054ec8c0) at 
 ../../../dev/ata/ata-queue.c:265
 #13 0x805ed419 in softclock (arg=Variable arg is not available.
 ) at ../../../kern/kern_timeout.c:430

This is very repeatable.  I'm not sure what's the best fix - always use a 
taskqueue on timeouts?  Don't reinit if direct commands fail?

-Andrew

--
Andrew Boyerabo...@averesystems.com




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


mfip and smartctl Re: smartctl / mpt on 9.0-RC1

2011-11-07 Thread Andrew Boyer
On Nov 7, 2011, at 6:24 AM, Marat N.Afanasyev wrote:
 
 this is an output on mfi controller with mfip loaded:
 
 # smartctl -a /dev/pass1
 smartctl 5.41 2011-06-09 r3365 [FreeBSD 8.2-RELEASE amd64] (local build)
 Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net
 
 Vendor:   SEAGATE
 Product:  ST3146356SS
 Revision: 0007
 User Capacity:146,815,737,856 bytes [146 GB]
 Logical block size:   512 bytes
 Logical Unit id:  0x5000c50028f8a56f
 Serial number:3QN4PWHS9130JLKB
 Device type:  31
 Transport protocol:   SAS
 Local Time is:Mon Nov  7 15:20:27 2011 MSK
 Device supports SMART and is Enabled
 Temperature Warning Enabled
 SMART Health Status: OK
 
 Current Drive Temperature: 26 C
 Drive Trip Temperature:68 C
 
 Error counter log:
   Errors Corrected by   Total   Correction GigabytesTotal
   ECC  rereads/errors   algorithm processed
 uncorrected
   fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  
 errors
 read:93821240 0   93821249382124   3436.782   
 0
 write: 00 0 0  0   8978.360   
 0
 verify:   6634330 0663433 663433332.651   
 0
 
 Non-medium error count:7
 
 [GLTSD (Global Logging Target Save Disable) set. Enable Save with '-S on']
 No self-tests have been logged
 Long (extended) Self Test duration: 1740 seconds [29.0 minutes]
 
 btw, 3dm can tell about reallocated sector count on sas somehow, while 
 smartctl cannot, even on supported controller :(

Notice how the device type is 31?  The mfip driver masks off the SCSI 
INQUIRY peripheral device type bits to prevent CAM from attached da* devices to 
the disks.  See sys/dev/mfi/mfi_cam.c, search for T_DIRECT.  That confuses 
smartctl and prevents it from displaying information like the Grown Defect List.

I added a local hack to smartctl to interpret a peripheral device type of 0x1f 
(unknown or missing) to 0x0 (disk), but I don't think the hack is appropriate 
for general consumption.  What we need is better way for mfi and aac to block 
CAM from attaching without corrupting the inquiry results.

-Andrew

 -- 
 SY, Marat
 

--
Andrew Boyerabo...@averesystems.com




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


USB/coredump hangs in 8 and 9

2011-08-12 Thread Andrew Boyer
Re: panic: bufwrite: buffer is not busy??? (originally on freebsd-net)
Re: debugging frequent kernel panics on 8.2-RELEASE (originally on 
freebsd-stable)
Re: System hang in USB umass module while processing panic  (originally on 
freebsd-usb)

Hello Andriy and Hans,

Sorry for tying in so many discussions on this topic, but I think I have an 
explanation for the problems we have been reporting* with hanging coredumps on 
multicore systems on 8.2-RELEASE, and it has implications for Andriy's proposed 
scheduler patch** and for USB.

In today's 8.X and 9.X branches, nothing that I can find stops the other CPUs 
when the kernel panics, but many parts of the locking code get disabled (grep 
on 'panicstr').  The 'bufwrite: buffer is not busy???' panic is caused by the 
syncer encountering an error.  If that happens when it's on the dumping CPU 
everything hangs.  If it's running on a different CPU, it will be blocked and 
hidden by the panic_cpu spinlock in panic(), and the dump continues, polling 
every attached keyboard for a Ctl-C.

But, the new 8.X USB stack relies on multithreading.  (The new stack is the 
variable that broke coredumps for us in the 7.1-8.2 transition, I think.)  SVN 
224223 fixes a hang that would happen when dumpsys() polls the USB keyboard 
(IPMI KVM, in our case).  That helps, but it only gets as far as usb_process(), 
where it hangs in a loop around a cv_wait() call.  This is easy to reproduce by 
adding code to the watchdog to break into the debugger if panicstr is set.

I am experimenting with Andriy's patch** to stop the scheduler and it seems to 
be most of the way there, stopping the CPUs and disabling the rest of locking.  
There are a few places that still reference panicstr, but that's minor.  These 
are the changes I made to the patch:
 * Changed ukbd_do_poll() to return immediately if SCHEDULER_STOPPED() is true, 
so that we don't hang up in USB.  ukbd_yield()  locks up in DROP_GIANT(), and 
if you skip ukbd_yield(), usbd_transfer_poll() locks up trying to drop mutexes.
 * Changed the call to spinlock_enter() back to critical_enter(), so that 
interrupts stay enabled and the hardclock still functions.
 * Added code in the beginning of panic() to switch to CPU 0, so that we're 
able to service the hardclock interrupts and so that watchdog panics get 
through.

This has worked 100% for me so far, although anyone using a USB keyboard or 
dump device would still be out of luck.

Thoughts?  It seems like stopping all of the other CPUs is the right thing to 
do on a panic (what are they doing otherwise?).  Are the USB issues fixable?  
If Andriy's patch get committed it might just involve short-circuiting all of 
the locking in the polling path, but I haven't gotten that far yet.  I bet 
dumping to NFS will have the same problem.

Thanks,
  Andrew

* - http://www.freebsd.org/cgi/query-pr.cgi?pr=kern/155421
** - http://people.freebsd.org/~avg/stop_scheduler_on_panic.8.x.diff
--
Andrew Boyerabo...@averesystems.com




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: Heads up: you'll need to do a fresh config KERNEL etc

2011-06-02 Thread Andrew Boyer

On May 19, 2011, at 6:53 PM, Rick Macklem wrote:
 Assuming that you are using the regular 8.n client (and not the new
 one), there have been some commits related to krpc bugs that could have
 fixed cases which would have caused poor perf., although all of those
 (except one where a client would hang on a TCP reconnect attempt) are in
 8.2.

Are you referring to r221934?  If not, which change?

(Trying to make sure I have them all...)

Thanks,
  Andrew

--
Andrew Boyerabo...@averesystems.com




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: About panic: bufwrite: buffer is not busy???

2011-02-20 Thread Andrew Boyer

On Feb 20, 2011, at 10:46 AM, Jeremy Chadwick wrote:

 On Sun, Feb 20, 2011 at 10:30:52AM -0500, Mike Tancsa wrote:
 On 2/20/2011 9:33 AM, Andrey Smagin wrote:
 On week -current I have same problem, my box paniced every 2-15 min. I 
 resolve problem by next steps - unplug network connectors from 2 intel em 
 (82574L) cards. I think last time that mpd5 related panic, but mpd5 work 
 with another re interface interated on MB. I think it may be em related 
 panic, or em+mpd5.
 
 The latest panic I saw didnt have anything to do with em.  Are you sure
 your crashes are because of the nic drive ?
 
 Not to mention, the error string the OP provided (see Subject) is only
 contained in one file: sys/ufs/ffs/ffs_vfsops.c, function
 ffs_bufwrite().  So, that would be some kind of weird filesystem-related
 issue, not NIC-specific.  I have no idea how to debug said problem.
 

The issue is the file system activity occurring in parallel with the coredump, 
which is strange.  It seems like everything else should be halted before the 
dump begins but I couldn't find a place in the code that actually tries to stop 
the other CPUs.

My question isn't about the initial panic (I was using the sysctl to provoke 
one), but about the secondary panic.

This is on 8-core systems.

-Andrew

--
Andrew Boyerabo...@averesystems.com




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org


Re: About panic: bufwrite: buffer is not busy???

2011-02-16 Thread Andrew Boyer
Moving this to -current and -stable and following up...

Something is broken with coredumps on stable/8 amd64.  I tried a vanilla 
8.2-RC3 and yesterday's csup of stable/8; neither can dump a core with 'sysctl 
debug.kdb.panic=1'.

For the 8.2-RC3 / amd64 / GENERIC install, I used the memstick image, installed 
on ad7 (a 250GB SATA drive), used the default partition map, and set dumpdev to 
AUTO.

I added enough tracing to show that the second panic is due to the syncer 
process flushing buffers to the other filesystems in parallel with the dump.  
I've seen this panic and a similar one 'buffer not locked' coming from 
ffs_write().  One time out of about 30 the core ran to completion, but slowly 
(~1MB/sec).  Other times the dump just locks up completely with no other output.

Does anyone know what might have changed to expose this problem?

I don't ever see it under 7.1.

Thanks,
 Andrew

On Feb 3, 2011, at 12:11 AM, Eugene Grosbein wrote:

 On 02.02.2011 00:50, Gleb Smirnoff wrote:
 
 E Uptime: 8h3m51s
 E Dumping 4087 MB (3 chunks)
 E   chunk 0: 1MB (150 pages) ... ok
 E   chunk 1: 3575MB (915088 pages) 3559 3543panic: bufwrite: buffer is not 
 busy???
 E cpuid = 3
 E Uptime: 8h3m52s
 E Automatic reboot in 15 seconds - press a key on the console to abort
 Can you add KDB_TRACE option to kernel? Your boxes for some reason can't
 dump core, but with this option we will have at least trace.
 
 I see Mike Tancsa's box has bufwrite: buffer is not busy??? problem too.
 Has anyone a thought how to fix generation of crashdumps?
 
 Eugene Grosbein
 
 
 ___
 freebsd-...@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-net
 To unsubscribe, send any mail to freebsd-net-unsubscr...@freebsd.org

--
Andrew Boyerabo...@averesystems.com




___
freebsd-stable@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-stable
To unsubscribe, send any mail to freebsd-stable-unsubscr...@freebsd.org