Re: current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm
On 2010-07-21 at 20:40, Svein Skogen (Listmail account) wrote: On 21.07.2010 18:33, Ståle Kristoffersen wrote: snip I -might- have solved my problem. It has now ran for 24h without timeouts, and with a bit of load on it. I think I might have ran into the seagate + NCQ-problem, even tho seagate's webpage told me my drives was not affected (according to the serial numbers). I did however update the following num drives firmware 6x ST31000340AS SD15 4x ST31500341AS SD17 I have 8 of the last type (31500341AS) mine running on CC1H firmware, connected to my MFI. Not a single glitch so far. I also have 8 of those :) Part of my problem is that they are all connected to a sas expander, and when one drive gets in trouble everything is reset, so I can't see which drives is causing the problems. Thats why I flashed every drive I could find an update for. to firmware SD1B (old SD17) and SD1A (old SD15), and that looks like it has done the trick. I'll report back in a week or so if the problem has not reappeared. Hope it's fixed for you. I'm still keeping an eye on the MPT code to see if someone changes something that CAN be affecting my timeout issues/reset, and if I see something promising, I'm willing to dump out the entire server to tapes, and test run (I have sufficient spare tapes to actually test without losing data), but such a job will take me a week to prepare, and another to test. Quite a bit of time for something that may solve my problem... ;) It still runs fine now after 6 days, so I'm optimistic :) Not a single timeout. Good luck with your tape drive. -- Ståle Kristoffersen ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm
On 2010-07-20 at 14:16, Svein Skogen (Listmail account) wrote: Sorry for the late response here, but what you're describing matches fairly well what I saw with RELENG_8 (just after 8.0 was released), but luckily I didn't have any disks on my MPT, just my tape autoloader. Random timeouts, and then bus resets (that made tape IO unreliable). The bad news, is that I had the exact same trouble with OpenSolaris (134), and something-similar with Linux (can't remember versions), at the time. I never did find a solution, and ended up throwing windows on the box, just to get reliable backups. My MPT is a 3801 LSI1068e based card running the latest bios. Hmm, that does not sound good. Did windows work on the same hardware without problems? I -might- have solved my problem. It has now ran for 24h without timeouts, and with a bit of load on it. I think I might have ran into the seagate + NCQ-problem, even tho seagate's webpage told me my drives was not affected (according to the serial numbers). I did however update the following num drives firmware 6x ST31000340AS SD15 4x ST31500341AS SD17 to firmware SD1B (old SD17) and SD1A (old SD15), and that looks like it has done the trick. I'll report back in a week or so if the problem has not reappeared. -- Ståle Kristoffersen ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm
On 2010-07-20 at 12:17, Marius Strobl wrote: On Mon, Jul 19, 2010 at 07:06:54PM +0200, Stle Kristoffersen wrote: On 2010-07-18 at 14:20, Marius Strobl wrote: Downgrading now... And it crashed again, with current from r209598... Ok, this at least means that your problem isn't caused by the recent changes to mpt(4) as the pre-r209599 version only differed from the 8-STABLE one in a cosmetic change at that time. I have another data-point, I cvsup'ed to the latest current again, and rebuilt without INVARIANT and WITNESS, and now it seems to survive the timeouts. That's more or less expected as the sanity check issuing the panic just isn't compiled in then. However, my understanding was that with STABLE you don't get the timeouts in the first place, or do you see them there also? I got the timeouts with STABLE as well, that was the reason for me to try out CURRENT. I'm sorry I didn't mention that earlier. My main concern is to get rid of the timeouts, but a panic on one can't be right. How can I debug this further? I can get timeout fairly consistent by putting a bit of load on the drives. If it would help I can also provide remote access. I'm trying to update the firmware on some of the drives now to see if that helps with the timeouts. -- Ståle Kristoffersen ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm
On 2010-07-18 at 14:20, Marius Strobl wrote: Downgrading now... And it crashed again, with current from r209598... Ok, this at least means that your problem isn't caused by the recent changes to mpt(4) as the pre-r209599 version only differed from the 8-STABLE one in a cosmetic change at that time. I have another data-point, I cvsup'ed to the latest current again, and rebuilt without INVARIANT and WITNESS, and now it seems to survive the timeouts. -- Ståle Kristoffersen ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm
On 2010-07-16 at 12:31, Ståle Kristoffersen wrote: On 2010-07-15 at 19:52, Ståle Kristoffersen wrote: On 2010-07-15 at 18:00, Marius Strobl wrote: On Thu, Jul 15, 2010 at 02:34:23PM +0200, Stle Kristoffersen wrote: Upgraded to from stable to current yesterday and very quickly received a panic. It did however not dump it's core, so I was unable to debug it. Today it did panic again, and I took a picture: (Sorry about the bad quality) http://folk.uio.no/stalk/mpt/IMG_1403.JPG And from the backtrace: http://folk.uio.no/stalk/mpt/IMG_1404.JPG Both times I hade the mpt0: request timed out just before the panic. I'm not sure why it's not dumping it's core (It was working under stable, and I have dumpdev=AUTO and dumpdir=/var/crash in rc.conf) What revision were you using? Not sure exactly what revision I was using, is there an easy way to figure that out? I ran cvsupdate around 13:00 CEST yesterday. Does using current as of r209598 make a difference? Downgrading now... And it crashed again, with current from r209598... It still keeps on crashing :/ I grabbed the output of show alllocks: http://folk.uio.no/stalk/mpt/IMAG0047.jpg To me it looks like maybe there is a race condition or something that makes TAILQ_REMOVE-call in mpt_scsi_tmf_reply_handler() work on an element that has been removed, but this is an un-educated guess ;) I do not understand enough of the driver to follow the flow of the requests around the driver. -- Ståle Kristoffersen ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm
On 2010-07-15 at 19:52, Ståle Kristoffersen wrote: On 2010-07-15 at 18:00, Marius Strobl wrote: On Thu, Jul 15, 2010 at 02:34:23PM +0200, Stle Kristoffersen wrote: Upgraded to from stable to current yesterday and very quickly received a panic. It did however not dump it's core, so I was unable to debug it. Today it did panic again, and I took a picture: (Sorry about the bad quality) http://folk.uio.no/stalk/mpt/IMG_1403.JPG And from the backtrace: http://folk.uio.no/stalk/mpt/IMG_1404.JPG Both times I hade the mpt0: request timed out just before the panic. I'm not sure why it's not dumping it's core (It was working under stable, and I have dumpdev=AUTO and dumpdir=/var/crash in rc.conf) What revision were you using? Not sure exactly what revision I was using, is there an easy way to figure that out? I ran cvsupdate around 13:00 CEST yesterday. Does using current as of r209598 make a difference? Downgrading now... And it crashed again, with current from r209598... -- Ståle Kristoffersen ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm
Upgraded to from stable to current yesterday and very quickly received a panic. It did however not dump it's core, so I was unable to debug it. Today it did panic again, and I took a picture: (Sorry about the bad quality) http://folk.uio.no/stalk/mpt/IMG_1403.JPG And from the backtrace: http://folk.uio.no/stalk/mpt/IMG_1404.JPG Both times I hade the mpt0: request timed out just before the panic. I'm not sure why it's not dumping it's core (It was working under stable, and I have dumpdev=AUTO and dumpdir=/var/crash in rc.conf) -- Ståle Kristoffersen ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm
On 2010-07-15 at 14:34, Ståle Kristoffersen wrote: Upgraded to from stable to current yesterday and very quickly received a panic. It did however not dump it's core, so I was unable to debug it. Today it did panic again, and I took a picture: (Sorry about the bad quality) http://folk.uio.no/stalk/mpt/IMG_1403.JPG And from the backtrace: http://folk.uio.no/stalk/mpt/IMG_1404.JPG Both times I hade the mpt0: request timed out just before the panic. I'm not sure why it's not dumping it's core (It was working under stable, and I have dumpdev=AUTO and dumpdir=/var/crash in rc.conf) Just to be complete: I also get this LOR at boot: lock order reversal: 1st 0xff80a5108b38 bufwait (bufwait) @ /usr/src/sys/kern/vfs_bio.c:2607 2nd 0xff0002dc6000 dirhash (dirhash) @ /usr/src/sys/ufs/ufs/ufs_dirhash.c:283 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a _witness_debugger() at _witness_debugger+0x2e witness_checkorder() at witness_checkorder+0x81e _sx_xlock() at _sx_xlock+0x55 ufsdirhash_acquire() at ufsdirhash_acquire+0x33 ufsdirhash_remove() at ufsdirhash_remove+0x16 ufs_dirremove() at ufs_dirremove+0x1a4 ufs_remove() at ufs_remove+0x92 VOP_REMOVE_APV() at VOP_REMOVE_APV+0x93 kern_unlinkat() at kern_unlinkat+0x2cb syscallenter() at syscallenter+0x1b5 syscall() at syscall+0x4c Xfast_syscall() at Xfast_syscall+0xe2 --- syscall (10, FreeBSD ELF64, unlink), rip = 0x80072f3cc, rsp = 0x7fffdb08, rbp = 0x7fffef58 --- lock order reversal: 1st 0xff00407a4458 ufs (ufs) @ /usr/src/sys/kern/vfs_mount.c:1058 2nd 0xff00407aedb8 devfs (devfs) @ /usr/src/sys/kern/vfs_subr.c:2090 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2a _witness_debugger() at _witness_debugger+0x2e witness_checkorder() at witness_checkorder+0x81e __lockmgr_args() at __lockmgr_args+0xd11 vop_stdlock() at vop_stdlock+0x39 VOP_LOCK1_APV() at VOP_LOCK1_APV+0x9b _vn_lock() at _vn_lock+0x47 vget() at vget+0x7b devfs_allocv() at devfs_allocv+0x100 devfs_root() at devfs_root+0x48 vfs_donmount() at vfs_donmount+0xfb2 nmount() at nmount+0x63 syscallenter() at syscallenter+0x1b5 syscall() at syscall+0x4c Xfast_syscall() at Xfast_syscall+0xe2 --- syscall (378, FreeBSD ELF64, nmount), rip = 0x8007b2b4c, rsp = 0x7fffdd28, rbp = 0x800c09048 --- -- Ståle Kristoffersen ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org
Re: current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm
On 2010-07-15 at 18:00, Marius Strobl wrote: On Thu, Jul 15, 2010 at 02:34:23PM +0200, Stle Kristoffersen wrote: Upgraded to from stable to current yesterday and very quickly received a panic. It did however not dump it's core, so I was unable to debug it. Today it did panic again, and I took a picture: (Sorry about the bad quality) http://folk.uio.no/stalk/mpt/IMG_1403.JPG And from the backtrace: http://folk.uio.no/stalk/mpt/IMG_1404.JPG Both times I hade the mpt0: request timed out just before the panic. I'm not sure why it's not dumping it's core (It was working under stable, and I have dumpdev=AUTO and dumpdir=/var/crash in rc.conf) What revision were you using? Not sure exactly what revision I was using, is there an easy way to figure that out? I ran cvsupdate around 13:00 CEST yesterday. Does using current as of r209598 make a difference? Downgrading now... -- Ståle Kristoffersen staal...@ifi.uio.no ___ freebsd-current@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-current To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org