Re: current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm

2010-07-26 Thread Ståle Kristoffersen
On 2010-07-21 at 20:40, Svein Skogen (Listmail account) wrote:
 On 21.07.2010 18:33, Ståle Kristoffersen wrote:

snip

  I -might- have solved my problem. It has now ran for 24h without timeouts,
  and with a bit of load on it. I think I might have ran into the seagate +
  NCQ-problem, even tho seagate's webpage told me my drives was not affected
  (according to the serial numbers). I did however update the following
  num drives   firmware 
  6x  ST31000340AS SD15
  4x  ST31500341AS SD17
 
 I have 8 of the last type (31500341AS) mine running on CC1H firmware,
 connected to my MFI. Not a single glitch so far.

I also have 8 of those :) Part of my problem is that they are all connected
to a sas expander, and when one drive gets in trouble everything is reset,
so I can't see which drives is causing the problems. Thats why I flashed
every drive I could find an update for.

  to firmware SD1B (old SD17) and SD1A (old SD15), and that looks like it has
  done the trick. I'll report back in a week or so if the problem has not
  reappeared.
 
 Hope it's fixed for you. I'm still keeping an eye on the MPT code to see
 if someone changes something that CAN be affecting my timeout
 issues/reset, and if I see something promising, I'm willing to dump out
 the entire server to tapes, and test run (I have sufficient spare tapes
 to actually test without losing data), but such a job will take me a
 week to prepare, and another to test. Quite a bit of time for something
 that may solve my problem... ;)

It still runs fine now after 6 days, so I'm optimistic :)
Not a single timeout.
Good luck with your tape drive.
-- 
Ståle Kristoffersen
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm

2010-07-21 Thread Ståle Kristoffersen
On 2010-07-20 at 14:16, Svein Skogen (Listmail account) wrote:
 Sorry for the late response here, but what you're describing matches
 fairly well what I saw with RELENG_8 (just after 8.0 was released), but
 luckily I didn't have any disks on my MPT, just my tape autoloader.
 
 Random timeouts, and then bus resets (that made tape IO unreliable).
 
 The bad news, is that I had the exact same trouble with OpenSolaris
 (134), and something-similar with Linux (can't remember versions), at
 the time.
 
 I never did find a solution, and ended up throwing windows on the box,
 just to get reliable backups.
 
 My MPT is a 3801 LSI1068e based card running the latest bios.

Hmm, that does not sound good. Did windows work on the same hardware
without problems?

I -might- have solved my problem. It has now ran for 24h without timeouts,
and with a bit of load on it. I think I might have ran into the seagate +
NCQ-problem, even tho seagate's webpage told me my drives was not affected
(according to the serial numbers). I did however update the following
num drives   firmware 
6x  ST31000340AS SD15
4x  ST31500341AS SD17

to firmware SD1B (old SD17) and SD1A (old SD15), and that looks like it has
done the trick. I'll report back in a week or so if the problem has not
reappeared.

-- 
Ståle Kristoffersen
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm

2010-07-20 Thread Ståle Kristoffersen
On 2010-07-20 at 12:17, Marius Strobl wrote:
 On Mon, Jul 19, 2010 at 07:06:54PM +0200, Stle Kristoffersen wrote:
  On 2010-07-18 at 14:20, Marius Strobl wrote:
 Downgrading now...

And it crashed again, with current from r209598...

   
   Ok, this at least means that your problem isn't caused by the recent
   changes to mpt(4) as the pre-r209599 version only differed from the
   8-STABLE one in a cosmetic change at that time.
  
  I have another data-point, I cvsup'ed to the latest current again, and
  rebuilt without INVARIANT and WITNESS, and now it seems to survive the
  timeouts.
 
 That's more or less expected as the sanity check issuing the panic
 just isn't compiled in then. However, my understanding was that with
 STABLE you don't get the timeouts in the first place, or do you see
 them there also?

I got the timeouts with STABLE as well, that was the reason for me to
try out CURRENT. I'm sorry I didn't mention that earlier.

My main concern is to get rid of the timeouts, but a panic on one can't be
right. How can I debug this further? I can get timeout fairly consistent by
putting a bit of load on the drives. If it would help I can also provide
remote access.

I'm trying to update the firmware on some of the drives now to see if that
helps with the timeouts.

-- 
Ståle Kristoffersen
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm

2010-07-19 Thread Ståle Kristoffersen
On 2010-07-18 at 14:20, Marius Strobl wrote:
   Downgrading now...
  
  And it crashed again, with current from r209598...
  
 
 Ok, this at least means that your problem isn't caused by the recent
 changes to mpt(4) as the pre-r209599 version only differed from the
 8-STABLE one in a cosmetic change at that time.

I have another data-point, I cvsup'ed to the latest current again, and
rebuilt without INVARIANT and WITNESS, and now it seems to survive the
timeouts.
-- 
Ståle Kristoffersen
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm

2010-07-18 Thread Ståle Kristoffersen
On 2010-07-16 at 12:31, Ståle Kristoffersen wrote:
 On 2010-07-15 at 19:52, Ståle Kristoffersen wrote:
  On 2010-07-15 at 18:00, Marius Strobl wrote:
   On Thu, Jul 15, 2010 at 02:34:23PM +0200, Stle Kristoffersen wrote:
Upgraded to from stable to current yesterday and very quickly received a
panic. It did however not dump it's core, so I was unable to debug it.
Today it did panic again, and I took a picture: (Sorry about the bad
quality)

http://folk.uio.no/stalk/mpt/IMG_1403.JPG

And from the backtrace:
http://folk.uio.no/stalk/mpt/IMG_1404.JPG

Both times I hade the mpt0: request timed out just before the panic.

I'm not sure why it's not dumping it's core (It was working under 
stable,
and I have dumpdev=AUTO and dumpdir=/var/crash in rc.conf)
   
   What revision were you using?
  
  Not sure exactly what revision I was using, is there an easy way to figure
  that out? I ran cvsupdate around 13:00 CEST yesterday.
  
   Does using current as of r209598 make a difference?
  
  Downgrading now...
 
 And it crashed again, with current from r209598...

It still keeps on crashing :/
I grabbed the output of show alllocks:
http://folk.uio.no/stalk/mpt/IMAG0047.jpg

To me it looks like maybe there is a race condition or something that makes
TAILQ_REMOVE-call in mpt_scsi_tmf_reply_handler() work on an element that
has been removed, but this is an un-educated guess ;)
I do not understand enough of the driver to follow the flow of the requests
around the driver.

-- 
Ståle Kristoffersen
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm

2010-07-16 Thread Ståle Kristoffersen
On 2010-07-15 at 19:52, Ståle Kristoffersen wrote:
 On 2010-07-15 at 18:00, Marius Strobl wrote:
  On Thu, Jul 15, 2010 at 02:34:23PM +0200, Stle Kristoffersen wrote:
   Upgraded to from stable to current yesterday and very quickly received a
   panic. It did however not dump it's core, so I was unable to debug it.
   Today it did panic again, and I took a picture: (Sorry about the bad
   quality)
   
   http://folk.uio.no/stalk/mpt/IMG_1403.JPG
   
   And from the backtrace:
   http://folk.uio.no/stalk/mpt/IMG_1404.JPG
   
   Both times I hade the mpt0: request timed out just before the panic.
   
   I'm not sure why it's not dumping it's core (It was working under stable,
   and I have dumpdev=AUTO and dumpdir=/var/crash in rc.conf)
  
  What revision were you using?
 
 Not sure exactly what revision I was using, is there an easy way to figure
 that out? I ran cvsupdate around 13:00 CEST yesterday.
 
  Does using current as of r209598 make a difference?
 
 Downgrading now...

And it crashed again, with current from r209598...


-- 
Ståle Kristoffersen
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm

2010-07-15 Thread Ståle Kristoffersen
Upgraded to from stable to current yesterday and very quickly received a
panic. It did however not dump it's core, so I was unable to debug it.
Today it did panic again, and I took a picture: (Sorry about the bad
quality)

http://folk.uio.no/stalk/mpt/IMG_1403.JPG

And from the backtrace:
http://folk.uio.no/stalk/mpt/IMG_1404.JPG

Both times I hade the mpt0: request timed out just before the panic.

I'm not sure why it's not dumping it's core (It was working under stable,
and I have dumpdev=AUTO and dumpdir=/var/crash in rc.conf)
-- 
Ståle Kristoffersen
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm

2010-07-15 Thread Ståle Kristoffersen
On 2010-07-15 at 14:34, Ståle Kristoffersen wrote:
 Upgraded to from stable to current yesterday and very quickly received a
 panic. It did however not dump it's core, so I was unable to debug it.
 Today it did panic again, and I took a picture: (Sorry about the bad
 quality)
 
 http://folk.uio.no/stalk/mpt/IMG_1403.JPG
 
 And from the backtrace:
 http://folk.uio.no/stalk/mpt/IMG_1404.JPG
 
 Both times I hade the mpt0: request timed out just before the panic.
 
 I'm not sure why it's not dumping it's core (It was working under stable,
 and I have dumpdev=AUTO and dumpdir=/var/crash in rc.conf)

Just to be complete: I also get this LOR at boot:
lock order reversal:
 1st 0xff80a5108b38 bufwait (bufwait) @
/usr/src/sys/kern/vfs_bio.c:2607
 2nd 0xff0002dc6000 dirhash (dirhash) @
/usr/src/sys/ufs/ufs/ufs_dirhash.c:283
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
_witness_debugger() at _witness_debugger+0x2e
witness_checkorder() at witness_checkorder+0x81e
_sx_xlock() at _sx_xlock+0x55
ufsdirhash_acquire() at ufsdirhash_acquire+0x33
ufsdirhash_remove() at ufsdirhash_remove+0x16
ufs_dirremove() at ufs_dirremove+0x1a4
ufs_remove() at ufs_remove+0x92
VOP_REMOVE_APV() at VOP_REMOVE_APV+0x93
kern_unlinkat() at kern_unlinkat+0x2cb
syscallenter() at syscallenter+0x1b5
syscall() at syscall+0x4c
Xfast_syscall() at Xfast_syscall+0xe2
--- syscall (10, FreeBSD ELF64, unlink), rip = 0x80072f3cc, rsp =
0x7fffdb08, rbp = 0x7fffef58 ---
lock order reversal:
 1st 0xff00407a4458 ufs (ufs) @ /usr/src/sys/kern/vfs_mount.c:1058
 2nd 0xff00407aedb8 devfs (devfs) @ /usr/src/sys/kern/vfs_subr.c:2090
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2a
_witness_debugger() at _witness_debugger+0x2e
witness_checkorder() at witness_checkorder+0x81e
__lockmgr_args() at __lockmgr_args+0xd11
vop_stdlock() at vop_stdlock+0x39
VOP_LOCK1_APV() at VOP_LOCK1_APV+0x9b
_vn_lock() at _vn_lock+0x47
vget() at vget+0x7b
devfs_allocv() at devfs_allocv+0x100
devfs_root() at devfs_root+0x48
vfs_donmount() at vfs_donmount+0xfb2
nmount() at nmount+0x63
syscallenter() at syscallenter+0x1b5
syscall() at syscall+0x4c
Xfast_syscall() at Xfast_syscall+0xe2
--- syscall (378, FreeBSD ELF64, nmount), rip = 0x8007b2b4c, rsp =
0x7fffdd28, rbp = 0x800c09048 ---

-- 
Ståle Kristoffersen
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org


Re: current + mpt = panic: Bad link elm 0xffffff80002d6480 next-prev != elm

2010-07-15 Thread Ståle Kristoffersen
On 2010-07-15 at 18:00, Marius Strobl wrote:
 On Thu, Jul 15, 2010 at 02:34:23PM +0200, Stle Kristoffersen wrote:
  Upgraded to from stable to current yesterday and very quickly received a
  panic. It did however not dump it's core, so I was unable to debug it.
  Today it did panic again, and I took a picture: (Sorry about the bad
  quality)
  
  http://folk.uio.no/stalk/mpt/IMG_1403.JPG
  
  And from the backtrace:
  http://folk.uio.no/stalk/mpt/IMG_1404.JPG
  
  Both times I hade the mpt0: request timed out just before the panic.
  
  I'm not sure why it's not dumping it's core (It was working under stable,
  and I have dumpdev=AUTO and dumpdir=/var/crash in rc.conf)
 
 What revision were you using?

Not sure exactly what revision I was using, is there an easy way to figure
that out? I ran cvsupdate around 13:00 CEST yesterday.

 Does using current as of r209598 make a difference?

Downgrading now...

-- 
Ståle Kristoffersen
staal...@ifi.uio.no
___
freebsd-current@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to freebsd-current-unsubscr...@freebsd.org