Re: AMI MegaRAID lockup? not accepting commands.
Can you try instead the changes that I just committed to -current? I think that the problem shows up when the controller is heavily loaded; your patch will keep the load on the controller down, which may mask the 'real' bug. Just recently (this evening), I was able to get our controller to lock up with the latest patch. Previously, with that patch installed, I must not have been able to tickle the bug just right, and I believe that Mike based his decision to make that mod based on my lack of a lockup, which always happened quickly. That's what made me think that we'd solved it, but I guess I just got "lucky" on the previous lockups that happened very quickly, making me think it was more easily reproduceable that it actually is. I'm not entirely sure about that; I think there are probably several sets of problems here. Can you be more specific about "locking up" though? The "controller wedged" bug is almost certainly not the same as the "lost interrupt" bug. It sounds like Markus may be onto something. I'm somewhat corralled here today, but I might get some time to apply his suggestions on Monday, especially if you're happy it works for you as well. -- \\ Give a man a fish, and you feed him for a day. \\ Mike Smith \\ Tell him he should learn how to fish himself, \\ [EMAIL PROTECTED] \\ and he'll hate you for a lifetime. \\ [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: AMI MegaRAID lockup? not accepting commands.
Mike Smith wrote: Just recently (this evening), I was able to get our controller to lock up with the latest patch. Previously, with that patch installed, I must not have been able to tickle the bug just right, and I believe that Mike based his decision to make that mod based on my lack of a lockup, which always happened quickly. That's what made me think that we'd solved it, but I guess I just got "lucky" on the previous lockups that happened very quickly, making me think it was more easily reproduceable that it actually is. I'm not entirely sure about that; I think there are probably several sets of problems here. Can you be more specific about "locking up" though? The "controller wedged" bug is almost certainly not the same as the "lost interrupt" bug. Here's a snippet of the messages from my syslog file: [...] Mar 24 12:35:19 cvsstage /kernel: amr0: controller wedged (not taking commands) Mar 24 12:35:19 cvsstage /kernel: amr0: I/O error - dead Mar 24 12:35:19 cvsstage /kernel: amr0: cmd 2 ident 17 drive 0 Mar 24 12:35:19 cvsstage /kernel: amr0: blkcount 12 lba 129695792 Mar 24 12:35:19 cvsstage /kernel: amr0: virtaddr 0xd1a5d000 length 6144 Mar 24 12:35:19 cvsstage /kernel: amr0: physaddr 7800 nsg 2 Mar 24 12:35:19 cvsstage /kernel: amr0: 1a11e000/4096 Mar 24 12:35:19 cvsstage /kernel: amr0: 1993f000/2048 Mar 24 12:35:19 cvsstage /kernel: amr0: controller wedged (not taking commands) Mar 24 12:35:19 cvsstage /kernel: amr0: I/O error - dead Mar 24 12:35:19 cvsstage /kernel: amr0: cmd 2 ident 17 drive 0 Mar 24 12:35:19 cvsstage /kernel: amr0: blkcount 12 lba 129826864 Mar 24 12:35:19 cvsstage /kernel: amr0: virtaddr 0xd1a4d000 length 6144 Mar 24 12:35:19 cvsstage /kernel: amr0: physaddr 7800 nsg 2 Mar 24 12:35:19 cvsstage /kernel: amr0: 71ce000/4096 Mar 24 12:35:19 cvsstage /kernel: amr0: 402f000/2048 Mar 24 12:35:19 cvsstage /kernel: amr0: controller wedged (not taking commands) Mar 24 12:35:19 cvsstage /kernel: amr0: I/O error - dead Mar 24 12:35:19 cvsstage /kernel: amr0: cmd 2 ident 17 drive 0 Mar 24 12:35:19 cvsstage /kernel: amr0: blkcount 12 lba 129630256 Mar 24 12:35:19 cvsstage /kernel: amr0: virtaddr 0xd1a3d000 length 6144 Mar 24 12:35:19 cvsstage /kernel: amr0: physaddr 7800 nsg 2 Mar 24 12:35:19 cvsstage /kernel: amr0: 1befe000/4096 Mar 24 12:35:19 cvsstage /kernel: amr0: 1869f000/2048 [...] In a separate lock up, there are no messages to syslog, but all accesses to the card are hung. A ps shows my 26 bonnie processes are in either in 'wswbuf0' or 'biord' (going from memory here, I may not have the exact state text correct). This is the one I believe we are calling the "lost interrupt" bug. I'm running a patched 3/13 on this machine which I can't readily do a full cvs update on it. I believe that 3/13 was before Poul made his B_READ changes, so I did not incorporate Poul's 1.8 revision for amr.c (because I assume it would be incorrect to do so without getting all of his changes throughout the rest of the kernel). However, I did get all of your changes at 1.9. I also incorporated Markus' patch, with the exception that I set maxio to 253 instead of 127 or 254 like the card reports (thinking possibly that there was an off by one issue, i.e., 254 available, 0-253). It is this kernel that produced the messages above. Just for sanity's sake, I'll try Markus' maxio of 127 and verify whether or not my 26 simultaneous bonnie processes can finish without locking it up. I agree that we are probably chasing more than one problem. Also, I don't necessarily think you should back out the "volatile" change; even though it did not fix this problem, I think it should still be there. It sounds like Markus may be onto something. I'm somewhat corralled here today, but I might get some time to apply his suggestions on Monday, especially if you're happy it works for you as well. What we're thinking about doing here is that if scaling back the number of outstanding io requests hides/avoids the problem, then we may do that here as a temporary fix, especially if we can still get good performance. We have the need to get this machine into production soon. Ultimately, I'd like to get another card that we can play with and experiment with a bit more so that we can diagnose the real cause, and then be able to run the card a full steam. I am still able to work on this, though, at least for a few days. One area I thought about spending some time was where you maintain whether the card has interrupts enabled or not and based on this info, you issue commands with the expectation of getting an interrupt back or use polled mode. The next thing I was going to check was to review that part of the code thinking maybe that the software state might possibly have gotten out of sync with reality at some point. Also, I'm open to other suggestions if you think there's a more productive area I should spend time on. Thanks for your help on
Re: AMI MegaRAID lockup? not accepting commands.
I've played around changing the spinloop to using DELAY (like the Linux model), but this didn't prevent the controller from either "just" locking up or crashing the whole machine with it. Changing various other places in a similar manner (like replacing the bcopy() in amr_quartz_get_work() with similar code as in the linux driver to wait for 0xFF to clear) didn't do the trick either. However, when I forced the driver to not use the full number of concurrent commands as returned by the firmware, I seem to finally have found the one change that made the difference. Looking at the linux code, it sets a hard limit of AMR_MAXCMD (MAX_COMMANDS in the linux code) of 127 (my controller, a 466, returned 254), and it says the value can be tweaked between 0 and 253, not 254...). So, forcing sc-amr_maxio to AMR_MAXCMD if that one's smaller, in amr_query_controller(), might cause some performance loss, but it made the code *significantly* stabler than before. I did two make world on the raid now, and not one hickup. Before I wasn't even able to copy over the system to the raid without sending the system to reboot. Possible explanation: people that introduced debugging statements slowed down the feeding of new commands to the controller, so the controller didn't ever use up the full set of concurrent commands. The lockup happens when too many concurrent commands are open (now, I haven't tried setting things to 253, I am glad things finally work:-)). Hope this helps, Markus -- KPNQwest Switzerland Ltd P.O. Box 9470, Zweierstrasse 35, CH-8036 Zuerich Tel: +41-1-298-6030, Fax: +41-1-291-4642 Markus Wild, Manager Engineering, e-mail: [EMAIL PROTECTED] Index: amr.c === RCS file: /home/ncvs/src/sys/dev/amr/amr.c,v retrieving revision 1.8 diff -c -r1.8 amr.c *** amr.c 2000/03/20 10:44:03 1.8 --- amr.c 2000/03/23 19:20:03 *** *** 699,704 --- 702,712 } sc-amr_maxdrives = 8; sc-amr_maxio = ae-ae_adapter.aa_maxio; + if (sc-amr_maxio AMR_MAXCMD) { + device_printf(sc-amr_dev, "reducing maxio from %d to %d\n", + sc-amr_maxio, AMR_MAXCMD); + sc-amr_maxio = AMR_MAXCMD; + } for (i = 0; i ae-ae_ldrv.al_numdrives; i++) { sc-amr_drive[i].al_size = ae-ae_ldrv.al_size[i]; sc-amr_drive[i].al_state = ae-ae_ldrv.al_state[i]; *** *** 853,859 ac-ac_private = bp; ac-ac_data = bp-b_data; ac-ac_length = bp-b_bcount; ! if (bp-b_iocmd == BIO_READ) { ac-ac_flags |= AMR_CMD_DATAIN; cmd = AMR_CMD_LREAD; } else { --- 861,868 ac-ac_private = bp; ac-ac_data = bp-b_data; ac-ac_length = bp-b_bcount; ! /*if (bp-b_iocmd == BIO_READ) { */ ! if (bp-b_flags B_READ) { ac-ac_flags |= AMR_CMD_DATAIN; cmd = AMR_CMD_LREAD; } else { Index: amrvar.h === RCS file: /home/ncvs/src/sys/dev/amr/amrvar.h,v retrieving revision 1.2 diff -c -r1.2 amrvar.h *** amrvar.h1999/10/26 23:18:57 1.2 --- amrvar.h2000/03/23 19:20:04 *** *** 37,43 #define AMR_CFG_SIG 0xa0 #define AMR_SIGNATURE 0x3344 ! #define AMR_MAXCMD255 /* ident = 0 not allowed */ #define AMR_MAXLD 40 #define AMR_BLKSIZE 512 --- 37,44 #define AMR_CFG_SIG 0xa0 #define AMR_SIGNATURE 0x3344 ! /*#define AMR_MAXCMD 255*/ /* ident = 0 not allowed */ ! #define AMR_MAXCMD127 /* ident = 0 not allowed */ #define AMR_MAXLD 40 #define AMR_BLKSIZE 512
Re: AMI MegaRAID lockup? not accepting commands.
* [EMAIL PROTECTED] [EMAIL PROTECTED] [000323 12:47] wrote: I've played around changing the spinloop to using DELAY (like the Linux model), but this didn't prevent the controller from either "just" locking up or crashing the whole machine with it. Changing various other places in a similar manner (like replacing the bcopy() in amr_quartz_get_work() with similar code as in the linux driver to wait for 0xFF to clear) didn't do the trick either. However, when I forced the driver to not use the full number of concurrent commands as returned by the firmware, I seem to finally have found the one change that made the difference. Looking at the linux code, it sets a hard limit of AMR_MAXCMD (MAX_COMMANDS in the linux code) of 127 (my controller, a 466, returned 254), and it says the value can be tweaked between 0 and 253, not 254...). So, forcing sc-amr_maxio to AMR_MAXCMD if that one's smaller, in amr_query_controller(), might cause some performance loss, but it made the code *significantly* stabler than before. I did two make world on the raid now, and not one hickup. Before I wasn't even able to copy over the system to the raid without sending the system to reboot. Possible explanation: people that introduced debugging statements slowed down the feeding of new commands to the controller, so the controller didn't ever use up the full set of concurrent commands. The lockup happens when too many concurrent commands are open (now, I haven't tried setting things to 253, I am glad things finally work:-)). dude, you rule! I'm glad this looks like it's finally resolved, can you let me know if it survives further stress testing? I've found the easiest way to wedge the box is to perform a 'cvs up' (not cvsup) from a local repository over /usr/src or /usr/ports, this would always lockup my box with amr, if you have the time and disk space that would be a much better stressor than just make world. thanks, -Alfred Hope this helps, Markus -- KPNQwest Switzerland Ltd P.O. Box 9470, Zweierstrasse 35, CH-8036 Zuerich Tel: +41-1-298-6030, Fax: +41-1-291-4642 Markus Wild, Manager Engineering, e-mail: [EMAIL PROTECTED] Content-Description: mydiff.short Index: amr.c === RCS file: /home/ncvs/src/sys/dev/amr/amr.c,v retrieving revision 1.8 diff -c -r1.8 amr.c *** amr.c 2000/03/20 10:44:03 1.8 --- amr.c 2000/03/23 19:20:03 *** *** 699,704 --- 702,712 } sc-amr_maxdrives = 8; sc-amr_maxio = ae-ae_adapter.aa_maxio; + if (sc-amr_maxio AMR_MAXCMD) { + device_printf(sc-amr_dev, "reducing maxio from %d to %d\n", + sc-amr_maxio, AMR_MAXCMD); + sc-amr_maxio = AMR_MAXCMD; + } for (i = 0; i ae-ae_ldrv.al_numdrives; i++) { sc-amr_drive[i].al_size = ae-ae_ldrv.al_size[i]; sc-amr_drive[i].al_state = ae-ae_ldrv.al_state[i]; *** *** 853,859 ac-ac_private = bp; ac-ac_data = bp-b_data; ac-ac_length = bp-b_bcount; ! if (bp-b_iocmd == BIO_READ) { ac-ac_flags |= AMR_CMD_DATAIN; cmd = AMR_CMD_LREAD; } else { --- 861,868 ac-ac_private = bp; ac-ac_data = bp-b_data; ac-ac_length = bp-b_bcount; ! /* if (bp-b_iocmd == BIO_READ) { */ ! if (bp-b_flags B_READ) { ac-ac_flags |= AMR_CMD_DATAIN; cmd = AMR_CMD_LREAD; } else { Index: amrvar.h === RCS file: /home/ncvs/src/sys/dev/amr/amrvar.h,v retrieving revision 1.2 diff -c -r1.2 amrvar.h *** amrvar.h 1999/10/26 23:18:57 1.2 --- amrvar.h 2000/03/23 19:20:04 *** *** 37,43 #define AMR_CFG_SIG 0xa0 #define AMR_SIGNATURE 0x3344 ! #define AMR_MAXCMD 255 /* ident = 0 not allowed */ #define AMR_MAXLD 40 #define AMR_BLKSIZE 512 --- 37,44 #define AMR_CFG_SIG 0xa0 #define AMR_SIGNATURE 0x3344 ! /*#define AMR_MAXCMD255*/ /* ident = 0 not allowed */ ! #define AMR_MAXCMD 127 /* ident = 0 not allowed */ #define AMR_MAXLD 40 #define AMR_BLKSIZE 512 -- -Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: AMI MegaRAID lockup? not accepting commands.
I've played around changing the spinloop to using DELAY (like the Linux model), but this didn't prevent the controller from either "just" locking up or crashing the whole machine with it. Changing various other places in a similar manner (like replacing the bcopy() in amr_quartz_get_work() with similar code as in the linux driver to wait for 0xFF to clear) didn't do the trick either. Can you try instead the changes that I just committed to -current? I think that the problem shows up when the controller is heavily loaded; your patch will keep the load on the controller down, which may mask the 'real' bug. -- \\ Give a man a fish, and you feed him for a day. \\ Mike Smith \\ Tell him he should learn how to fish himself, \\ [EMAIL PROTECTED] \\ and he'll hate you for a lifetime. \\ [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: AMI MegaRAID lockup? not accepting commands.
I've found the easiest way to wedge the box is to perform a 'cvs up' (not cvsup) from a local repository over /usr/src or /usr/ports, this would always lockup my box with amr, if you have the time and disk space that would be a much better stressor than just make world. I have done a cvs update on the whole root tree, well, the repository was the default (certainly not local, but I'd say I have fairly decent connectivity, and there was quite some stress on the drive). I had also done something that should be comparable. I had the initial system on a disk on my Adaptec controller, and as the first stress test I copied over the whole system (~13GB) with dump | restore to the raid (so, /usr/src and /usr/ports were included there). And, I just unpacked X11R6.4 sources, which also caused quite a bit of stress. Still running :) Might have to add I've enabled softdeps on all partitions, this could change the usage pattern slightly (don't know whether to the better or worse regarding crash likelyhood). About my source tree: I tried some other changes before, and some of them are still in my sources (and were not in the included diff). Since those changes by themselves didn't make a difference, I didn't include them. However, if someone should still get crashes with just the minimal diffs, I can include the complete diffs to fully reproduce my sources. Markus -- KPNQwest Switzerland Ltd P.O. Box 9470, Zweierstrasse 35, CH-8036 Zuerich Tel: +41-1-298-6030, Fax: +41-1-291-4642 Markus Wild, Manager Engineering, e-mail: [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: AMI MegaRAID lockup? not accepting commands.
Can you try instead the changes that I just committed to -current? I think that the problem shows up when the controller is heavily loaded; your patch will keep the load on the controller down, which may mask the 'real' bug. I tried your approach (that was what I described with "fiddling with DELAY"). I even went even further to clear that loop, but it didn't help. This is what I currently still have in there from these experiments: /* from linux: The "volatile" is due to gcc bugs */ #define barrier() __asm__ __volatile__("": : :"memory") for (i = 1, done = 0, worked = 0; (i 0) !done; i--) { s = splbio(); /* is the mailbox free? */ if (sc-amr_mailbox-mb_busy == 0) { debug("got mailbox"); sc-amr_mailbox64-mb64_segment = 0; bcopy(ac-ac_mailbox, sc-amr_mailbox, AMR_MBOX_CMDSIZE); sc-amr_submit_command(sc); done = 1; sc-amr_workcount++; TAILQ_INSERT_TAIL(sc-amr_work, ac, ac_link); /* not free, try to clean up while we wait */ } else { debug("busy flag %x\n", sc-amr_mailbox-mb_busy); /* don't do this in here for now, it involves talking to the * controller to see whether there's work done, and since we * just saw that the controller is somewhat busy, that's perhaps * not such a good idea? */ /* worked += amr_done(sc); */ } splx(s); DELAY(100); barrier(); } /* check here for work to be done */ s = splbio(); worked += amr_done(sc); splx(s); This did *NOT* stop the controller from crashing. Ignore the comment above, I'll take this amr_done call back up, but I just wanted to REALLY be sure this loop wasn't the cause for the crash. Markus -- KPNQwest Switzerland Ltd P.O. Box 9470, Zweierstrasse 35, CH-8036 Zuerich Tel: +41-1-298-6030, Fax: +41-1-291-4642 Markus Wild, Manager Engineering, e-mail: [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: AMI MegaRAID lockup? not accepting commands.
Mike Smith wrote: I've played around changing the spinloop to using DELAY (like the Linux model), but this didn't prevent the controller from either "just" locking up or crashing the whole machine with it. Changing various other places in a similar manner (like replacing the bcopy() in amr_quartz_get_work() with similar code as in the linux driver to wait for 0xFF to clear) didn't do the trick either. Can you try instead the changes that I just committed to -current? I think that the problem shows up when the controller is heavily loaded; your patch will keep the load on the controller down, which may mask the 'real' bug. Just recently (this evening), I was able to get our controller to lock up with the latest patch. Previously, with that patch installed, I must not have been able to tickle the bug just right, and I believe that Mike based his decision to make that mod based on my lack of a lockup, which always happened quickly. That's what made me think that we'd solved it, but I guess I just got "lucky" on the previous lockups that happened very quickly, making me think it was more easily reproduceable that it actually is. It sounds like Markus may be onto something. -Brian -- Brian Dean [EMAIL PROTECTED] SAS Institute Inc. [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: AMI MegaRAID lockup? not accepting commands.
At 7:25 PM -0800 2000/3/20, Mike Smith wrote: Not that I consider this particuarly optimal; busy-waiting for the controller is a terrible waste of the host CPU. A better solution would probably defer the command and try again a short time later, but let's see if this works first. Since this is a device driver, I guess you can't usleep() and then check again? Is there anything else useful you could be doing during that period of time -- other than busy waiting? -- These are my opinions -- not to be taken as official Skynet policy == Brad Knowles, [EMAIL PROTECTED]|| Belgacom Skynet SA/NV Systems Architect, Mail/News/FTP/Proxy Admin || Rue Colonel Bourg, 124 Phone/Fax: +32-2-706.13.11/12.49 || B-1140 Brussels http://www.skynet.be || Belgium To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: AMI MegaRAID lockup? not accepting commands.
:At 7:25 PM -0800 2000/3/20, Mike Smith wrote: : : Not that I consider this particuarly optimal; busy-waiting for the : controller is a terrible waste of the host CPU. A better solution would : probably defer the command and try again a short time later, but let's : see if this works first. : : Since this is a device driver, I guess you can't usleep() and :then check again? Is there anything else useful you could be doing :during that period of time -- other than busy waiting? : :-- : These are my opinions -- not to be taken as official Skynet policy :== :Brad Knowles, [EMAIL PROTECTED]|| Belgacom Skynet SA/NV :Systems Architect, Mail/News/FTP/Proxy Admin || Rue Colonel Bourg, 124 For situations that aren't in the critical path and don't happen often, it may be beneficial to do a voluntary context switch inside the loop. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: AMI MegaRAID lockup? not accepting commands.
At 7:25 PM -0800 2000/3/20, Mike Smith wrote: Not that I consider this particuarly optimal; busy-waiting for the controller is a terrible waste of the host CPU. A better solution would probably defer the command and try again a short time later, but let's see if this works first. Since this is a device driver, I guess you can't usleep() and then check again? Is there anything else useful you could be doing during that period of time -- other than busy waiting? Well, I call amr_done() to collect completed commands. There's not much other housekeeping that's possible at that point. -- \\ Give a man a fish, and you feed him for a day. \\ Mike Smith \\ Tell him he should learn how to fish himself, \\ [EMAIL PROTECTED] \\ and he'll hate you for a lifetime. \\ [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: AMI MegaRAID lockup? not accepting commands.
: For situations that aren't in the critical path and don't happen often, : it may be beneficial to do a voluntary context switch inside the loop. : :Is it possible/legal to do this inside a strategy() routine? Yes, though it isn't playing nice if the caller was trying to issue an asynchronous I/O. -Matt Matthew Dillon [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: AMI MegaRAID lockup? not accepting commands.
We have a system with a new AMI card in it controlling a pair of shelves from Dell (fbsd dated: 4.0-2313-SNAP). The relevant dmesg output is below: (complete dmesg at end) amr0: AMI MegaRAID mem 0xf6c0-0xf6ff irq 14 at device 10.1 on pci2 amr0: firmware 1.01 bios 1p00 128MB memory amrd0: MegaRAID logical drive on amr0 amrd0: 172780MB (353853440 sectors) RAID 5 (optimal) The adapter does not lockup while testing with bonnie and such. Try running 20 or so bonnie processes in parallel; I can usually get it to lock up with this configuration. I'm wondering which controller you've got there though - I don't recognise the BIOS/firmware versions. However, we have a 50Gig CVS repository sitting on the raid volume. When we do a 'cvs co' of -HEAD, it causes it to lockup. The following messages are repeating continuously: Mar 19 16:02:59 cvs /kernel: amr0: controller wedged (not taking commands) I'm not sure why this happens; the controller isn't coming ready even though we haven't hit any sort of limit that we're aware of. I've been considering some workarounds involving deferring the command until the controller gives us back an interrupt, but I'm still surprised that we get to this point at all. Unfortunately, I'm not able to spend any time on this at the moment; if someone wants to do a little experimenting I'd be very happy to talk them through what I think should be done (will require some programming ability). -- \\ Give a man a fish, and you feed him for a day. \\ Mike Smith \\ Tell him he should learn how to fish himself, \\ [EMAIL PROTECTED] \\ and he'll hate you for a lifetime. \\ [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: AMI MegaRAID lockup? not accepting commands.
The controller is new. Dell calls it a Perc2/dc and it has 128Meg of memory installed in it. I'm not sitting infront of the machine right now. More detailed information is available when the machines is booted and you enter the bios setup on the adapter card. Ok. From some rumours coming out of Dell, I get the impression that this is an Enterprise 1400 or 1500 with only two channels loaded. I guess I need a better way of telling these controllers apart. 8( We have a system with a new AMI card in it controlling a pair of shelves from Dell (fbsd dated: 4.0-2313-SNAP). The relevant dmesg output is below: (complete dmesg at end) amr0: AMI MegaRAID mem 0xf6c0-0xf6ff irq 14 at device 10.1 on pci2 amr0: firmware 1.01 bios 1p00 128MB memory amrd0: MegaRAID logical drive on amr0 amrd0: 172780MB (353853440 sectors) RAID 5 (optimal) The adapter does not lockup while testing with bonnie and such. Try running 20 or so bonnie processes in parallel; I can usually get it to lock up with this configuration. I'm wondering which controller you've got there though - I don't recognise the BIOS/firmware versions. However, we have a 50Gig CVS repository sitting on the raid volume. When we do a 'cvs co' of -HEAD, it causes it to lockup. The following messages are repeating continuously: Mar 19 16:02:59 cvs /kernel: amr0: controller wedged (not taking commands) I'm not sure why this happens; the controller isn't coming ready even though we haven't hit any sort of limit that we're aware of. I've been considering some workarounds involving deferring the command until the controller gives us back an interrupt, but I'm still surprised that we get to this point at all. Well, we've been playing around in amr.c/amr_start in the following code sequence: /* spin waiting for the mailbox */ debug("wait for mailbox"); for (i = 1, done = 0, worked = 0; (i 0) !done; i--) { s = splbio(); /* is the mailbox free? */ if (sc-amr_mailbox-mb_busy == 0) { debug("got mailbox"); sc-amr_mailbox64-mb64_segment = 0; bcopy(ac-ac_mailbox, sc-amr_mailbox, AMR_MBOX_CMDSIZE); sc-amr_submit_command(sc); done = 1; sc-amr_workcount++; TAILQ_INSERT_TAIL(sc-amr_work, ac, ac_link); /* not free, try to clean up while we wait */ } else { -- printf("%s: busy flag %x\n", __FUNCTION__, sc-amr_mailbox-mb_busy); debug("busy flag %x\n", sc-amr_mailbox-mb_busy); worked = amr_done(sc); } splx(s); } Note the addition of the printf statement in the else clause. Two interesting things happen. One, we are unable to cause the controller to lock up. Two, the following messages showup in syslog: Mar 20 12:55:15 cvsstage /kernel: amr_start: busy flag 1 Mar 20 12:55:46 cvsstage last message repeated 1057 times Mar 20 12:57:47 cvsstage last message repeated 5574 times Mar 20 12:59:26 cvsstage last message repeated 5431 times Mar 20 12:59:26 cvsstage /kernel: amr_start: busy flag 0 If I understand the sequence correctly, we enter splbio() and then check the mailbox. Most of the time, we take the else clause and the busy flag is 1 as it should be. However, once every 10 to 12 thousand loops, mb_busy is checked as being 1, but by the time we get to the else clause, it's 0. I wonder if there is some sort of timing issue since the addition of the printf allows the card to operate correctly. I haven't traced the kernel printf code, but it could change the spl level thus allowing the mb_busy flag to be modified. Comments? The mb_busy flag is in system memory, but it's maintained by the card itself (it will bus-master and update it according to its internal state). Thus, when you see it printed as 0, somewhere between the test and the printf the controller has updated the flag and indicated it's busy. You probably only see this quite rarely because the code path from the if() to the printf() is very short (a jump) while the code path the rest of the way 'round is much longer (through printf(), amr_done(), splx(), splbio() etc.). Adding the printfs massively slows the loop down; you might try increasing the timeout (initial value of 'i') by an order of magnitude instead. The real problem here is the spinloop - because the flag is in system memory, the loop runs entirely in the cache and thus executes insanely quickly. If it wasn't for the fact that this code is called both with interrupts enabled and disabled, I'd use a much shorter loop and simply defer the command if the controller didn't come ready almost immediately. Some strategic use of DELAY() might also help. The Linux driver uses the following code: /*==*/ /* Wait until the
Re: AMI MegaRAID lockup? not accepting commands.
Hi, The controller is new. Dell calls it a Perc2/dc and it has 128Meg of memory installed in it. I'm not sitting infront of the machine right now. More detailed information is available when the machines is booted and you enter the bios setup on the adapter card. We have a system with a new AMI card in it controlling a pair of shelves from Dell (fbsd dated: 4.0-2313-SNAP). The relevant dmesg output is below: (complete dmesg at end) amr0: AMI MegaRAID mem 0xf6c0-0xf6ff irq 14 at device 10.1 on pci2 amr0: firmware 1.01 bios 1p00 128MB memory amrd0: MegaRAID logical drive on amr0 amrd0: 172780MB (353853440 sectors) RAID 5 (optimal) The adapter does not lockup while testing with bonnie and such. Try running 20 or so bonnie processes in parallel; I can usually get it to lock up with this configuration. I'm wondering which controller you've got there though - I don't recognise the BIOS/firmware versions. However, we have a 50Gig CVS repository sitting on the raid volume. When we do a 'cvs co' of -HEAD, it causes it to lockup. The following messages are repeating continuously: Mar 19 16:02:59 cvs /kernel: amr0: controller wedged (not taking commands) I'm not sure why this happens; the controller isn't coming ready even though we haven't hit any sort of limit that we're aware of. I've been considering some workarounds involving deferring the command until the controller gives us back an interrupt, but I'm still surprised that we get to this point at all. Well, we've been playing around in amr.c/amr_start in the following code sequence: /* spin waiting for the mailbox */ debug("wait for mailbox"); for (i = 1, done = 0, worked = 0; (i 0) !done; i--) { s = splbio(); /* is the mailbox free? */ if (sc-amr_mailbox-mb_busy == 0) { debug("got mailbox"); sc-amr_mailbox64-mb64_segment = 0; bcopy(ac-ac_mailbox, sc-amr_mailbox, AMR_MBOX_CMDSIZE); sc-amr_submit_command(sc); done = 1; sc-amr_workcount++; TAILQ_INSERT_TAIL(sc-amr_work, ac, ac_link); /* not free, try to clean up while we wait */ } else { -- printf("%s: busy flag %x\n", __FUNCTION__, sc-amr_mailbox-mb_busy); debug("busy flag %x\n", sc-amr_mailbox-mb_busy); worked = amr_done(sc); } splx(s); } Note the addition of the printf statement in the else clause. Two interesting things happen. One, we are unable to cause the controller to lock up. Two, the following messages showup in syslog: Mar 20 12:55:15 cvsstage /kernel: amr_start: busy flag 1 Mar 20 12:55:46 cvsstage last message repeated 1057 times Mar 20 12:57:47 cvsstage last message repeated 5574 times Mar 20 12:59:26 cvsstage last message repeated 5431 times Mar 20 12:59:26 cvsstage /kernel: amr_start: busy flag 0 If I understand the sequence correctly, we enter splbio() and then check the mailbox. Most of the time, we take the else clause and the busy flag is 1 as it should be. However, once every 10 to 12 thousand loops, mb_busy is checked as being 1, but by the time we get to the else clause, it's 0. I wonder if there is some sort of timing issue since the addition of the printf allows the card to operate correctly. I haven't traced the kernel printf code, but it could change the spl level thus allowing the mb_busy flag to be modified. Comments? Unfortunately, I'm not able to spend any time on this at the moment; if someone wants to do a little experimenting I'd be very happy to talk them through what I think should be done (will require some programming ability). We're more than willing to try. Just point us in the right direction. -- \\ Give a man a fish, and you feed him for a day. \\ Mike Smith \\ Tell him he should learn how to fish himself, \\ [EMAIL PROTECTED] \\ and he'll hate you for a lifetime. \\ [EMAIL PROTECTED] -John To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: AMI MegaRAID lockup? not accepting commands.
A couple of clarifications on the last message: Thus, when you see it printed as 0, somewhere between the test and the printf the controller has updated the flag and indicated it's busy. That should of course be "not busy". I'd be guessing that the current loop (100k iterations) is probably completing far sooner than 1s. You could confirm this by grabbing a timestamp at the beginning of amr_start and then checking again at the point where it bails out. If that's the case, try cutting the initial value of i down to 10,000 and insert a DELAY(100) in the "did not get mailbox" case. I didn't use DELAY() initially because I wasn't sure it would work correctly if called before interrupts are enabled. That was probably a stupid mistake; I would try the above suggestion first as I suspect it'll get you going. Not that I consider this particuarly optimal; busy-waiting for the controller is a terrible waste of the host CPU. A better solution would probably defer the command and try again a short time later, but let's see if this works first. -- \\ Give a man a fish, and you feed him for a day. \\ Mike Smith \\ Tell him he should learn how to fish himself, \\ [EMAIL PROTECTED] \\ and he'll hate you for a lifetime. \\ [EMAIL PROTECTED] To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message
Re: AMI MegaRAID lockup? not accepting commands.
On Sun, Mar 19, 2000 at 09:00:35PM -0500, John W. DeBoskey wrote: amr0: AMI MegaRAID mem 0xf6c0-0xf6ff irq 14 at device 10.1 on pci2 amr0: firmware 1.01 bios 1p00 128MB memory amrd0: MegaRAID logical drive on amr0 amrd0: 172780MB (353853440 sectors) RAID 5 (optimal) The adapter does not lockup while testing with bonnie and such. However, we have a 50Gig CVS repository sitting on the raid volume. When we do a 'cvs co' of -HEAD, it causes it to lockup. The following messages are repeating continuously: Mar 19 16:02:59 cvs /kernel: amr0: controller wedged (not taking commands) Mar 19 16:03:00 cvs /kernel: amr0: I/O error - dead Mar 19 16:03:00 cvs /kernel: amr0: cmd 2 ident 178 drive 0 Mar 19 16:03:00 cvs /kernel: amr0: blkcount 12 lba 59506736 Mar 19 16:03:00 cvs /kernel: amr0: virtaddr 0xd3089000 length 6144 Mar 19 16:03:00 cvs /kernel: amr0: physaddr c880 nsg 2 Mar 19 16:03:00 cvs /kernel: amr0: 1abea000/4096 Mar 19 16:03:00 cvs /kernel: amr0: 25d2b000/2048 Ditto, less the kernel messages. Every process wedged into biord and I couldn't reboot the machine or do much of anything (no DDB, g). Rebooting this resulted in all the of the disks being marked 'failed' by the controller and I had to remark them 'online' and boot. A very long night(well, whatever 3am is..) of fscking followed. It was not fun. amr0: AMI MegaRAID mem 0xf6c0-0xf6ff irq 10 at device 10.1 on pci2 amr0: firmware 3.13 bios 1.43 16MB memory amrd0: MegaRAID logical drive on amr0 amrd0: 173390MB (355102720 sectors) RAID 5 (optimal) -- Bill Fumerola - Network Architect Computer Horizons Corp - CVM e-mail: [EMAIL PROTECTED] / [EMAIL PROTECTED] Office: 800-252-2421 x128 / Cell: 248-761-7272 To Unsubscribe: send mail to [EMAIL PROTECTED] with "unsubscribe freebsd-current" in the body of the message