Re: AMI MegaRAID lockup? not accepting commands.

2000-03-24 Thread Mike Smith

  Can you try instead the changes that I just committed to -current?  I 
  think that the problem shows up when the controller is heavily loaded; 
  your patch will keep the load on the controller down, which may mask the 
  'real' bug.

 Just recently (this evening), I was able to get our controller to lock
 up with the latest patch.  Previously, with that patch installed, I
 must not have been able to tickle the bug just right, and I believe
 that Mike based his decision to make that mod based on my lack of a
 lockup, which always happened quickly.  That's what made me think that
 we'd solved it, but I guess I just got "lucky" on the previous lockups
 that happened very quickly, making me think it was more easily
 reproduceable that it actually is.

I'm not entirely sure about that; I think there are probably several sets 
of problems here.

Can you be more specific about "locking up" though?  The "controller 
wedged" bug is almost certainly not the same as the "lost interrupt" bug.

 It sounds like Markus may be onto something.

I'm somewhat corralled here today, but I might get some time to apply his 
suggestions on Monday, especially if you're happy it works for you as 
well.

-- 
\\ Give a man a fish, and you feed him for a day. \\  Mike Smith
\\ Tell him he should learn how to fish himself,  \\  [EMAIL PROTECTED]
\\ and he'll hate you for a lifetime. \\  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: AMI MegaRAID lockup? not accepting commands.

2000-03-24 Thread Brian Dean

Mike Smith wrote:
  Just recently (this evening), I was able to get our controller to lock
  up with the latest patch.  Previously, with that patch installed, I
  must not have been able to tickle the bug just right, and I believe
  that Mike based his decision to make that mod based on my lack of a
  lockup, which always happened quickly.  That's what made me think that
  we'd solved it, but I guess I just got "lucky" on the previous lockups
  that happened very quickly, making me think it was more easily
  reproduceable that it actually is.
 
 I'm not entirely sure about that; I think there are probably several sets 
 of problems here.
 
 Can you be more specific about "locking up" though?  The "controller 
 wedged" bug is almost certainly not the same as the "lost interrupt" bug.

Here's a snippet of the messages from my syslog file:

[...]
Mar 24 12:35:19 cvsstage /kernel: amr0: controller wedged (not taking commands)
Mar 24 12:35:19 cvsstage /kernel: amr0: I/O error - dead
Mar 24 12:35:19 cvsstage /kernel: amr0: cmd 2  ident 17  drive 0
Mar 24 12:35:19 cvsstage /kernel: amr0: blkcount 12  lba 129695792
Mar 24 12:35:19 cvsstage /kernel: amr0: virtaddr 0xd1a5d000  length 6144
Mar 24 12:35:19 cvsstage /kernel: amr0: physaddr 7800  nsg 2
Mar 24 12:35:19 cvsstage /kernel: amr0:   1a11e000/4096
Mar 24 12:35:19 cvsstage /kernel: amr0:   1993f000/2048
Mar 24 12:35:19 cvsstage /kernel: amr0: controller wedged (not taking commands)
Mar 24 12:35:19 cvsstage /kernel: amr0: I/O error - dead
Mar 24 12:35:19 cvsstage /kernel: amr0: cmd 2  ident 17  drive 0
Mar 24 12:35:19 cvsstage /kernel: amr0: blkcount 12  lba 129826864
Mar 24 12:35:19 cvsstage /kernel: amr0: virtaddr 0xd1a4d000  length 6144
Mar 24 12:35:19 cvsstage /kernel: amr0: physaddr 7800  nsg 2
Mar 24 12:35:19 cvsstage /kernel: amr0:   71ce000/4096
Mar 24 12:35:19 cvsstage /kernel: amr0:   402f000/2048
Mar 24 12:35:19 cvsstage /kernel: amr0: controller wedged (not taking commands)
Mar 24 12:35:19 cvsstage /kernel: amr0: I/O error - dead
Mar 24 12:35:19 cvsstage /kernel: amr0: cmd 2  ident 17  drive 0
Mar 24 12:35:19 cvsstage /kernel: amr0: blkcount 12  lba 129630256
Mar 24 12:35:19 cvsstage /kernel: amr0: virtaddr 0xd1a3d000  length 6144
Mar 24 12:35:19 cvsstage /kernel: amr0: physaddr 7800  nsg 2
Mar 24 12:35:19 cvsstage /kernel: amr0:   1befe000/4096
Mar 24 12:35:19 cvsstage /kernel: amr0:   1869f000/2048
[...]

In a separate lock up, there are no messages to syslog, but all
accesses to the card are hung.  A ps shows my 26 bonnie processes are
in either in 'wswbuf0' or 'biord' (going from memory here, I may not
have the exact state text correct).  This is the one I believe we are
calling the "lost interrupt" bug.

I'm running a patched 3/13 on this machine which I can't readily do a
full cvs update on it.  I believe that 3/13 was before Poul made his
B_READ changes, so I did not incorporate Poul's 1.8 revision for amr.c
(because I assume it would be incorrect to do so without getting all of
his changes throughout the rest of the kernel).  However, I did get
all of your changes at 1.9.

I also incorporated Markus' patch, with the exception that I set maxio
to 253 instead of 127 or 254 like the card reports (thinking possibly
that there was an off by one issue, i.e., 254 available, 0-253).  It
is this kernel that produced the messages above.  Just for sanity's
sake, I'll try Markus' maxio of 127 and verify whether or not my 26
simultaneous bonnie processes can finish without locking it up.

I agree that we are probably chasing more than one problem.  Also, I
don't necessarily think you should back out the "volatile" change;
even though it did not fix this problem, I think it should still be
there.

  It sounds like Markus may be onto something.
 
 I'm somewhat corralled here today, but I might get some time to apply his 
 suggestions on Monday, especially if you're happy it works for you as 
 well.

What we're thinking about doing here is that if scaling back the
number of outstanding io requests hides/avoids the problem, then we
may do that here as a temporary fix, especially if we can still get
good performance.  We have the need to get this machine into
production soon.  Ultimately, I'd like to get another card that we can
play with and experiment with a bit more so that we can diagnose the
real cause, and then be able to run the card a full steam.

I am still able to work on this, though, at least for a few days.  One
area I thought about spending some time was where you maintain whether
the card has interrupts enabled or not and based on this info, you
issue commands with the expectation of getting an interrupt back or
use polled mode.  The next thing I was going to check was to review
that part of the code thinking maybe that the software state might
possibly have gotten out of sync with reality at some point.  Also,
I'm open to other suggestions if you think there's a more productive
area I should spend time on.

Thanks for your help on 

Re: AMI MegaRAID lockup? not accepting commands.

2000-03-23 Thread mw

I've played around changing the spinloop to using DELAY (like the Linux model),
but this didn't prevent the controller from either "just" locking up or 
crashing the whole machine with it. Changing various other places in a similar
manner (like replacing the bcopy() in amr_quartz_get_work() with similar
code as in the linux driver to wait for 0xFF to clear) didn't do the trick
either. 

However, when I forced the driver to not use the full number of
concurrent commands as returned by the firmware, I seem to finally have 
found the one change that made the difference. Looking at the linux
code, it sets a hard limit of AMR_MAXCMD (MAX_COMMANDS in the linux code) of
127 (my controller, a 466, returned 254), and it says the value can be tweaked
between 0 and 253, not 254...). So, forcing sc-amr_maxio to AMR_MAXCMD if
that one's smaller, in amr_query_controller(), might cause some performance
loss, but it made the code *significantly* stabler than before. I did two
make world on the raid now, and not one hickup. Before I wasn't even able to
copy over the system to the raid without sending the system to reboot. 

Possible explanation: people that introduced debugging statements slowed down
the feeding of new commands to the controller, so the controller didn't ever
use up the full set of concurrent commands. The lockup happens when too many
concurrent commands are open (now, I haven't tried setting things to 253, I
am glad things finally work:-)).

Hope this helps,
Markus
-- 
KPNQwest Switzerland Ltd
P.O. Box 9470, Zweierstrasse 35, CH-8036 Zuerich
Tel: +41-1-298-6030, Fax: +41-1-291-4642
Markus Wild, Manager Engineering, e-mail: [EMAIL PROTECTED]


Index: amr.c
===
RCS file: /home/ncvs/src/sys/dev/amr/amr.c,v
retrieving revision 1.8
diff -c -r1.8 amr.c
*** amr.c   2000/03/20 10:44:03 1.8
--- amr.c   2000/03/23 19:20:03
***
*** 699,704 
--- 702,712 
}
sc-amr_maxdrives = 8;
sc-amr_maxio = ae-ae_adapter.aa_maxio;
+   if (sc-amr_maxio  AMR_MAXCMD) {
+ device_printf(sc-amr_dev, "reducing maxio from %d to %d\n", 
+   sc-amr_maxio, AMR_MAXCMD);
+ sc-amr_maxio = AMR_MAXCMD;
+   }
for (i = 0; i  ae-ae_ldrv.al_numdrives; i++) {
sc-amr_drive[i].al_size = ae-ae_ldrv.al_size[i];
sc-amr_drive[i].al_state = ae-ae_ldrv.al_state[i];
***
*** 853,859 
ac-ac_private = bp;
ac-ac_data = bp-b_data;
ac-ac_length = bp-b_bcount;
!   if (bp-b_iocmd == BIO_READ) {
ac-ac_flags |= AMR_CMD_DATAIN;
cmd = AMR_CMD_LREAD;
} else {
--- 861,868 
ac-ac_private = bp;
ac-ac_data = bp-b_data;
ac-ac_length = bp-b_bcount;
! /*if (bp-b_iocmd == BIO_READ) { */
!   if (bp-b_flags  B_READ) {
ac-ac_flags |= AMR_CMD_DATAIN;
cmd = AMR_CMD_LREAD;
} else {
Index: amrvar.h
===
RCS file: /home/ncvs/src/sys/dev/amr/amrvar.h,v
retrieving revision 1.2
diff -c -r1.2 amrvar.h
*** amrvar.h1999/10/26 23:18:57 1.2
--- amrvar.h2000/03/23 19:20:04
***
*** 37,43 
  #define AMR_CFG_SIG   0xa0
  #define AMR_SIGNATURE 0x3344
  
! #define AMR_MAXCMD255 /* ident = 0 not allowed */
  #define AMR_MAXLD 40
  
  #define AMR_BLKSIZE   512
--- 37,44 
  #define AMR_CFG_SIG   0xa0
  #define AMR_SIGNATURE 0x3344
  
! /*#define AMR_MAXCMD  255*/   /* ident = 0 not allowed */
! #define AMR_MAXCMD127 /* ident = 0 not allowed */
  #define AMR_MAXLD 40
  
  #define AMR_BLKSIZE   512



Re: AMI MegaRAID lockup? not accepting commands.

2000-03-23 Thread Alfred Perlstein

* [EMAIL PROTECTED] [EMAIL PROTECTED] [000323 12:47] wrote:
 I've played around changing the spinloop to using DELAY (like the Linux model),
 but this didn't prevent the controller from either "just" locking up or 
 crashing the whole machine with it. Changing various other places in a similar
 manner (like replacing the bcopy() in amr_quartz_get_work() with similar
 code as in the linux driver to wait for 0xFF to clear) didn't do the trick
 either. 
 
 However, when I forced the driver to not use the full number of
 concurrent commands as returned by the firmware, I seem to finally have 
 found the one change that made the difference. Looking at the linux
 code, it sets a hard limit of AMR_MAXCMD (MAX_COMMANDS in the linux code) of
 127 (my controller, a 466, returned 254), and it says the value can be tweaked
 between 0 and 253, not 254...). So, forcing sc-amr_maxio to AMR_MAXCMD if
 that one's smaller, in amr_query_controller(), might cause some performance
 loss, but it made the code *significantly* stabler than before. I did two
 make world on the raid now, and not one hickup. Before I wasn't even able to
 copy over the system to the raid without sending the system to reboot. 
 
 Possible explanation: people that introduced debugging statements slowed down
 the feeding of new commands to the controller, so the controller didn't ever
 use up the full set of concurrent commands. The lockup happens when too many
 concurrent commands are open (now, I haven't tried setting things to 253, I
 am glad things finally work:-)).

dude, you rule!  I'm glad this looks like it's finally resolved, can
you let me know if it survives further stress testing?

I've found the easiest way to wedge the box is to perform a 'cvs up'
(not cvsup) from a local repository over /usr/src or /usr/ports, this
would always lockup my box with amr, if you have the time and disk
space that would be a much better stressor than just make world.

thanks,
-Alfred

 
 Hope this helps,
 Markus
 -- 
 KPNQwest Switzerland Ltd
 P.O. Box 9470, Zweierstrasse 35, CH-8036 Zuerich
 Tel: +41-1-298-6030, Fax: +41-1-291-4642
 Markus Wild, Manager Engineering, e-mail: [EMAIL PROTECTED]

Content-Description: mydiff.short
 Index: amr.c
 ===
 RCS file: /home/ncvs/src/sys/dev/amr/amr.c,v
 retrieving revision 1.8
 diff -c -r1.8 amr.c
 *** amr.c 2000/03/20 10:44:03 1.8
 --- amr.c 2000/03/23 19:20:03
 ***
 *** 699,704 
 --- 702,712 
   }
   sc-amr_maxdrives = 8;
   sc-amr_maxio = ae-ae_adapter.aa_maxio;
 + if (sc-amr_maxio  AMR_MAXCMD) {
 +   device_printf(sc-amr_dev, "reducing maxio from %d to %d\n", 
 + sc-amr_maxio, AMR_MAXCMD);
 +   sc-amr_maxio = AMR_MAXCMD;
 + }
   for (i = 0; i  ae-ae_ldrv.al_numdrives; i++) {
   sc-amr_drive[i].al_size = ae-ae_ldrv.al_size[i];
   sc-amr_drive[i].al_state = ae-ae_ldrv.al_state[i];
 ***
 *** 853,859 
   ac-ac_private = bp;
   ac-ac_data = bp-b_data;
   ac-ac_length = bp-b_bcount;
 ! if (bp-b_iocmd == BIO_READ) {
   ac-ac_flags |= AMR_CMD_DATAIN;
   cmd = AMR_CMD_LREAD;
   } else {
 --- 861,868 
   ac-ac_private = bp;
   ac-ac_data = bp-b_data;
   ac-ac_length = bp-b_bcount;
 ! /*  if (bp-b_iocmd == BIO_READ) { */
 ! if (bp-b_flags  B_READ) {
   ac-ac_flags |= AMR_CMD_DATAIN;
   cmd = AMR_CMD_LREAD;
   } else {
 Index: amrvar.h
 ===
 RCS file: /home/ncvs/src/sys/dev/amr/amrvar.h,v
 retrieving revision 1.2
 diff -c -r1.2 amrvar.h
 *** amrvar.h  1999/10/26 23:18:57 1.2
 --- amrvar.h  2000/03/23 19:20:04
 ***
 *** 37,43 
   #define AMR_CFG_SIG 0xa0
   #define AMR_SIGNATURE   0x3344
   
 ! #define AMR_MAXCMD  255 /* ident = 0 not allowed */
   #define AMR_MAXLD   40
   
   #define AMR_BLKSIZE 512
 --- 37,44 
   #define AMR_CFG_SIG 0xa0
   #define AMR_SIGNATURE   0x3344
   
 ! /*#define AMR_MAXCMD255*/   /* ident = 0 not allowed */
 ! #define AMR_MAXCMD  127 /* ident = 0 not allowed */
   #define AMR_MAXLD   40
   
   #define AMR_BLKSIZE 512


-- 
-Alfred Perlstein - [[EMAIL PROTECTED]|[EMAIL PROTECTED]]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: AMI MegaRAID lockup? not accepting commands.

2000-03-23 Thread Mike Smith

 I've played around changing the spinloop to using DELAY (like the Linux model),
 but this didn't prevent the controller from either "just" locking up or 
 crashing the whole machine with it. Changing various other places in a similar
 manner (like replacing the bcopy() in amr_quartz_get_work() with similar
 code as in the linux driver to wait for 0xFF to clear) didn't do the trick
 either. 

Can you try instead the changes that I just committed to -current?  I 
think that the problem shows up when the controller is heavily loaded; 
your patch will keep the load on the controller down, which may mask the 
'real' bug.

-- 
\\ Give a man a fish, and you feed him for a day. \\  Mike Smith
\\ Tell him he should learn how to fish himself,  \\  [EMAIL PROTECTED]
\\ and he'll hate you for a lifetime. \\  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: AMI MegaRAID lockup? not accepting commands.

2000-03-23 Thread mw

 I've found the easiest way to wedge the box is to perform a 'cvs up'
 (not cvsup) from a local repository over /usr/src or /usr/ports, this
 would always lockup my box with amr, if you have the time and disk
 space that would be a much better stressor than just make world.

I have done a cvs update on the whole root tree, well, the repository was 
the default (certainly not local, but I'd say I have fairly decent 
connectivity, and there was quite some stress on the drive).

I had also done something that should be comparable. I had the initial system
on a disk on my Adaptec controller, and as the first stress test I copied over
the whole system (~13GB) with dump | restore to the raid (so, /usr/src and
/usr/ports were included there). And, I just unpacked X11R6.4 sources, which
also caused quite a bit of stress. Still running :)

Might have to add I've enabled softdeps on all partitions, this could change
the usage pattern slightly (don't know whether to the better or worse regarding
crash likelyhood).

About my source tree: I tried some other changes before, and some of them are 
still in my sources (and were not in the included diff). Since those changes 
by themselves didn't make a difference, I didn't include them. However, if 
someone should still get crashes with just the minimal diffs, I can include 
the complete diffs to fully reproduce my sources.

Markus
-- 
KPNQwest Switzerland Ltd
P.O. Box 9470, Zweierstrasse 35, CH-8036 Zuerich
Tel: +41-1-298-6030, Fax: +41-1-291-4642
Markus Wild, Manager Engineering, e-mail: [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: AMI MegaRAID lockup? not accepting commands.

2000-03-23 Thread mw

 Can you try instead the changes that I just committed to -current?  I 
 think that the problem shows up when the controller is heavily loaded; 
 your patch will keep the load on the controller down, which may mask the 
 'real' bug.

I tried your approach (that was what I described with "fiddling with DELAY").
I even went even further to clear that loop, but it didn't help. This is 
what I currently still have in there from these experiments:

/* from linux: The "volatile" is due to gcc bugs */
#define barrier() __asm__ __volatile__("": : :"memory")

for (i = 1, done = 0, worked = 0; (i  0)  !done; i--) {
s = splbio();

/* is the mailbox free? */
if (sc-amr_mailbox-mb_busy == 0) {
debug("got mailbox");
sc-amr_mailbox64-mb64_segment = 0;
bcopy(ac-ac_mailbox, sc-amr_mailbox, AMR_MBOX_CMDSIZE);
sc-amr_submit_command(sc);
done = 1;
sc-amr_workcount++;
TAILQ_INSERT_TAIL(sc-amr_work, ac, ac_link);

/* not free, try to clean up while we wait */
} else {
debug("busy flag %x\n", sc-amr_mailbox-mb_busy);
/* don't do this in here for now, it involves talking to the
 * controller to see whether there's work done, and since we
 * just saw that the controller is somewhat busy, that's perhaps
 * not such a good idea? */
/* worked += amr_done(sc); */
}
splx(s);

DELAY(100);
barrier();
}

/* check here for work to be done */
s = splbio();
worked += amr_done(sc);
splx(s);


This did *NOT* stop the controller from crashing. Ignore the comment above,
I'll take this amr_done call back up, but I just wanted to REALLY be sure
this loop wasn't the cause for the crash.

Markus
-- 
KPNQwest Switzerland Ltd
P.O. Box 9470, Zweierstrasse 35, CH-8036 Zuerich
Tel: +41-1-298-6030, Fax: +41-1-291-4642
Markus Wild, Manager Engineering, e-mail: [EMAIL PROTECTED]



To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: AMI MegaRAID lockup? not accepting commands.

2000-03-23 Thread Brian Dean

Mike Smith wrote:
  I've played around changing the spinloop to using DELAY (like the Linux model),
  but this didn't prevent the controller from either "just" locking up or 
  crashing the whole machine with it. Changing various other places in a similar
  manner (like replacing the bcopy() in amr_quartz_get_work() with similar
  code as in the linux driver to wait for 0xFF to clear) didn't do the trick
  either. 
 
 Can you try instead the changes that I just committed to -current?  I 
 think that the problem shows up when the controller is heavily loaded; 
 your patch will keep the load on the controller down, which may mask the 
 'real' bug.

Just recently (this evening), I was able to get our controller to lock
up with the latest patch.  Previously, with that patch installed, I
must not have been able to tickle the bug just right, and I believe
that Mike based his decision to make that mod based on my lack of a
lockup, which always happened quickly.  That's what made me think that
we'd solved it, but I guess I just got "lucky" on the previous lockups
that happened very quickly, making me think it was more easily
reproduceable that it actually is.

It sounds like Markus may be onto something.

-Brian
-- 
Brian Dean  [EMAIL PROTECTED]
SAS Institute Inc.  [EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: AMI MegaRAID lockup? not accepting commands.

2000-03-21 Thread Brad Knowles

At 7:25 PM -0800 2000/3/20, Mike Smith wrote:

  Not that I consider this particuarly optimal; busy-waiting for the
  controller is a terrible waste of the host CPU.  A better solution would
  probably defer the command and try again a short time later, but let's
  see if this works first.

Since this is a device driver, I guess you can't usleep() and 
then check again?  Is there anything else useful you could be doing 
during that period of time -- other than busy waiting?

--
   These are my opinions -- not to be taken as official Skynet policy
==
Brad Knowles, [EMAIL PROTECTED]|| Belgacom Skynet SA/NV
Systems Architect, Mail/News/FTP/Proxy Admin || Rue Colonel Bourg, 124
Phone/Fax: +32-2-706.13.11/12.49 || B-1140 Brussels
http://www.skynet.be || Belgium


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: AMI MegaRAID lockup? not accepting commands.

2000-03-21 Thread Matthew Dillon

:At 7:25 PM -0800 2000/3/20, Mike Smith wrote:
:
:  Not that I consider this particuarly optimal; busy-waiting for the
:  controller is a terrible waste of the host CPU.  A better solution would
:  probably defer the command and try again a short time later, but let's
:  see if this works first.
:
:   Since this is a device driver, I guess you can't usleep() and 
:then check again?  Is there anything else useful you could be doing 
:during that period of time -- other than busy waiting?
:
:--
:   These are my opinions -- not to be taken as official Skynet policy
:==
:Brad Knowles, [EMAIL PROTECTED]|| Belgacom Skynet SA/NV
:Systems Architect, Mail/News/FTP/Proxy Admin || Rue Colonel Bourg, 124

For situations that aren't in the critical path and don't happen often,
it may be beneficial to do a voluntary context switch inside the loop.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: AMI MegaRAID lockup? not accepting commands.

2000-03-21 Thread Mike Smith

 At 7:25 PM -0800 2000/3/20, Mike Smith wrote:
 
   Not that I consider this particuarly optimal; busy-waiting for the
   controller is a terrible waste of the host CPU.  A better solution would
   probably defer the command and try again a short time later, but let's
   see if this works first.
 
   Since this is a device driver, I guess you can't usleep() and 
 then check again?  Is there anything else useful you could be doing 
 during that period of time -- other than busy waiting?

Well, I call amr_done() to collect completed commands.  There's not much 
other housekeeping that's possible at that point.

-- 
\\ Give a man a fish, and you feed him for a day. \\  Mike Smith
\\ Tell him he should learn how to fish himself,  \\  [EMAIL PROTECTED]
\\ and he'll hate you for a lifetime. \\  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: AMI MegaRAID lockup? not accepting commands.

2000-03-21 Thread Matthew Dillon


: For situations that aren't in the critical path and don't happen often,
: it may be beneficial to do a voluntary context switch inside the loop.
:
:Is it possible/legal to do this inside a strategy() routine?

Yes, though it isn't playing nice if the caller was trying to issue
an asynchronous I/O.

-Matt
Matthew Dillon 
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: AMI MegaRAID lockup? not accepting commands.

2000-03-20 Thread Mike Smith

We have a system with a new AMI card in it controlling a pair
 of shelves from Dell (fbsd dated: 4.0-2313-SNAP).
 
The relevant dmesg output is below: (complete dmesg at end)
 
 amr0: AMI MegaRAID mem 0xf6c0-0xf6ff irq 14 at device 10.1 on pci2
 amr0: firmware 1.01 bios 1p00  128MB memory
 amrd0: MegaRAID logical drive on amr0
 amrd0: 172780MB (353853440 sectors) RAID 5 (optimal)
 
The adapter does not lockup while testing with bonnie and such.

Try running 20 or so bonnie processes in parallel; I can usually get it 
to lock up with this configuration.  I'm wondering which controller 
you've got there though - I don't recognise the BIOS/firmware versions.

 However, we have a 50Gig CVS repository sitting on the raid
 volume. When we do a 'cvs co' of -HEAD, it causes it to lockup.
 The following messages are repeating continuously:
 
 Mar 19 16:02:59 cvs /kernel: amr0: controller wedged (not taking commands)

I'm not sure why this happens; the controller isn't coming ready even 
though we haven't hit any sort of limit that we're aware of.  I've been 
considering some workarounds involving deferring the command until the 
controller gives us back an interrupt, but I'm still surprised that we 
get to this point at all.

Unfortunately, I'm not able to spend any time on this at the moment; if 
someone wants to do a little experimenting I'd be very happy to talk them 
through what I think should be done (will require some programming 
ability).


-- 
\\ Give a man a fish, and you feed him for a day. \\  Mike Smith
\\ Tell him he should learn how to fish himself,  \\  [EMAIL PROTECTED]
\\ and he'll hate you for a lifetime. \\  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: AMI MegaRAID lockup? not accepting commands.

2000-03-20 Thread Mike Smith

The controller is new. Dell calls it a Perc2/dc and it has 128Meg
 of memory installed in it. I'm not sitting infront of the
 machine right now. More detailed information is available
 when the machines is booted and you enter the bios setup
 on the adapter card.

Ok.  From some rumours coming out of Dell, I get the impression that this 
is an Enterprise 1400 or 1500 with only two channels loaded.  I guess I 
need a better way of telling these controllers apart. 8(

  We have a system with a new AMI card in it controlling a pair
   of shelves from Dell (fbsd dated: 4.0-2313-SNAP).
   
  The relevant dmesg output is below: (complete dmesg at end)
   
   amr0: AMI MegaRAID mem 0xf6c0-0xf6ff irq 14 at device 10.1 on pci2
   amr0: firmware 1.01 bios 1p00  128MB memory
   amrd0: MegaRAID logical drive on amr0
   amrd0: 172780MB (353853440 sectors) RAID 5 (optimal)
   
  The adapter does not lockup while testing with bonnie and such.
  
  Try running 20 or so bonnie processes in parallel; I can usually get it 
  to lock up with this configuration.  I'm wondering which controller 
  you've got there though - I don't recognise the BIOS/firmware versions.
  
   However, we have a 50Gig CVS repository sitting on the raid
   volume. When we do a 'cvs co' of -HEAD, it causes it to lockup.
   The following messages are repeating continuously:
   
   Mar 19 16:02:59 cvs /kernel: amr0: controller wedged (not taking commands)
  
  I'm not sure why this happens; the controller isn't coming ready even 
  though we haven't hit any sort of limit that we're aware of.  I've been 
  considering some workarounds involving deferring the command until the 
  controller gives us back an interrupt, but I'm still surprised that we 
  get to this point at all.
 
Well, we've been playing around in amr.c/amr_start in the following
 code sequence:
 
 /* spin waiting for the mailbox */
 debug("wait for mailbox");
 for (i = 1, done = 0, worked = 0; (i  0)  !done; i--) {
 s = splbio();
 
 /* is the mailbox free? */
 if (sc-amr_mailbox-mb_busy == 0) {
 debug("got mailbox");
 sc-amr_mailbox64-mb64_segment = 0;
 bcopy(ac-ac_mailbox, sc-amr_mailbox, AMR_MBOX_CMDSIZE);
 sc-amr_submit_command(sc);
 done = 1;
 sc-amr_workcount++;
 TAILQ_INSERT_TAIL(sc-amr_work, ac, ac_link);
 
 /* not free, try to clean up while we wait */
 } else {
 --   printf("%s: busy flag %x\n", __FUNCTION__, sc-amr_mailbox-mb_busy);
 debug("busy flag %x\n", sc-amr_mailbox-mb_busy);
 worked = amr_done(sc); 
 }
 splx(s);
 }
 
 
 
 
Note the addition of the printf statement in the else clause. Two
 interesting things happen. One, we are unable to cause the controller
 to lock up. Two, the following messages showup in syslog:
 
 Mar 20 12:55:15 cvsstage /kernel: amr_start: busy flag 1
 Mar 20 12:55:46 cvsstage last message repeated 1057 times
 Mar 20 12:57:47 cvsstage last message repeated 5574 times
 Mar 20 12:59:26 cvsstage last message repeated 5431 times
 Mar 20 12:59:26 cvsstage /kernel: amr_start: busy flag 0
 
If I understand the sequence correctly, we enter splbio() and
 then check the mailbox. Most of the time, we take the else clause
 and the busy flag is 1 as it should be. However, once every 10 to 12
 thousand loops, mb_busy is checked as being 1, but by the time we
 get to the else clause, it's 0.
 
I wonder if there is some sort of timing issue since the
 addition of the printf allows the card to operate correctly. I
 haven't traced the kernel printf code, but it could change the
 spl level thus allowing the mb_busy flag to be modified.
 
Comments?

The mb_busy flag is in system memory, but it's maintained by the card 
itself (it will bus-master and update it according to its internal state).
Thus, when you see it printed as 0, somewhere between the test and the 
printf the controller has updated the flag and indicated it's busy. 

You probably only see this quite rarely because the code path from the 
if() to the printf() is very short (a jump) while the code path the rest of
the way 'round is much longer (through printf(), amr_done(), splx(),
splbio() etc.).

Adding the printfs massively slows the loop down; you might try 
increasing the timeout (initial value of 'i') by an order of magnitude 
instead.  The real problem here is the spinloop - because the flag is in 
system memory, the loop runs entirely in the cache and thus executes 
insanely quickly.  If it wasn't for the fact that this code is called 
both with interrupts enabled and disabled, I'd use a much shorter loop 
and simply defer the command if the controller didn't come ready almost 
immediately.  Some strategic use of DELAY() might also help.  The Linux 
driver uses the following code:

/*==*/
/* Wait until the 

Re: AMI MegaRAID lockup? not accepting commands.

2000-03-20 Thread John W. DeBoskey

Hi,

   The controller is new. Dell calls it a Perc2/dc and it has 128Meg
of memory installed in it. I'm not sitting infront of the
machine right now. More detailed information is available
when the machines is booted and you enter the bios setup
on the adapter card.

 We have a system with a new AMI card in it controlling a pair
  of shelves from Dell (fbsd dated: 4.0-2313-SNAP).
  
 The relevant dmesg output is below: (complete dmesg at end)
  
  amr0: AMI MegaRAID mem 0xf6c0-0xf6ff irq 14 at device 10.1 on pci2
  amr0: firmware 1.01 bios 1p00  128MB memory
  amrd0: MegaRAID logical drive on amr0
  amrd0: 172780MB (353853440 sectors) RAID 5 (optimal)
  
 The adapter does not lockup while testing with bonnie and such.
 
 Try running 20 or so bonnie processes in parallel; I can usually get it 
 to lock up with this configuration.  I'm wondering which controller 
 you've got there though - I don't recognise the BIOS/firmware versions.
 
  However, we have a 50Gig CVS repository sitting on the raid
  volume. When we do a 'cvs co' of -HEAD, it causes it to lockup.
  The following messages are repeating continuously:
  
  Mar 19 16:02:59 cvs /kernel: amr0: controller wedged (not taking commands)
 
 I'm not sure why this happens; the controller isn't coming ready even 
 though we haven't hit any sort of limit that we're aware of.  I've been 
 considering some workarounds involving deferring the command until the 
 controller gives us back an interrupt, but I'm still surprised that we 
 get to this point at all.

   Well, we've been playing around in amr.c/amr_start in the following
code sequence:

/* spin waiting for the mailbox */
debug("wait for mailbox");
for (i = 1, done = 0, worked = 0; (i  0)  !done; i--) {
s = splbio();

/* is the mailbox free? */
if (sc-amr_mailbox-mb_busy == 0) {
debug("got mailbox");
sc-amr_mailbox64-mb64_segment = 0;
bcopy(ac-ac_mailbox, sc-amr_mailbox, AMR_MBOX_CMDSIZE);
sc-amr_submit_command(sc);
done = 1;
sc-amr_workcount++;
TAILQ_INSERT_TAIL(sc-amr_work, ac, ac_link);

/* not free, try to clean up while we wait */
} else {
--   printf("%s: busy flag %x\n", __FUNCTION__, sc-amr_mailbox-mb_busy);
debug("busy flag %x\n", sc-amr_mailbox-mb_busy);
worked = amr_done(sc); 
}
splx(s);
}




   Note the addition of the printf statement in the else clause. Two
interesting things happen. One, we are unable to cause the controller
to lock up. Two, the following messages showup in syslog:

Mar 20 12:55:15 cvsstage /kernel: amr_start: busy flag 1
Mar 20 12:55:46 cvsstage last message repeated 1057 times
Mar 20 12:57:47 cvsstage last message repeated 5574 times
Mar 20 12:59:26 cvsstage last message repeated 5431 times
Mar 20 12:59:26 cvsstage /kernel: amr_start: busy flag 0

   If I understand the sequence correctly, we enter splbio() and
then check the mailbox. Most of the time, we take the else clause
and the busy flag is 1 as it should be. However, once every 10 to 12
thousand loops, mb_busy is checked as being 1, but by the time we
get to the else clause, it's 0.

   I wonder if there is some sort of timing issue since the
addition of the printf allows the card to operate correctly. I
haven't traced the kernel printf code, but it could change the
spl level thus allowing the mb_busy flag to be modified.

   Comments?

 
 Unfortunately, I'm not able to spend any time on this at the moment; if 
 someone wants to do a little experimenting I'd be very happy to talk them 
 through what I think should be done (will require some programming 
 ability).

   We're more than willing to try. Just point us in the right
direction.

 -- 
 \\ Give a man a fish, and you feed him for a day. \\  Mike Smith
 \\ Tell him he should learn how to fish himself,  \\  [EMAIL PROTECTED]
 \\ and he'll hate you for a lifetime. \\  [EMAIL PROTECTED]

-John




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: AMI MegaRAID lockup? not accepting commands.

2000-03-20 Thread Mike Smith


A couple of clarifications on the last message:

 Thus, when you see it printed as 0, somewhere between the test and the 
 printf the controller has updated the flag and indicated it's busy. 

That should of course be "not busy".

 I'd be guessing that the current loop (100k iterations) is probably 
 completing far sooner than 1s.  You could confirm this by grabbing a 
 timestamp at the beginning of amr_start and then checking again at the 
 point where it bails out.  If that's the case, try cutting the initial 
 value of i down to 10,000 and insert a DELAY(100) in the "did not get 
 mailbox" case.

I didn't use DELAY() initially because I wasn't sure it would work 
correctly if called before interrupts are enabled.  That was probably a 
stupid mistake; I would try the above suggestion first as I suspect it'll 
get you going.

Not that I consider this particuarly optimal; busy-waiting for the 
controller is a terrible waste of the host CPU.  A better solution would 
probably defer the command and try again a short time later, but let's 
see if this works first.

-- 
\\ Give a man a fish, and you feed him for a day. \\  Mike Smith
\\ Tell him he should learn how to fish himself,  \\  [EMAIL PROTECTED]
\\ and he'll hate you for a lifetime. \\  [EMAIL PROTECTED]




To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message



Re: AMI MegaRAID lockup? not accepting commands.

2000-03-19 Thread Bill Fumerola

On Sun, Mar 19, 2000 at 09:00:35PM -0500, John W. DeBoskey wrote:

 amr0: AMI MegaRAID mem 0xf6c0-0xf6ff irq 14 at device 10.1 on pci2
 amr0: firmware 1.01 bios 1p00  128MB memory
 amrd0: MegaRAID logical drive on amr0
 amrd0: 172780MB (353853440 sectors) RAID 5 (optimal)
 
The adapter does not lockup while testing with bonnie and such.
 However, we have a 50Gig CVS repository sitting on the raid
 volume. When we do a 'cvs co' of -HEAD, it causes it to lockup.
 The following messages are repeating continuously:
 
 Mar 19 16:02:59 cvs /kernel: amr0: controller wedged (not taking commands)
 Mar 19 16:03:00 cvs /kernel: amr0: I/O error - dead
 Mar 19 16:03:00 cvs /kernel: amr0: cmd 2  ident 178  drive 0
 Mar 19 16:03:00 cvs /kernel: amr0: blkcount 12  lba 59506736
 Mar 19 16:03:00 cvs /kernel: amr0: virtaddr 0xd3089000  length 6144
 Mar 19 16:03:00 cvs /kernel: amr0: physaddr c880  nsg 2
 Mar 19 16:03:00 cvs /kernel: amr0:   1abea000/4096
 Mar 19 16:03:00 cvs /kernel: amr0:   25d2b000/2048

Ditto, less the kernel messages. Every process wedged into biord and I
couldn't reboot the machine or do much of anything (no DDB, g). Rebooting
this resulted in all the of the disks being marked 'failed' by the controller
and I had to remark them 'online' and boot. A very long night(well, whatever
3am is..) of fscking followed.

It was not fun.

amr0: AMI MegaRAID mem 0xf6c0-0xf6ff irq 10 at device 10.1 on pci2
amr0: firmware 3.13 bios 1.43  16MB memory
amrd0: MegaRAID logical drive on amr0
amrd0: 173390MB (355102720 sectors) RAID 5 (optimal)

-- 
Bill Fumerola - Network Architect
Computer Horizons Corp - CVM
e-mail: [EMAIL PROTECTED] / [EMAIL PROTECTED]
Office: 800-252-2421 x128 / Cell: 248-761-7272





To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-current" in the body of the message