Re: ahc problems (with vinum?)

1999-11-28 Thread Kenneth D. Merry

David Gilbert wrote...
> > "Joe" == Joe Greco <[EMAIL PROTECTED]> writes:
> 
> >> > Copyright (c) 1992-1999 FreeBSD Inc.  > Copyright (c) 1982, 1986,
> >> 1989, 1991, 1993 > The Regents of the University of California. All
> >> rights reserved.  > FreeBSD 3.3-RELEASE #0: Mon Nov 22 13:38:07 CST
> >> 1999 > root@host:/usr/src/sys/compile/DEMO
> >> 
> >> The first problem is that you're running 3.3-R with two 7890s.
> >> Justin worked around a bug in the 7890 in the Adaptec driver
> >> shortly after 3.3 came out.  I'd recommend at the very least
> >> updating your Adaptec driver, although depending on your
> >> circumstances, it might be easier to just update to the latest
> >> -stable.
> 
> Joe> Noted.  One is an onboard controller, part of the ASUS P2B-DS.
> Joe> This particular system was supposed to have a 3940, but I didn't
> Joe> have one so I crammed in two 2940-type controllers.  Would this
> Joe> also be an issue for a system with the onboard controller and a
> Joe> 3940-type controller?
> 
> In my case... this happens with single or multiple controlers and I'm
> not using the 3940's.  I am already running 3.3-STABLE (as of
> Thursday, I believe) because vinum improved quite a bit after
> 3.3-RELEASE.

So you shouldn't have a problem with the 7890 bug, since you have the newer
driver.  Joe's problem is actually with a bus on a non-Ultra2 2940.

> Joe> I thought a bus reset was supposed to deal with bus phase
> Joe> issues...?  But I'm admittedly an armchair SCSI quarterback.  I
> Joe> used to see Suns that had a heterogeneous SCSI array of mildly
> Joe> incompatible SCSI devices routinely go through the
> Joe> jam-reset-restart sequence.
> 
> One strange datapoint to add.  I was looking at the system in
> preparation for putting another SCSI controller into it so I could get 
> back the errors... and I removed the external terminator from the LVD
> chain to read it's label.  Immediately, the screen started scrolling
> again with ahc messages --- This leads me to believe that everything
> is stuck waiting for the card to un-wedge.  Now... I'd already hit
> CTRL-ALT-DEL to see if I could unwedge things... so the system didn't
> come back at that point, but I thought it was interesting.

I believe LVD busses need to be properly terminated with an LVD terminator
in order to function.  So yanking the terminator is a bad idea. :)

> The following is my carefully typed sequence of messages from the
> console: (does not include messages after the terminator was
> removed).
> 
> (da5:ahc0:0:9;0): SCB 0xd5 - time out in dataout phase, SEQADDR == 0x5e
> (da5:ahc0:0:9;0): BDR message in message buffer
> (da5:ahc0:0:9;0): SCB 0xd5 - time out in dataout phase, SEQADDR == 0x5d
> (da5:ahc0:0:9;0): no longer in timeout, status 34b
> ahc0: Issued Channel A Bus Reset.  4 SCBs aborted

That generally means you have a cabling or termination problem.  It could
be a bent pin somewhere even.  Justin has had trouble with pins on his LVD
cables getting bent and causing weird problems.

So if you've got a spare cable setup, it might be a good idea to just swap
the cables out and see if the problem goes away.  Before you put in the new
cable set, make sure you check for any bent pins.

Specifically, the error above means that the bus probably got stuck for 60
seconds while the controller was trying to write data out to the drive.
So it is the fact that your bus is getting stuck in dataout phase that
leads me to believe that you've got a cabling or termination problem.

Ken
-- 
Kenneth Merry
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-stable" in the body of the message



Re: ahc problems (with vinum?)

1999-11-28 Thread Kenneth D. Merry

David Gilbert wrote...
> > "Kenneth" == Kenneth D Merry <[EMAIL PROTECTED]> writes:
> 
> Kenneth> It does, generally, but if you've got flaky cabling, it's
> Kenneth> hard to guarantee that the bus reset will fix all of your
> Kenneth> problems.
> 
> But since removing the terminator seems to unwedge things, it would
> make sense to look at what state we're getting stuck in.

You're getting stuck in dataout phase.

> In my case, the components were checked and certified by an ISO 9001
> var on Friday.  I have also tried several different adaptec cards.

>From personal experience, ISO 9001 doesn't mean they're competent. :)
(I think it just means they document things.)

And you've only tried different cards, not cables.

Ken
-- 
Kenneth Merry
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-stable" in the body of the message



Re: ahc problems (with vinum?)

1999-11-28 Thread Kenneth D. Merry

David Gilbert wrote...
> > "Kenneth" == Kenneth D Merry <[EMAIL PROTECTED]> writes:
> 
> Kenneth> I haven't heard of any problems with the Atlas IV that would
> Kenneth> cause anything other than Quantum's usual bogus queue full
> Kenneth> behavior, but of course that doesn't mean that there aren't
> Kenneth> any.
> 
> Hmmm... well... this is very repeatable.  This controller now has
> nothing but the ARRAY of LVD drives using the vendor's cable and
> terminator.  I have another LVD cable and terminator (take came with
> an HP DAT drive)... I'm going to try them...

Hopefully that'll fix it.

> Basically, cd'ing to a directory on the raid drive and doing
> 
> dump -0af - /usr | team 1m 8 | restore rf -
> 
> will cause the problem very repeatably.
> 
> (da5:ahc1:0:4:0): SCB 0xdc - timed out in dataout phase, SEQADDR == 0x5d
> (da5:ahc1:0:4:0): Other SCB Timeout
> (da8:ahc1:0:11:0): SCB 0xe6 - timed out in dataout phase, SEQADDR == 0x5d
> (da8:ahc1:0:11:0): BDR message in message buffer
> (da8:ahc1:0:11:0): SCB 0xe6 - timed out in dataout phase, SEQADDR == 0x5e
> (da8:ahc1:0:11:0): no longer in timeout, status = 34b
> ahc1: Issued Channel A Bus Reset. 4 SCBs aborted

Yep, looks like a cabling problem all right.

Ken
-- 
Kenneth Merry
[EMAIL PROTECTED]


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-stable" in the body of the message



Re: ahc problems (with vinum?)

1999-11-28 Thread Joe Greco

> > Noted.  One is an onboard controller, part of the ASUS P2B-DS.  This
> > particular system was supposed to have a 3940, but I didn't have one
> > so I crammed in two 2940-type controllers.  Would this also be an issue
> > for a system with the onboard controller and a 3940-type controller?
> 
> It will be an issue for any system with a 7890/1 in it.  I'm not sure if
> the same bug affects the 7896/7, so I can't say whether using a 3950 would
> fix the problem.
> 
> > > That isn't where your problems are showing up, however.  (Likely you
> > > haven't loaded your system enough to trigger the 7890 problem.)
> > 
> > Maybe/maybe not.  What might I expect to see from such a problem?
> 
> Well, I know you would probably get some data corruption.  I can't remember
> which list the thread was on, but you can search for "data corruption" and
> "aic7890" in the -current and -hackers list archives and see what turns up.

Ok.

> > I have certainly beat the $#!+ out of these systems in a variety of ways,
> > and have run into some odd things.  Most were traceable to SCSI issues.
> > Some didn't get classified.  I'm running vinum in a ten-filesystem config
> > on top of the 18 18GB drives, and I copy in data from another machine.  I
> > then have an application which mmap()'s the files, doing search and replace
> > ops on the data.  Running this app in parallel causes the system to hang
> > (eventually causing the watchdog to expire and reset the system).  Running
> > it serially on one fs at a time doesn't.  This is probably the most
> > worrisome of the issues I've seen.  If you have a recommended revision of
> > the ahc driver you'd like me to try, let me know.
> 
> Yes, you should run a version of the driver that has Justin's fix from
> September 20th.  Unfortunately, he didn't find the problem before 3.3 came
> out.

Ok.

> > > > (da10:ahc2:0:0:0): SCB 0x9 - timed out in datain phase, SEQADDR == 0x153
> > > > (da10:ahc2:0:0:0): no longer in timeout, status = 34b
> > > > ahc2: Issued Channel A Bus Reset. 3 SCBs aborted
> > > > (da10:ahc2:0:0:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x110
> > > > (da10:ahc2:0:0:0): BDR message in message buffer
> > > > (da10:ahc2:0:0:0): SCB 0xa - timed out in datain phase, SEQADDR == 0x10f
> > > > (da10:ahc2:0:0:0): no longer in timeout, status = 34b
> > > > ahc2: Issued Channel A Bus Reset. 6 SCBs aborted
> > > > 4357+1 records in
> > > > 4357+1 records out
> > > > 456960 bytes transferred in 428.640450 secs (10660683 bytes/sec)
> > > 
> > > [ ... ]
> > > 
> > > "Timed out in {datain|dataout} phase" means that a transaction took longer
> > > than 60 seconds to complete, and the bus was stuck in datain/dataout phase
> > > at the time.
> > > 
> > > This is almost always the result of a cabling or termination problem.
> > > 
> > > So you'll probably want to replace the cable on your Ultra-Wide chain, and
> > > verify that the termination is correct.
> > 
> > It's more complex than that. :-)  These machines are intended for deployment
> > in remote areas, and realistically I may never see many of them ever again
> > after that point.  They are rackmount in Antec PC cases and Kingston 9-bay
> > drive arrays, the drives themselves are mounted in Antec 690 drive modules.
> > This allows for easy replacement/upgrade in the event of problems, and with
> > the exception of this one problem-child machine, has worked out fantastic
> > so far.  But it introduces multi-multi variables into the equation.  The
> > 3940-to-PC backplate cable, the external cables, the terminators, the
> > internal 9-position Kingston ribbon cable, any of the 9 receiving brackets,
> > any of the 9 drive modules, and any of the 9 drives can potentially be an
> > issue.  The Antec drive modules seem to be the typical source of flakiness,
> > about 1:20 seem to give problems.
> > 
> > Okay, now, stop rolling your eyes.  I know it is ugly from a SCSI
> > perspective, but it is very functional and very useful, not to mention very
> > nice and damn fast.  It's hard to build something like that which can also
> > be deployed in a remote location where you'll have to explain to someone who
> > has 1/2 a clue what you want replaced, and how.  I prefer the
> > no-screwdriver-required method.
> 
> Oh, I can certainly appreciate the idiot-proof approach.  In your
> situation, it makes a lot of sense.  However it'll make it a little more
> difficult to track down the problem.

Already tracked down and fixed as of last week, it just took some time since
the problem only manifested itself after really hammering on the thing for a
while.  Sorry I didn't make that clear.  :-)  It makes for a really sucky
debug cycle... try "x", hammer on system for hours, watch for errors.  You
know.  Bleah.

... Joe

---
Joe Greco - Systems Administrator [EMAIL PROTECTED]
Solaria Public Access UNIX - Milwaukee, WI 

Re: ahc problems (with vinum?)

1999-11-28 Thread David Gilbert

> "Kenneth" == Kenneth D Merry <[EMAIL PROTECTED]> writes:

Kenneth> David Gilbert wrote...
>> > "Kenneth" == Kenneth D Merry <[EMAIL PROTECTED]> writes:
>> 
Kenneth> It does, generally, but if you've got flaky cabling, it's
Kenneth> hard to guarantee that the bus reset will fix all of your
Kenneth> problems.
>>  But since removing the terminator seems to unwedge things, it
>> would make sense to look at what state we're getting stuck in.

Kenneth> You're getting stuck in dataout phase.

Ok... but...

I just went over to the machine with the intention of changing the
cables. I removed the terminator just to see... and everything
unwedged (although ... this is after I replaced the terminator).

Dave.

-- 

|David Gilbert, Velocet Communications.   | Two things can only be |
|Mail:   [EMAIL PROTECTED] |  equal if and only if they |
|http://www.velocet.net/~dgilbert |   are precisely opposite.  |
=GLO


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-stable" in the body of the message



mmap bugs (was Re: ahc problems (with vinum?))

1999-11-28 Thread Mike Tancsa

At 02:45 PM 11/28/99 , Joe Greco wrote:
>I have certainly beat the $#!+ out of these systems in a variety of ways,
>and have run into some odd things.  Most were traceable to SCSI issues.
>Some didn't get classified.  I'm running vinum in a ten-filesystem config
>on top of the 18 18GB drives, and I copy in data from another machine.  I
>then have an application which mmap()'s the files, doing search and replace
>ops on the data.  Running this app in parallel causes the system to hang
>(eventually causing the watchdog to expire and reset the system).  Running
>it serially on one fs at a time doesn't.  This is probably the most
>worrisome of the issues I've seen.  If you have a recommended revision of
>the ahc driver you'd like me to try, let me know.

Can you post more details of the mmap bug you have come across ?  It would
be nice if this were fixed for 3.4. [EMAIL PROTECTED] is coordinating
testing of RCs for 3.4.  Perhaps this is a problem that someone could be
fix in time.

---Mike
**
Mike Tancsa   *  [EMAIL PROTECTED]
Sentex Communications Corp,   *  http://www.sentex.net/mike
Cambridge, Ontario*  519 651 3400
Canada*


To Unsubscribe: send mail to [EMAIL PROTECTED]
with "unsubscribe freebsd-stable" in the body of the message



Re: mmap bugs (was Re: ahc problems (with vinum?))

1999-11-28 Thread Joe Greco

> At 02:45 PM 11/28/99 , Joe Greco wrote:
> >I have certainly beat the $#!+ out of these systems in a variety of ways,
> >and have run into some odd things.  Most were traceable to SCSI issues.
> >Some didn't get classified.  I'm running vinum in a ten-filesystem config
> >on top of the 18 18GB drives, and I copy in data from another machine.  I
> >then have an application which mmap()'s the files, doing search and replace
> >ops on the data.  Running this app in parallel causes the system to hang
> >(eventually causing the watchdog to expire and reset the system).  Running
> >it serially on one fs at a time doesn't.  This is probably the most
> >worrisome of the issues I've seen.  If you have a recommended revision of
> >the ahc driver you'd like me to try, let me know.
> 
> Can you post more details of the mmap bug you have come across ?  It would
> be nice if this were fixed for 3.4. [EMAIL PROTECTED] is coordinating
> testing of RCs for 3.4.  Perhaps this is a problem that someone could be
> fix in time.

That's the problem, I don't really know what it is.  I'd sure love to see it
fixed, since anything that can hang a system in such a manner is unsettling,
but I don't really have much of an idea what's causing it.  It could be a
vinum thing, it could be some VM thing, it could be my crappy programming
(but userland programs should never puke the kernel).

I'll show you the program, the wrapper script, and a description of the
specific environment and use.  I'll also try to get around to doing some
additional debugging, but basically I've been seeing a soft system lockup
(userland processes appear to stop running, but console is responsive to
vty changes, pressing return results in an echo but the underlying program
doesn't seem to receive it and then further keystrokes are not echoed).
The kernel is still sane enough to be running my watchdog code, which will
eventually cause the system to reboot via software.  However, it does a
forced termination of the kernel since killing init doesn't work.

% cat filesed.c
/*
 * filesed.c
 *
 * (c) 1999 Joe Greco and sol.net Network Services.  All Rights Reserved.
 *
 * mmap a file, hunting for a string.  Replace with an identical-length
 * string.  Intended for scouring a spool and replacing Path: hosts after
 * a load-via-disk-copy.
 *
 * filesed 'from' 'to' file [file...]
 */

#include
#include
#include
#include
#include





int filesed(file, from, to)
char *file, *from, *to;
{
int count = 0;
int slen = strlen(from);
struct stat statbuf;
caddr_t map;
char *here, *end, *ptr;
int fd;

if (stat(file, &statbuf) < 0) {
perror(file);
return(-1);
}
if ((fd = open(file, O_RDWR, 0)) < 0) {
perror(file);
return(-1);
}
if (((int)(map = mmap(NULL, statbuf.st_size, PROT_READ|PROT_WRITE, MAP_SHARED, 
fd, 0))) == -1) {
close(fd);
perror(file);
return(-1);
}

/* Search and replace. */
here = map;
end = map + statbuf.st_size - slen;

while (here < end) {
ptr = memchr(here, *from, end - here);
if (! ptr) {
here = end;
} else {
if (! memcmp(ptr, from, slen)) {
memcpy(ptr, to, slen);
count++;
}
here = ptr + 1;
}
}

if (munmap(map, statbuf.st_size) < 0) {
perror(file);
}
if (count) {
printf("%s: %d change%s\n", file, count, count == 1 ? "" : "s");
} else {
printf("%s: no changes\n", file);
}
return(0);
}





int main(argc, argv)
int argc;
char *argv[];
{
int slen;
char *from;
char *to;

if (argc < 4) {
fprintf(stderr, "usage: filesed[file 
...]\n");
exit(1);
}

from = argv[1];
to = argv[2];
slen = strlen(from);
if (slen != strlen(to)) {
fprintf(stderr, "error: string lengths must be identical\n");
exit(1);
}
if (! slen) {
fprintf(stderr, "error: zero-length string unacceptable\n");
exit(1);
}
argv += 3;
argc -= 3;

while (argc) {
filesed(*argv, from, to);
argv++;
argc--;
}
}
% cat fixpath.sh
#! /bin/sh -

case "${1}" in
spool*|bins*)   continue;;
*)  exit 1;;
esac

for i in /news/spool/news/N.*; do
find ${i} -type f -name 'B.*' -print | xargs ./filesed $1 $2 &
done

What happens is I've got a system that looks like this:

% df -k
Filesystem  1K-blocks UsedAvail Cap