Raid-5 Problem

2003-07-22 Thread Ryan Schaefer
I've got a Digital Ultimate Personal Workstation (AS1200) with dual
533 procs running Debian unstable. The machine has a five disk
software Raid-5 array using Linux Software Raid.

When I installed it, under whatever the default kernel is for Debian
Stable, everything worked great, the raid array started up sync'd
fine. I've since upgraded to Debian Unstable and the 2.4.21 kernel. I
need to re-sync the array now from a power failure and the following
happens..

The machine will come up all the way, begin syncing the array, and run
normally for a bit, then after a few minutes (3-5) the following will
come up:

nostromo:~# Unable to handle kernel paging request at virtual address 
0240
CPU 0 swapper(0): Oops 1
pc = []  ra = []  ps = 0007Not tainted
v0 =   t0 =   t1 = fffc00227170
t2 =   t3 = 0002  t4 = 
t5 = fca6c484  t6 = fcaf8cc4  t7 = fc9d4000
s0 =   s1 = 0003  s2 = fc000ff69000
s3 = fc000ff690c0  s4 = 0003  s5 = fca77570
s6 = 
a0 = fc000fc3c080  a1 =   a2 = fc9d7ed8
a3 =   a4 =   a5 = fc9f88c0
t8 = 001f  t9 = 0014ca28b232  t10= 2800
t11= 0007  pv = fc826af0  at = 
gp = fffc00239c78  sp = fc9d7de8
Trace:fc818da4 fc819f8c fc82ce50 fc81a754 
fc813918 fc8151d0 fc82eb40 fc883a1c 
fc8151b4 fc883a1c fc8100e0 fc81001c
Code: 27ba0001  23bd2d3c  c3c4  47ff041f  2ffe  d3400049  
c3b7
Kernel panic: Aiee, killing interrupt handler!
In interrupt handler - not syncing



If I log in immediately after the machine boots, umount the md and run
a raidstop on it, the machine will run flawlessly for (as far as I can
tell) forever.

I've tried changing back to the 2.4.20 kernel and had the same
results. I've tried compiling the source for 2.4.21 and 2.4.22 pre7
and neither will compile due to errors involving xor.c:




make[2]: Entering directory `/usr/src/kernel-source-2.4.21/drivers/md'
gcc -D__KERNEL__ -I/usr/src/kernel-source-2.4.21/include -Wall 
-Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common 
-fomit-frame-pointer -pipe -mno-fp-regs -ffixed-8 -mcpu=ev56 -Wa,-mev6 -DMODULE 
-DMODVERSIONS -include 
/usr/src/kernel-source-2.4.21/include/linux/modversions.h  -nostdinc 
-iwithprefix include -DKBUILD_BASENAME=raid5  -c -o raid5.o raid5.c
gcc -D__KERNEL__ -I/usr/src/kernel-source-2.4.21/include -Wall 
-Wstrict-prototypes -Wno-trigraphs -O2 -fno-strict-aliasing -fno-common 
-fomit-frame-pointer -pipe -mno-fp-regs -ffixed-8 -mcpu=ev56 -Wa,-mev6 -DMODULE 
-DMODVERSIONS -include 
/usr/src/kernel-source-2.4.21/include/linux/modversions.h  -nostdinc 
-iwithprefix include -DKBUILD_BASENAME=xor  -DEXPORT_SYMTAB -c xor.c
In file included from xor.c:23:
/usr/src/kernel-source-2.4.21/include/asm/xor.h:35:5: missing terminating " 
character
In file included from xor.c:23:
/usr/src/kernel-source-2.4.21/include/asm/xor.h:36: error: request for member 
`text' in something not a structure or union
/usr/src/kernel-source-2.4.21/include/asm/xor.h:37: error: syntax error before 
numeric constant
/usr/src/kernel-source-2.4.21/include/asm/xor.h:62: error: syntax error at '#' 
token
/usr/src/kernel-source-2.4.21/include/asm/xor.h:87:17: invalid suffix "b" on 
integer constant
/usr/src/kernel-source-2.4.21/include/asm/xor.h:119: error: syntax error at '#' 
token
/usr/src/kernel-source-2.4.21/include/asm/xor.h:120: error: syntax error at '#' 
token
/usr/src/kernel-source-2.4.21/include/asm/xor.h:121: error: syntax error at '#' 
token
/usr/src/kernel-source-2.4.21/include/asm/xor.h:122: error: syntax error at '#' 
token
/usr/src/kernel-source-2.4.21/include/asm/xor.h:124: error: syntax error at '#' 
token
/usr/src/kernel-source-2.4.21/include/asm/xor.h:125: error: syntax error at '#' 
token
/usr/src/kernel-source-2.4.21/include/asm/xor.h:127: error: syntax error at '#' 
token
/usr/src/kernel-source-2.4.21/include/asm/xor.h:130: error: syntax error at '#' 
token
/usr/src/kernel-source-2.4.21/include/asm/xor.h:132: error: syntax error at '#' 
token
/usr/src/kernel-source-2.4.21/include/asm/xor.h:135: error: syntax error at '#' 
token
/usr/src/kernel-source-2.4.21/include/asm/xor.h:150: error: syntax error at '#' 
token
/usr/src/kernel-source-2.4.21/include/asm/xor.h:151: error: syntax error at '#' 
token
/usr/src/kernel-source-2.4.21/include/asm/xor.h:152: error: syntax error at '#' 
token
/usr/src/kernel-source-2.4.21/include/asm/xor.h:154: error: syntax error at '#' 
token
/usr/src/kernel-source-2.4.21/include/asm/xor.h:155: error: syntax error at '#' 
token
/usr/src/kernel-source-2.4.21/include/asm/xor.h:157: error: syntax error at '#' 
token
/usr/src/kernel-source-2.4.21/include/asm/xor.h:

Re: Raid-5 Problem

2003-07-23 Thread Adrian Zaugg
Dear Ryan

The same happens with RAID1 and maybe RAID0 Software RAIDs (unable to
handle kernel paging request) when using or syncing them. Apperently
Craig Small did some testing in June 2002 and it used to be the same. I
heard from the redhat-axp list, that on RedHat 7.2 RAID1 is running
well:
"I am still on 2.4.9-32.5, and I've been running RAID1 for 2 years on it
with no problems." (Chu, E. Tue, 22 Jul 2003 12:32:17 -0700 (PDT)) Is it
the debian patches to the kernel, which cause kernel panic? Has someone
a well running Software-RAID under an "original" kernel?

There is a patch from Scott Bailey around, which should fix the problem
with RAID1: See the kernel mailinglist "PATCH: raid1 on alpha" (S. Bailey Thu,
13 Mar 2003 23:31:02 -0500) e.g. http://www.spinics.net/lists/raid/msg02526.html
I haven't patched my kernel yet, so I can't tell, if it helps. I will
give it a try. But anyway, this doesn't help for your RAID5 problem...
Did someone on the list use this patch successfully? (BTW: How do I
apply a patch like this? It is a patch for 2.4.20, does this work with
2.4.21, too? Do I have to change the header line only, if the code of
raid1.c is still the same?  ...)

> I've tried changing back to the 2.4.20 kernel and had the same
> results. I've tried compiling the source for 2.4.21 and 2.4.22 pre7
> and neither will compile due to errors involving xor.c:
> I'm using the following gcc:
> 
> nostromo:/usr/src/kernel-source-2.4.21# gcc -v
> Reading specs from /usr/lib/gcc-lib/alpha-linux/3.3.1/specs
> Configured with: ../src/configure -v 
> --enable-languages=c,c++,java,f77,pascal,objc,ada,treelang --prefix=/usr 
> --mandir=/usr/share/man --infodir=/usr/share/info 
> --with-gxx-include-dir=/usr/include/c++/3.3 --enable-shared 
> --with-system-zlib --enable-nls --without-included-gettext 
> --enable-__cxa_atexit --enable-clocale=gnu --enable-debug 
> --enable-java-gc=boehm --enable-java-awt=xlib --enable-objc-gc alpha-linux
> Thread model: posix
> gcc version 3.3.1 20030626 (Debian prerelease)

Use gcc-3.2 to compile kernel 2.4.21 instead, which works. It tells
something of a multi literal string (or something like that) in xor.h,
which is deprecated in gcc-3.2 (thus a warning) and no more supported in
gcc-3.3 (leads to stop compiling). (Is this a bug of the kernel-source
to be reported?)

I had the same idea as you and tried the Software-RAID with different
Kernel versions. But neither 2.4.18, 2.4.20 nor 2.4.21 fix the problem.
At least I get the impression on 2.4.21 it works best. This means the
machine was able to complete a sync operation, if just one md-device
did it. When I did something else in parallel, e. g. under X, sometimes
the computer got blocked for a few seconds and continued afterwards
normally (it is a 7305 with 4 processors and 4GB of memory, the hardware
might not be the reason). I can imagine, if more md-devices are syncing
simultaniousely, a timing problem can occur during such "pauses". (Sorry,
it's just my phantasy, I do not understand Linux very well!)

> On the axp-list, someone else posted with a similar question and was
> tersely told to get a new version of the HP/Compaq/DEC ftp server. Of
"Someone" was me...
> course.. no link was included. If you know of a patched version of the
> kernel source, or a patch that I can apply to fix the compile errors,
> please let me know.
..and like you, I couldn't find the mentionned kernel-source.


Greetings, Adrian.




Re: Raid-5 Problem

2003-07-23 Thread Falk Hueffner
Adrian Zaugg <[EMAIL PROTECTED]> writes:

> The same happens with RAID1 and maybe RAID0 Software RAIDs (unable to
> handle kernel paging request) when using or syncing them.

Apparently, this is triggered by a gcc bug (see
http://gcc.gnu.org/PR11087). It will be fixed in the
soon-to-be-released gcc 3.3.1, I'm not sure whether it is fixed in 3.3
already.

> Use gcc-3.2 to compile kernel 2.4.21 instead, which works. It tells
> something of a multi literal string (or something like that) in
> xor.h, which is deprecated in gcc-3.2 (thus a warning) and no more
> supported in gcc-3.3 (leads to stop compiling). (Is this a bug of
> the kernel-source to be reported?)

Yes.

-- 
Falk




Re: Raid-5 Problem

2003-07-29 Thread Ryan Schaefer
On Tue, Jul 22, 2003 at 10:06:01PM -0500, Ryan Schaefer wrote:
> 
> I've got a Digital Ultimate Personal Workstation (AS1200) with dual
> 533 procs running Debian unstable. The machine has a five disk
> software Raid-5 array using Linux Software Raid.
> 
> When I installed it, under whatever the default kernel is for Debian
> Stable, everything worked great, the raid array started up sync'd
> fine. I've since upgraded to Debian Unstable and the 2.4.21 kernel. I
> need to re-sync the array now from a power failure and the following
> happens..


Epilogue:

I pulled the drives out of my AS1200 and threw em back into my AS1000a
(5/500) running 2.4.21-2-generic and the machine rebuilt the array
fine and has been running for about 12 hours now without problem. 

The big difference I notice, besides the type of machine is that the
AS1000a has an Adaptec 2940U2W controller on the built-in array.

Maybe there's a bug in the ISP SCSI driver? Maybe this is just a bug
that pops up on the Rawhide architecture? I dunno. I'm going to try
putting the Adaptec controller in my AS1200 to see whether that
changes it's behavior.

Another thing I noticed along the way was that 2.4.22-pre8 was able to
rebuild the array as long as no other activity was happening on the
machine. Once rebuilt, a medium amount of disk activity would cause
the machine to die (let me know if you want a print out of this
crash).

So.. I just thought I'd let everyone know my experiences. 

--
Ryan Schaefer
[EMAIL PROTECTED]