SV: SV: raid5: bug: stripe->bh_new[4]

Johan Ekenberg Wed, 1 Mar 2000 16:16:42 -0800
> This topic has come up a few times. Can you post Gadi's comments
> to linux-raid

Ok, here we go:
===============================================

>From [EMAIL PROTECTED] Fri Feb 27 16:51:19 1998
Date: Fri, 27 Feb 1998 16:51:17 +0300 (IST)
From: Gadi Oxman <[EMAIL PROTECTED]>
X-Sender: gadio@localhost
To: Richard Jones <[EMAIL PROTECTED]>
cc: [EMAIL PROTECTED]
Subject: Re: 2.1.88: rpc.nfsd hanging in __wait_on_page, the saga continues
In-Reply-To: <[EMAIL PROTECTED]>
Message-ID: <Pine.LNX.3.91.980227161918.210A-100000@localhost>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: RO
X-Status:

Hi,

> Hi all:
>
> I still have this bug. I've got a few questions and
> observations that I'd like to pose to cleverer kernel
> hackers than me on this list. Any suggestions would
> be a great help.
>
> - Every time I've seen the bug, just before rpc.nfsd hangs,
>   I see the following message printed by the kernel:
>
>   raid5: bug: stripe->bh_new[5], sector XXX exists
>
>   The dd_idx argument is always 5 in the cases observed
>   so far, but the sector number XXX is different.
>
> - This message seems to be provoked by the function
>   add_stripe_bh() in raid5.c. What does this bug message
>   mean? If I see this message, is it possible that it
>   could be triggering the hang in __wait_on_page?

This "bug: stripe->bh_new[5], sector XXX exists" is an old bug and has
been reported on several RAID-5 configurations, from the very first RAID-5
release.

Unfortunately, we don't have a theory which might explain this.

Whenever a bh is requested from the MD device by ll_rw_block(), the bh is
locked and registered in a sh structure. A sh manages a single row in the
RAID array, and contains bh's which share the same sector number, one for
each device, such that performing XOR on all the bh's results in 0.

The above error message indicates that a buffer which is now being
requested from the RAID device was already requested in the near past,
is currently locked, registered in a sh, and in the process of being
serviced, yet it is being requested again by the high level layers..

The code displays the "bug" error message, but otherwise ignores the
condition and overwrites the corresponding position in the sh with the
new buffer. If the new buffer is not actually the old buffer, and a
process is sleeping on the old buffer, this process will be locked in
uninterruptible sleep.

It is a rare bug, and can't be easily reproduced. The real cause might
not even be in the RAID-5 code (just detected by the RAID-5 code),
which makes it even harder to understand.

> - Sometimes I see this message, and rpc.nfsd doesn't hang
>   (unless it hangs *much* later ...)
>
> - In __wait_on_stripe in the same file, should sh->count++;
>   and sh->count-- be replaced by atomic_inc(&sh->count) and
>   atomic_dec(&sh->count)? If not, why not? Seems like there's
>   a possible race condition here.

No, since sh->count is never modified from an interrupt context, only
from a process context.

However, the code was coded with UP in mind, and I'm afraid that the
code might not be SMP safe, and there might be additional issues which
are only visible on SMP.

SMP is probably not the reason for this bug, however, as it has been
reported with the 2.0.x kernels as well, were the kernel is protected
on SMP behind a global kernel lock.

Gadi



>From [EMAIL PROTECTED] Tue Apr 21 22:13:53 1998
Date: Tue, 21 Apr 1998 22:13:52 +0400 (IDT)
From: Gadi Oxman <[EMAIL PROTECTED]>
To: Richard Jones <[EMAIL PROTECTED]>,
    Linus Torvalds <[EMAIL PROTECTED]>,
    MOLNAR Ingo <[EMAIL PROTECTED]>,
    Miguel de Icaza <[EMAIL PROTECTED]>,
    "Theodore Y. Ts'o" <[EMAIL PROTECTED]>
Subject: Re: RAID5 (or ext2_truncate?) bug
In-Reply-To: <[EMAIL PROTECTED]>
Message-ID: <[EMAIL PROTECTED]>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: RO
X-Status:

Hi,

> Gadi:
>
> We just got a message from your patch ...
>
> raid5: bug: stripe->bh_new[1], sector 13749952 exists
> raid5: bh cfb37720, bh_new c9119620
>
> The NFS server kept on running this time. Thanks :-)
>
> However, of course we have no stack trace, so we don't
> know if it hit the (possible) bug in ext2_truncate, or
> if it was caused by something else. Nevertheless, the timing
> and symptoms were very similar to what happened previously.
>
> I can now fairly reliably reproduce the report by:
>
>       . having lots of NFS traffic
>       . running htmerge (the final stage of the htdig
>         search program)
>
> At some point during the htmerge, htmerge spawns off
> a sort process, and somewhere around this time we hit
> the bug.
>
> Of course, HTDIG takes about 6 hours to run, so it's not
> a great test ...
>
> Rich.

Thanks for the report Rich; looks like we are finally about to resolve
this long-standing cause of crashes with the RAID-5 driver :-)

The fact that bh != bh_new seems to confirm our assumption that
at least some cases of that problem are not caused directly by RAID-5
driver -- we are receiving two I/O requests for the same disk buffer,
pointed to by two different memory buffers simultaneously.

The old RAID-5 behavior simply issued a warning, forgot about the old
buffer, and serviced only the new one, which probably lead to the processes
which were stuck in the 'D' state, waiting on the old buffer.

Linus, I appended the RAID-5 patch below; the new behavior waits for
I/O on the first buffer to complete, and then services the new buffer,
which is more or less the behavior of a standard non-RAID device.

However, receiving two different buffer heads which point to the
same disk buffer simulatenously seems to indicate a potential problem
in the higher levels of the kernel.

Last time it happened, Rich was running with Ingo's patch which forced
a stack trace at this point, and it always happened in combination
with truncate():

Using `/System.map-2.1.88' to map addresses to symbols.

>>EIP: d0815230 cannot be resolved
Trace: d0816f18
Trace: d0816ee5
Trace: d0815f23
Trace: c0164f0b <md_make_request+93/b0>
Trace: c01663e1 <ll_rw_block+169/1e0>
Trace: c0138888 <ext2_bread+3c/7c>
Trace: c013ca0e <ext2_truncate+e6/16c>   <-***---
Trace: c0121b22 <do_truncate+56/70>
Trace: c0121c3f <sys_truncate+103/130>
Trace: c0109926 <system_call+3a/40>
Code:
Code:  c7 05 00 00 00   movl   $0x0,0x0
Code:  00 00 00 00 00
Code:  83 c4 10         addl   $0x10,%esp
Code:  b8 02 00 00 00   movl   $0x2,%eax
Code:  8b 55 00         movl   0x0(%ebp),%edx
Code:  90               nop
Code:  90               nop
Code:  90               nop

Here is the RAID-5 patch:

Thanks,

Gadi

--- linux/drivers/block/raid5.c.old     Tue Mar 31 17:39:19 1998
+++ linux/drivers/block/raid5.c Tue Mar 31 17:41:10 1998
@@ -1209,6 +1209,14 @@
        sh->pd_idx = pd_idx;
        if (sh->phase != PHASE_COMPLETE && sh->phase != PHASE_BEGIN)
                PRINTK(("stripe %lu catching the bus!\n", sh->sector));
+       if (sh->bh_new[dd_idx]) {
+               printk("raid5: bug: stripe->bh_new[%d], sector %lu exists\n", dd_idx,
sh->sector);
+               printk("raid5: bh %p, bh_new %p\n", bh, sh->bh_new[dd_idx]);
+               lock_stripe(sh);
+               md_wakeup_thread(raid_conf->thread);
+               wait_on_stripe(sh);
+               goto repeat;
+       }
        add_stripe_bh(sh, bh, dd_idx, rw);

        md_wakeup_thread(raid_conf->thread);


>From [EMAIL PROTECTED] Mon Feb  1 20:38:23 1999
Date: Mon, 1 Feb 1999 20:38:23 +0200 (IST)
From: Gadi Oxman <[EMAIL PROTECTED]>
X-Sender: gadio@localhost
To: Stephen Costaras <[EMAIL PROTECTED]>
cc: [EMAIL PROTECTED]
Subject: Re: raid5: bugs?
In-Reply-To: <001801be4d8e$02b89aa0$[EMAIL PROTECTED]>
Message-ID: <Pine.LNX.3.96.990201203524.1074B-100000@localhost>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Status: RO
X-Status:

On Sun, 31 Jan 1999, Stephen Costaras wrote:

>
> Now that I'm back on the list I wanted to bring this up.  For
> the past while running linux 2.2.0pre6, & 2.2.0 release I've
> seen the following on one of my volumes:
>
> --------
> Jan 21 19:49:09 news kernel: raid5: bug: stripe->bh_new[6], sector 2643680
> exists
> Jan 21 19:49:10 news kernel: raid5: bh c27d7080, bh_new c36d5a40
> Jan 23 05:46:18 news kernel: raid5: bug: stripe->bh_new[5], sector 3216552
> exists
> Jan 23 05:46:18 news kernel: raid5: bh c922ec20, bh_new cb11d9c0
> Jan 23 08:18:00 news kernel: raid5: bug: stripe->bh_new[4], sector 3217776
> exists
> Jan 23 08:18:00 news kernel: raid5: bh c922ece0, bh_new ca9735a0
> Jan 25 17:46:20 news kernel: raid5: bug: stripe->bh_new[8], sector 6467032
> exists
> Jan 25 17:46:20 news kernel: raid5: bh c2f88a20, bh_new c123c480
> Jan 26 05:15:14 news kernel: raid5: bug: stripe->bh_new[4], sector 4462040
> exists
> Jan 26 05:15:14 news kernel: raid5: bh c0a3c300, bh_new c32b93e0
> Jan 26 17:46:01 news kernel: raid5: bug: stripe->bh_new[4], sector 3975968
> exists
> Jan 26 17:46:02 news kernel: raid5: bh c4289da0, bh_new c2d5d200
> Jan 26 20:45:18 news kernel: raid5: bug: stripe->bh_new[7], sector 5493120
> exists
> Jan 26 20:45:18 news kernel: raid5: bh c922e2c0, bh_new c8b370e0
> Jan 29 20:45:24 news kernel: raid5: bug: stripe->bh_new[1], sector 5573072
> exists
> Jan 29 20:45:24 news kernel: raid5: bh c2bfb7a0, bh_new ca244c60
> Jan 30 19:45:43 news kernel: raid5: bug: stripe->bh_new[1], sector 5599704
> exists
> Jan 30 19:45:43 news kernel: raid5: bh ccf60860, bh_new c9bdecc0
> Jan 30 22:45:37 news kernel: raid5: bug: stripe->bh_new[6], sector 4089216
> exists
> Jan 30 22:45:38 news kernel: raid5: bh c4da2e00, bh_new c4da2f20
> Jan 31 20:15:25 news kernel: raid5: bug: stripe->bh_new[7], sector 3607576
> exists
> Jan 31 20:15:25 news kernel: raid5: bh c96149c0, bh_new c3086420
> --------
>
> This is my raidtab file for this array:
>
> raiddev                 /dev/md0
> raid-level              5
> nr-raid-disks           9
> chunk-size              512
> parity-algorithm        left-symmetric
> nr-spare-disks          0
> device                  /dev/sdb1
> raid-disk               0
> device                  /dev/sdc1
> raid-disk               1
> device                  /dev/sdd1
> raid-disk               2
> device                  /dev/sde1
> raid-disk               3
> device                  /dev/sdf1
> raid-disk               4
> device                  /dev/sdg1
> raid-disk               5
> device                  /dev/sdh1
> raid-disk               6
> device                  /dev/sdi1
> raid-disk               7
> device                  /dev/sdj1
> raid-disk               8
> --------
>
> Can anyone here shed any light on this?  Is this something that I need to
> worry about?  Am I corrupting data?
>
> Steve

Probably not. That message means that we detected a problem which would
cause a crash on older kernels, but worked around it (two requests
for the same disk sector were issued simultaneously from two different
buffers).

We still print the message, as it might point to a bug in the higher
levels of the kernel rather than the RAID-5 driver (the work-around
services both requests, where in older versions we serviced just one
of them).

Gadi


>From [EMAIL PROTECTED] Thu Dec  2 08:05:12 1999
Date: Thu, 2 Dec 1999 08:04:59 +0200 (IST)
From: Gadi Oxman <[EMAIL PROTECTED]>
X-Sender: gadio@localhost
To: =?iso-8859-1?Q?Jakob_=D8stergaard?= <[EMAIL PROTECTED]>
cc: linux-raid <[EMAIL PROTECTED]>,
    Deanne O'Connell <[EMAIL PROTECTED]>
Subject: Re: strange messages from raid5
In-Reply-To: <[EMAIL PROTECTED]>
Message-ID: <Pine.LNX.3.96.991202075634.999E-100000@localhost>
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Content-Transfer-Encoding: QUOTED-PRINTABLE
Status: RO
X-Status:

Hi,

The first lines (bug: sector %d exists, bh_new...) probably do mean that
there is a kernel bug lurking somewhere. They tell us that the RAID-5=20
layer received two requests for the same physical sector on the RAID
array pointed to by two different buffer head structures at the same
time (that is, the second request arrived before the first was completed).

It might be ok, but more likely it's a race somewhere (perhaps harmless,
we don't know where it is). Early versions of the RAID-5 code used to crash
on this, as they continued with the "old" request and ignored the "new"
completely (locking up kernel code which waited on the new buffers).

In current versions, we wait for completion of the old request, and then
continue with issuing the "new" one for the same sector, which is more
or less the behavior on a standard non RAID disk drive.

The condition is likely to happen on a standard disk as well, we are just
not able to detect it as we don't need to keep a physical sector -->
buffer_head mapping for a non-RAID drive; we are always following the
mapping in the opposite structure --> physical sector direction.

Gadi

On Thu, 2 Dec 1999, [iso-8859-1] Jakob =D8stergaard wrote:

> On Thu, Dec 02, 1999 at 12:11:33PM +0800, Deanne O'Connell wrote:
> > Hi all,
> >=20
> > I am running redhat 6.0, with kernel 2.2.13 all patched up to run raid,=
 on a
> > dual celeron 400 system.
> > I have 3x 18.2 gig atlas 4 lvd drives all on one cable, configured as
> > follows;
> [snip]
>=20
> Ok, RAID usually works, so start by looking elsewhere:
>  *) Which compiler do you use ?  (gcc -v)
>  *) What kernel exactly ?  2.2.13+raid_patches or 2.2.13acX or ?
>=20
> Of course you system isn't overclocked,  right ?   :)
> What about the memory ?  You don't receive sig-11 messages from gcc
> when compiling the kernel or something ?
>=20
> RAID _should_ work, these errors are almost always caused by hardware or
> not-directly-raid-related software.
>=20
> --=20
> ................................................................
> : [EMAIL PROTECTED]  : And I see the elder races,         :
> :.........................: putrid forms of man                :
> :   Jakob =D8stergaard      : See him rise and claim the earth,  :
> :        OZ9ABN           : his downfall is at hand.           :
> :.........................:............{Konkhra}...............:
>=20
>=20
SV: SV: raid5: bug: stripe->bh_new[4]

Reply via email to