On Sat, 27 May 2006, Neil Brown wrote:

> On Friday May 26, [EMAIL PROTECTED] wrote:
> > On Tue, 23 May 2006, Neil Brown wrote:
> > 
> > i applied them against 2.6.16.18 and two days later i got my first hang... 
> > below is the stripe_cache foo.
> > 
> > thanks
> > -dean
> > 
> > neemlark:~# cd /sys/block/md4/md/
> > neemlark:/sys/block/md4/md# cat stripe_cache_active 
> > 255
> > 0 preread
> > bitlist=0 delaylist=255
> > neemlark:/sys/block/md4/md# cat stripe_cache_active 
> > 255
> > 0 preread
> > bitlist=0 delaylist=255
> > neemlark:/sys/block/md4/md# cat stripe_cache_active 
> > 255
> > 0 preread
> > bitlist=0 delaylist=255
> 
> Thanks.  This narrows it down quite a bit... too much infact:  I can
> now say for sure that this cannot possible happen :-)

heheh.  fwiw the box has traditionally been rock solid.. it's ancient 
though... dual p3 750 w/440bx chipset and pc100 ecc memory... 3ware 7508 
w/seagate 400GB disks... i really don't suspect the hardware all that much 
because the freeze seems to be rather consistent as to time of day 
(overnight while i've got 3x rdiff-backup, plus bittorrent, plus updatedb 
going).  unfortunately it doesn't happen every time... but every time i've 
unstuck the box i've noticed those processes going.

other tidbits... md4 is a lvm2 PV ... there are two LVs, one with ext3
and one with xfs.


> Two things that might be helpful:
>   1/ Do you have any other patches on 2.6.16.18 other than the 3 I
>     sent you?  If you do I'd like to see them, just in case.

it was just 2.6.16.18 plus the 3 you sent... i attached the .config
(it's rather full -- based off debian kernel .config).

maybe there's a compiler bug:

gcc version 4.0.4 20060507 (prerelease) (Debian 4.0.3-3)


>   2/ The message.gz you sent earlier with the
>           echo t > /proc/sysrq-trigger
>      trace in it didn't contain information about md4_raid5 - the 
>      controlling thread for that array.  It must have missed out
>      due to a buffer overflowing.  Next time it happens, could you
>      to get this trace again and see if you can find out what
>      what md4_raid5 is going.  Maybe do the 'echo t' several times.
>      I think that you need a kernel recompile to make the dmesg
>      buffer larger.

ok i'll set CONFIG_LOG_BUF_SHIFT=18 and rebuild ...

note that i'm going to include two more patches in this next kernel:

http://lkml.org/lkml/2006/5/23/42
http://arctic.org/~dean/patches/linux-2.6.16.5-no-treason.patch

the first was the Jens Axboe patch you mentioned here recently (for
accounting with i/o barriers)... and the second gets rid of the tcp
treason uncloaked messages.


> Thanks for your patience - this must be very frustrating for you.

fortunately i'm the primary user of this box... and the bug doesn't
corrupt anything... and i can unstick it easily :)  so it's not all that
frustrating actually.

-dean

Attachment: config.gz
Description: Binary data

Reply via email to