Segfaults in mkdir under high load. Software or hardware?

2005-08-03 Thread Jules Colding
Hi,

I am experiencing segfaults in mkdir, and mkdir alone, under high load.
I have tried to eliminate the RAM by letting memtest86plus run overnight
(no errors found). dmesg does report other filesystem related segfaults
and some others that I am not so sure of.

I am not really qualified to guess whether this is software or hardware
so please see this mail as information for those more qualified. Info
below.

Regards,
  jules


# Hardware #
Dual AMD Opteron 252.
8x1GB ECC RAM (CMX1024RE-3200-Black).
LSI Logic MegaRAID SCSI 320-4X.
4 Seagate Cheetah 73.4 GB 15KRPM Ultra 320 68pin SCSI in raid5 (ST373453LW).
Tyan Thunder K8W S2885.
Kernel gentoo-sources 2.6.12-r6.
Most filesystems mounted as reiserfs with noatime and notail.

# Output of script #
./memtest.sh: line 107: 19536 Segmentation fault  mkdir $j
./memtest.sh: line 107: 19553 Segmentation fault  mkdir $j
Inconsistency detected by ld.so: dynamic-link.h: 151: elf_get_dynamic_info: 
Assertion `info[20]->d_un.d_val == 7' failed!
Inconsistency detected by ld.so: dynamic-link.h: 151: elf_get_dynamic_info: 
Assertion `info[20]->d_un.d_val == 7' failed!
Inconsistency detected by ld.so: dynamic-link.h: 151: elf_get_dynamic_info: 
Assertion `info[20]->d_un.d_val == 7' failed!
Inconsistency detected by ld.so: dynamic-link.h: 151: elf_get_dynamic_info: 
Assertion `info[20]->d_un.d_val == 7' failed!
Inconsistency detected by ld.so: dynamic-link.h: 151: elf_get_dynamic_info: 
Assertion `info[20]->d_un.d_val == 7' failed!
Inconsistency detected by ld.so: dynamic-link.h: 151: elf_get_dynamic_info: 
Assertion `info[20]->d_un.d_val == 7' failed!
Inconsistency detected by ld.so: dynamic-link.h: 151: elf_get_dynamic_info: 
Assertion `info[20]->d_un.d_val == 7' failed!


# /var/log/messages #
Aug  2 10:43:33 omc-2 [393507.742750] mkdir[19536]: segfault at 
 rip 0040184d rsp 7fe2c4b0 error 4
Aug  2 10:43:33 omc-2 [393507.818660] mkdir[19553]: segfault at 
 rip 0040184d rsp 7ffdb2e0 error 4

# The script #
#!/bin/bash
#
# memtest.sh
#
# Shell script to help isolate memory failures under linux
#
# Author: Doug Ledford  + contributors
#
# (C) Copyright 2000-2002 Doug Ledford; Red Hat, Inc.
# This shell script is released under the terms of the GNU General
# Public License Version 2, June 1991.  If you do not have a copy
# of the GNU General Public License Version 2, then one may be
# retrieved from http://people.redhat.com/dledford/GPL.html
#
# Note, this needs bash2 for the wait command support.

# This is where we will run the tests at
TEST_DIR=/home/colding/tmp

# The location of the linux kernel source file we will be using
if [ -z "$SOURCE_FILE" ]; then
  SOURCE_FILE=$TEST_DIR/linux.tar.gz
fi

if [ ! -f "$SOURCE_FILE" ]; then
  echo "Missing source file $SOURCE_FILE"
  exit 1
fi

# How many passes to run of this test, higher numbers are better
if [ -z "$NR_PASSES" ]; then
  NR_PASSES=20
fi

# Guess how many megs the unpacked archive is
if [ -z "$MEG_PER_COPY" ]; then
  MEG_PER_COPY=$(ls -l $SOURCE_FILE | awk '{print int($5/1024/1024) * 4}')
fi

# How many trees do we have to unpack in order to make our trees be larger
# than physical RAM?  If we don't unpack more data than memory can hold
# before we start to run the diff program on the trees then we won't
# actually flush the data to disk and force the system to reread the data
# from disk.  Instead, the system will do everything in RAM.  That doesn't
# work (as far as the memory test is concerned).  It's the simultaneous
# unpacking of data in memory and the read/writes to hard disk via DMA that
# breaks the memory subsystem in most cases.  Doing everything in RAM without
# causing disk I/O will pass bad memory far more often than when you add
# in the disk I/O.
if [ -z "$NR_SIMULTANEOUS" ]; then
  NR_SIMULTANEOUS=$(free | awk -v meg_per_copy=$MEG_PER_COPY 'NR == 2 {print 
int($2*1.5/1024/meg_per_copy + (($2/1024)%meg_per_copy >= (meg_per_copy/2)) + 
(($2/1024/32) < 1))}')
fi

# Should we unpack/diff the $NR_SIMULTANEOUS trees in series or in parallel?
if [ ! -z "$PARALLEL" ]; then
  PARALLEL="yes"
else
  PARALLEL="no"
fi
PARALLEL="yes"

if [ ! -z "$JUST_INFO" ]; then
  echo "TEST_DIR:   $TEST_DIR"
  echo "SOURCE_FILE:$SOURCE_FILE"
  echo "NR_PASSES:  $NR_PASSES"
  echo "MEG_PER_COPY:   $MEG_PER_COPY"
  echo "NR_SIMULTANEOUS:$NR_SIMULTANEOUS"
  echo "PARALLEL:   $PARALLEL"
  echo
  exit
fi

cd $TEST_DIR

# Remove any possible left over directories from a cancelled previous run
rm -fr linux linux.orig linux.pass.*

# Unpack the one copy of the source tree that we will be comparing against
tar -xzf $SOURCE_FILE
mv linux linux.orig

i=0
while [ "$i" -lt "$NR_PASSES" ]; do
  j=0
  while [ "$j" -lt "$NR_SIMULTANEOUS" ]; do
if [ $PARALLEL = "yes" ]; then
  (mkdir $j; tar -xzf $SOURCE_FILE -C $j; mv $j/linux linux.pass.$j; rmdir 
$j) &
else
 

LSI MegaRAID problems

2007-05-30 Thread Jules Colding
Hi,

I have a "LSI Logic MegaRAID SCSI 320-4x" adapter with an external raid5
array of 5 Seagate ST336754LW and XFS as fs on it. The device in
question is /dev/sdb and the box is a dual Opteron 252.

I've recently started to see this in the log almost whenever I touch the
filesystem:

May 30 12:22:56 omc-2 [ 1120.991356] megaraid: aborting-109150 cmd=28 
May 30 12:22:56 omc-2 [ 1120.991366] megaraid abort: 109150:68[255:129], fw 
owner
May 30 12:22:56 omc-2 [ 1120.991371] megaraid: aborting-109151 cmd=28 
May 30 12:22:56 omc-2 [ 1120.991374] megaraid abort: 109151:64[255:129], fw 
owner
May 30 12:22:56 omc-2 [ 1120.991379] megaraid: 2 outstanding commands. Max wait 
300 sec
May 30 12:22:56 omc-2 [ 1120.991382] megaraid mbox: Wait for 2 commands to 
complete:300
May 30 12:23:01 omc-2 [ 1126.006002] megaraid mbox: Wait for 2 commands to 
complete:295
May 30 12:23:06 omc-2 [ 1131.020774] megaraid mbox: Wait for 2 commands to 
complete:290
May 30 12:23:11 omc-2 [ 1136.035548] megaraid mbox: Wait for 2 commands to 
complete:285
May 30 12:23:16 omc-2 [ 1141.050325] megaraid mbox: Wait for 2 commands to 
complete:280
May 30 12:23:21 omc-2 [ 1146.065098] megaraid mbox: Wait for 2 commands to 
complete:275
May 30 12:23:26 omc-2 [ 1151.083870] megaraid mbox: Wait for 0 commands to 
complete:270
May 30 12:23:26 omc-2 [ 1151.083874] megaraid mbox: reset sequence completed 
sucessfully
May 30 12:23:26 omc-2 [ 1151.083979] sd 0:4:1:0: SCSI error: return code = 
0x00040001
May 30 12:23:26 omc-2 [ 1151.083983] end_request: I/O error, dev sdb, sector 
95601663
May 30 12:23:26 omc-2 [ 1151.084124] sd 0:4:1:0: SCSI error: return code = 
0x00040001
May 30 12:23:26 omc-2 [ 1151.084128] end_request: I/O error, dev sdb, sector 
95601535
May 30 12:23:26 omc-2 [ 1151.084332] sd 0:4:1:0: SCSI error: return code = 
0x00040001
May 30 12:23:26 omc-2 [ 1151.084334] end_request: I/O error, dev sdb, sector 
95601535
May 30 12:23:27 omc-2 [ 1152.725763] sd 0:4:1:0: SCSI error: return code = 
0x00040001
May 30 12:23:27 omc-2 [ 1152.725768] end_request: I/O error, dev sdb, sector 
71411967
May 30 12:23:27 omc-2 [ 1152.725816] sd 0:4:1:0: SCSI error: return code = 
0x00040001
May 30 12:23:27 omc-2 [ 1152.725818] end_request: I/O error, dev sdb, sector 
71411967
May 30 12:23:31 omc-2 [ 1156.578149] sd 0:4:1:0: SCSI error: return code = 
0x00040001
May 30 12:23:31 omc-2 [ 1156.578156] end_request: I/O error, dev sdb, sector 
143351464
May 30 12:23:31 omc-2 [ 1156.578173] I/O error in filesystem ("sdb1") meta-data 
dev sdb1 block 0x88b5e69   ("xlog_iodone") error 5 buf count 10752
May 30 12:23:31 omc-2 [ 1156.578178] xfs_force_shutdown(sdb1,0x2) called from 
line 960 of file fs/xfs/xfs_log.c.  Return address = 0x80398b56
May 30 12:23:31 omc-2 [ 1156.578204] Filesystem "sdb1": Log I/O Error Detected. 
 Shutting down filesystem: sdb1
May 30 12:23:31 omc-2 [ 1156.578207] Please umount the filesystem, and rectify 
the problem(s)
May 30 12:23:31 omc-2 [ 1156.578251] sd 0:4:1:0: SCSI error: return code = 
0x00040001
May 30 12:23:31 omc-2 [ 1156.578253] end_request: I/O error, dev sdb, sector 63
May 30 12:24:13 omc-2 [ 1198.747915] xfs_force_shutdown(sdb1,0x1) called from 
line 424 of file fs/xfs/xfs_rw.c.  Return address = 0x803afc2a


One of the drives in the array has been put offline after having seen
media errors. I'm waiting for a replacement but the recurring errors
worry me...

Any help/advises would be greatly appreciated.

Thanks a lot in advance,
  jules


PS: I'm running a distribution kernel, but having seen zero responses on
the gentoo list I dared to write here. The kernel is gentoo-sources
2.6.20-r8.
 

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/