Segfaults in mkdir under high load. Software or hardware?
Hi, I am experiencing segfaults in mkdir, and mkdir alone, under high load. I have tried to eliminate the RAM by letting memtest86plus run overnight (no errors found). dmesg does report other filesystem related segfaults and some others that I am not so sure of. I am not really qualified to guess whether this is software or hardware so please see this mail as information for those more qualified. Info below. Regards, jules # Hardware # Dual AMD Opteron 252. 8x1GB ECC RAM (CMX1024RE-3200-Black). LSI Logic MegaRAID SCSI 320-4X. 4 Seagate Cheetah 73.4 GB 15KRPM Ultra 320 68pin SCSI in raid5 (ST373453LW). Tyan Thunder K8W S2885. Kernel gentoo-sources 2.6.12-r6. Most filesystems mounted as reiserfs with noatime and notail. # Output of script # ./memtest.sh: line 107: 19536 Segmentation fault mkdir $j ./memtest.sh: line 107: 19553 Segmentation fault mkdir $j Inconsistency detected by ld.so: dynamic-link.h: 151: elf_get_dynamic_info: Assertion `info[20]->d_un.d_val == 7' failed! Inconsistency detected by ld.so: dynamic-link.h: 151: elf_get_dynamic_info: Assertion `info[20]->d_un.d_val == 7' failed! Inconsistency detected by ld.so: dynamic-link.h: 151: elf_get_dynamic_info: Assertion `info[20]->d_un.d_val == 7' failed! Inconsistency detected by ld.so: dynamic-link.h: 151: elf_get_dynamic_info: Assertion `info[20]->d_un.d_val == 7' failed! Inconsistency detected by ld.so: dynamic-link.h: 151: elf_get_dynamic_info: Assertion `info[20]->d_un.d_val == 7' failed! Inconsistency detected by ld.so: dynamic-link.h: 151: elf_get_dynamic_info: Assertion `info[20]->d_un.d_val == 7' failed! Inconsistency detected by ld.so: dynamic-link.h: 151: elf_get_dynamic_info: Assertion `info[20]->d_un.d_val == 7' failed! # /var/log/messages # Aug 2 10:43:33 omc-2 [393507.742750] mkdir[19536]: segfault at rip 0040184d rsp 7fe2c4b0 error 4 Aug 2 10:43:33 omc-2 [393507.818660] mkdir[19553]: segfault at rip 0040184d rsp 7ffdb2e0 error 4 # The script # #!/bin/bash # # memtest.sh # # Shell script to help isolate memory failures under linux # # Author: Doug Ledford + contributors # # (C) Copyright 2000-2002 Doug Ledford; Red Hat, Inc. # This shell script is released under the terms of the GNU General # Public License Version 2, June 1991. If you do not have a copy # of the GNU General Public License Version 2, then one may be # retrieved from http://people.redhat.com/dledford/GPL.html # # Note, this needs bash2 for the wait command support. # This is where we will run the tests at TEST_DIR=/home/colding/tmp # The location of the linux kernel source file we will be using if [ -z "$SOURCE_FILE" ]; then SOURCE_FILE=$TEST_DIR/linux.tar.gz fi if [ ! -f "$SOURCE_FILE" ]; then echo "Missing source file $SOURCE_FILE" exit 1 fi # How many passes to run of this test, higher numbers are better if [ -z "$NR_PASSES" ]; then NR_PASSES=20 fi # Guess how many megs the unpacked archive is if [ -z "$MEG_PER_COPY" ]; then MEG_PER_COPY=$(ls -l $SOURCE_FILE | awk '{print int($5/1024/1024) * 4}') fi # How many trees do we have to unpack in order to make our trees be larger # than physical RAM? If we don't unpack more data than memory can hold # before we start to run the diff program on the trees then we won't # actually flush the data to disk and force the system to reread the data # from disk. Instead, the system will do everything in RAM. That doesn't # work (as far as the memory test is concerned). It's the simultaneous # unpacking of data in memory and the read/writes to hard disk via DMA that # breaks the memory subsystem in most cases. Doing everything in RAM without # causing disk I/O will pass bad memory far more often than when you add # in the disk I/O. if [ -z "$NR_SIMULTANEOUS" ]; then NR_SIMULTANEOUS=$(free | awk -v meg_per_copy=$MEG_PER_COPY 'NR == 2 {print int($2*1.5/1024/meg_per_copy + (($2/1024)%meg_per_copy >= (meg_per_copy/2)) + (($2/1024/32) < 1))}') fi # Should we unpack/diff the $NR_SIMULTANEOUS trees in series or in parallel? if [ ! -z "$PARALLEL" ]; then PARALLEL="yes" else PARALLEL="no" fi PARALLEL="yes" if [ ! -z "$JUST_INFO" ]; then echo "TEST_DIR: $TEST_DIR" echo "SOURCE_FILE:$SOURCE_FILE" echo "NR_PASSES: $NR_PASSES" echo "MEG_PER_COPY: $MEG_PER_COPY" echo "NR_SIMULTANEOUS:$NR_SIMULTANEOUS" echo "PARALLEL: $PARALLEL" echo exit fi cd $TEST_DIR # Remove any possible left over directories from a cancelled previous run rm -fr linux linux.orig linux.pass.* # Unpack the one copy of the source tree that we will be comparing against tar -xzf $SOURCE_FILE mv linux linux.orig i=0 while [ "$i" -lt "$NR_PASSES" ]; do j=0 while [ "$j" -lt "$NR_SIMULTANEOUS" ]; do if [ $PARALLEL = "yes" ]; then (mkdir $j; tar -xzf $SOURCE_FILE -C $j; mv $j/linux linux.pass.$j; rmdir $j) & else
LSI MegaRAID problems
Hi, I have a "LSI Logic MegaRAID SCSI 320-4x" adapter with an external raid5 array of 5 Seagate ST336754LW and XFS as fs on it. The device in question is /dev/sdb and the box is a dual Opteron 252. I've recently started to see this in the log almost whenever I touch the filesystem: May 30 12:22:56 omc-2 [ 1120.991356] megaraid: aborting-109150 cmd=28 May 30 12:22:56 omc-2 [ 1120.991366] megaraid abort: 109150:68[255:129], fw owner May 30 12:22:56 omc-2 [ 1120.991371] megaraid: aborting-109151 cmd=28 May 30 12:22:56 omc-2 [ 1120.991374] megaraid abort: 109151:64[255:129], fw owner May 30 12:22:56 omc-2 [ 1120.991379] megaraid: 2 outstanding commands. Max wait 300 sec May 30 12:22:56 omc-2 [ 1120.991382] megaraid mbox: Wait for 2 commands to complete:300 May 30 12:23:01 omc-2 [ 1126.006002] megaraid mbox: Wait for 2 commands to complete:295 May 30 12:23:06 omc-2 [ 1131.020774] megaraid mbox: Wait for 2 commands to complete:290 May 30 12:23:11 omc-2 [ 1136.035548] megaraid mbox: Wait for 2 commands to complete:285 May 30 12:23:16 omc-2 [ 1141.050325] megaraid mbox: Wait for 2 commands to complete:280 May 30 12:23:21 omc-2 [ 1146.065098] megaraid mbox: Wait for 2 commands to complete:275 May 30 12:23:26 omc-2 [ 1151.083870] megaraid mbox: Wait for 0 commands to complete:270 May 30 12:23:26 omc-2 [ 1151.083874] megaraid mbox: reset sequence completed sucessfully May 30 12:23:26 omc-2 [ 1151.083979] sd 0:4:1:0: SCSI error: return code = 0x00040001 May 30 12:23:26 omc-2 [ 1151.083983] end_request: I/O error, dev sdb, sector 95601663 May 30 12:23:26 omc-2 [ 1151.084124] sd 0:4:1:0: SCSI error: return code = 0x00040001 May 30 12:23:26 omc-2 [ 1151.084128] end_request: I/O error, dev sdb, sector 95601535 May 30 12:23:26 omc-2 [ 1151.084332] sd 0:4:1:0: SCSI error: return code = 0x00040001 May 30 12:23:26 omc-2 [ 1151.084334] end_request: I/O error, dev sdb, sector 95601535 May 30 12:23:27 omc-2 [ 1152.725763] sd 0:4:1:0: SCSI error: return code = 0x00040001 May 30 12:23:27 omc-2 [ 1152.725768] end_request: I/O error, dev sdb, sector 71411967 May 30 12:23:27 omc-2 [ 1152.725816] sd 0:4:1:0: SCSI error: return code = 0x00040001 May 30 12:23:27 omc-2 [ 1152.725818] end_request: I/O error, dev sdb, sector 71411967 May 30 12:23:31 omc-2 [ 1156.578149] sd 0:4:1:0: SCSI error: return code = 0x00040001 May 30 12:23:31 omc-2 [ 1156.578156] end_request: I/O error, dev sdb, sector 143351464 May 30 12:23:31 omc-2 [ 1156.578173] I/O error in filesystem ("sdb1") meta-data dev sdb1 block 0x88b5e69 ("xlog_iodone") error 5 buf count 10752 May 30 12:23:31 omc-2 [ 1156.578178] xfs_force_shutdown(sdb1,0x2) called from line 960 of file fs/xfs/xfs_log.c. Return address = 0x80398b56 May 30 12:23:31 omc-2 [ 1156.578204] Filesystem "sdb1": Log I/O Error Detected. Shutting down filesystem: sdb1 May 30 12:23:31 omc-2 [ 1156.578207] Please umount the filesystem, and rectify the problem(s) May 30 12:23:31 omc-2 [ 1156.578251] sd 0:4:1:0: SCSI error: return code = 0x00040001 May 30 12:23:31 omc-2 [ 1156.578253] end_request: I/O error, dev sdb, sector 63 May 30 12:24:13 omc-2 [ 1198.747915] xfs_force_shutdown(sdb1,0x1) called from line 424 of file fs/xfs/xfs_rw.c. Return address = 0x803afc2a One of the drives in the array has been put offline after having seen media errors. I'm waiting for a replacement but the recurring errors worry me... Any help/advises would be greatly appreciated. Thanks a lot in advance, jules PS: I'm running a distribution kernel, but having seen zero responses on the gentoo list I dared to write here. The kernel is gentoo-sources 2.6.20-r8. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/