Re: Bonnie++ with 1024k stripe SW/RAID5 causes kernel to goto D-state
Justin Piszcz wrote: Kernel: 2.6.23-rc8 (older kernels do this as well) When running the following command: /usr/bin/time /usr/sbin/bonnie++ -d /x/test -s 16384 -m p34 -n 16:10:16:64 It hangs unless I increase various parameters md/raid such as the stripe_cache_size etc.. # ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 276 0.0 0.0 0 0 ?D12:14 0:00 [pdflush] root 277 0.0 0.0 0 0 ?D12:14 0:00 [pdflush] root 1639 0.0 0.0 0 0 ?D< 12:14 0:00 [xfsbufd] root 1767 0.0 0.0 8100 420 ?Ds 12:14 0:00 root 2895 0.0 0.0 5916 632 ?Ds 12:15 0:00 /sbin/syslogd -r See the bottom for more details. Is this normal? Does md only work without tuning up to a certain stripe size? I use a RAID 5 with 1024k stripe which works fine with many optimizations, but if I just boot the system and run bonnie++ on it without applying the optimizations, it will hang in d-state. When I run the optimizations, then it exits out of D-state, pretty weird? Not at all. 1024k stripes are way outside the norm. If you do something way outside the norm, and don't tune for it in advance, don't be terribly surprised when something like bonnie++ brings your box to its knees. That's not to say we couldn't make md auto-tune itself more intelligently, but this isn't really a bug. With a sufficiently huge amount of RAM, you'd be able to dynamically allocate the buffers that you're not pre-allocating with stripe_cache_size, but bonnie++ is eating that up in this case. -- Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Bonnie++ with 1024k stripe SW/RAID5 causes kernel to goto D-state
Kernel: 2.6.23-rc8 (older kernels do this as well) When running the following command: /usr/bin/time /usr/sbin/bonnie++ -d /x/test -s 16384 -m p34 -n 16:10:16:64 It hangs unless I increase various parameters md/raid such as the stripe_cache_size etc.. # ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 276 0.0 0.0 0 0 ?D12:14 0:00 [pdflush] root 277 0.0 0.0 0 0 ?D12:14 0:00 [pdflush] root 1639 0.0 0.0 0 0 ?D< 12:14 0:00 [xfsbufd] root 1767 0.0 0.0 8100 420 ?Ds 12:14 0:00 root 2895 0.0 0.0 5916 632 ?Ds 12:15 0:00 /sbin/syslogd -r See the bottom for more details. Is this normal? Does md only work without tuning up to a certain stripe size? I use a RAID 5 with 1024k stripe which works fine with many optimizations, but if I just boot the system and run bonnie++ on it without applying the optimizations, it will hang in d-state. When I run the optimizations, then it exits out of D-state, pretty weird? (again, without this, bonnie++ will hang in d-state.. until this is run) Optimization script: #!/bin/bash # source profile . /etc/profile # Tell user what's going on. echo "Optimizing RAID Arrays..." # Define DISKS. cd /sys/block DISKS=$(/bin/ls -1d sd[a-z]) # This step must come first. # See: http://www.3ware.com/KB/article.aspx?id=11050 echo "Setting max_sectors_kb to 128 KiB" for i in $DISKS do echo "Setting /dev/$i to 128 KiB..." echo 128 > /sys/block/"$i"/queue/max_sectors_kb done # This step comes next. echo "Setting nr_requests to 512 KiB" for i in $DISKS do echo "Setting /dev/$i to 512K KiB" echo 512 > /sys/block/"$i"/queue/nr_requests done # Set read-ahead. echo "Setting read-ahead to 64 MiB for /dev/md3" blockdev --setra 65536 /dev/md3 # Set stripe-cache_size for RAID5. echo "Setting stripe_cache_size to 16 MiB for /dev/md3" echo 16384 > /sys/block/md3/md/stripe_cache_size # Set minimum and maximum raid rebuild speed to 30MB/s. echo "Setting minimum and maximum resync speed to 30 MiB/s..." echo 3 > /sys/block/md0/md/sync_speed_min echo 3 > /sys/block/md0/md/sync_speed_max echo 3 > /sys/block/md1/md/sync_speed_min echo 3 > /sys/block/md1/md/sync_speed_max echo 3 > /sys/block/md2/md/sync_speed_min echo 3 > /sys/block/md2/md/sync_speed_max echo 3 > /sys/block/md3/md/sync_speed_min echo 3 > /sys/block/md3/md/sync_speed_max # Disable NCQ on all disks. echo "Disabling NCQ on all disks..." for i in $DISKS do echo "Disabling NCQ on $i" echo 1 > /sys/block/"$i"/device/queue_depth done -- Once this runs, everything works fine again. -- # mdadm -D /dev/md3 /dev/md3: Version : 00.90.03 Creation Time : Wed Aug 22 10:38:53 2007 Raid Level : raid5 Array Size : 1318680576 (1257.59 GiB 1350.33 GB) Used Dev Size : 146520064 (139.73 GiB 150.04 GB) Raid Devices : 10 Total Devices : 10 Preferred Minor : 3 Persistence : Superblock is persistent Update Time : Sat Sep 29 13:05:15 2007 State : active, resyncing Active Devices : 10 Working Devices : 10 Failed Devices : 0 Spare Devices : 0 Layout : left-symmetric Chunk Size : 1024K Rebuild Status : 8% complete UUID : e37a12d1:1b0b989a:083fb634:68e9eb49 (local to host p34.internal.lan) Events : 0.4211 Number Major Minor RaidDevice State 0 8 330 active sync /dev/sdc1 1 8 491 active sync /dev/sdd1 2 8 652 active sync /dev/sde1 3 8 813 active sync /dev/sdf1 4 8 974 active sync /dev/sdg1 5 8 1135 active sync /dev/sdh1 6 8 1296 active sync /dev/sdi1 7 8 1457 active sync /dev/sdj1 8 8 1618 active sync /dev/sdk1 9 8 1779 active sync /dev/sdl1 -- NOTE: This bug is reproducible every time: Example: $ /usr/bin/time /usr/sbin/bonnie++ -d /x/test -s 16384 -m p34 -n 16:10:16:64 Writing with putc()... It writes for 4-5 minutes and then.. SILENCE + D-STATE, I was too late this time :( $ ps auxww | grep D USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 276 1.2 0.0 0 0 ?D12:50 0:03 [pdflush] root 2901 0.0 0.0 5916 632 ?Ds 12:50 0:00 /sbin/syslogd - r user 4571 48.0 0.0 11644 1084 pts/1D+ 12:51 1:55 /usr/sbin/bonn ie++ -d /x/test -s 16384 -m p34 -n 16:10:16:64 root 4612 1.0 0.0 0 0 ?D12:52 0:01 [pdflush] root 4624 5.0 0.0 40964 7436 ?D12:55 0:00 /usr/bin/perl - w /app/rrd-cputemp/bin/rrd_cputemp.pl root 4684 0.0 0.0 31968 1416 ?