Thanks Andreas. I'll reconfigure the RAID and give it another shot today. Would it be reasonable to credit the stalled writes with this I/O mismatch I have?
Dan On Thu, 2008-01-31 at 01:40 -0700, Andreas Dilger wrote: > On Jan 30, 2008 18:32 -0800, Dan wrote: > > I was a little uncertain of the stripe size calculation so here we go... > > My chunk size is 128k and there are 23 disks in RAID 6 (one hot spare > > leave 23). That means 21 data disks? Judging by your formula I take 23 * > > 128k whis is 2944. Is this even close to what you intended? This stripe > > size hangs at mount... > > Hmm, I don't think the mballoc code can efficiently deal with a stripe size > larger than the RPC size (which is 1MB) because this will always result in > a read-modify-write of the RAID stripe as not enough data can be collected > to fill a stripe. > > > I've tried to test with the lustre-io kit but the tests (writes) fail on > > most OSTs. That is the problem I'm having after all... frustrating. > > > > Would it make sense to reconfigure the RAID controllers to have separate > > groups of disks in RAID 6? For performance is there a recommended max > > size or number of disks for each OST? Lastly, is it worth while to > > consider putting the ext3 journal on another device exported from the RAID > > controller? > > Having 21 disks in the RAID set is probably too large to be practical > because of the high overhead of doing IO of such a large size. > Good configurations for such a system might be 2x 8+2 + spare = 21 disks > with 128kB chunk size, or 16+2 + spare = 19 disks with 64kB chunk size. > Both result in 1MB full stripe size, which is what mballoc and Lustre > are optimized to by default. > > > > On Jan 18, 2008 16:45 -0800, Dan wrote: > > >> I'm looking for some advice on improving disk performance and > > >> understanding what Lustre is doing with it. Right now I have a ~28 TB > > >> OSS with 4 OSTs on it. There are 4 clients using Lustre native - no > > >> NFS. If I write to the lustre volume from the clients I get odd > > >> behavior. Typically the writes have a long pause before any data > > >> starts hitting the disks. Then 2 or 3 of the clients will write > > >> happily but one or two will not. Eventually Lustre will pump out a > > >> number of I/O related errors such as "slow i_mutex 165 seconds, slow > > >> direct_io 32 seconds" and so on. Next the clients that couldn't write > > >> will catch up and pass the clients that could write. At some point (5 > > >> minutes or so) the jobs start failing without any errors. New jobs > > >> can be started after these fail and the pattern repeats. Write speeds > > >> are low, around 22 MB/sec per client, the disks shouldn't have any > > >> problem handling 4 writes at this speed!! This did work using NFS. > > >> > > >> When these disks were formated with XFS I/O was fast. No problems > > >> at > > >> all writing 475 MB/sec sustained per RAID controller (locally, not via > > >> NFS). No delays. After configuring for Lustre the peak sustained > > >> write (locally) is 230 MB/sec. It will write for about 2 minutes > > >> before logging about slow I/O. This is without any clients connected. > > >> > > >> So far I've done the following: > > >> > > >> 1. Recompiled SCSI driver for RAID controller to use 1 MB blocks (from > > >> 256k). > > >> 2. Adjusted MDS, OST threads > > >> 3. Tried all I/O schedulers > > >> 4. Tried all possible settings on RAID controllers for Caching and > > >> read-ahead. > > >> 5. Some minor stuff I forgot about! > > >> > > >> Nothing makes a difference - same results under each configuration > > >> except > > >> for schedulers. When running the deadline scheduler the writes fail > > >> faster and have delays around 30 seconds. With all others the delays > > >> range from 100 to 500 seconds. > > >> > > >> The system has 4 cores and 4 GB of memory with 4 7 TB OSTs. The disks > > >> are > > >> in RAID 6 split between two controllers with 2 GB cache each. One > > >> controller has the MGS/MDT on it. When running top it indicates 2/3 to > > >> 3/4 of memory utilized and 25% CPU utilization normally. > > > > > > Are you using Lustre 1.4 or 1.6? Are you mounting your OSTs with > > > "-o extents,mballoc"? We've had Lustre OSSs nodes running in excess > > > of 2GB/s with h/w RAID controllers. > > > > > > Are you using partitions on your RAID device? You shouldn't - that causes > > > unaligned IO to the device and needless read-modify-write for each RAID > > > stripe. > > > > > > Is your RAID geometry efficient with 1MB IOs (e.g. 4+1 or 8+1)? If not, > > > then you should consider mounting your OSTs with "-o > > > stripe={raid_stripe}", > > > where raid_stripe=N*raid_chunksize, N is the number of data disks for > > > RAID 5 N+1 or RAID 6 N+2. > > > > > > You should download the lustre-iokit and use sgpdd-survey, > > > obdfilter-survey, > > > and PIOS to determine what is causing the performance bottleneck. > > > > > > Cheers, Andreas > > > -- > > > Andreas Dilger > > > Sr. Staff Engineer, Lustre Group > > > Sun Microsystems of Canada, Inc. > > > > > > > > > _______________________________________________ > > Lustre-discuss mailing list > > Lustre-discuss@lists.lustre.org > > http://lists.lustre.org/mailman/listinfo/lustre-discuss > > Cheers, Andreas > -- > Andreas Dilger > Sr. Staff Engineer, Lustre Group > Sun Microsystems of Canada, Inc.
_______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss