On Tue, 2009-09-01 at 11:34 -0700, Don Thorp wrote: > > New hardware that will support the workload is on the way, but are > there some changes I can make now to 1.6.6 that would increase > reliability, even at the expense of performance?
With what you have given us to work with, my first suggestion would be to increase your obd_timeout. You should not need to go higher than about 300 seconds, but should try to choose a value only high enough to stop the callback timeouts. Higher obd_timeout values mean longer recoveries. Additionally, you might look into tuning the number of OST threads on your OSSes if you are driving your disks too hard. OST thread count, like obd_timeout should be just high enough, but not more, to reach maximum throughput. If you have not baselined your hardware with the iokit, you can simply start dropping the OST thread counts until you find that you are impacting throughput. It's a bit more trial and error than using the iokit, but if you are in production already, it's probably the best you can do. b.
signature.asc
Description: This is a digitally signed message part
_______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss