On Dec 15, 2006 15:23 -0400, Peter Bojanic wrote: > In a call today with Cray, we discussed briefly about 4MB IOs on the > DDN 9500. I've heard and seen numbers from one customer (I'm not sure > I should name them, so I will not), that demonstrate that they get > better performance with our patch that coalesces 4MB IOs for the > Lustre VFS client. Nic Henke mentioned a discussion with eeb that > debated the advantages of 4MB IOs wrt LNET. > > Can you advise: > - is the 4MB IO scenario specific to the Elan configuration of our > first customer who used DDN 9500s
No, I don't think the 4MB IO performance relates to the network at all, with the exception that a low-bandwidth network like GigE wouldn't show anything because it is the bottleneck and not the disk IO. > - what is the drawback of 4MB IOs wrt LNET The problem is that generating large IOVs is problematic for some LNDs because they can't handle more than 256 pages at a time in the scatter- gather list. This is LND-specific. > - is there a clear verdict for Lustre VFS clients? We'll have to wait until Jody can test Lustre clients against the DDN 9500, I don't recall offhand whether the previous results were end-to-end or with sgp-dd. > - does this even matter for liblustre, for which we're unable to > aggregate IOs? Yes, possibly even moreso because if liblustre can't do asynchronous IOs then IO performance is all the more important for applications (on Linux the application can resume computation while Lustre flushes the cache in the background). That said, liblustre would only be able to take advantage of larger RPC IO sizes if the application is itself doing such large IOs, while Linux clients can aggregate multiple smaller IOs into a larger single one on the wire. I actually thought of a very interesting solution to this that allows large disk IO sizes without even changing the wire protocol. It would also allow per-client bandwidth throttling, which some customers have expressed some interest in. Eric has expressed repeatedly that having a 1MB IO size is plenty large enough to saturate the network, and the reason that 4MB IO is faster is purely a function of the disk IOs. Jody's recent sgp-dd testing has shown that the raw disk performance does increase dramatically for 4MB IOs compared to 1MB IOs. The changes needed would be as follows: - clients would submit IOs as normal (1MB) to the OSTs - at the OST side the requests are immediately added to that client's export instead of waiting in the incoming request queue for an IO thread to handle them (likely a single thread would decode enough of each request to figure out the export to attach it to), adding the exports to a list of exports with pending requests - the OST service threads would walk the pending export list and process some number of requests from that export. If it processed 4 bulk IO requests together it would give us 4MB IOs to disk with no change to the wire protocol. It could even be smart and pick among the pending requests to submit them in file offset order. - The threads could process more or less requests per export to provide more or less throughput on a per-export basis Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. _______________________________________________ Lustre-devel mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-devel
