PFC wrote:



My argument is that a sufficiently smart kernel scheduler *should*
yield performance results that are reasonably close to what you can
get with that feature.  Perhaps not quite as good, but reasonably
close.  It shouldn't be an orders-of-magnitude type difference.


And a controller card (or drive) has a lot less RAM to use as a cache / queue for reordering stuff than the OS has, potentially the OS can us most of the available RAM, which can be gigabytes on a big server, whereas in the drive there are at most a few tens of megabytes...

However all this is a bit looking at the problem through the wrong end. The OS should provide a multi-read call for the applications to pass a list of blocks they'll need, then reorder them and read them the fastest possible way, clustering them with similar requests from other threads.

Right now when a thread/process issues a read() it will block until the block is delivered to this thread. The OS does not know if this thread will then need the next block (which can be had very cheaply if you know ahead of time you'll need it) or not. Thus it must make guesses, read ahead (sometimes), etc...

All true. Which is why high performance computing folks use aio_read()/aio_write() and load up the kernel with all the requests they expect to make.


The kernels that I'm familiar with will do read ahead on files based on some heuristics: when you read the first byte of a file the OS will typically load up several pages of the file (depending on file size, etc). If you continue doing read() calls without a seek() on the file descriptor the kernel will get the hint that you're doing a sequential read and continue caching up the pages ahead of time, usually using the pages you just read to hold the new data so that one isn't bloating out memory with data that won't be needed again. Throw in a seek() and the amount of read ahead caching may be reduced.


One point that is being missed in all this discussion is that the file system also imposes some constraints on how IO's can be done. For example, simply doing a write(fd, buf, 100000000) doesn't emit a stream of sequential blocks to the drives. Some file systems (UFS was one) would force portions of large files into other cylinder groups so that small files could be located near the inode data, thus avoiding/reducing the size of seeks. Similarly, extents need to be allocated and the bitmaps recording this data usually need synchronous updates, which will require some seeks, etc. Not to mention the need to update inode data, etc. Anyway, my point is that the allocation policies of the file system can confuse the situation.


Also, the seek times one sees reported are an average. One really needs to look at the track-to-track seek time and also the "full stoke" seek times. It takes a *long* time to move the heads across the whole platter. I've seen people partition drives to only use small regions of the drives to avoid long seeks and to better use the increased number of bits going under the head in one rotation. A 15K drive doesn't need to have a faster seek time than a 10K drive because the rotational speed is higher. The average seek time might be faster just because the 15K drives are smaller with fewer number of cylinders.

-- Alan

---------------------------(end of broadcast)---------------------------
TIP 3: if posting/reading through Usenet, please send an appropriate
     subscribe-nomail command to [EMAIL PROTECTED] so that your
     message can get through to the mailing list cleanly

Reply via email to