Re: [Fwd: [linux-audio-dev] info point on linux hdr]
>> 2) Why am I not having any of these problems ? Unlike Benno's code, I >>have a working application that runs just fine. I get smooth >>throughput from the disk subsystem too. > >What do you mean exactly with "unlike Benno's code" ? >My code just tries to simulate the operation of a busy harddisk recorder >using the sorting algorithm to support variable speed. >I read in 256kb chunks too therefore I don't see a big difference between >my code and yours form a disk IO subsystem POV. In your own words, your code "simulates the operations of a busy harddisk recorder". Mine *is* a busy harddisk recorder. There's lot of stuff in my code that isn't in yours because I have a bunch of real world stuff like a tape transport mechanism, extra intra-thread communication, MTC delivery, an audio thread event/play list, and more. Yet, despite all this, I have never run into the problems you describe. As Stephen has noted, this may very well be because of my use of SCSI h/w. >Therefore it would be useful if you could run my benchmark on your >disk to see if your (or my approach) gets better performance out of the disk, >and with how much buffer utilization. When I get a minute or 30, I will. >>In both cases, without O_SYNC, or anything else but preallocation >>and careful design, I seem to be able to get smooth disk throughput >>at significantly above the rate I need (9MB/sec; I get up to >>17MB/sec from the UltraStar) > >17MB/sec using hdparm or linear large reads/writes ( large cat / cp etc) or >17MB/sec within your harddisk recording app where >num_tracks * datarate_of_each_track = 17MB/sec >(it if's the latter then I doubt it because seek kills some of the throughput, >that's almost unavoidable, at least on my EIDE UDMA disks) No, I do mean the latter: 17MB/sec from within ardour. You can doubt it all you want, but I get it regularly. Actually, the real numbers look more like (from memory, each line is one iteration of disk i/o across all tracks, so 24*256kB of data): 15MB/sec 450MB/sec 10MB/sec 14MB/sec 567MB/sec 16MB/sec 19MB/sec 8MB/sec 378MB/sec The super-high numbers, I assume, are because of the read-ahead being done by the kernel, which helps us out every so often. Remember, these files are as contiguous as I can make 'em with ext2. And keep in mind that my disks have a maximum transfer rate of 35MB/sec (nothing to do with U2W - just that they are just about the latest disks). I have a very small, standalone single-threaded test app that gets similar rates, even though it does random sized seeks across the whole disk. I've posted that program before on LAD, and so I trust the numbers. --p
Re: O_DIRECT architecture (was Re: info point on linux hdr)
Hi, On Tue, Apr 18, 2000 at 01:17:52PM -0500, Steve Lord wrote: > So I guess the question here is how do you plan on keeping track of the > origin of the pages? You don't have to. > Which ones were originally part of the kernel cache > and thus need copying up to user space? If the caller requested O_ALIAS, then the IO routine is allowed to free any page in the kiobuf and alias the existing page cache page into the kiobuf. It doesn't actually matter whether the original page in the page cache was there when we started, or whether it was a kiobuf page which we mapped into the page cache for a direct IO. The whole point is that as long as the IO is in progress, it is a perfectly legal page in the page cache. There is one nastiness. This use of the kiobuf effectively results in a temporary mmap() of the file while the IO is in progress. If another thread happens to write to the page undergoing an O_DIRECT read, then we end up modifying the page cache. That's bad. So, we need to make sure that while the IO is in progress for a true raw IO, we keep the page locked during the IO and mark the page not- uptodate after the IO completes. That way, even if another process looks up the page while the IO is in progress, the data that was undergoing IO will be kept private, and we get a second chance once the IO has completed to see that the page is now shared and we have to do a copy. That is slightly messy, but it nicely hides the transient mmap while still preserving zero-copy for all of the important cases. It's about the cleanest solution I can see which preserves complete cache coherency at all times, because it guarantees that the IO is always done inside the page cache itself. > It does not seem hard, just wondering > what you had in mind. Also, I presume, if the page was already present > and up to date then on a read you would not refill it from disk - since it > may be more recent that the on disk data, existing buffer heads would > give you this information. It's not a physical buffer_head lookup, it's a logical page cache lookup that we would do, but yes, we'd read from the page cache in this case and just do a copy. > > One consequence is that O_DIRECT I/O from a file which is already cached > > will always result in copies, but I don't mind that too much. > > So maybe an O_CLEANCACHE (or something similar) could be used to indicate > that anything which is found cached should be moved out of the way (flushed > to disk or tossed depending on what is happening). That's an orthogonal issue: posix_fadvise() should be the mechanism for that if the application truly wants to do explicit cache eviction. --Stephen
Re: O_DIRECT architecture (was Re: info point on linux hdr)
> Hi, > > On Tue, Apr 18, 2000 at 07:56:04AM -0500, Steve Lord wrote: > > > I said basic implementation because it is currently paying no attention > > to cached data. The Irix approach to this was to flush or toss cached > > data which overlapped a direct I/O, I am leaning towards keeping them > > as part of the I/O. > > The big advantage of the scheme where I map the kiobuf pages into the > real page cache before the I/O, and unmap after, is that cache > coherency at the beginning of the I/O and all the way through it is > guaranteed. The cost is that the direct I/O may end up doing copies > if there is other I/O going on at the same time to the same page, but > I don't see that as a problem! I was thinking along these lines. So I guess the question here is how do you plan on keeping track of the origin of the pages? Which ones were originally part of the kernel cache and thus need copying up to user space? It does not seem hard, just wondering what you had in mind. Also, I presume, if the page was already present and up to date then on a read you would not refill it from disk - since it may be more recent that the on disk data, existing buffer heads would give you this information. > > > Ultimately we are going to have to review the whole device driver > interface. We need that both to do things like >2TB block devices, and > also to achieve better efficiency than we can attain right now with a > separate buffer_head for every single block in the I/O. It's just using > too much CPU; being able to pass kiobufs directly to ll_rw_block along > with a block address list would be much more efficient. Agreed, XFS was getting killed by this (and the fixed block size requirement of the interface) we have 512 byte I/O requests we need to do for some meta-data, having to impose this on all I/O and create 8 buffer heads for each 4K page was just nasty. > > > So if O_ALIAS allows user pages to be put in the cache (provided you use > > O_UNCACHE with it), you can do this. > > Yes. > > > However, O_DIRECT would be a bit more > > than this - since if there already was cached data for part of the I/O > > you still need to copy those pages up into the user pages which did not > > get into cache. > > That's the intention --- O_ALIAS _allows_ the user page to be mapped > into the cache, but if existing cached data or alignment constraints > prevent that, it will fall back to doing a copy. > > One consequence is that O_DIRECT I/O from a file which is already cached > will always result in copies, but I don't mind that too much. So maybe an O_CLEANCACHE (or something similar) could be used to indicate that anything which is found cached should be moved out of the way (flushed to disk or tossed depending on what is happening). Some other sort of API such as an fsync variant or that fadvise call which was mentioned recently could be used to clean cache for a file. This would let those apps which really want direct disk <-> user memory I/O get what they wanted. > > The pagebuf stuff sounds like it is fairly specialised for now. As > long as all of the components that we are talking about can pass kiobufs > between themselves, we should be able to make them interoperate pretty > easily. > > Is the pagebuf code intended to be core VFS functionality or do you > see it being an XFS library component for the forseeable future? We had talked about trying to use it on some other filesystem to see what happened, but we don't really have the bandwidth to do that. We don't see it as being just there for XFS - although, for existing Linux filesystems, there may not be benefits to switching over to it. > > --Stephen Steve
Re: O_DIRECT architecture (was Re: info point on linux hdr)
Hi, On Tue, Apr 18, 2000 at 07:56:04AM -0500, Steve Lord wrote: > > XFS is using the pagebuf code we wrote (or I should say are writing - it > needs a lot of work yet). This uses kiobufs to represent data in a set of > pages. So, we have the infrastructure to take a kiobuf and read or write > it from disk (OK, it uses buffer heads under the covers). That's fine, and in fact is exactly what kiobufs were designed for: to abstract out the storage of the buffer from whatever construction you happen to use to do the IO. (Raw IO also uses buffer_heads internally but passes data around in kiobufs.) > I said basic implementation because it is currently paying no attention > to cached data. The Irix approach to this was to flush or toss cached > data which overlapped a direct I/O, I am leaning towards keeping them > as part of the I/O. The big advantage of the scheme where I map the kiobuf pages into the real page cache before the I/O, and unmap after, is that cache coherency at the beginning of the I/O and all the way through it is guaranteed. The cost is that the direct I/O may end up doing copies if there is other I/O going on at the same time to the same page, but I don't see that as a problem! > o using caching to remove the alignment restrictions on direct I/O by > doing unaligned head and tail processing via buffered I/O. I'm just planning on doing a copy for any unaligned I/O. Raw character devices simply reject unaligned I/O for now, but O_DIRECT will be a bit more forgiving. > > It's something I've been thinking about in the general case. Basically > > what I want to do is this: > > > > Augment the inode operations with a new operation, "rw_kiovec" which > > performs reads and writes on vectors of kiobufs. > > You should probably take a look at what we have been doing to the ops, > although our extensions are really biased towards extent based filesystems, > rather than using getblock to identify individual blocks of file data we > added a bmap interface to return a larger range - this requires different > locking semantics than getblock, since the mapping we return covers multiple > pages. I suspect that any approach which assembles multiple pages in advance > is going to have similar issues. OK. These are probably orthogonal for now, but doing extent bmaps is an important optimisation. Ultimately we are going to have to review the whole device driver interface. We need that both to do things like >2TB block devices, and also to achieve better efficiency than we can attain right now with a separate buffer_head for every single block in the I/O. It's just using too much CPU; being able to pass kiobufs directly to ll_rw_block along with a block address list would be much more efficient. > So if O_ALIAS allows user pages to be put in the cache (provided you use > O_UNCACHE with it), you can do this. Yes. > However, O_DIRECT would be a bit more > than this - since if there already was cached data for part of the I/O > you still need to copy those pages up into the user pages which did not > get into cache. That's the intention --- O_ALIAS _allows_ the user page to be mapped into the cache, but if existing cached data or alignment constraints prevent that, it will fall back to doing a copy. One consequence is that O_DIRECT I/O from a file which is already cached will always result in copies, but I don't mind that too much. > We (SGI) really need to get better hooked in on stuff like this - I really > don't want to see us going off in one direction (pagebuf) and all the other > filesystems going off in a different direction. The pagebuf stuff sounds like it is fairly specialised for now. As long as all of the components that we are talking about can pass kiobufs between themselves, we should be able to make them interoperate pretty easily. Is the pagebuf code intended to be core VFS functionality or do you see it being an XFS library component for the forseeable future? > p.s. did you know we also cache meta data in pages directly? That was one of the intentions in the new page cache structure, and we may actually end up moving ext2's metadata caching to use the page cache too in the future. --Stephen
Re: [Fwd: [linux-audio-dev] info point on linux hdr]
Hi, On Tue, Apr 18, 2000 at 10:57:25AM -0400, Paul Barton-Davis wrote: > >> 1) pre-allocation takes a *long* time. Allocating 24 203MB files on a > >>clean ext2 partition of 18GB takes many, many minutes, for example. > >>Presumably, the same overhead is being incurred when block > >>allocation happens "on the fly". > > > >It is not the allocation which is taking ages, it's the actual > >writing of the data. > > Except that for preallocation, I only write one byte in every block, > so for a 203MB file, I only write 52K approximately (ext2 4K blocks). Umm, how is that going to make _any_ difference at all? The filesystem works in blocks, not bytes. You still end up with 203MB of disk IO. --Stephen
Re: [Fwd: [linux-audio-dev] info point on linux hdr]
>> 1) pre-allocation takes a *long* time. Allocating 24 203MB files on a >>clean ext2 partition of 18GB takes many, many minutes, for example. >>Presumably, the same overhead is being incurred when block >>allocation happens "on the fly". > >It is not the allocation which is taking ages, it's the actual >writing of the data. Except that for preallocation, I only write one byte in every block, so for a 203MB file, I only write 52K approximately (ext2 4K blocks). --p
Re: O_DIRECT architecture (was Re: info point on linux hdr)
> Hi, > > On Mon, Apr 17, 2000 at 05:58:48PM -0500, Steve Lord wrote: > > > > O_DIRECT on Linux XFS is still a work in progress, we only have > > direct reads so far. A very basic implementation was made available > > this weekend. > > Care to elaborate on how you are doing O_DIRECT? XFS is using the pagebuf code we wrote (or I should say are writing - it needs a lot of work yet). This uses kiobufs to represent data in a set of pages. So, we have the infrastructure to take a kiobuf and read or write it from disk (OK, it uses buffer heads under the covers). I glued this together with the map_user_kiobuf() and unmap_kiobuf() calls from your raw I/O driver and that was about it. We only build these kiobufs for data which is sequential on disk, not for the whole user request, the sequence we do things in is a bit different, basically: while data left to copy obtain bmap from filesystem representing location of next chunk of data (sequential on disk) for buffered I/O go find pages covering this range - create if they do not exist. issue blocking read for pages which are not uptodate copy out to user space for direct I/O map user pages into a kiobuf issue blocking read for pages unmap pages I said basic implementation because it is currently paying no attention to cached data. The Irix approach to this was to flush or toss cached data which overlapped a direct I/O, I am leaning towards keeping them as part of the I/O. Other future possibilities I see are: o using caching to remove the alignment restrictions on direct I/O by doing unaligned head and tail processing via buffered I/O. o Automatically switching to direct I/O under conditions where there the I/O would flush to much cache. > > It's something I've been thinking about in the general case. Basically > what I want to do is this: > > Augment the inode operations with a new operation, "rw_kiovec" which > performs reads and writes on vectors of kiobufs. You should probably take a look at what we have been doing to the ops, although our extensions are really biased towards extent based filesystems, rather than using getblock to identify individual blocks of file data we added a bmap interface to return a larger range - this requires different locking semantics than getblock, since the mapping we return covers multiple pages. I suspect that any approach which assembles multiple pages in advance is going to have similar issues. > > Provide a generic_rw_kiovec() function which uses the existing page- > oriented IO vectors to set up page mappings much as generic_file_{read, > write} do, but honouring the following flags in the file descriptor: > > * O_ALIAS > >Allows the write function to install the page in the kiobuf >into the page cache if the data is correctly aligned and there is >not already a page in the page cache. > >For read, the meaning is different: it allows existing pages in >the page cache to be installed into the kiobuf. > > * O_UNCACHE > >If the IO created a new page in the page cache, then attempt to >unlink the page after the IO completes. > > * O_SYNC > >Usual meaning: wait for synchronous write IO completion. > > O_DIRECT becomes no more than a combination of these options. So if O_ALIAS allows user pages to be put in the cache (provided you use O_UNCACHE with it), you can do this. However, O_DIRECT would be a bit more than this - since if there already was cached data for part of the I/O you still need to copy those pages up into the user pages which did not get into cache. > > Furthermore, by implementing this mechanism with kiobufs, we can go > one step further and perform things like Larry's splice operations by > performing reads and writes in kiobufs. Using O_ALIAS kiobuf reads and > writes gives us copies between regular files entirely in kernel space > with the minimum possible memory copies. sendfile() between regular > files can be optimised to use this mechanism. The data never has to > hit user space. > > As an example of the flexibility of the interface, you can perform > an O_ALIAS, O_UNCACHE sendfile to copy one file to another, with full > readahead still being performed on the input file but with no memory > copies at all. You can also choose not to have O_UNCACHE and O_SYNC > on the writes, in which case you have both readahead and writebehind > with zero copy. > > This is all fairly easy to implement (at least for ext2), and gives > us much more than just O_DIRECT for no extra work. > > --Stephen We (SGI) really need to get better hooked in on stuff like this - I really don't want to see us going off in one direction (pagebuf) and all the other filesystems going off in a different d
Re: [Fwd: [linux-audio-dev] info point on linux hdr]
Hi, On Mon, Apr 17, 2000 at 07:10:43PM +0200, Martin Schenk wrote: > If you are interested in a more efficient fsync (and a real fdatasync), > I have some patches that provide better performance for very large > files (where fsync is mostly busy scanning the page cache for changes), > and a fdatasync that eliminates writing the inode if not necessary. > (at the moment these patches are only for 2.3.4?, and I don't have > the time to keep them up to date - especially as nobody was interested > the last time I posted them) Please post them, I will have a look at them. --Stephen
Re: [Fwd: [linux-audio-dev] info point on linux hdr]
Hi, On Mon, Apr 17, 2000 at 01:05:12PM -0400, Paul Barton-Davis wrote: > > Acknowledging your much greater wisdom in this are than me, I don't > understand the above given that, in my experience: > > 1) pre-allocation takes a *long* time. Allocating 24 203MB files on a >clean ext2 partition of 18GB takes many, many minutes, for example. >Presumably, the same overhead is being incurred when block >allocation happens "on the fly". It is not the allocation which is taking ages, it's the actual writing of the data. Even if you preallocate, you still have the data writes to worry about! Preallocation just can't gain that much. (It can gain some, but not as much as you would like.) > 2) Why am I not having any of these problems ? Unlike Benno's code, I >have a working application that runs just fine. Umm, you are running a different application on different hardware, and you are wondering why you don't see the same effects... not surprising! In particular, it's really the use of multiple streams which causes big problems. The difference between SCSI and IDE is also significant. (That difference is hopefully improvede by the elevator patches.) --Stephen
O_DIRECT architecture (was Re: info point on linux hdr)
Hi, On Mon, Apr 17, 2000 at 05:58:48PM -0500, Steve Lord wrote: > > O_DIRECT on Linux XFS is still a work in progress, we only have > direct reads so far. A very basic implementation was made available > this weekend. Care to elaborate on how you are doing O_DIRECT? It's something I've been thinking about in the general case. Basically what I want to do is this: Augment the inode operations with a new operation, "rw_kiovec" which performs reads and writes on vectors of kiobufs. Provide a generic_rw_kiovec() function which uses the existing page- oriented IO vectors to set up page mappings much as generic_file_{read, write} do, but honouring the following flags in the file descriptor: * O_ALIAS Allows the write function to install the page in the kiobuf into the page cache if the data is correctly aligned and there is not already a page in the page cache. For read, the meaning is different: it allows existing pages in the page cache to be installed into the kiobuf. * O_UNCACHE If the IO created a new page in the page cache, then attempt to unlink the page after the IO completes. * O_SYNC Usual meaning: wait for synchronous write IO completion. O_DIRECT becomes no more than a combination of these options. Furthermore, by implementing this mechanism with kiobufs, we can go one step further and perform things like Larry's splice operations by performing reads and writes in kiobufs. Using O_ALIAS kiobuf reads and writes gives us copies between regular files entirely in kernel space with the minimum possible memory copies. sendfile() between regular files can be optimised to use this mechanism. The data never has to hit user space. As an example of the flexibility of the interface, you can perform an O_ALIAS, O_UNCACHE sendfile to copy one file to another, with full readahead still being performed on the input file but with no memory copies at all. You can also choose not to have O_UNCACHE and O_SYNC on the writes, in which case you have both readahead and writebehind with zero copy. This is all fairly easy to implement (at least for ext2), and gives us much more than just O_DIRECT for no extra work. --Stephen