Re: [linux-audio-dev] Performance and SCSI
I was trying to understand why the performance you get is so much better than what Benno and I get, and I'm now fairly convinced it's the hardware. [ ... ] yes, but a WARNING: anyone thinking of moving to 2.3.99pre5 or above: there appears to be a serious bug that will kill ardour's performance (as well as Benno's test programs). I've posted a followup on linux-kernel to the person who first noticed it. Basically, writing a few hundred MB causes the box to slow to a crawl as memory utilization goes through the roof. This was not the case in 2.3.51. Right now, I cannot run Benno's hdtest program (it will never complete), and ardour cannot record more than a minute or so of a handful of tracks without a dropout. Again, this appears to be a new bug introduced in 2.3.99 somewhere. --p
Re: [linux-audio-dev] more preallocation vs no prealloc / async vs sync tests.
./hdtest 500 async trunc SINGLE THREADED: 5.856 MByte/sec MULTI-THREADED: 6.096 MByte/sec ./hdtest 500 async notrunc (rewrite to preallocated files) SINGLE THREADED: 4.040 MByte/sec MULTI-THREADED: 4.766 MByte/sec ./hdtest 150 sync trunc SINGLE THREADED: 1.442 MByte/sec MULTI-THREADED: 0.121 MByte/sec (floppy-like performance :-) ) ./hdtest 150 sync notrunc SINGLE THREADED: 4.788 MByte/sec MULTI-THREADED: 1.984 MByte/sec PS: Paul run it on your 10k rpm SCSI disk so that we can do some comparison. I hope you are ready for some *very* different numbers. /tmp/hdtest 500 async trunc SINGLE THREADED: 12.788 MByte/sec MULTI-THREADED: 12.788 MByte/sec /tmp/hdtest 500 async notrunc SINGLE THREADED: 6.096 MByte/sec MULTI-THREADED: 6.168 MByte/sec /tmp/hdtest 150 sync trunc SINGLE THREADED: 11.292 MByte/sec MULTI-THREADED: 12.233 MByte/sec /tmp/hdtest 150 sync trunc SINGLE THREADED: 5.437 MByte/sec MULTI-THREADED: 6.383 MByte/sec A few notes. In the source you sent, you are not doing 256kB writes, but 1MB writes, since you defined MYSIZE as (262144*4). This is puzzling. However, changing it to 256kB doesn't change the results in any significant way, as far as I can tell. It troubles me that the ongoing rate display is always significantly higher than the eventual effective speed. I understand the reason for the initially very high rate, but I typically see final rates from the ongoing display that are very much higher than in your effective rate display (e.g. 13MB/sec versus 5.5MB/sec, 20MB/sec versus 12MB/sec). I don't have the time to stare at the source and figure out why this is. Its very interesting that writing to pre-allocated files is 50% slower for me. This is even though your pre-allocation strategy causes block-interleaving of the files. I suspect, but at this time cannot prove, that this is due (in my case at least) to fs fragmentation. I will try the benchmark on a clean 18GB disk the next time I'm over at the studio. Stephen Tweedie or someone else would know the answer to my last question: I am wondering if contiguous allocation of fs blocks to a file reduces the amount of metadata updating ? Does metadata belong to a fixed-sized unit, or an inode, or a variable-sized unit, or some combination ? I ask this because I see some visual indication of the disk stalls you have talked about when running your hdtest program (it may just be paging issues, however - hard to tell), and I still have not seen them in ardour. Assuming for a second that these are real stalls, one obvious difference is that your preallocation strategy does not produce contiguous files. --p
Re: [linux-audio-dev] Re: More results and thoughts on disk writing performance
Incidently, on the systems I tested it appears that preallocation *slows dow n* data writing. Paul, have you compared your system with and without using preallocation? What speed difference do you see? EXACTLY ! I am experiencing the same ! After Paul praised the preallocation so much, I decided to test it, and I get about 20% performance slowdown over the case when running without preallocation. . . . I can't explain why we do experience the slowdown with preallocation. No big suprise here. Suppose you write the files in 256kB chunks, and re-read them in the same way. If ext2 behaves the way I would expect (*) it to, you end up with somewhat-to-totally block-interleaved files that are read with no-or-little seeking (because the read pattern will exactly match the write pattern). The problem with not preallocating occurs only on the first write, and to be honest, my preallocation scheme should be changed to mirror the actual actions of a true "first write" by block-interleaving the files instead of aiming for complete per-file contiguity. The one difficulty with this is that if you change the size of the i/o requests, you may get *worse* performance. At one time, I imagined this size to be rather fluid, but it now appears to be likely to assume a fairly constant value across all disks on all systems (and certainly on any particular system). This removes my only real objection to block-interleaving the files. I will change the way the files are pre-allocated and see if it speeds things up even more. Again, just in case anyone missed it: I have never encountered the problems Benno has had (or at least, not the same underlying causes - I used to have disk i/o performance problems), probably due to my use of SCSI h/w, and ardour is working multichannel hard-disk recorder. No you are wrong here: the audio thread requires higher priority because it nneds finer granularity, (we want low-latency response from our HD recorder). The audio thread releases the CPU during blocking write()s to the audio device , giving the disk thread all the time it needs to perform large disk I/Os which blocks the disk thread almost all the time. Therefore you gain NOTHING (=zero disk performance increase) by giving the dis k thread higher priority than to the audio thread, except that the audio will drop-out sometime. Fully agreed. --p (*) ext2 filesystems have a pre-allocate distance which someone mentioned. I am hoping that allocating 256kB at a time makes this figure irrelevant, but I am not sure at all that this is true.
Re: [Fwd: [linux-audio-dev] info point on linux hdr]
2) Why am I not having any of these problems ? Unlike Benno's code, I have a working application that runs just fine. I get smooth throughput from the disk subsystem too. What do you mean exactly with "unlike Benno's code" ? My code just tries to simulate the operation of a busy harddisk recorder using the sorting algorithm to support variable speed. I read in 256kb chunks too therefore I don't see a big difference between my code and yours form a disk IO subsystem POV. In your own words, your code "simulates the operations of a busy harddisk recorder". Mine *is* a busy harddisk recorder. There's lot of stuff in my code that isn't in yours because I have a bunch of real world stuff like a tape transport mechanism, extra intra-thread communication, MTC delivery, an audio thread event/play list, and more. Yet, despite all this, I have never run into the problems you describe. As Stephen has noted, this may very well be because of my use of SCSI h/w. Therefore it would be useful if you could run my benchmark on your disk to see if your (or my approach) gets better performance out of the disk, and with how much buffer utilization. When I get a minute or 30, I will. In both cases, without O_SYNC, or anything else but preallocation and careful design, I seem to be able to get smooth disk throughput at significantly above the rate I need (9MB/sec; I get up to 17MB/sec from the UltraStar) 17MB/sec using hdparm or linear large reads/writes ( large cat / cp etc) or 17MB/sec within your harddisk recording app where num_tracks * datarate_of_each_track = 17MB/sec (it if's the latter then I doubt it because seek kills some of the throughput, that's almost unavoidable, at least on my EIDE UDMA disks) No, I do mean the latter: 17MB/sec from within ardour. You can doubt it all you want, but I get it regularly. Actually, the real numbers look more like (from memory, each line is one iteration of disk i/o across all tracks, so 24*256kB of data): 15MB/sec 450MB/sec 10MB/sec 14MB/sec 567MB/sec 16MB/sec 19MB/sec 8MB/sec 378MB/sec The super-high numbers, I assume, are because of the read-ahead being done by the kernel, which helps us out every so often. Remember, these files are as contiguous as I can make 'em with ext2. And keep in mind that my disks have a maximum transfer rate of 35MB/sec (nothing to do with U2W - just that they are just about the latest disks). I have a very small, standalone single-threaded test app that gets similar rates, even though it does random sized seeks across the whole disk. I've posted that program before on LAD, and so I trust the numbers. --p
Re: [Fwd: [linux-audio-dev] info point on linux hdr]
From: Paul Barton-Davis [EMAIL PROTECTED] i mentioned in some remarks to benno how important i thought it was to preallocate the files used for hard disk recording under linux. Preallocation will make little difference. The real issue is that the buffer cache is doing write-behind, ie. it is batching up the writes into big chunks which get blasted to disk once every five seconds or so, causing large IO request queues to accumulate when that happens. That's great for normal use because it means that trickles of write activity don't tie up the spindles the whole time, but it's not ideal for audio recording. Acknowledging your much greater wisdom in this are than me, I don't understand the above given that, in my experience: 1) pre-allocation takes a *long* time. Allocating 24 203MB files on a clean ext2 partition of 18GB takes many, many minutes, for example. Presumably, the same overhead is being incurred when block allocation happens "on the fly". If so, then even if pre-allocation doesn't solve the buffer cache write batching problem, it certainly gets rid of what appears to be an onerous task. 2) Why am I not having any of these problems ? Unlike Benno's code, I have a working application that runs just fine. I get smooth throughput from the disk subsystem too. My configuration (there are two): Kernel 2.3.52 Kernel 2.3.52 Dual PII-450Dual PII-450 on-board Adaptec 7890 on-board Adaptec 7890 Seagate 4.5GB Cheetah U2W 10K rpm IBM 9GB UltraStar U2W 10K rpm Quantum 4.5GB Viking U2W 7.5K rpm 3 x IBM 18GB UltraStar In both cases, without O_SYNC, or anything else but preallocation and careful design, I seem to be able to get smooth disk throughput at significantly above the rate I need (9MB/sec; I get up to 17MB/sec from the UltraStar) In every case, I am doing disk I/O from a dedicated thread to a single disk. I'm confused. Is it just that I'm running on a genuine SMP system ? --p
Re: [Fwd: [linux-audio-dev] info point on linux hdr]
2) Why am I not having any of these problems ? Unlike Benno's code, I Seagate 4.5GB Cheetah U2W 10K rpmIBM 9GB UltraStar U2W 10K rpm Quantum 4.5GB Viking U2W 7.5K rpm3 x IBM 18GB UltraStar Ahh --- SCSI. The request queuing for SCSI is very different to that for non-SCSI devices. Different enough that you feel its likely to explain the significant difference between my experience and that of both Benno and Juhana when trying to record to disk ? Is it different enough to explain why the buffer cache write-behind batching doesn't seem to show up as a problem for me ? Stephen - thanks for paying attention and giving us time on this. I know you have a lot to work on, and that HDR is not what most people consider to be a hot Linux application; our little minority is pretty fanatical :) The application I have ("ardour") stands in relation to existing HDR systems in the same way that Linux or *BSD-based routers stand to Cisco h/w, with the difference being that very few people own the dedicated h/w yet, and so we have a real chance to provide a genuine, new service for people by getting this to work. --p
Re: [Fwd: [linux-audio-dev] info point on linux hdr]
It depends. Using more threads can lead to more conflicting IO seeks unless you can schedule enough IOs at once to give the scheduler a good chance to sort the IOs into decent-sized blocks. The objective should probably be to make sure that you have a few hundred KB of outstanding IO requests on each stream at any one time. That can be done either with lots of threads submitting small IOs, or a few threads submitting large IOs. Just adding a few more threads but still performing small IOs will definitely not help. Right now, I'm using 1 thread with 256kB IO requests, since 256kB seems to give me optimal throughput. Obviously, measuring it is the way to go, but do you have any sense of whether it would be worth submitting 24 of these at more or less the same time, or 12, or just 1 ? --p
Re: [Fwd: [linux-audio-dev] info point on linux hdr]
It would be interesting to compare filesystem latencies in the HDrecording case. As said it's amazing how long the disk thread can get blocked during a large buffer flush/metadata update, on a PII I saw watched the disk thread blocking for several seconds (up 8 secs in the worst case). That means at a datarate of 200kb/sec per audio tracks sometimes you would need up to 2MB ringbuffer per track. Multiply this value with 50 tracks , then you get 100MB of precious RAM wasted for doing only buffering. I am convinced that we can do the job with 0.5MB per track when using RAW IO. (Windoze hdrecording apps seems to work well with these amounts of buffering). Well, first of all, as I've mentioned already on LAD, by preallocating the files, this problem appears to vanish. Second, its unclear whether the Windows/MacOS apps do preallocation or not (though it is clear that they are very fragmentation sensitive). Thirdly, the amount of user-space buffering thats needed is not just a function of jitter in the apparent disk throughput rate. Benno and I have been through this before with respect to audio h/w buffer usage, and we concluded that its very advantageous to use 3 fragments there, precisely to protect against jitter. I think that the same applies here - you want the user space buffer divided into at least 3 "fragments". When fragment N is in use by the audio thread, the "previous" fragment is being handled by the butler thread (for read-ahead and flush to disk). In theory, 2 would be enough, but if for any reason, the butler is ever slowed down for one iteration, so that it fails to finish processing its buffer before the audio thread needs it again, having 3 provides a way to avoid problems. If thats correct, then we next have to look at the appropriate size of those fragments. Since they are intended to correspond to single disk i/o requests, they need to be sized so that we get maximal disk throughput. My experiments have suggested that a 256kB disk i/o request seems to be the smallest size that gives the maximal throughput. If so, then the lower bound on the amount of buffering is not really "optional", but is 3 x 256kB. You can substitute other numbers in here for the number of fragments and the size of the disk i/o requests, but the principle will remain: the amount of buffering is not really a function of time (seconds), but of the way you subdivide the buffer for disk i/o and the optimal disk i/o request size. --p
Re: [Fwd: [linux-audio-dev] info point on linux hdr]
Unfortunately efficient preallocation is rather hard with the current ext2. To do it efficiently you just want to allocate the blocks in the bitmaps without writing into the actual allocated blocks (otherwise it would be as slow as the manual write-every-block-from-userspace trick) yes, its slow, but its not hard to design an application so that its rare that it needs to be done. that said, when i was creating a 24 track, 40 minute "tape" for ardour the other day, i managed to reheat lunch, eat it, and play with my friend's son for a while during the creation process :) --p