Re: [Haifux] Real-time write on *ANY* filesystem
On Wed, 22 Jun 2005, Eli Billauer wrote: Insights, anybody? Reliable high speed continuous writes on a multitasking system like linux are not possible. The next time there is network load or a daemon wakes up your task will be scheduled out. You can do it with RTLinux or VmWare or QNX. The closest you can get on linux is to use a SCHED_RR scheduled process with high priority as root and poll or select on the relevant file descriptors, all set to unbuffered mode (O_NDELAY), and lock the buffer and the program in memory. If you have to maintain the speed you must buy an AV disk which is specifically rated not to recalibrate from time to time. Look at the source code of cdrecord f.ex. for clues. It uses all the methods I have enumerated, plus mmap. It is possible to rise the priority of a single task so high that the system will be unusable for anything else using these methods (it appears frozen excepting the high priority task). You must find your tradeoff. Tivo etc. boxes do this (they don't do much besides recording and playing video so it's ok). Peter#include stdio.h #include sys/time.h #include unistd.h char junkdata[32768]; int main(int argc, char **argv) { FILE *f; struct timeval now, sleeptime; struct timezone junk; long prev_sec, prev_usec, deltat; int i; if (argc!=2) { fprintf(stderr, Usage: %s output-filename\n, argv[0]); return 1; } f = fopen (argv[1], wb); if (f==NULL) { fprintf(stderr, Failed to open output file %s\n, argv[1]); return 1; } if (gettimeofday(now, junk) != 0) { fprintf(stderr, gettimeofday() failed!\n); return 1; } prev_sec = now.tv_sec; prev_usec = now.tv_usec; for (i=0; i2047*32; i++) { // Almost 2 GB // Time to sleep between writes. Check this program's output values // to see what you actually got as typical values. They may be // significantly longer than requested due to context switch overhead // when the desired time is very short. sleeptime.tv_sec = 0; sleeptime.tv_usec = 100; // Yeah, right. The real value will be much longer // Next line should be commented out for full-speed // select(0, NULL, NULL, NULL, sleeptime); if (fwrite (junkdata, 1, sizeof(junkdata), f) != sizeof(junkdata)) { fprintf(stderr, Data write failed!\n); return 1; } if (gettimeofday(now, junk) != 0) { fprintf(stderr, gettimeofday() failed!\n); return 1; } deltat = now.tv_sec - prev_sec; deltat = deltat * 100; deltat += now.tv_usec - prev_usec; prev_sec = now.tv_sec; prev_usec = now.tv_usec; printf(%ld\n, deltat); } fclose(f); return 0; }
Re: [Haifux] Real-time write on *ANY* filesystem
On Wed, Jun 22, 2005 at 02:41:39AM +0200, Eli Billauer wrote: And finally: Does an RAM FIFO help? Surprisingly, the answer is no. I did the following: mknod mypipe p mbuffer -i mypipe -o /fatfs/output-file ./writefat mypipe listfile Until 2.6.mumble, pipes only used a single page in memory. Since 2.6.mumble we're using up to 16 pages and flipping between consumer and producer, which should give much better pipe utilization for large writes. Insights, anybody? Yeah, how about you cut out the various middlemen from the code? at least it's not in Java... - use write, not fwrite!!! - use DIRECT_IO to bypass kernel caching - use the appropriate IO elevator - verify that your disk drivers are tuned for whatever you want to do (is DMA on?) - what else is the system doing? is it idle? busy? is anything else interfering with the scheduling of your program? Linux is a general purpose OS, which means it's good for a lot of things and optimal for none. If you want it to be optimal for your specific usage, you should spend some time optimizing and tuning it for it. And that's true regardless of what your usage happens to be. Cheers, Muli -- Muli Ben-Yehuda http://www.mulix.org | http://mulix.livejournal.com/ -- Haifa Linux Club Mailing List (http://www.haifux.org) To unsub send an empty message to [EMAIL PROTECTED]
Re: [Haifux] Real-time write on *ANY* filesystem
First, thanks for an interesting test plan. Just a quick note (I currently don't have time for retesting): On Wednesday 22 June 2005 03:41, Eli Billauer wrote: And finally: Does an RAM FIFO help? Surprisingly, the answer is no. Since you use stdio (fwrite), which by default does full buffering in user space (c.f: setvbuf(3)) this does not surprise me. Repeating the test with open/write/close etc, would give more significant results (altough I suspect they would only be worse :-( Tzahi mentioned XFS. While I'm not sure XFS would help with small write chunks (Reiserfs seems like better candidate for these), I'd like to mention a related feature the original XFS had on Irix (I think this feature wasn't ported to Linux) -- You could assign a special sub-volume in the filesystem as real-time. The kernel would then give absolute priority to I/O requests related to that sub-volume -- In marketing speech this would give you guaranteed I/O response time (although I don't remember seeing any specific constraints on this [that's why I call it marketing]). Any other ideas anybody? -- Oron Peled Voice/Fax: +972-4-8228492 [EMAIL PROTECTED] http://www.actcom.co.il/~oron ICQ UIN: 16527398 If I have been able to see farther, it was only because I stood on the shoulders of giants. -- Sir Isaac Newton -- Haifa Linux Club Mailing List (http://www.haifux.org) To unsub send an empty message to [EMAIL PROTECTED]
Re: [Haifux] Real-time write on *ANY* filesystem
Muli Ben-Yehuda wrote: mknod mypipe p mbuffer -i mypipe -o /fatfs/output-file ./writefat mypipe listfile Until 2.6.mumble, pipes only used a single page in memory. Since 2.6.mumble we're using up to 16 pages and flipping between consumer and producer, which should give much better pipe utilization for large writes. Note that mbuffer is the RAM FIFO, and it was empty all the time (as one could expect). Since mbuffer never blocked, I don't think it matters how good the pipe between them is. This is why I found it weird that I got delays at all, using a RAM FIFO. - use write, not fwrite!!! - use DIRECT_IO to bypass kernel caching - use the appropriate IO elevator Are these general guidelines for writing fast I/O, or are there good reasons to suspect that one of these cause occasional long blocks? Keep in mind that 3 MB/sec isn't fast at all. It's not like I care about a long average delay. It's the peaks. Besides, it's all nice when I write the application myself. But usually what we do is to use some prewritten software. In my case I could hack it (as I've already done for other reasons). But still this looks like a kernel problem to me. - verify that your disk drivers are tuned for whatever you want to do (is DMA on?) - what else is the system doing? is it idle? busy? is anything else interfering with the scheduling of your program? Yes, the DMA is on for both computers. And at least on the laptop, there shouldn't be anything running (not even X). Linux is a general purpose OS, which means it's good for a lot of things and optimal for none. Well, as it turns out, it's not so good for a rather mainstream multimedia recording task. At least not on my computers. I don't need optimal. I need reasonable. It would be nice if some of you tried the program I sent, and let's see if you get the same results. Note that the real action begins when the partition you write to is getting full. Regards, Eli -- Web: http://www.billauer.co.il
Re: [Haifux] Real-time write on *ANY* filesystem
On Wed, 22 Jun 2005, Eli Billauer wrote: It turns out that it's not a FAT issue, but that the same problem occurs on ext3 systems as well. I've written a small program to test the delays between writes, and the results are not very encouraging. Specially when the disk getsfull (it always does, doesn't it?). i don't think it's the disk gets full. i think its the page-cache gets full. try this: get a partition that is already quite full, and run the test on it. you will not see this problem. or: get a very very large partition (e.g. 30GB free space) and run it there - you'll notice the problem when the ammount of data you wrote _lately_ gets large. the page cache of the linux system is, by default, tuned for overall throughput, not for worst-case-per-I/O. i don't think that changing the elevator algorithm will help. things i would try: 1. what muli said, using O_DIRECT would help, but turn all the I/O into synchronous mode, which might give you a too slow throughput (can't tell without trying). to overcome this, your will need a combination of application-based buffering and direct/raw I/O. e.g. taking the source of mbuffer, making sure it works with O_DIRECT, or even with a 'raw' device (available on some linux distributions, not all). yes, this is a problem with the linux system (as a whole). --guy In my opinion, this should concert anyone who want to use a Linux box for storing a data stream (audio, video, whatever). I've attached the source of the program I used. Basically, it loops on writing 32kB chunks of data to a file, creating a list of number telling how much time (in microseconds) elapsed since the last loop (to stdout). There are two modes of testing: One is to let it write as fast as possible, and the second is to put delays between writes, which simulates waiting for incoming data. If there is enough room, the program will write slightly less than 2 GB (guess why?). Since Linux is a multitasking system, the results are not exactly repeatable. But the general impression is that writing to FAT or on ext3, on my laptop or on my desktop, they all behave more or less the same. First test regards full-speed write. Data was simply written as fast as possible. For anon-full partition, the write operation dwelled typically 5.5 ms, with occasional bursts of 0.7-0.9 *seconds* delay on the write operation. When the partition gets full, things get even nastier. Several seconds of blocking was observed. 5 seconds, and up to 14 seconds delay typically appeared a few times for a 2 GB writing session. Then I added a short sleep in the loop, in order to simulate data written at ~ 3MB/sec (which is reasonable for video capturing). This is far below the disk's physical capicity. The disk LED showed occasional flushes. Results: For a non-full partition, occasional peaks of up to 60ms were observed, which is something one can live with, probably. At 3 MB/s this means 180 kB stuck in the buffer. But when the partition started to get full, peaks of 0.2-0.3 seconds started to appear. The latter means 900 kB waiting to go out, and this maybe explains why I originally had problems. If you want to see how your system behaves, just compile the attached code and go: ./writefat output-junk-data-file listfile The list of loop timings will be in listfile. Use your favourite number cruncher to view graphs. (The program's name is due to historic reasons...) If you want to test the slower writing speeds, check the typical delay in the listfile, or see how fast the output file grows. The sleep period defined in the program itself is not reliable, since the operating system may not be able to sleep for too short periods. And finally: Does an RAM FIFO help? Surprisingly, the answer is no. I did the following: mknod mypipe p mbuffer -i mypipe -o /fatfs/output-file ./writefat mypipe listfile and was quite surprised to find delays of 0.2 sec. BTW, mbuffer seems to force the data to be flushed to disk much more often. The disk LED showed that writes occured all the time, unlike the direct write to file, in which flushes occured occasionally. And mypipe and logfile are on an ext3, while outfile-file is on FAT. Insights, anybody? Eli -- Web: http://www.billauer.co.il -- guy For world domination - press 1, or dial 0, and please hold, for the creator. -- nob o. dy -- Haifa Linux Club Mailing List (http://www.haifux.org) To unsub send an empty message to [EMAIL PROTECTED]
Re: [Haifux] Real-time write on *ANY* filesystem
guy keren wrote: I don't think it's the disk gets full. i think its the page-cache gets full. try this: get a partition that is already quite full, and run the test on it. you will not see this problem. Well, you may get other results if you test it, but what I saw was that if the partition was about to be full, I got one behaviour. Ran the same test after deleting some gigas of data from the partition, got something much better. Back and forth. This is how I reached the conclusion. The question I find appealing in this context is when the filesystem looks for free blocks. If it does it only by demand, this would explain what happens. IMHO, it would make sense to fire off some tasklet (?) whenever the pool of free blocks starts to get empty, but I have no idea how it really works. Eli -- Web: http://www.billauer.co.il -- Haifa Linux Club Mailing List (http://www.haifux.org) To unsub send an empty message to [EMAIL PROTECTED]
Re: [Haifux] Real-time write on *ANY* filesystem
Muli Ben-Yehuda wrote: Where can I find the sourcve for mbuffer? http://www.rcs.ei.tum.de/~maierkom/privat/software/mbuffer/ I downloaded 20011008 (the latest version didn't compile). Which kernel are you using? I'm on 2.4.22 and 2.4.21 (yeah, yeah, retro). As for the results you posted: It's the peaks I'm after, not the tail. The peaks appear anywhere in the list. So the best thing is to draw a graph of these numbers. Thanks, Eli -- Web: http://www.billauer.co.il -- Haifa Linux Club Mailing List (http://www.haifux.org) To unsub send an empty message to [EMAIL PROTECTED]
Re: [Haifux] Real-time write on *ANY* filesystem
On Wed, Jun 22, 2005 at 03:08:45PM +0200, Eli Billauer wrote: As for the results you posted: It's the peaks I'm after, not the tail. The peaks appear anywhere in the list. So the best thing is to draw a graph of these numbers. This is the tail of the distribution - i.e the peak. (generated via sort -n $file | tail -15) Cheers, Muli -- Muli Ben-Yehuda http://www.mulix.org | http://mulix.livejournal.com/ -- Haifa Linux Club Mailing List (http://www.haifux.org) To unsub send an empty message to [EMAIL PROTECTED]
Re: [Haifux] Real-time write on *ANY* filesystem
On Wed, 22 Jun 2005, Eli Billauer wrote: guy keren wrote: I don't think it's the disk gets full. i think its the page-cache gets full. try this: get a partition that is already quite full, and run the test on it. you will not see this problem. Well, you may get other results if you test it, but what I saw was that if the partition was about to be full, I got one behaviour. Ran the same test after deleting some gigas of data from the partition, got something much better. Back and forth. This is how I reached the conclusion. so it _could_ be that due to fragmentation, instead of writing a large set of data consecutively, the system wrote this large set of data in several write commands on different parts of the hard drive. The question I find appealing in this context is when the filesystem looks for free blocks. If it does it only by demand, this would explain what happens. the file system contains a list of all free blocks. it looks for a free block _from this list_ when there is a need for a new free block. furthermore, it usually does not allocate a single block - rather, it tries to pre-allocate several consecutive blocks, assuming they'll soon be needed. it does this in order to avoid spreading the file all over the disk. -- guy For world domination - press 1, or dial 0, and please hold, for the creator. -- nob o. dy -- Haifa Linux Club Mailing List (http://www.haifux.org) To unsub send an empty message to [EMAIL PROTECTED]
RE: [Haifux] Real-time write on *ANY* filesystem
General insight. Try XFS, it tries to avoid disk operations as much as possible. It should outperform ext3 on large files. Regards, tzahi. -Original Message- From: Haifux - Haifa Linux Club [mailto:[EMAIL PROTECTED] On Behalf Of Eli Billauer Sent: Wednesday, June 22, 2005 2:42 AM To: Haifa Linux Club Mailing list Subject: [Haifux] Real-time write on *ANY* filesystem Hello again, It turns out that it's not a FAT issue, but that the same problem occurs on ext3 systems as well. I've written a small program to test the delays between writes, and the results are not very encouraging. Specially when the disk gets full (it always does, doesn't it?). In my opinion, this should concert anyone who want to use a Linux box for storing a data stream (audio, video, whatever). I've attached the source of the program I used. Basically, it loops on writing 32kB chunks of data to a file, creating a list of number telling how much time (in microseconds) elapsed since the last loop (to stdout). There are two modes of testing: One is to let it write as fast as possible, and the second is to put delays between writes, which simulates waiting for incoming data. If there is enough room, the program will write slightly less than 2 GB (guess why?). Since Linux is a multitasking system, the results are not exactly repeatable. But the general impression is that writing to FAT or on ext3, on my laptop or on my desktop, they all behave more or less the same. First test regards full-speed write. Data was simply written as fast as possible. For a non-full partition, the write operation dwelled typically 5.5 ms, with occasional bursts of 0.7-0.9 *seconds* delay on the write operation. When the partition gets full, things get even nastier. Several seconds of blocking was observed. 5 seconds, and up to 14 seconds delay typically appeared a few times for a 2 GB writing session. Then I added a short sleep in the loop, in order to simulate data written at ~ 3MB/sec (which is reasonable for video capturing). This is far below the disk's physical capicity. The disk LED showed occasional flushes. Results: For a non-full partition, occasional peaks of up to 60ms were observed, which is something one can live with, probably. At 3 MB/s this means 180 kB stuck in the buffer. But when the partition started to get full, peaks of 0.2-0.3 seconds started to appear. The latter means 900 kB waiting to go out, and this maybe explains why I originally had problems. If you want to see how your system behaves, just compile the attached code and go: ./writefat output-junk-data-file listfile The list of loop timings will be in listfile. Use your favourite number cruncher to view graphs. (The program's name is due to historic reasons...) If you want to test the slower writing speeds, check the typical delay in the listfile, or see how fast the output file grows. The sleep period defined in the program itself is not reliable, since the operating system may not be able to sleep for too short periods. And finally: Does an RAM FIFO help? Surprisingly, the answer is no. I did the following: mknod mypipe p mbuffer -i mypipe -o /fatfs/output-file ./writefat mypipe listfile and was quite surprised to find delays of 0.2 sec. BTW, mbuffer seems to force the data to be flushed to disk much more often. The disk LED showed that writes occured all the time, unlike the direct write to file, in which flushes occured occasionally. And mypipe and logfile are on an ext3, while outfile-file is on FAT. Insights, anybody? Eli -- Web: http://www.billauer.co.il -- Haifa Linux Club Mailing List (http://www.haifux.org) To unsub send an empty message to [EMAIL PROTECTED]