Re: More on 2.2.18pre2aa2
On 2000-09-11 09:22:23 -0400, Chris Mason wrote: > Thanks Andrea, Andi, new patch is attached, with the warning messages > removed. The first patch got munged somewhere between test machine and > mailer, please don't use it. I've been hammering this all day installing the relevent tools and building win32 mozilla under vmware (the drives being 2Gb files on a reiserfs partition). This partition also doubles as /home. Very stable so far, and having Andrea's VM patches in (I usually didn't put them in) has made a noticeable difference - xmms has rarely skipped and things start faster and run smoother. Hopefully I'll see the same (or better) results from Rik's new VM in 2.4 when that's released. # free total used free sharedbuffers cached Mem:392792 390232 2560 0 25948 261484 -/+ buffers/cache: 102800 289992 Swap: 196548 0 196548 # vmstat 2 procs memoryswap io system cpu r b w swpd free buff cache si sobibo incs us sy id 1 0 0 0 2412 26084 261496 0 0 817 454 922 17 40 44 1 0 0 0 2292 26204 261496 0 0 213 443 516 6 50 45 2 0 0 0 2244 26252 261496 0 0 0 0 439 367 13 38 49 1 0 0 0 2784 26016 261192 0 059 0 453 815 8 50 42 1 0 0 0 2420 26152 261420 0 03023 460 675 4 59 37 1 0 0 0 2252 26208 261532 0 015 0 445 465 5 55 40 3 0 0 0 2084 26252 261656 0 01819 450 446 4 56 40 3 0 0 0 2056 26280 261656 0 0 1 0 442 366 3 54 43 4 0 0 0 3012 25868 261108 0 018 0 443 423 9 45 46 2 0 0 0 2348 25932 261708 0 07519 463 703 6 51 44 FWIW the reiserfs partition is on a 30Gb IBM 75GXP attached to a Promise ATA100 PCI controller. I have Andre's 2904 ide patch in. Nice drive if you have to use IDE, and I have no complaints about the controller either. -- * Matthew Hawkins <[EMAIL PROTECTED]> :(){ :|:&};: ** Information Specialist, tSA Group Pty. Ltd. Ph: +61 2 6257 7111 *** 1 Hall Street, Lyneham ACT 2602 Australia. Fx: +61 2 6257 7311 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2
On 2000-09-11 09:22:23 -0400, Chris Mason wrote: Thanks Andrea, Andi, new patch is attached, with the warning messages removed. The first patch got munged somewhere between test machine and mailer, please don't use it. I've been hammering this all day installing the relevent tools and building win32 mozilla under vmware (the drives being 2Gb files on a reiserfs partition). This partition also doubles as /home. Very stable so far, and having Andrea's VM patches in (I usually didn't put them in) has made a noticeable difference - xmms has rarely skipped and things start faster and run smoother. Hopefully I'll see the same (or better) results from Rik's new VM in 2.4 when that's released. # free total used free sharedbuffers cached Mem:392792 390232 2560 0 25948 261484 -/+ buffers/cache: 102800 289992 Swap: 196548 0 196548 # vmstat 2 procs memoryswap io system cpu r b w swpd free buff cache si sobibo incs us sy id 1 0 0 0 2412 26084 261496 0 0 817 454 922 17 40 44 1 0 0 0 2292 26204 261496 0 0 213 443 516 6 50 45 2 0 0 0 2244 26252 261496 0 0 0 0 439 367 13 38 49 1 0 0 0 2784 26016 261192 0 059 0 453 815 8 50 42 1 0 0 0 2420 26152 261420 0 03023 460 675 4 59 37 1 0 0 0 2252 26208 261532 0 015 0 445 465 5 55 40 3 0 0 0 2084 26252 261656 0 01819 450 446 4 56 40 3 0 0 0 2056 26280 261656 0 0 1 0 442 366 3 54 43 4 0 0 0 3012 25868 261108 0 018 0 443 423 9 45 46 2 0 0 0 2348 25932 261708 0 07519 463 703 6 51 44 FWIW the reiserfs partition is on a 30Gb IBM 75GXP attached to a Promise ATA100 PCI controller. I have Andre's 2904 ide patch in. Nice drive if you have to use IDE, and I have no complaints about the controller either. -- * Matthew Hawkins [EMAIL PROTECTED] :(){ :|:};: ** Information Specialist, tSA Group Pty. Ltd. Ph: +61 2 6257 7111 *** 1 Hall Street, Lyneham ACT 2602 Australia. Fx: +61 2 6257 7311 - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Matthew Hawkins wrote: Very stable so far, and having Andrea's VM patches in (I usually didn't put them in) has made a noticeable difference - xmms has rarely skipped and things start faster and run smoother. Hopefully I'll see the same (or better) results from Rik's new VM in 2.4 when that's released. I'm working on it. ;) I've just uploaded a new snapshot of my new VM for 2.4 to my home page, this version contains a wakeup_kswapd() function (copied from wakeup_bdflush) and should balance memory a bit better. Also, streaming IO seems to be almost back on track. The large IO delays I'm seeing in certain tests have been traced back to the /elevator/ code. I think I'll be playing with the "close the door on timeout" idea Jeff Merkey proposed some time ago... (I talked about this with Jens Axboe on irc ... a patch to solve the elevator problems and maybe make a self-tuning elevator (with maximum latency specified in /proc?) should be ready later today) regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Rik van Riel wrote: The large IO delays I'm seeing in certain tests have been traced back to the /elevator/ code. I think I'll Actually the elevator works as in 2.2.15 (before any fix). The latency settings are too high. They should be around 250 for reads and 500 for writes. If you want better latency than that you should reinsert the stuff that we had until test1 that account the position where the request is put on the queue. We backed it out because it was making a performance difference. Even if you disable the elevator completly (so if you use elevator_noop) with the long queues we have these days you'll still get bad latency. Please try to decrease the size of the queue if even "enabling" the latency control doesn't make a difference. BTW, 2.2.18pre2aa2 put the elevator in sync with 2.4.x using the settings 250 for reads and 500 for writes. Note that 250 for reads - without the logic that take into account the position the request is put in the queue - is much more than 250 with the previous logic included. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Andrea Arcangeli wrote: On Tue, 12 Sep 2000, Rik van Riel wrote: The large IO delays I'm seeing in certain tests have been traced back to the /elevator/ code. I think I'll Actually the elevator works as in 2.2.15 (before any fix). The latency settings are too high. They should be around 250 for reads and 500 for writes. There's a much much simpler solution. We simply keep track of how old the oldest request in the queue is, and when that request is getting too old (say 1/2 second), we /stop/ all the others from entering their request into the queue (except if it can be merged with another request ???). And when either the queue gets emptied, OR the oldest request is below some other threshold (say, 1/10th of a second), then we wake up the other tasks and let them put their requests on the queue. This is a very simple idea that gets the /number/ of requests batched automagically right for the configuration of disks on the machine it's running on. No need for a magic number of requests to be bypassed or not, the user can specify the wanted latency and the system automatically gets right what the user specifies... (courtesy of Jeff Merkey who did this for Netware?) regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Rik van Riel wrote: We simply keep track of how old the oldest request in the queue is, and when that request is getting too old (say 1/2 second), we /stop/ all the others Going in function of time is obviously wrong. A blockdevice can write 1 request every two seconds or 1 request every msecond. You can't assume anything in function of time _unless_ you have per harddisk timing informations into the kernel. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Andrea Arcangeli wrote: On Tue, 12 Sep 2000, Rik van Riel wrote: We simply keep track of how old the oldest request in the queue is, and when that request is getting too old (say 1/2 second), we /stop/ all the others Going in function of time is obviously wrong. A blockdevice can write 1 request every two seconds or 1 request every msecond. You can't assume anything in function of time _unless_ you have per harddisk timing informations into the kernel. Uhmmm, isn't the elevator about request /latency/ ? And if so, in what unit do you want to measure latency if it isn't time? regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Rik van Riel wrote: Uhmmm, isn't the elevator about request /latency/ ? Yes, but definitely not absolute "time" latency. How do you get a 1msec latency for a read request out of a blockdevice that writes 1 request in 2 seconds? See? That was one of the first issues I was thinking about when I started playing with the elevator. (and yes, some of my early patches was setting a per-request timestamp using jiffies) Note: I understand you can do in function of time something that works ok for a normal 10/20Mbyte/sec harddisk, but since the elevator is used every time you write to any blockdevice out there, you also have to take into account things like ZIP drives and whatever other slow device that does less than 1Mbyte/sec I/O or even slower (as well as faster devices). A zip drive is slow writing, but that doesn't mean it isn't even slower while seeking. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2
Going in function of time is obviously wrong. A blockdevice can write 1 request every two seconds or 1 request every msecond. You can't assume anything in function of time _unless_ you have per harddisk timing informations into the kernel. Andrea - latency is time measured and perceived. Doing it time based seems to make reasonable sense. I grant you might want to play with the weighting per device, but actually keeping device based dirty list length limits based on average throughput is probably a lot more productive - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Alan Cox wrote: Andrea - latency is time measured and perceived. Doing it time based seems to make reasonable sense. I grant you might want to play with the weighting per When you have a device that writes a request every two seconds you still want it not to seek all the time because this would mean to make it even slower. No? The point very is simple: if you want good latency buy a faster hardware (and with a faster hardware our current elevator can become even more aggressive than the 1/2 second thing). You can't workaround the slowness of a slow device by putting the elevator in function of time, that will only make the global system even slower. BTW, about the rest of the proposal we're just doing that. It's just that we don't do that in function of time, but in function of how many requests are passing a certain request. When too many request passed a certain request, we stop further requests to pass it again. Simple. You control the "how many request can pass a request" factor via elvtune. Read and writes have different factors. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Andrea Arcangeli wrote: On Tue, 12 Sep 2000, Rik van Riel wrote: Uhmmm, isn't the elevator about request /latency/ ? Yes, but definitely not absolute "time" latency. How do you get a 1msec latency for a read request out of a blockdevice that writes 1 request in 2 seconds? See? Of course, if you set a rediculous latency figure you'll get rediculously bad performance. However, that doesn't say /anything/ about if the idea is a good one or not... Along the same lines, I could redicule the current setup by running one of the following lines and then pointing out how bad performance would be. # elvtune -r 1000 -w 1000 /dev/hda # elvtune -r 1 -w 1 /dev/hda That was one of the first issues I was thinking about when I started playing with the elevator. (and yes, some of my early patches was setting a per-request timestamp using jiffies) Note: I understand you can do in function of time something that works ok for a normal 10/20Mbyte/sec harddisk, but since the elevator is used every time you write to any blockdevice out there, you also have to take into account things like ZIP drives and whatever other slow device that does less than 1Mbyte/sec I/O or even slower (as well as faster devices). A zip drive is slow writing, but that doesn't mean it isn't even slower while seeking. We can already set different figures for different drives. Would it really be more than 30 minutes of work to put in a different request # limit for each drive that automatically satisfies the latency specified by the user? regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Rik van Riel wrote: We can already set different figures for different drives. Right. Would it really be more than 30 minutes of work to put in a different request # limit for each drive that automatically satisfies the latency specified by the user? Note that if you know the transfer rate of each of your harddisk you can just almost emulate the "in function of time" behaviour by converting from "throughput" and "latency in function of time", to "latency in function of bh sized requests". (one userspace script could do that automatically for example using hdparm to first benchmark the throughput) The reason I prefer working with "latency in function of bh sized requests" in kernel space is that I can just use one constant for most of the blockdevices. Using the same latency setting with the "in function of time approch" would be wrong instead (and you should gather some hardware info from the device before you are able to decide a good constant). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2
Andrea Arcangeli wrote: Andrea - latency is time measured and perceived. Doing it time based seems to make reasonable sense. I grant you might want to play with the weighting per [device] Right. Perception. When you have a device that writes a request every two seconds you still want it not to seek all the time because this would mean to make it even slower. No? The point very is simple: if you want good latency buy a faster hardware (and with a faster hardware our current elevator can become even more aggressive than the 1/2 second thing). You can't workaround the slowness of a slow device by putting the elevator in function of time, that will only make the global system even slower. Sure the global system is slower. But the "interactive feel" is faster. If I type "find /" I want it to go quickly. But I still want Emacs to start up in a reasonable time, even if that means the overall time for both processes is slower. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Andrea Arcangeli wrote: On Tue, 12 Sep 2000, Rik van Riel wrote: But you don't. Transfer rate is very much dependant on the kind of load you're putting on the disk... Transfer rate means `hdparm -t` in single user mode. Try it and you'll see you'll get always the same result. *sigh* People don't use their machine to run `hdparm -t` all day. They use their machine for different things, and user perceived latency varies wildly between different loads users put on the machine... Throughput really isn't that relevant here. The problems are Thoughput is relevant. Again, how do you get a 1msec latency out of a blockdevice that writes 1 request every two seconds? Why do you always come up with impossible examples? If you had any realistic example I'd be inclined to believe your argument, but this is just not realistic enough to be taken seriously ... With equally horrible results for most machines I've seen. For a while I actually thought the bug /must/ have been somewhere else because I saw processes 'hanging' for about 10 minutes before making progress again ... As said in my earlier email the current 2.4.x elevator scheduler is _disabled_. I repeat: you should change include/linux/elevator.h and set the read and write latency to 250 and 500 respectively. You won't get latency as good as in test1, but it won't hang for 10 minutes. Not for 10 minutes no, but even with the latency set to 100 and 100 I sometimes get /bad stalls/ in just one or two processes (while the rest of the system happily runs on). On the other hand, when I do something different with my machine, the settings 250 and 500 (or even higher) give perfectly fine latency... By applying a /different/ IO load, the same settings with the current elevator tuning give wildly different results. What I'd like to see is an elevator where the settings set by the user have a direct influence in the behaviour observed. Doing time-based request sorting should give us that behaviour and a default of say 1/2 second would work fine for all the hard disks and cdrom drives I've seen in the last 6 years. Of course you can dream up an imaginary device where it won't work, but in that case there's always the possibility of tuning the elevator (like we do right now) ... regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Rik van Riel wrote: On Tue, 12 Sep 2000, Andrea Arcangeli wrote: On Tue, 12 Sep 2000, Rik van Riel wrote: Uhmmm, isn't the elevator about request /latency/ ? Yes, but definitely not absolute "time" latency. How do you get a 1msec latency for a read request out of a blockdevice that writes 1 request in 2 seconds? See? Of course, if you set a rediculous latency figure you'll get rediculously bad performance. However, that doesn't say /anything/ about if the idea is a good one or not... People, Remember why the elevator algorithm was changed in the first place? It was introduced to solve a very specific problem. That problem: the original elevator code did not schedule I/O particularly fairly under certain I/O usage patterns. So it got fixed. Now, I see people trying to introduce the concept of elapsed time into that fix, which smells strongly of hack. How will this hack be cobbled into the elevator code so that it copes with block devices from fast RAID arrays to slow floppies to network block device! So I have to agree with Andrea that the concept of time does not belong in the elevator code. Keep it to a queue management system, and suddenly it scales to slow or fast devices without any gross device-type specific tuning. Cheers Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
I really think Rik has it right here. In particular, an MP3 player needs to be able to say, I have X milliseconds of buffer so make my worst case latency X milliseconds. The number of requests is the wrong metric, because the time required per request depends on disk geometry, disk caching, etc. Hans - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Hans Reiser wrote: I really think Rik has it right here. In particular, an MP3 player needs to be able to say, I have X milliseconds of buffer so make my worst case latency X milliseconds. The number of requests is the wrong metric, because the time required per request depends on disk geometry, disk caching, etc. Hi, We need to separate "what's a good idea" from "what's the best way to do it". Sure, it's a good idea to get an mp3 player's I/O request serviced within a certain amount of time. Is the best way to do that hacking the concept of time into the elevator algorithm? That's currently under debate. In fact, the guarantee of I/O service time for a single process (mp3 player) is pretty orthogonal to the per-device elevator settings. If you have a certain block device set to a max latency of 0.25s, and lots of processes are hammering the disk, then something will have to give, i.e. under heavy load this setting will be useless and not honoured. The solution to this particular mp3 player scenario would be something like the task scheduler policies we have. For example, maybe we could flag a given process so that all it's I/O requests go to the head of the queue. Cheers Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Chris Evans wrote: On Tue, 12 Sep 2000, Hans Reiser wrote: I really think Rik has it right here. In particular, an MP3 player needs to be able to say, I have X milliseconds of buffer so make my worst case latency X milliseconds. The number of requests is the wrong metric, because the time required per request depends on disk geometry, disk caching, etc. Hi, We need to separate "what's a good idea" from "what's the best way to do it". Sure, it's a good idea to get an mp3 player's I/O request serviced within a certain amount of time. Is the best way to do that hacking the concept of time into the elevator algorithm? That's currently under debate. In fact, the guarantee of I/O service time for a single process (mp3 player) is pretty orthogonal to the per-device elevator settings. If you have a certain block device set to a max latency of 0.25s, and lots of processes are hammering the disk, then something will have to give, i.e. under heavy load this setting will be useless and not honoured. The solution to this particular mp3 player scenario would be something like the task scheduler policies we have. For example, maybe we could flag a given process so that all it's I/O requests go to the head of the queue. Cheers Chris First of all: In the case of the mp3 player and such there is already a fine proper way to give it better chances on getting it's job done smooth - RT kernel sceduler priorities and proper IO buffering. I did something similiar to a GDI printer driver... Second: The concept of time can give you very very nasty behaviour in even cases. Assume that a disc can only do 1 request per second. And then imagin a sceduling based on a 1+epsilon second... basically the disc will be run with half the speed it could. Those nasty integer arithmetics can you catch easly and mostly allways entierly unexpecting. Third: All you try to improve is the boundary case between an entierly overloaded system and a system which has a huge reserve to get the task done. I don't think you can find any "improvement" which will not just improve some cases and hurt some only slightly different cases badly. That's basically the same problem as with the paging strategy to follow. (However we have some kind of "common sense" in respect of this, despite the fact that linux does ignore it...) Firth: The most common solution for such boundary cases is some notion of cost optimization, like the nice value of a process or page age for example, or alternative some kind of choice between entierly different strategies (remember the term strategy routine) - all of them are just *relative* measures not absolute time constrains. Fifth: I think that such kind of IO behaviour control isn't something generic enough for the elevator - it should be all done on the device driver level, if at all. In fact you have already bad interactions between strategies of low level drivers and the high level code in Linux - like for example the "get from top of queue" or "don't get it from top of the IO queue" mess between IDE and SCSI middlelayers... (However this got a bit better recently.) -- - phone: +49 214 8656 283 - job: STOCK-WORLD Media AG, LEV .de (MY OPPINNIONS ARE MY OWN!) - langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort: ru_RU.KOI8-R - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Alan Cox wrote: Now, I see people trying to introduce the concept of elapsed time into that fix, which smells strongly of hack. How will this hack be cobbled Actually my brain says that elapsed time based scheduling is the right thing to do. It certainly works for networks Interesting, I'll try and run with this. The mention of networks reminds me that any "max service time" variable is a tunable quantity depending on current conditions.. .. and sct's block device I/O accounting patches give us the current average request service time on a per-device basis. Multiply that up a bit and maybe you have your threshold for moving things to the head of the queue. Cheers Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
(reiserfs) Re: More on 2.2.18pre2aa2
That problem: the original elevator code did not schedule I/O particularly fairly under certain I/O usage patterns. So it got fixed. No it got hacked up a bit. Now, I see people trying to introduce the concept of elapsed time into that fix, which smells strongly of hack. How will this hack be cobbled Actually my brain says that elapsed time based scheduling is the right thing to do. It certainly works for networks - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Jamie Lokier wrote: Sure the global system is slower. But the "interactive feel" is faster. If I type "find /" I want it to go quickly. But I still want Emacs to You always want it to go quickly. But when you're in the blockdevice layer you lost all the semantics of such I/O request. You have no idea if somebody is running a background `cp` or if it's your `find` that's doing the I/O. start up in a reasonable time, even if that means the overall time for both processes is slower. And you as well want your `cp` not to run two times slower, right? Then you have to choose. And you can choose with elvtune. (currently you have to choose per blockdevice basis, maybe in the future you'll be able to choose per 'struct file' basis) That's not a matter of the algorithm. The only difference between the "in function of time latency" and "in function of blocksize latency" is that in kernel space we don't need to know the internal timings of the hardware. If you know the timings and you want a 2 second latency (assuming your harddisk write 1 block in less than 2 seconds) that do the calc in userspace and run the blkelvset ioctl and you're almost happy. You won't be completly happy because as said the elevator will give a two seconds latency also to requests that are at the end of the queue, but this is a matter of the algorithm, not really of the unit of measure of the latency in kernel space. The current unit of measure allows us to use the same latency settings for most devices out there and still providing a good throughput (and avoiding huge stalls). That's why I prefer it. But you can convert it easily in userspace (or even you can change the unit in kernel but I don't see the advantage, I'd rather prefer elvtune to do the conversion and to talk in function of time instead of putting the timing stuff into the kernel). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Chris Evans wrote: the elevator code. Keep it to a queue management system, and suddenly it scales to slow or fast devices without any gross device-type specific tuning. Yep, that was the object. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
(reiserfs) Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Martin Dalecki wrote: Second: The concept of time can give you very very nasty behaviour in even cases. [integer arithmetic] Point taken. Third: All you try to improve is the boundary case between an entierly overloaded system and a system which has a huge reserve to get the task done. I don't think you can find any "improvement" which will not just improve some cases and hurt some only slightly different cases badly. That's basically the same problem as with the paging strategy to follow. (However we have some kind of "common sense" in respect of this, despite the fact that linux does ignore it...) Please don't ignore my VM work ;) http://www.surriel.com/patches/ Firth: The most common solution for such boundary cases is some notion of cost optimization, like the nice value of a process or page age for example, or alternative some kind of choice between entierly different strategies (remember the term strategy routine) - all of them are just *relative* measures not absolute time constrains. Indeed, we'll need to work with relative measures to make sure both throughput and latency are OK. Some kind of (very simple) self-tuning system is probably best here. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
(reiserfs) Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Andrea Arcangeli wrote: On Tue, 12 Sep 2000, Rik van Riel wrote: Also, this possibility is /extremely/ remote, if not impossible. Well, it could happen at one point in time, It's not impossible. Think when you run a backup of you home directory while you're listening mp3. Both `tar` and `xmms` will read the same file that is out of cache. `tar` will be the first one who will read the next out-of-cache data-page of the file. The I/O will be so issued with low prio but then, as soon as `tar` has issued the read I/O, also `xmms` will wait on the same page and it will skip the next deadline because the I/O is been issued with low prio. Indeed, this could be an issue... To make it work right is not simple. I don't know if we really have to care about this case. The process queueing the IO is more than likely a good guess, and a good guess is (IMHO) better than not guessing at all and hoping things will be ok. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Why do you say it's not been fixed? Can you still reproduce hangs long as a write(2) can write? I certainly can't. I cant reproduce long hangs. Im not seeing as good I/O throughput as before but right now Im quite happy with the tradeoff. If someone can make it better then Im happier still - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
(reiserfs) Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Martin Dalecki wrote: First of all: In the case of the mp3 player and such there is already a fine proper way to give it better chances on getting it's job done smooth - RT kernel sceduler priorities and proper IO buffering. I did something similiar to a GDI printer driver... Take 2.2.15, set a buffer of 128mbyte (of course assume your mp3 is larger than 128mbyte :) and then run in background `cp /dev/zero .` in the same fs where your mp3 file out of cache is living. Then you'll see why a large buffer is useless if there's none kind of I/O fair scheduling into the elevator. Repeat the same test in 2.2.16 then. The I/O latency Hans was taking about for the mp3 player, is the time it takes for the buffer to become empty. device driver level, if at all. In fact you have already bad interactions between strategies of low level drivers and the high level code in Linux - like for example the "get from top of queue" or "don't get it from top of the IO queue" mess between IDE and SCSI middlelayers... (However this got a bit better recently.) That's historic cruft, it's unrelated to controlling the elevator algorithm per-task/per-file basis IMHO. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2 (summary of elevator ideas)
Hi, Geez, A simple comment on IRC can _really_ generate lots of feedback. (There were over 50 messages about this in my queue - did not help that some were duplicated three times grin). I made the comment because I remember back when the discussion was current on linux kernel. I thought Jeff Merkey's, message was to the point. Para- phrasing from memory, it was something to the effect that novell had tried many elevators. All had problems with some loads. The best they had found was the 'close the door' idea. I do not remember if the door was based on requests or time. Another point to remember is that the netware people came up with a what they considered a good solution. From Jeff's comment they arrived at this solution by experiments and bitter experience. Maybe we can learn something from their research? Here is what I glean from the thread: From all the discussion I find this suggestion from Alan to make lots of sense. Think it can be made to work with number of request almost as easily as with time... When you do the scan to decide where to insert the entry you dont consider insertion before the time. Also you keep two queue heads the real and the insert head. Whenever you walk from the insert head and find it points to a 'too old' entry you update the insert_head. And this suggestion from Rik should counter most of Andrea's time vs requests vs slow block devices issues. We just have to be sure to close the door after atleast n request or m time. However, as pointed out by Chris Evan, later we may not have to do this - there are stats that can give a good idea of a device's latency. Not really. What about just using something like "half a second, but at least 10 requests liberty to reorder" ? It's simple, should be good enough for the user and allows for a little bit of reordering even on very slow devices ... As Andrea points out its easy enought to do some sort of test with the current code. Well changing that is very easy, you only have to change the unit of measure w/o changing one bit in the algorithm that is just implemented indeed. How Just assume the req-elevator_sequence to be calculate in jiffy and in elevator.c change the check for `!req-elevator_sequence' to `time_before(req-elevator_sequence, jiffies)'. Then of course change the initialization of req-elevator_sequence to be done with `jiffies + elevator-read_latency'. Then also elvtune will talk in jiffies and not in requests. I wonder if using a wandering insert pointer, as Alan suggests, would give lower overhead than the current implementation (and would it really help)? Again from Alan, Andrea Now, I see people trying to introduce the concept of elapsed time Andrea that fix, which smells strongly of hack. How will this hack be cobbled Actually my brain says that elapsed time based scheduling is the right thing to do. It certainly works for networks And from Chris Evans, Interesting, I'll try and run with this. The mention of networks reminds me that any "max service time" variable is a tunable quantity depending on current conditions.. .. and sct's block device I/O accounting patches give us the current average request service time on a per-device basis. Multiply that up a bit and maybe you have your threshold for moving things to the head of the queue. So we could end up using a figure that builds in both number of request and time on a per device level without to much effort... And Alan sums up the whole thing nicly with: Andrea Why do you say it's not been fixed? Can you still reproduce hangs long Andrea a write(2) can write? I certainly can't. I cant reproduce long hangs. Im not seeing as good I/O throughput as before but right now Im quite happy with the tradeoff. If someone can make it better then Im happier still If the idea works, lead to simpler code, a more reponsive system maybe with better benchmarks then its a winner. Only way we can be sure is to try it. Thanks, Ed Tomlinson [EMAIL PROTECTED] (ontadata on IRC) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2 (summary of elevator ideas)
One important point on remirroring I did not mention in my post. In NetWare, remirroring scans the disk BACKWARDS (n0) to prevent artificial starvation while remirring is going on. This was another optimization we learned the hard way by trying numerous approaches to the problem. Jeff Ed Tomlinson wrote: Hi, Geez, A simple comment on IRC can _really_ generate lots of feedback. (There were over 50 messages about this in my queue - did not help that some were duplicated three times grin). I made the comment because I remember back when the discussion was current on linux kernel. I thought Jeff Merkey's, message was to the point. Para- phrasing from memory, it was something to the effect that novell had tried many elevators. All had problems with some loads. The best they had found was the 'close the door' idea. I do not remember if the door was based on requests or time. Another point to remember is that the netware people came up with a what they considered a good solution. From Jeff's comment they arrived at this solution by experiments and bitter experience. Maybe we can learn something from their research? Here is what I glean from the thread: From all the discussion I find this suggestion from Alan to make lots of sense. Think it can be made to work with number of request almost as easily as with time... When you do the scan to decide where to insert the entry you dont consider insertion before the time. Also you keep two queue heads the real and the insert head. Whenever you walk from the insert head and find it points to a 'too old' entry you update the insert_head. And this suggestion from Rik should counter most of Andrea's time vs requests vs slow block devices issues. We just have to be sure to close the door after atleast n request or m time. However, as pointed out by Chris Evan, later we may not have to do this - there are stats that can give a good idea of a device's latency. Not really. What about just using something like "half a second, but at least 10 requests liberty to reorder" ? It's simple, should be good enough for the user and allows for a little bit of reordering even on very slow devices ... As Andrea points out its easy enought to do some sort of test with the current code. Well changing that is very easy, you only have to change the unit of measure w/o changing one bit in the algorithm that is just implemented indeed. How Just assume the req-elevator_sequence to be calculate in jiffy and in elevator.c change the check for `!req-elevator_sequence' to `time_before(req-elevator_sequence, jiffies)'. Then of course change the initialization of req-elevator_sequence to be done with `jiffies + elevator-read_latency'. Then also elvtune will talk in jiffies and not in requests. I wonder if using a wandering insert pointer, as Alan suggests, would give lower overhead than the current implementation (and would it really help)? Again from Alan, Andrea Now, I see people trying to introduce the concept of elapsed time Andrea that fix, which smells strongly of hack. How will this hack be cobbled Actually my brain says that elapsed time based scheduling is the right thing to do. It certainly works for networks And from Chris Evans, Interesting, I'll try and run with this. The mention of networks reminds me that any "max service time" variable is a tunable quantity depending on current conditions.. .. and sct's block device I/O accounting patches give us the current average request service time on a per-device basis. Multiply that up a bit and maybe you have your threshold for moving things to the head of the queue. So we could end up using a figure that builds in both number of request and time on a per device level without to much effort... And Alan sums up the whole thing nicly with: Andrea Why do you say it's not been fixed? Can you still reproduce hangs long Andrea a write(2) can write? I certainly can't. I cant reproduce long hangs. Im not seeing as good I/O throughput as before but right now Im quite happy with the tradeoff. If someone can make it better then Im happier still If the idea works, lead to simpler code, a more reponsive system maybe with better benchmarks then its a winner. Only way we can be sure is to try it. Thanks, Ed Tomlinson [EMAIL PROTECTED] (ontadata on IRC) - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Andrea Arcangeli wrote: On Tue, 12 Sep 2000, Martin Dalecki wrote: First of all: In the case of the mp3 player and such there is already a fine proper way to give it better chances on getting it's job done smooth - RT kernel sceduler priorities and proper IO buffering. I did something similiar to a GDI printer driver... Take 2.2.15, set a buffer of 128mbyte (of course assume your mp3 is larger than 128mbyte :) and then run in background `cp /dev/zero .` in the same fs where your mp3 file out of cache is living. Then you'll see why a large buffer is useless if there's none kind of I/O fair scheduling into the elevator. Repeat the same test in 2.2.16 then. The I/O latency Hans was taking about for the mp3 player, is the time it takes for the buffer to become empty. I was talking about *proper* buffering not necessary *big* buffers. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Alan Cox wrote: Now, I see people trying to introduce the concept of elapsed time into that fix, which smells strongly of hack. How will this hack be cobbled Actually my brain says that elapsed time based scheduling is the right thing to do. No, Andrea is right here. The argument that everyone is using ("Our target - latency - is measured in time") is utterly bogus. Yes, it's measured in time, but remember that there are two things measured in time here: A. The time for the whole queue of requests to run (this is what Rik is proposing using to throttle) B. The time an average request takes to process. If we limit on the depth of queue we're (to some level of approximation) making our decision based on A/B. It's still a magic constant, but at least it's scaled to take into account the speed of the drive. And underneath, it's still based on time. It certainly works for networks Well, actually just about any communications protocol worth its salt uses some sort of windowing throttle based on the amount of data outstanding, not the length of time it's been in the queue. Which is why TCP works well over both GigE and 28.8. [*] Now substitute "big fiberchannel RAID" for GigE and "360K floppy" for 28.8 and you've got the same problem. * -- Yes, for optimal TCP over big WAN pipes you may want to use a larger buffer size, but that's a matter of the bandwidth delay product, which isn't relavent for talking about storage If we move to a "length of queue in time" as Rik suggests then we're going to have to MAKE the user set it manually for each device. There's too many orders of magnatude difference between even just SCSI disks (10 yr old drive? 16-way RAID? Solid state?) to make supplying any sort of default with the kernel impractical. The end result might be a bit better behaved, but only just slightly. If people absolutely need this behavior for some reason, the current algorithm should stay as the default. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
time, but remember that there are two things measured in time here: A. The time for the whole queue of requests to run (this is what Rik is proposing using to throttle) B. The time an average request takes to process. Your perceived latency is based entirely on A. If we limit on the depth of queue we're (to some level of approximation) making our decision based on A/B. It's still a magic constant, but at I dont suggest you do queue limiting on that basis. I suggest you do order limiting based on time slots Well, actually just about any communications protocol worth its salt uses some sort of windowing throttle based on the amount of data Im talking about flow control/traffic shaping If we move to a "length of queue in time" as Rik suggests then we're going to have to MAKE the user set it manually for each device. No There's too many orders of magnatude difference between even just SCSI disks (10 yr old drive? 16-way RAID? Solid state?) to make supplying any sort of default with the kernel impractical. The end The same argument is equally valid for the current scheme, and I think you'll find equally bogus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Hans Reiser wrote: I really think Rik has it right here. In particular, an MP3 player needs to be able to say, I have X milliseconds of buffer so make my worst case latency X milliseconds. The number of requests is the wrong metric, because the time required per request depends on disk geometry, disk caching, etc. No the problem is that an application should either: 1. Take full controll of the underlying system. 2. Don't care about selftuning the OS. Becouse that's what operating systems are for in first place: Letting the applications run without care of the underlying hardware. Linux is just mistaken by desing that there should be a generic elevator for any block device sitting on a single queue for any kind of attached device. Only device drivers know best how to handle queueing and stuff like this. The upper layers should only car about semanticall correctness of the request orders not about optimization of them. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Alan Cox wrote: time, but remember that there are two things measured in time here: A. The time for the whole queue of requests to run (this is what Rik is proposing using to throttle) B. The time an average request takes to process. Your perceived latency is based entirely on A. Yes, but "how hard is it reasonable for the kernel to try" is based on both items. A good first order approximation is number of requests. If we limit on the depth of queue we're (to some level of approximation) making our decision based on A/B. It's still a magic constant, but at I dont suggest you do queue limiting on that basis. I suggest you do order limiting based on time slots It's still a queue - the queue of things we're going to take on this elevator swipe, right? And the problem is one of keeping a sane watermark on this queue - not too many requests to destroy latency but enough to let the elevator do some good. Well, actually just about any communications protocol worth its salt uses some sort of windowing throttle based on the amount of data Im talking about flow control/traffic shaping ...where the user sets a number exlpicitly for what performance they want. Again, if we're going to make the user set this latency variable for each of their devices, then doing it based on time will work great. There's too many orders of magnatude difference between even just SCSI disks (10 yr old drive? 16-way RAID? Solid state?) to make supplying any sort of default with the kernel impractical. The end The same argument is equally valid for the current scheme, and I think you'll find equally bogus There will always need to be tunables - and it's fine to say "if you've got oddball hardware and/or workload and/or requirements then you should twiddle this knob". But it seems to me that the current scheme works well for a pretty big range of devices. If you do the setting based on time, I think it'll be a lot more sensitive since there's nothing that will scale based on the speed of the device. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2
Date:Tue, 12 Sep 2000 04:23:05 -0300 (BRST) From: Rik van Riel [EMAIL PROTECTED] I've just uploaded a new snapshot of my new VM for 2.4 to my home page, this version contains a wakeup_kswapd() function (copied from wakeup_bdflush) and should balance memory a bit better. How can drop_behind() work properly? You do not recompute the hash chain head for each decreasing 'index' in the main while loop, and thus you search potentially the wrong hash chain each time. Thus, you need to change: + page = __find_page_nolock(mapping, index, *hash); to something more like: + hash = page_hash(mapping, index); + page = __find_page_nolock(mapping, index, *hash); and remove the now-spurious local variable initialization of 'hash' at the top of this function. Also, why this? + page = NULL; That's spurious, you set it on the next line, probably this is from some older revision of this function :-) Later, David S. Miller [EMAIL PROTECTED] - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Considering there are a lot of people still using 2.0.x because they find it more stable than the 2.2.x series, doesn't it make sense to give this scalability to people who are already running SMP boxes on 2.2.x and who may decide to use ReiserFS? - Original Message - From: "Andrea Arcangeli" <[EMAIL PROTECTED]> > On Mon, 11 Sep 2000, Andi Kleen wrote: > > >BTW, there is a another optimization that could help reiserfs a lot > >on SMP settings: do a unlock_kernel()/lock_kernel() around the user > >copies. It is quite legal to do that (you have to handle sleeping > >anyways in case of a page fault), and it allows CPUs to run in parallel > >for long running copies. > > I'd prefer not to spend time to make 2.2.x to scale better in SMP, 2.4.x > just fixed that problem by dropping the big lock in first place in the > read/write paths :). The copy-user reschedule points were bugfixes > instead. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Mon, 11 Sep 2000, Andi Kleen wrote: >BTW, there is a another optimization that could help reiserfs a lot >on SMP settings: do a unlock_kernel()/lock_kernel() around the user >copies. It is quite legal to do that (you have to handle sleeping >anyways in case of a page fault), and it allows CPUs to run in parallel >for long running copies. I'd prefer not to spend time to make 2.2.x to scale better in SMP, 2.4.x just fixed that problem by dropping the big lock in first place in the read/write paths :). The copy-user reschedule points were bugfixes instead. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2
--On 09/11/00 15:02:34 +0200 Andrea Arcangeli <[EMAIL PROTECTED]> wrote: > > In 2.2.18pre2aa2.bz2 there's a latency bugfix, now a: > > read(fd, , 0x7fff) > write(fd, , 0x7fff) > sendfile(src, dst, NULL, 0x7fff) > > doesn't hang the machine anymore for several seconds. (well, really they > are all three still a bit buggy because they don't interrupt themself when > a signal arrives so we can't use senfile for `cp` of files smaller than 2g > yet... but at least now you can do something in parallel even if you are > so lucky to have some giga of fs cache) > Thanks Andrea, Andi, new patch is attached, with the warning messages removed. The first patch got munged somewhere between test machine and mailer, please don't use it. -chris reiserfs-2.2.18p2aa2-2.diff.gz
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Mon, Sep 11, 2000 at 08:15:15AM -0400, Chris Mason wrote: > LFS changes for filldir, reiserfs_readpage, and adds limit checking in > file_write to make sure we don't go above 2GB (Andi Kleen). Also fixes > include/linux/fs.h, which does not patch cleanly for 3.5.25 because of usb. > > Note, you might see debugging messages about items moving during > copy_from_user. These are safe, but I'm leaving them in for now as I'd > like to find out why copy_from_user is suddenly scheduling much more than > it used to. That's easy to explain. Andrea's latest aa contains some low latency patches, which add a if (current->need_resched) schedule() to copy*user to avoid bad schedule latencies for big copies. The result is that you see a lot more schedules, everytime the copy*user happens to hit the end of a time slice. BTW, there is a another optimization that could help reiserfs a lot on SMP settings: do a unlock_kernel()/lock_kernel() around the user copies. It is quite legal to do that (you have to handle sleeping anyways in case of a page fault), and it allows CPUs to run in parallel for long running copies. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Mon, 11 Sep 2000, Chris Mason wrote: >reiserfs-3.5.25, this patch. I tested against pre3-aa2. BTW, pre3-aa2 means 2.2.18pre2aa2.bz2 applyed on top of 2.2.18pre3. >Note, you might see debugging messages about items moving during >copy_from_user. These are safe, but I'm leaving them in for now as I'd >like to find out why copy_from_user is suddenly scheduling much more than >it used to. In 2.2.18pre2aa2.bz2 there's a latency bugfix, now a: read(fd, , 0x7fff) write(fd, , 0x7fff) sendfile(src, dst, NULL, 0x7fff) doesn't hang the machine anymore for several seconds. (well, really they are all three still a bit buggy because they don't interrupt themself when a signal arrives so we can't use senfile for `cp` of files smaller than 2g yet... but at least now you can do something in parallel even if you are so lucky to have some giga of fs cache) Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: More on 2.2.18pre2aa2
--On 09/11/00 07:45:16 -0400 Ed Tomlinson <[EMAIL PROTECTED]> wrote: > Hi Chris, > >>> Something between bigmem and his big VM changes makes reiserfs >>> uncompilable. [..] > >> It's due LFS. Chris should have a reiserfs patch that compiles on top of >> 2.2.18pre2aa2, right? (if not Chris, I can sure find it because the >> server that was reproducing the DAC960 SMP lock inversion was running >> 2.2.18pre2aa2+IKD on top of an huge reiserfs fs) > > Chris, is this patch posted anywhere? Alternately can you take a look > at the updated patch brian posted and comment/correct (if required). > Patch attached. The patch order should be 2.2.18-prex-aa2, reiserfs-3.5.25, this patch. I tested against pre3-aa2. LFS changes for filldir, reiserfs_readpage, and adds limit checking in file_write to make sure we don't go above 2GB (Andi Kleen). Also fixes include/linux/fs.h, which does not patch cleanly for 3.5.25 because of usb. Note, you might see debugging messages about items moving during copy_from_user. These are safe, but I'm leaving them in for now as I'd like to find out why copy_from_user is suddenly scheduling much more than it used to. > Has the 'atomic' delete code made it into the main reiserfs trees? > No, it is still only 90% done. -chris diff -urN diff/linux/fs/reiserfs/dir.c linux/fs/reiserfs/dir.c --- diff/linux/fs/reiserfs/dir.cSun Sep 10 21:58:07 2000 +++ linux/fs/reiserfs/dir.c Mon Sep 11 01:04:19 2000 @@ -177,7 +177,7 @@ // user space buffer is swapped out. At that time // entry can move to somewhere else memcpy (local_buf, d_name, d_reclen); - if (filldir (dirent, local_buf, d_reclen, d_off, d_ino) < 0) { + if (filldir (dirent, local_buf, d_reclen, d_off, d_ino, DT_UNKNOWN) < +0) { pathrelse (_to_entry); filp->f_pos = next_pos; if (local_buf != small_buf) { diff -urN diff/linux/fs/reiserfs/file.c linux/fs/reiserfs/file.c --- diff/linux/fs/reiserfs/file.c Sun Sep 10 21:58:07 2000 +++ linux/fs/reiserfs/file.cSun Sep 10 22:20:48 2000 @@ -40,6 +40,8 @@ int reiserfs_readpage (struct file * file, struct page * page); +/* LFS: should we add a open function that gives EINVAL for O_LARGEFILE? The specs + are not clear here. -AK */ static struct file_operations reiserfs_file_operations = { NULL, /* lseek */ reiserfs_file_read, /* read */ @@ -743,40 +745,58 @@ int windex ; struct reiserfs_transaction_handle th ; unsigned longlimit = current->rlim[RLIMIT_FSIZE].rlim_cur; - unsigned long pos = *p_n_pos; - struct buffer_head *buffer_list[REISERFS_NBUF] ; int buffer_count = 0 ; int n_blocks_flushed = 0 ; /* tracks i/o errors during O_SYNC */ - -/* - if (!p_s_inode->i_op || !p_s_inode->i_op->updatepage) - return -EIO; - */ + if (p_s_filp->f_error) { int error = p_s_filp->f_error; p_s_filp->f_error = 0; return error; } + if (n_count == 0) + return 0; + if ((signed)n_count < 0) + return -EINVAL; + +/* + if (!p_s_inode->i_op || !p_s_inode->i_op->updatepage) + return -EIO; + */ /* Calculate position in the file. */ + /* Make sure nothing overflows before converting loff_t to unsigned long */ if ( p_s_filp->f_flags & O_APPEND ) { -n_pos_in_file = p_s_inode->i_size + 1; -pos = p_s_inode->i_size; - } else -n_pos_in_file = *p_n_pos + 1; - - if (pos >= limit) +/* notify_change should guard against that */ +if (p_s_inode->i_size > (0x7fff - 1)) + return -EFBIG; +n_pos_in_file = ((unsigned long)p_s_inode->i_size) + 1; + } else { +if (*p_n_pos > (0x7fff - 1)) return -EFBIG; +n_pos_in_file = ((unsigned long)*p_n_pos) + 1; + } - if (n_count > limit - pos) - n_count = limit - pos; + if (limit >= 0x7fff) +limit = 0x7fff - 1; + + if (n_pos_in_file + n_count >= limit) { +/* Should send SIGXFSZ when above rlim */ +if (n_pos_in_file - 1 < limit) + n_count = limit - n_pos_in_file + 1; +else + return -EFBIG; + } + + if (n_count > limit - n_pos_in_file) + n_count = limit - n_pos_in_file; n_written = 0; n_tail_bytes_written = 0; p_s_sb = p_s_inode->i_sb; remove_suid(p_s_inode) ; + journal_begin(, p_s_sb, jbegin_count) ; if (p_s_filp->f_flags & O_SYNC) { diff -urN diff/linux/fs/reiserfs/inode.c linux/fs/reiserfs/inode.c --- diff/linux/fs/reiserfs/inode.c Sun Sep 10 21:58:07 2000 +++ linux/fs/reiserfs/inode.c Mon Sep 11 00:33:07 2000 @@ -217,10 +217,12 @@ inode = file->f_dentry->d_inode; increment_i_read_sync_counter(inode) ; -if (has_tail (inode) && tail_offset (inode) < page->offset + PAGE_SIZE) { +if (has_tail (inode) && +tail_offset(inode) < pgoff2ulong(page->index) * PAGE_SIZE + PAGE_SIZE) { /* there is a tail and it is in this page */
Re: More on 2.2.18pre2aa2
--On 09/11/00 07:45:16 -0400 Ed Tomlinson [EMAIL PROTECTED] wrote: Hi Chris, Something between bigmem and his big VM changes makes reiserfs uncompilable. [..] It's due LFS. Chris should have a reiserfs patch that compiles on top of 2.2.18pre2aa2, right? (if not Chris, I can sure find it because the server that was reproducing the DAC960 SMP lock inversion was running 2.2.18pre2aa2+IKD on top of an huge reiserfs fs) Chris, is this patch posted anywhere? Alternately can you take a look at the updated patch brian posted and comment/correct (if required). Patch attached. The patch order should be 2.2.18-prex-aa2, reiserfs-3.5.25, this patch. I tested against pre3-aa2. LFS changes for filldir, reiserfs_readpage, and adds limit checking in file_write to make sure we don't go above 2GB (Andi Kleen). Also fixes include/linux/fs.h, which does not patch cleanly for 3.5.25 because of usb. Note, you might see debugging messages about items moving during copy_from_user. These are safe, but I'm leaving them in for now as I'd like to find out why copy_from_user is suddenly scheduling much more than it used to. Has the 'atomic' delete code made it into the main reiserfs trees? No, it is still only 90% done. -chris diff -urN diff/linux/fs/reiserfs/dir.c linux/fs/reiserfs/dir.c --- diff/linux/fs/reiserfs/dir.cSun Sep 10 21:58:07 2000 +++ linux/fs/reiserfs/dir.c Mon Sep 11 01:04:19 2000 @@ -177,7 +177,7 @@ // user space buffer is swapped out. At that time // entry can move to somewhere else memcpy (local_buf, d_name, d_reclen); - if (filldir (dirent, local_buf, d_reclen, d_off, d_ino) 0) { + if (filldir (dirent, local_buf, d_reclen, d_off, d_ino, DT_UNKNOWN) +0) { pathrelse (path_to_entry); filp-f_pos = next_pos; if (local_buf != small_buf) { diff -urN diff/linux/fs/reiserfs/file.c linux/fs/reiserfs/file.c --- diff/linux/fs/reiserfs/file.c Sun Sep 10 21:58:07 2000 +++ linux/fs/reiserfs/file.cSun Sep 10 22:20:48 2000 @@ -40,6 +40,8 @@ int reiserfs_readpage (struct file * file, struct page * page); +/* LFS: should we add a open function that gives EINVAL for O_LARGEFILE? The specs + are not clear here. -AK */ static struct file_operations reiserfs_file_operations = { NULL, /* lseek */ reiserfs_file_read, /* read */ @@ -743,40 +745,58 @@ int windex ; struct reiserfs_transaction_handle th ; unsigned longlimit = current-rlim[RLIMIT_FSIZE].rlim_cur; - unsigned long pos = *p_n_pos; - struct buffer_head *buffer_list[REISERFS_NBUF] ; int buffer_count = 0 ; int n_blocks_flushed = 0 ; /* tracks i/o errors during O_SYNC */ - -/* - if (!p_s_inode-i_op || !p_s_inode-i_op-updatepage) - return -EIO; - */ + if (p_s_filp-f_error) { int error = p_s_filp-f_error; p_s_filp-f_error = 0; return error; } + if (n_count == 0) + return 0; + if ((signed)n_count 0) + return -EINVAL; + +/* + if (!p_s_inode-i_op || !p_s_inode-i_op-updatepage) + return -EIO; + */ /* Calculate position in the file. */ + /* Make sure nothing overflows before converting loff_t to unsigned long */ if ( p_s_filp-f_flags O_APPEND ) { -n_pos_in_file = p_s_inode-i_size + 1; -pos = p_s_inode-i_size; - } else -n_pos_in_file = *p_n_pos + 1; - - if (pos = limit) +/* notify_change should guard against that */ +if (p_s_inode-i_size (0x7fff - 1)) + return -EFBIG; +n_pos_in_file = ((unsigned long)p_s_inode-i_size) + 1; + } else { +if (*p_n_pos (0x7fff - 1)) return -EFBIG; +n_pos_in_file = ((unsigned long)*p_n_pos) + 1; + } - if (n_count limit - pos) - n_count = limit - pos; + if (limit = 0x7fff) +limit = 0x7fff - 1; + + if (n_pos_in_file + n_count = limit) { +/* Should send SIGXFSZ when above rlim */ +if (n_pos_in_file - 1 limit) + n_count = limit - n_pos_in_file + 1; +else + return -EFBIG; + } + + if (n_count limit - n_pos_in_file) + n_count = limit - n_pos_in_file; n_written = 0; n_tail_bytes_written = 0; p_s_sb = p_s_inode-i_sb; remove_suid(p_s_inode) ; + journal_begin(th, p_s_sb, jbegin_count) ; if (p_s_filp-f_flags O_SYNC) { diff -urN diff/linux/fs/reiserfs/inode.c linux/fs/reiserfs/inode.c --- diff/linux/fs/reiserfs/inode.c Sun Sep 10 21:58:07 2000 +++ linux/fs/reiserfs/inode.c Mon Sep 11 00:33:07 2000 @@ -217,10 +217,12 @@ inode = file-f_dentry-d_inode; increment_i_read_sync_counter(inode) ; -if (has_tail (inode) tail_offset (inode) page-offset + PAGE_SIZE) { +if (has_tail (inode) +tail_offset(inode) pgoff2ulong(page-index) * PAGE_SIZE + PAGE_SIZE) { /* there is a tail and it is in this page */ memset ((char *)page_address (page), 0, PAGE_SIZE); -
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Mon, 11 Sep 2000, Andi Kleen wrote: BTW, there is a another optimization that could help reiserfs a lot on SMP settings: do a unlock_kernel()/lock_kernel() around the user copies. It is quite legal to do that (you have to handle sleeping anyways in case of a page fault), and it allows CPUs to run in parallel for long running copies. I'd prefer not to spend time to make 2.2.x to scale better in SMP, 2.4.x just fixed that problem by dropping the big lock in first place in the read/write paths :). The copy-user reschedule points were bugfixes instead. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Considering there are a lot of people still using 2.0.x because they find it more stable than the 2.2.x series, doesn't it make sense to give this scalability to people who are already running SMP boxes on 2.2.x and who may decide to use ReiserFS? - Original Message - From: "Andrea Arcangeli" [EMAIL PROTECTED] On Mon, 11 Sep 2000, Andi Kleen wrote: BTW, there is a another optimization that could help reiserfs a lot on SMP settings: do a unlock_kernel()/lock_kernel() around the user copies. It is quite legal to do that (you have to handle sleeping anyways in case of a page fault), and it allows CPUs to run in parallel for long running copies. I'd prefer not to spend time to make 2.2.x to scale better in SMP, 2.4.x just fixed that problem by dropping the big lock in first place in the read/write paths :). The copy-user reschedule points were bugfixes instead. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/