Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Sat, 16 Sep 2000, Giuliano Pochini wrote: > I wrote, then got air-brushed out of the thread?! > > That's one approach; I prefer my "weighted scoring" approach. Supposing we > > have three devices: a solid state disk (instant "seeks"), a hard drive and > > a tape. The SSD will benefit from merges (fewer commands to process), but > > not hugely - set both the metrics at 1, so a 64Kb request is just under > > twice the "cost" of a 32Kb one. The hard drive can seek fairly quickly, > > but long reads are preferable - say, seek_cost = 16, block_cost = 1. A > > single 64Kb request is "cheaper" than a pair of 32Kb requests, but not > > hugely so. Then the tape: seeks can take a few minutes, so make it > > seek_cost = 65536, block_cost = 1: we don't really care how much is being > > read, within reason, since seeks are so "expensive". > > If we could read by a SCSI command the maximun/typical/minimum seek and > transfer speed directly from the drive things were a lot simpler :)) We can almost do that with any device; if we have a nice interface for user-space tuning (a few /proc files), we can be told at boot-time what the characteristics of each device are. A simple piece of code which does a couple of seeks and a big read or two on each device would achieve this easily enough. We don't need to know the exact characteristics, but it could be of some use in fine-tuning performance; my overall approach should work reasonably well for any given device, I think? James. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
> The current HUGE queue size is probably another reason for > the very bad latencies we sometimes see... Yes, but if the queue is short processes are likely to remain blocked in D state for more time and the chances to merge rq are smaller. IMHO we should add a way to give priority to rqs from live tasks. i.e. a "cp" usually puts stuff on the queue and then exits. There's no need to service that i/o immediately. > > I don't think you should allow merges. If one process is doing a big > > IO operation on a big file it would still get _all_ capacity, right? > > Only if you were to allow unlimited merges ;) > > I guess a 'decent' maximum size for one request would > be somewhere around 256kB, since this is the amount of > data a drive can read in the time of one disk seek... Today's HDs transfer rates of 10+MB/s are not uncommon. 256KB is a bit too small IMHO. SCSI discs that don't support TQ (e.g. quantum) get a performance hit. Bye. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
> That's one approach; I prefer my "weighted scoring" approach. Supposing we > have three devices: a solid state disk (instant "seeks"), a hard drive and > a tape. The SSD will benefit from merges (fewer commands to process), but > not hugely - set both the metrics at 1, so a 64Kb request is just under > twice the "cost" of a 32Kb one. The hard drive can seek fairly quickly, > but long reads are preferable - say, seek_cost = 16, block_cost = 1. A > single 64Kb request is "cheaper" than a pair of 32Kb requests, but not > hugely so. Then the tape: seeks can take a few minutes, so make it > seek_cost = 65536, block_cost = 1: we don't really care how much is being > read, within reason, since seeks are so "expensive". If we could read by a SCSI command the maximun/typical/minimum seek and transfer speed directly from the drive things were a lot simpler :)) Bye. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
[EMAIL PROTECTED] wrote: > If we played some "zippy" music that would add to the "feel". Of > course, we could actually use benchmarks instead. Benchmarks, famed for their universal appeal :-) > And, to me, if kernel compiles take longer, I don't care how fast it "feels". A kernel compile _by itself_ should always run at full speed. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Sat, Sep 16, 2000 at 02:15:16PM +0200, Jamie Lokier wrote: > [EMAIL PROTECTED] wrote: > > > Sure the global system is slower. But the "interactive feel" is faster. > > > > Let's pop up little buttons to make it "feel" faster. > > If little buttons pop up quickly when I click on them, then yes that's > better interactive feel. Sometimes the disk is involved in this. > > Nobody has a problem with tuning the task scheduler to work this way. If we played some "zippy" music that would add to the "feel". Of course, we could actually use benchmarks instead. And, to me, if kernel compiles take longer, I don't care how fast it "feels". > > -- Jamie -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
[EMAIL PROTECTED] wrote: > > Sure the global system is slower. But the "interactive feel" is faster. > > Let's pop up little buttons to make it "feel" faster. If little buttons pop up quickly when I click on them, then yes that's better interactive feel. Sometimes the disk is involved in this. Nobody has a problem with tuning the task scheduler to work this way. -- Jamie - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Tue, Sep 12, 2000 at 04:30:32PM +0200, Jamie Lokier wrote: > Sure the global system is slower. But the "interactive feel" is faster. Let's pop up little buttons to make it "feel" faster. -- - Victor Yodaiken Finite State Machine Labs: The RTLinux Company. www.fsmlabs.com www.rtlinux.com - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Fri, Sep 15, 2000 at 09:08:25PM -0400, Giuliano Pochini wrote: > Which is tightly dependent on the way we insert new rqs. Sure the implementation would differ in function of the way we order the requests during inserction, but the conceptual algorithm of the latency control could remain the same even if we'll sligtly change the ordering algorithm. The reason the implementation would differ is because they are implemented together to browse the queue the less times possible. > Rqs are added according to the IN_ORDER() condition which enforces ascending > ordering. What does CSCAN mean ? CSCAN means Cyclic scan. It enforces ascending order and when it reaches the last cylinder it causes a large seek to the head. > [..] When we insert 15 the list becomes 12 13 14 1 2 3 15. You're right. Infact the only reason we need the latency control stuff is that 14 have to become a barrier eventually (and 15 have to become a barrier eventually as well) if we want them to be served by the lowlevel driver otherwise they could be passed for way too long time (just the time to write some Tera of data to disk) by requests with lower blocknr. This was the elevator latency bugfix. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
> >It's a good way to avoid stalls, but IMHO I thing the elevator should > > Here you're talking about the latency control logic. Which is tightly dependent on the way we insert new rqs. > >not work this way. The main problem is that it doesn't minimize head > >movement. For example, when comes a request for a sector at the > > The algorithm that control the latency and the one that enforce the > ordering are two separate things. > > For example in 2.2.15 there's only the latter (2.2.15 was missing a > latency control mechanism). > > >beginning of the disk it goes to the beginning of the queue (it's > >ordered) so it will be serviced before other rqs (it picks up rqs from > >the head). This way the higher is the block number, more likely that > >rq will wait a lot. > > We're not using a SCAN ordering algorithm it's instead a CSCAN. The queue > can be for istance: > > 12 13 14 1 2 3 > > Now insert 15 (assume 15 is the last sector of the disk): > > 12 13 14 15 1 2 3 Rqs are added according to the IN_ORDER() condition which enforces ascending ordering. What does CSCAN mean ? ie, if the queue initially is 1 2 3 it will become 1 2 3 14. If it's 12 13 14 and we insert 2: 2 12 13 14. So your example can happen only if the list is 12 13 14 and "14" is the latency barrier. When we insert 15 the list becomes 12 13 14 1 2 3 15. Am I missing something ? > PS. Thanks for having read and understood the internals of the 2.4.x > elevator before posting your comments, it simplifies very much the > communication. I had to study it while hunting for the no-merge bug :) Bye. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Ragnar Kjørstad wrote: > > On Wed, Sep 13, 2000 at 11:22:16AM -0400, Michael T. Babcock wrote: > > If I may ask a potentially stupid question, how can request latency be > > anything but a factor of time? Latency is how /long/ you (or the computer) > > /waits/ for something. That defines it as a function of time. > > Latency is of course a factor of time, but the point is that the > acceptable latency differs from device to device. For a slower device > longer latency must be acceptable, and if the relationship is linear, > then using number of requests may be a simpler and better way of doing > it. > Latency is a function of time, but the units of time need not be seconds and could be requests. In theory. In this particular case, measuring time in units of requests is inappropriate. We have a choice between having users (or device driver authors for the defaults) estimate how many milliseconds is reasonable for a given device, or having them estimate how many requests can be guaranteed to not exceed the time they can wait for a particular device. I would rather take responsibility for knowing not to get unreasonable in the smallness of my MP3 buffering for a floppy device experiencing multiple processes accessing it for reasons of performance effects being disastrous than take responsibility for estimating how many requests will not add up to more than X milliseconds of buffer on my hard drive. Number of requests is irrelevant to realtime I/O scheduling, seconds are relevant to realtime I/O scheduling. There are two problems to be solved: fairness, and realtime I/O scheduling. Fairness can be handled effectively by measuring time in units of requests. Guaranteeing maximum latencies requires traditional time units. Hans - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Wed, 13 Sep 2000, Rik van Riel wrote: > On Thu, 14 Sep 2000, Ragnar Kjørstad wrote: > This (IMHO) is wrong. When the drive has trouble keeping up > with requests in FCFS order, we should do some elevator sorting > to keep throughput high enough to deal with the requests we get. Reversing that, you mean the drive should be given requests on an FCFS basis until it is running flat out, at which point elevator sorting should come into play? That's what I would suggest: under light loads, requests are then handled with no extra latency at all, but we still get increased throughput under heavier load. > Jeff's idea would take care of this, together with a *limited* > (say, 32 entries) queue size. Every time the "work" queue (the > one the disk is working on) is empty, we switch queues. Then process one queue until it's empty, even if there are higher priority requests in another queue? I'm not so sure about that: then my backup program or a file copy will be getting between my MP3 player and the disk. > Processes get to put their request on the other queue, either > until the disk takes their requests (the queues are switched) > or until the queue is full. That way only the N highest priority > processes can get access to the disk if we're really overloaded > in a bad way. How does your approach block lower priority processes? If they were to issue a request before a higher priority one, it is inserted into the "filling" queue; once we switch queues, this request will be handled - perhaps before a higher priority request, if we're unlucky with the timing? If we can keep it sorted reasonably efficiently, a single rolling queue seems better - or perhaps model it on the process scheduler? Have "realtime" requests ("it doesn't matter if everything else locks up, this request MUST be completed ASAP"), then "other" requests of various priorities? > The current HUGE queue size is probably another reason for > the very bad latencies we sometimes see... If the code ever sits there sorting the queue out, while the disk goes idle, the code needs adjusting: getting requests processed in a sub-optimal order is still better than not getting any requests processed at all for some period of time! > > > (maybe with the twist that we /do/ allow merges on the queue > > > that's being processed by the disk, but no insertions of new > > > requests) > > > > I don't think you should allow merges. If one process is doing a big > > IO operation on a big file it would still get _all_ capacity, right? > > Only if you were to allow unlimited merges ;) > > I guess a 'decent' maximum size for one request would > be somewhere around 256kB, since this is the amount of > data a drive can read in the time of one disk seek... That's one approach; I prefer my "weighted scoring" approach. Supposing we have three devices: a solid state disk (instant "seeks"), a hard drive and a tape. The SSD will benefit from merges (fewer commands to process), but not hugely - set both the metrics at 1, so a 64Kb request is just under twice the "cost" of a 32Kb one. The hard drive can seek fairly quickly, but long reads are preferable - say, seek_cost = 16, block_cost = 1. A single 64Kb request is "cheaper" than a pair of 32Kb requests, but not hugely so. Then the tape: seeks can take a few minutes, so make it seek_cost = 65536, block_cost = 1: we don't really care how much is being read, within reason, since seeks are so "expensive". In short, having a single rolling queue (probably divided into priority classes) seems preferable; we should also bypass the elevator entirely if the device is able to accept new commands at the time. (Q: how intelligently will drives which handle command queueing themselves be??) James. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Wed, Sep 13, 2000 at 08:08:12PM -0300, Rik van Riel wrote: > On Thu, 14 Sep 2000, Ragnar Kjørstad wrote: > > On Wed, Sep 13, 2000 at 06:57:21PM -0300, Rik van Riel wrote: > > > > Another potentially stupid question: > > > > When the queue gets too long/old, new requests should be put in > > > > a new queue to avoid starvation for the ones in the current > > > > queue, right? > > > > > > Indeed. That would solve the problem... > > > > I think something like this: (I don't know the current code, so maybe > > this suggestion is really stupid.) > > > > struct request_queue { > > int4start_time; > > struct request *list; > > struct request_queue *next; > > } > > > > struct request_queue *current_queue; > > > > function too_old (struct request_queue *q) { > > /* can actually easily be implemented using time, nr of requests > >or any other messure of amount of work */ > > if (current_time > q->start_time + MAXTIME) > > return 1; > > else > > return 0; > > } > > > > function insert_request(struct request *new) { > > struct request_queue *q=current_queue; > > if (new->sectornr < current_sector); > > q=q->next; > > while (too_old(q)) > > q=q->next; > > insert_request(q->list, new); > > } > > This (IMHO) is wrong. When the drive has trouble keeping up > with requests in FCFS order, we should do some elevator sorting > to keep throughput high enough to deal with the requests we get. The "list" in each queue should be sorted. So, even if the queue is too old, the requests will still be sorted, but only within each queue. > Jeff's idea would take care of this, together with a *limited* > (say, 32 entries) queue size. Every time the "work" queue (the > one the disk is working on) is empty, we switch queues. OK, I called the "work" queue my "current" queue in the suggestion above, but other than that it may not be so different. > Processes get to put their request on the other queue, either > until the disk takes their requests (the queues are switched) > or until the queue is full. That way only the N highest priority > processes can get access to the disk if we're really overloaded > in a bad way. I'm not sure I understand this. New requests should always be put in the first queue that has room (is not too old). This means that no requests will be shifted too far off it's original order. (in other words, it is not too unfair) > > I don't think you should allow merges. If one process is doing a big > > IO operation on a big file it would still get _all_ capacity, right? > > Only if you were to allow unlimited merges ;) > > I guess a 'decent' maximum size for one request would > be somewhere around 256kB, since this is the amount of > data a drive can read in the time of one disk seek... Yes, this sounds like a good idea. (Don't know about the size though, but testing will show) -- Ragnar Kjørstad - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Wed, 13 Sep 2000, Giuliano Pochini wrote: >I had a look at the new elevator. I hope I'm not wrong here... Requests >are added to the queue scanning it starting from the tail. When the >new request if found in-order with ad existing one, it's inserted int Right so that it works in O(1) if you're doing sequential I/O. >the queue, unless it finds an existing request out of latency units. In >that case the new rq is added before it (so it will be serviced after). Correct. The request out of latency units will become a barrier and nothing will be allowed to pass it. >After it have inserted the request it walks back and decrease all >latency counters. Yes, it have to tell to the other requests that they are being delayed. >So, the "unit" to measure latency is "number of requests". New requests Yes, in the sense of I/O request (not `struct request'). >are not allowed to go before an existing one more than N times. Correct. >It's a good way to avoid stalls, but IMHO I thing the elevator should Here you're talking about the latency control logic. >not work this way. The main problem is that it doesn't minimize head >movement. For example, when comes a request for a sector at the The algorithm that control the latency and the one that enforce the ordering are two separate things. For example in 2.2.15 there's only the latter (2.2.15 was missing a latency control mechanism). >beginning of the disk it goes to the beginning of the queue (it's >ordered) so it will be serviced before other rqs (it picks up rqs from >the head). This way the higher is the block number, more likely that >rq will wait a lot. We're not using a SCAN ordering algorithm it's instead a CSCAN. The queue can be for istance: 12 13 14 1 2 3 Now insert 15 (assume 15 is the last sector of the disk): 12 13 14 15 1 2 3 as you can see the higher block number of the HD will wait less time than 1 2 3 (it will pass 1 2 and 3, of course assuming they are fresh requests and they still have some latency unit to spend). SCAN instead would have a fairness problem where the middle of the disk gets less latency but SCAN as well goes backwards and it's not know how well it works in all the hardware out there. (I think that's one of the main reasons we use CSCAN instead) Andrea PS. Thanks for having read and understood the internals of the 2.4.x elevator before posting your comments, it simplifies very much the communication. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Thu, 14 Sep 2000, Ragnar Kjørstad wrote: > On Wed, Sep 13, 2000 at 06:57:21PM -0300, Rik van Riel wrote: > > > Another potentially stupid question: > > > When the queue gets too long/old, new requests should be put in > > > a new queue to avoid starvation for the ones in the current > > > queue, right? > > > > Indeed. That would solve the problem... > > I think something like this: (I don't know the current code, so maybe > this suggestion is really stupid.) > > struct request_queue { > int4start_time; > struct request *list; > struct request_queue *next; > } > > struct request_queue *current_queue; > > function too_old (struct request_queue *q) { > /* can actually easily be implemented using time, nr of requests > or any other messure of amount of work */ > if (current_time > q->start_time + MAXTIME) > return 1; > else > return 0; > } > > function insert_request(struct request *new) { > struct request_queue *q=current_queue; > if (new->sectornr < current_sector); > q=q->next; > while (too_old(q)) > q=q->next; > insert_request(q->list, new); > } This (IMHO) is wrong. When the drive has trouble keeping up with requests in FCFS order, we should do some elevator sorting to keep throughput high enough to deal with the requests we get. Jeff's idea would take care of this, together with a *limited* (say, 32 entries) queue size. Every time the "work" queue (the one the disk is working on) is empty, we switch queues. Processes get to put their request on the other queue, either until the disk takes their requests (the queues are switched) or until the queue is full. That way only the N highest priority processes can get access to the disk if we're really overloaded in a bad way. The current HUGE queue size is probably another reason for the very bad latencies we sometimes see... > > (maybe with the twist that we /do/ allow merges on the queue > > that's being processed by the disk, but no insertions of new > > requests) > > I don't think you should allow merges. If one process is doing a big > IO operation on a big file it would still get _all_ capacity, right? Only if you were to allow unlimited merges ;) I guess a 'decent' maximum size for one request would be somewhere around 256kB, since this is the amount of data a drive can read in the time of one disk seek... regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Wed, Sep 13, 2000 at 06:57:21PM -0300, Rik van Riel wrote: > > Another potentially stupid question: > > When the queue gets too long/old, new requests should be put in > > a new queue to avoid starvation for the ones in the current > > queue, right? > > Indeed. That would solve the problem... I think something like this: (I don't know the current code, so maybe this suggestion is really stupid.) struct request_queue { int4start_time; struct request *list; struct request_queue *next; } struct request_queue *current_queue; function too_old (struct request_queue *q) { /* can actually easily be implemented using time, nr of requests or any other messure of amount of work */ if (current_time > q->start_time + MAXTIME) return 1; else return 0; } function insert_request(struct request *new) { struct request_queue *q=current_queue; if (new->sectornr < current_sector); q=q->next; while (too_old(q)) q=q->next; insert_request(q->list, new); } > (maybe with the twist that we /do/ allow merges on the queue > that's being processed by the disk, but no insertions of new > requests) I don't think you should allow merges. If one process is doing a big IO operation on a big file it would still get _all_ capacity, right? -- Ragnar Kjørstad - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Wed, Sep 13, 2000 at 08:08:51PM -0400, Giuliano Pochini wrote: > > And if so, in what unit do you want to measure > > latency if it isn't time? > > I had a look at the new elevator. I hope I'm not wrong here... Requests > are added to the queue scanning it starting from the tail. When the > new request if found in-order with ad existing one, it's inserted int > the queue, unless it finds an existing request out of latency units. In > that case the new rq is added before it (so it will be serviced after). If I understand this correctly, that means once the queue is too long (latency wise) the elevator will stop working completely, because all requests will be served in the order they come in. right? If multiple queues are used, so performance is OK even when the first queue is full (too old), the discussion about how to tell if the queue is too old or not (by time or requests) becomes less important. After all, there is no reason not to start a new queue pretty often, even on a fast device. The correct behavoir would be to reorderder the new requests individually even if the queue is alreaddy too long. In other words - multiple queues. > not work this way. The main problem is that it doesn't minimize head > movement. For example, when comes a request for a sector at the > beginning of the disk it goes to the beginning of the queue (it's > ordered) so it will be serviced before other rqs (it picks up rqs from > the head). This way the higher is the block number, more likely that > rq will wait a lot. If a new request is made for a sector before the current sector (the sector of the last request to be served), the new request should be added to the second queue even if the first one is not too old. -- Ragnar Kjørstad - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Wed, 13 Sep 2000, Ragnar Kjørstad wrote: > On Wed, Sep 13, 2000 at 11:22:16AM -0400, Michael T. Babcock wrote: > > If I may ask a potentially stupid question, how can request latency be > > anything but a factor of time? Latency is how /long/ you (or the computer) > > /waits/ for something. That defines it as a function of time. > > Latency is of course a factor of time, but the point is that the > acceptable latency differs from device to device. For a slower device > longer latency must be acceptable, and if the relationship is linear, > then using number of requests may be a simpler and better way of doing > it. > > Another potentially stupid question: > When the queue gets too long/old, new requests should be put in > a new queue to avoid starvation for the ones in the current > queue, right? Indeed. That would solve the problem... > So if this is done by time, how do you know when the oldest > request get too old? You would need to index the requests both > by sector and time, and thus performance overhead, right? > If you, however have a simple rule that max 100 requests should > be put in each queue, it's easy to know when to start a new one. The idea Jeff Merkey gave us is even simpler and should work ok every time. I think I'll implement it RSN and try if it works ok for Linux. (maybe with the twist that we /do/ allow merges on the queue that's being processed by the disk, but no insertions of new requests) regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Wed, Sep 13, 2000 at 11:22:16AM -0400, Michael T. Babcock wrote: > If I may ask a potentially stupid question, how can request latency be > anything but a factor of time? Latency is how /long/ you (or the computer) > /waits/ for something. That defines it as a function of time. Latency is of course a factor of time, but the point is that the acceptable latency differs from device to device. For a slower device longer latency must be acceptable, and if the relationship is linear, then using number of requests may be a simpler and better way of doing it. Another potentially stupid question: When the queue gets too long/old, new requests should be put in a new queue to avoid starvation for the ones in the current queue, right? So if this is done by time, how do you know when the oldest request get too old? You would need to index the requests both by sector and time, and thus performance overhead, right? If you, however have a simple rule that max 100 requests should be put in each queue, it's easy to know when to start a new one. The number 100 should be found by calculating how many requests can be served within the acceptable latency. -- Ragnar Kjørstad - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2 (summary of elevator ideas)
On 12 Sep 2000, at 18:08, Ed Tomlinson wrote: > Hi, > > I made the comment because I remember back when the discussion was current > on linux kernel. I thought Jeff Merkey's, message was to the point. Para- > phrasing from memory, it was something to the effect that novell had > tried many elevators. All had problems with some loads. The best they > had found was the 'close the door' idea. I do not remember if the door > was based on requests or time. Another point to remember is that the > netware people came up with a what they considered a good solution. I believe that the Netware elevator is based on outstanding requests. This is a tunable parameter which may be increased for fast disk subsystems. I think that you could consider the number of requests to be loosely related to time. That is, the time to service 50 requests should be fairly predictable for a given disk/controller. I don't think you need to time stamp every request to get good results. --Mailed via Pegasus 3.12 & Mercury 1.44--- [EMAIL PROTECTED]Fax (617)373-2942 Andrew Scott Tel (617)373-5278 _ Northeastern University--138 Meserve Hall / \ / College of Arts & Sciences-Deans Office / \ \ / Boston, Ma. 02115 / \_/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
> > Going in function of time is obviously wrong. A blockdevice can > > write 1 request every two seconds or 1 request every msecond. > > You can't assume anything in function of time _unless_ you have > > per harddisk timing informations into the kernel. > > Uhmmm, isn't the elevator about request /latency/ ? > > And if so, in what unit do you want to measure > latency if it isn't time? I had a look at the new elevator. I hope I'm not wrong here... Requests are added to the queue scanning it starting from the tail. When the new request if found in-order with ad existing one, it's inserted int the queue, unless it finds an existing request out of latency units. In that case the new rq is added before it (so it will be serviced after). After it have inserted the request it walks back and decrease all latency counters. So, the "unit" to measure latency is "number of requests". New requests are not allowed to go before an existing one more than N times. It's a good way to avoid stalls, but IMHO I thing the elevator should not work this way. The main problem is that it doesn't minimize head movement. For example, when comes a request for a sector at the beginning of the disk it goes to the beginning of the queue (it's ordered) so it will be serviced before other rqs (it picks up rqs from the head). This way the higher is the block number, more likely that rq will wait a lot. Again... If I understood correctly how the elevator works. Bye. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2 (summary of elevator ideas)
You're welcome. :-) Jeff Hans Reiser wrote: > > "Jeff V. Merkey" wrote: > > > > One important point on remirroring I did not mention in my post. In > > NetWare, remirroring scans the disk BACKWARDS (n0) to prevent > > artificial starvation while remirring is going on. This was another > > optimization we learned the hard way by trying numerous approaches to > > the problem. > > > > Jeff > > Interesting. Thanks Jeff. > - > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to [EMAIL PROTECTED] > Please read the FAQ at http://www.tux.org/lkml/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
- Original Message - From: "Rik van Riel" <[EMAIL PROTECTED]> > On Tue, 12 Sep 2000, Andrea Arcangeli wrote: > > On Tue, 12 Sep 2000, Rik van Riel wrote: > > > > >Uhmmm, isn't the elevator about request /latency/ ? > > > > Yes, but definitely not absolute "time" latency. > > So what do you think about the idea Jeff Merkey > presented? It seems to work well for Netware, so > maybe we should try for Linux? If I may ask a potentially stupid question, how can request latency be anything but a factor of time? Latency is how /long/ you (or the computer) /waits/ for something. That defines it as a function of time. Just curious what the full comment being made is ... ... because I'm suspicious that the terms are being confused and there is some discussion about reducing drive access latency within the context of not moving the heads too much, etc. without the context of the user-space application waiting for data. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Wed, 13 Sep 2000, Andrea Arcangeli wrote: > On Wed, 13 Sep 2000, Mitchell Blank Jr wrote: > > >The "large queue" goes against the whole point of this exercise - that > >is that if there are many items in the "queue" being sorted then > >unlucky requests can end up waiting a long time to get serviced. > > Yep. Ehmm... that large queue is a backlog of I/O requests. Since that's a function of the workload and the disk speed, you can't reduce it without either cutting the workload or speeding up the disk! I suspect we were talking at cross purposes here a little. I was NOT suggesting holding back requests to build up a big queue - I was talking about the situation where we have a large backlog of requests pending. James. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Wed, 13 Sep 2000, Mitchell Blank Jr wrote: > James Sutherland wrote: > > In terms of latency, I'd suggest we aim to keep the device in use all the > > time we have outstanding requests: every time the device is ready to > > accept a request, we feed it the "next" one in the queue; until it is free > > again, requests pile up in the queue, being sorted, merged etc. > > > > This should perform very well under light loads - requests just get passed > > to the device as fast as they can be handled - and under very heavy loads > > - we have a large queue in play, with plenty of scope for reordering, > > merges etc. > > The "large queue" goes against the whole point of this exercise - that > is that if there are many items in the "queue" being sorted then > unlucky requests can end up waiting a long time to get serviced. Yes, you need some sort of priority boost to avoid starving unlucky processes. Tracking a sort of I/O goodness value may help here: if I fire off a copy of a large file, and a "find /", the former will get much higher throughput due to large consecutive reads. As it does so, though, it reduces its own I/O priority; 'eventually' it will be below find, and find gets some I/O bandwidth. > You want to have some reasonable break-off point where you just > decide to make an elevator swipe in order to avoid starving those > requests. Hrmm... not quite the system I had in mind. I specifically want to avoid delaying any requests: when possible, we just pass requests straight to the drive. The queue of outstanding requests only arises in the first place when the drive is busy. > Keep in mind that I'm using the word "queue" in a confusing manner > here - we're talking about how much we'll try to elevator-sort in > a go, not the actual queue of all requests. I've been calling this > a queue because from a performance standpoint that's what it is, > so I think it helps to think of the problem that way. Ahh... I see. If we have a large enough backlog of requests that we can't even elevator sort them fast enough, there's something very odd going on. Still, I was planning to impose a maximum limit on the backlog - the question is, what to do when that limit is reached?? > Oh yeah one other thing about huge elevator queues to think about > (although its not too pertinant these days). A friend of mine had > a job 15 years or so ago to working on a commercial UNIX flavor. > They were having some severe I/O performance problems under high > loads. The problem was that they were trying to sort and merge > all the pending requests. When I/O was heavy this ended up > chewing all the CPU time and the kernel wouldn't actually be > able to keep up sending requests to the disk :-) However, keep > in mind that this was int the bad-old-days when "hard drive" > meant a rack-mount Eagle and the elevator was expected to consider > things like head positioning and rotational speed when doing > a sort... LOL! My approach does avoid that, though. As a worst case, the elevator just doesn't get a chance to sort anything: every request issues hits the disk directly. If the disk is handling requests fast enough the CPU can't keep up, we don't really want an elevator anyway... So, in various scenarios: * Light I/O - disk handling requests as they are issued. Disk may be idle at times. * Medium I/O - small backlog being elevator sorted as needed. Disk active continuously, but all requests being handled OK. * Heavy I/O - large backlog building up. Disk continuously active, some processes may be starved or throttled, depending on priorities. In the third case, sometimes we WILL want processes starved of I/O. If I have three processes running - an MP3 player (high priority), a 'find' and a low priority backup job, I DO want the backup job starved in favour of the other two. James. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Wed, 13 Sep 2000, Martin Dalecki wrote: >Andrea Arcangeli wrote: >> >> On Tue, 12 Sep 2000, Martin Dalecki wrote: >> >> >First of all: In the case of the mp3 player and such there is already a >> >fine >> >proper way to give it better chances on getting it's job done smooth - >> >RT kernel sceduler priorities and proper IO buffering. I did something >> >similiar >> >to a GDI printer driver... >> >> Take 2.2.15, set a buffer of 128mbyte (of course assume your mp3 is larger >> than 128mbyte :) and then run in background `cp /dev/zero .` in the same >> fs where your mp3 file out of cache is living. Then you'll see why a large >> buffer is useless if there's none kind of I/O fair scheduling into the >> elevator. Repeat the same test in 2.2.16 then. >> >> The I/O latency Hans was taking about for the mp3 player, is the time it >> takes for the buffer to become empty. > >I was talking about *proper* buffering not necessary *big* buffers. Specificate "*proper* buffering". If your buffer is shorter than the size of the file (and as it's a buffer it's intended to be shorter as your mp3 file can be several gigabytes long), then at some point you'll have to page in from disk the new data to refill the buffer. What your proper buffering will do when linux doesn't complete the read I/O to refill the buffer in time? (in time == before the buffer become empty) Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Wed, 13 Sep 2000, Mitchell Blank Jr wrote: >The "large queue" goes against the whole point of this exercise - that >is that if there are many items in the "queue" being sorted then >unlucky requests can end up waiting a long time to get serviced. Yep. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Wed, 13 Sep 2000, Alan Cox wrote: >> Yes, but "how hard is it reasonable for the kernel to try" is based on >> both items. A good first order approximation is number of requests. > >I must strongly disagree with that claim. A request could be 512 bytes or >128K. Current elevator account the number of bh that enters the bh layer (we're not making difference between 1k and 4k softblocksizes though). I was misleading (sorry) when I used the term "request", I didn't meant a `struct request" but a new I/O request (1k/4k sized) pushed into the blkdev layer. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
James Sutherland wrote: > In terms of latency, I'd suggest we aim to keep the device in use all the > time we have outstanding requests: every time the device is ready to > accept a request, we feed it the "next" one in the queue; until it is free > again, requests pile up in the queue, being sorted, merged etc. > > This should perform very well under light loads - requests just get passed > to the device as fast as they can be handled - and under very heavy loads > - we have a large queue in play, with plenty of scope for reordering, > merges etc. The "large queue" goes against the whole point of this exercise - that is that if there are many items in the "queue" being sorted then unlucky requests can end up waiting a long time to get serviced. You want to have some reasonable break-off point where you just decide to make an elevator swipe in order to avoid starving those requests. Keep in mind that I'm using the word "queue" in a confusing manner here - we're talking about how much we'll try to elevator-sort in a go, not the actual queue of all requests. I've been calling this a queue because from a performance standpoint that's what it is, so I think it helps to think of the problem that way. Oh yeah one other thing about huge elevator queues to think about (although its not too pertinant these days). A friend of mine had a job 15 years or so ago to working on a commercial UNIX flavor. They were having some severe I/O performance problems under high loads. The problem was that they were trying to sort and merge all the pending requests. When I/O was heavy this ended up chewing all the CPU time and the kernel wouldn't actually be able to keep up sending requests to the disk :-) However, keep in mind that this was int the bad-old-days when "hard drive" meant a rack-mount Eagle and the elevator was expected to consider things like head positioning and rotational speed when doing a sort... -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Wed, 13 Sep 2000, Mitchell Blank Jr wrote: > Alan Cox wrote: > > > Yes, but "how hard is it reasonable for the kernel to try" is based on > > > both items. A good first order approximation is number of requests. > > > > I must strongly disagree with that claim. A request could be 512 bytes or > > 128K. > > Yeah, as sct pointed out this gets thorny. For a modern harddrive this > probably doesn't matter (since sequential I/O is SO fast compared to > seek times) but for other devices its an issue. How about a simple "cost metric" for each request? (Each individual request gets X "points", plus Y per block. Make Y almost zero for HDDs and tape drives etc., make X and Y equal for solid state storage, say.) In terms of latency, I'd suggest we aim to keep the device in use all the time we have outstanding requests: every time the device is ready to accept a request, we feed it the "next" one in the queue; until it is free again, requests pile up in the queue, being sorted, merged etc. This should perform very well under light loads - requests just get passed to the device as fast as they can be handled - and under very heavy loads - we have a large queue in play, with plenty of scope for reordering, merges etc. > > > ...where the user sets a number exlpicitly for what performance they > > > want. Again, if we're going to make the user set this latency > > > > No they do not. The parameters are defined by the bandwidth and measured > > behaviour. > > Hmmm... well if someone wants to propose an algorithm to self-tune the > "queue depth in milliseconds" number then maybe we'll get somewhere. > You'd need to do some sort of moving averages of both requests/sec and > sectors/sec that come out of the elevator and use those as feedback to > adjust the queue-depth-value. I'm not sure if this is advisable, but > at least it sounds feasable. With the metrics above, you should be able to calculate appropriate values for X and Y to make the "cost" figures roughly correspond to actual times, I think? The big question, of course, is what do you do when the queue reaches the maximum - block the next process to make a request? Better, block all but the highest "I/O priority" process? Then, I can go copying, moving and deleting files left, right and centre without my MP3 player ever skipping, which is nice :-) James. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Alan Cox wrote: > > Yes, but "how hard is it reasonable for the kernel to try" is based on > > both items. A good first order approximation is number of requests. > > I must strongly disagree with that claim. A request could be 512 bytes or > 128K. Yeah, as sct pointed out this gets thorny. For a modern harddrive this probably doesn't matter (since sequential I/O is SO fast compared to seek times) but for other devices its an issue. > > ...where the user sets a number exlpicitly for what performance they > > want. Again, if we're going to make the user set this latency > > No they do not. The parameters are defined by the bandwidth and measured > behaviour. Hmmm... well if someone wants to propose an algorithm to self-tune the "queue depth in milliseconds" number then maybe we'll get somewhere. You'd need to do some sort of moving averages of both requests/sec and sectors/sec that come out of the elevator and use those as feedback to adjust the queue-depth-value. I'm not sure if this is advisable, but at least it sounds feasable. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Wed, 13 Sep 2000, Matthew Kirkwood wrote: > [0] Insert obligatory "why oh why" on Linus' not > taking SCT's sar/iostat patches. Indeed. It's a shame because 2.4 is gaining a relatively complete "Enterprise checklist". It's a further shame because it's a non-intrusive patch. Yes, detailled I/O usage figures _are_ an Enterprise requirement. And no, we don't have good enough statistics yet. A decent DBA will _require_ figures like % utilisation, average request service latency, average queue depth, etc. Cheers Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
> Yes, but "how hard is it reasonable for the kernel to try" is based on > both items. A good first order approximation is number of requests. I must strongly disagree with that claim. A request could be 512 bytes or 128K. > It's still a queue - the queue of things we're going to take on this > elevator swipe, right? And the problem is one of keeping a sane > watermark on this queue - not too many requests to destroy latency > but enough to let the elevator do some good. Yes. OK I agree there. If you want an efficiency bound then you need to consider that. > > Im talking about flow control/traffic shaping > > ...where the user sets a number exlpicitly for what performance they > want. Again, if we're going to make the user set this latency No they do not. The parameters are defined by the bandwidth and measured behaviour. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Rik van Riel wrote: [ outrageous cc: list trimmed ] > > >We simply keep track of how old the oldest request > > >in the queue is, and when that request is getting > > >too old (say 1/2 second), we /stop/ all the others > > > > Going in function of time is obviously wrong. A blockdevice can > > write 1 request every two seconds or 1 request every msecond. > > You can't assume anything in function of time _unless_ you have > > per harddisk timing informations into the kernel. > > Uhmmm, isn't the elevator about request /latency/ ? > > And if so, in what unit do you want to measure > latency if it isn't time? Number of requests which get serviced between the insertion and completion of the request. Maybe scaled by the average i/o queue length[0]. Matthew. [0] Insert obligatory "why oh why" on Linus' not taking SCT's sar/iostat patches. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Alan Cox wrote: > > time, but remember that there are two things measured in time here: > > A. The time for the whole queue of requests to run (this is what Rik is > > proposing using to throttle) > > B. The time an average request takes to process. > > Your perceived latency is based entirely on A. Yes, but "how hard is it reasonable for the kernel to try" is based on both items. A good first order approximation is number of requests. > > If we limit on the depth of queue we're (to some level of approximation) > > making our decision based on A/B. It's still a magic constant, but at > > I dont suggest you do queue limiting on that basis. I suggest you do order > limiting based on time slots It's still a queue - the queue of things we're going to take on this elevator swipe, right? And the problem is one of keeping a sane watermark on this queue - not too many requests to destroy latency but enough to let the elevator do some good. > > Well, actually just about any communications protocol worth its salt > > uses some sort of windowing throttle based on the amount of data > > Im talking about flow control/traffic shaping ...where the user sets a number exlpicitly for what performance they want. Again, if we're going to make the user set this latency variable for each of their devices, then doing it based on time will work great. > > There's too many orders of magnatude difference between even just SCSI > > disks (10 yr old drive? 16-way RAID? Solid state?) to make > > supplying any sort of default with the kernel impractical. The end > > The same argument is equally valid for the current scheme, and I think you'll > find equally bogus There will always need to be tunables - and it's fine to say "if you've got oddball hardware and/or workload and/or requirements then you should twiddle this knob". But it seems to me that the current scheme works well for a pretty big range of devices. If you do the setting based on time, I think it'll be a lot more sensitive since there's nothing that will scale based on the speed of the device. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Hans Reiser wrote: > > I really think Rik has it right here. In particular, an MP3 player needs to be able >to say, I have > X milliseconds of buffer so make my worst case latency X milliseconds. The number >of requests is > the wrong metric, because the time required per request depends on disk geometry, >disk caching, etc. > No the problem is that an application should either: 1. Take full controll of the underlying system. 2. Don't care about selftuning the OS. Becouse that's what operating systems are for in first place: Letting the applications run without care of the underlying hardware. Linux is just mistaken by desing that there should be a generic elevator for any block device sitting on a single queue for any kind of attached device. Only device drivers know best how to handle queueing and stuff like this. The upper layers should only car about semanticall correctness of the request orders not about optimization of them. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
> time, but remember that there are two things measured in time here: > A. The time for the whole queue of requests to run (this is what Rik is > proposing using to throttle) > B. The time an average request takes to process. Your perceived latency is based entirely on A. > If we limit on the depth of queue we're (to some level of approximation) > making our decision based on A/B. It's still a magic constant, but at I dont suggest you do queue limiting on that basis. I suggest you do order limiting based on time slots > Well, actually just about any communications protocol worth its salt > uses some sort of windowing throttle based on the amount of data Im talking about flow control/traffic shaping > If we move to a "length of queue in time" as Rik suggests then we're > going to have to MAKE the user set it manually for each device. No > There's too many orders of magnatude difference between even just SCSI > disks (10 yr old drive? 16-way RAID? Solid state?) to make > supplying any sort of default with the kernel impractical. The end The same argument is equally valid for the current scheme, and I think you'll find equally bogus - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Alan Cox wrote: > > Now, I see people trying to introduce the concept of elapsed time into > > that fix, which smells strongly of hack. How will this hack be cobbled > > Actually my brain says that elapsed time based scheduling is the right thing > to do. No, Andrea is right here. The argument that everyone is using ("Our target - latency - is measured in time") is utterly bogus. Yes, it's measured in time, but remember that there are two things measured in time here: A. The time for the whole queue of requests to run (this is what Rik is proposing using to throttle) B. The time an average request takes to process. If we limit on the depth of queue we're (to some level of approximation) making our decision based on A/B. It's still a magic constant, but at least it's scaled to take into account the speed of the drive. And underneath, it's still based on time. > It certainly works for networks Well, actually just about any communications protocol worth its salt uses some sort of windowing throttle based on the amount of data outstanding, not the length of time it's been in the queue. Which is why TCP works well over both GigE and 28.8. [*] Now substitute "big fiberchannel RAID" for GigE and "360K floppy" for 28.8 and you've got the same problem. * -- Yes, for optimal TCP over big WAN pipes you may want to use a larger buffer size, but that's a matter of the bandwidth delay product, which isn't relavent for talking about storage If we move to a "length of queue in time" as Rik suggests then we're going to have to MAKE the user set it manually for each device. There's too many orders of magnatude difference between even just SCSI disks (10 yr old drive? 16-way RAID? Solid state?) to make supplying any sort of default with the kernel impractical. The end result might be a bit better behaved, but only just slightly. If people absolutely need this behavior for some reason, the current algorithm should stay as the default. -Mitch - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Andrea Arcangeli wrote: > > On Tue, 12 Sep 2000, Martin Dalecki wrote: > > >First of all: In the case of the mp3 player and such there is already a > >fine > >proper way to give it better chances on getting it's job done smooth - > >RT kernel sceduler priorities and proper IO buffering. I did something > >similiar > >to a GDI printer driver... > > Take 2.2.15, set a buffer of 128mbyte (of course assume your mp3 is larger > than 128mbyte :) and then run in background `cp /dev/zero .` in the same > fs where your mp3 file out of cache is living. Then you'll see why a large > buffer is useless if there's none kind of I/O fair scheduling into the > elevator. Repeat the same test in 2.2.16 then. > > The I/O latency Hans was taking about for the mp3 player, is the time it > takes for the buffer to become empty. I was talking about *proper* buffering not necessary *big* buffers. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Andrea Arcangeli wrote: > On Tue, 12 Sep 2000, Rik van Riel wrote: > > >Uhmmm, isn't the elevator about request /latency/ ? > > Yes, but definitely not absolute "time" latency. So what do you think about the idea Jeff Merkey presented? It seems to work well for Netware, so maybe we should try for Linux? regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
> Why do you say it's not been fixed? Can you still reproduce hangs long as > a write(2) can write? I certainly can't. I cant reproduce long hangs. Im not seeing as good I/O throughput as before but right now Im quite happy with the tradeoff. If someone can make it better then Im happier still > - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Chris Evans wrote: >the elevator code. Keep it to a queue management system, and suddenly it >scales to slow or fast devices without any gross device-type specific >tuning. Yep, that was the object. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Andrea Arcangeli wrote: > On Tue, 12 Sep 2000, Chris Evans wrote: > > >like the task scheduler policies we have. For example, maybe we could flag > >a given process so that all it's I/O requests go to the head of the queue. > > Yes. We can't do that at the moment because at the elevator > layer we don't know who is waiting for I/O completion. As an approximation, you could take the task that queues the request. Only kswapd, kflushd and kupdate would be the "false alarms" here ... and we don't care all that much about them since they're only writing anyway. regards, Rik -- "What you're running that piece of shit Gnome?!?!" -- Miguel de Icaza, UKUUG 2000 http://www.conectiva.com/ http://www.surriel.com/ - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Alan Cox wrote: > > Now, I see people trying to introduce the concept of elapsed time into > > that fix, which smells strongly of hack. How will this hack be cobbled > > Actually my brain says that elapsed time based scheduling is the right > thing to do. It certainly works for networks Interesting, I'll try and run with this. The mention of networks reminds me that any "max service time" variable is a tunable quantity depending on current conditions.. .. and sct's block device I/O accounting patches give us the current average request service time on a per-device basis. Multiply that up a bit and maybe you have your threshold for moving things to the head of the queue. Cheers Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Chris Evans wrote: > > On Tue, 12 Sep 2000, Hans Reiser wrote: > > > I really think Rik has it right here. In particular, an MP3 player > > needs to be able to say, I have X milliseconds of buffer so make my > > worst case latency X milliseconds. The number of requests is the > > wrong metric, because the time required per request depends on disk > > geometry, disk caching, etc. > > Hi, > > We need to separate "what's a good idea" from "what's the best way to do > it". > > Sure, it's a good idea to get an mp3 player's I/O request serviced within > a certain amount of time. Is the best way to do that hacking the concept > of time into the elevator algorithm? That's currently under debate. > > In fact, the guarantee of I/O service time for a single process (mp3 > player) is pretty orthogonal to the per-device elevator settings. If you > have a certain block device set to a max latency of 0.25s, and lots of > processes are hammering the disk, then something will have to give, > i.e. under heavy load this setting will be useless and not honoured. > > The solution to this particular mp3 player scenario would be something > like the task scheduler policies we have. For example, maybe we could flag > a given process so that all it's I/O requests go to the head of the queue. > > Cheers > Chris First of all: In the case of the mp3 player and such there is already a fine proper way to give it better chances on getting it's job done smooth - RT kernel sceduler priorities and proper IO buffering. I did something similiar to a GDI printer driver... Second: The concept of time can give you very very nasty behaviour in even cases. Assume that a disc can only do 1 request per second. And then imagin a sceduling based on a 1+epsilon second... basically the disc will be run with half the speed it could. Those nasty integer arithmetics can you catch easly and mostly allways entierly unexpecting. Third: All you try to improve is the boundary case between an entierly overloaded system and a system which has a huge reserve to get the task done. I don't think you can find any "improvement" which will not just improve some cases and hurt some only slightly different cases badly. That's basically the same problem as with the paging strategy to follow. (However we have some kind of "common sense" in respect of this, despite the fact that linux does ignore it...) Firth: The most common solution for such boundary cases is some notion of cost optimization, like the nice value of a process or page age for example, or alternative some kind of choice between entierly different strategies (remember the term strategy routine) - all of them are just *relative* measures not absolute time constrains. Fifth: I think that such kind of IO behaviour control isn't something generic enough for the elevator - it should be all done on the device driver level, if at all. In fact you have already bad interactions between strategies of low level drivers and the high level code in Linux - like for example the "get from top of queue" or "don't get it from top of the IO queue" mess between IDE and SCSI middlelayers... (However this got a bit better recently.) -- - phone: +49 214 8656 283 - job: STOCK-WORLD Media AG, LEV .de (MY OPPINNIONS ARE MY OWN!) - langs: de_DE.ISO8859-1, en_US, pl_PL.ISO8859-2, last ressort: ru_RU.KOI8-R - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Hans Reiser wrote: > I really think Rik has it right here. In particular, an MP3 player > needs to be able to say, I have X milliseconds of buffer so make my > worst case latency X milliseconds. The number of requests is the > wrong metric, because the time required per request depends on disk > geometry, disk caching, etc. Hi, We need to separate "what's a good idea" from "what's the best way to do it". Sure, it's a good idea to get an mp3 player's I/O request serviced within a certain amount of time. Is the best way to do that hacking the concept of time into the elevator algorithm? That's currently under debate. In fact, the guarantee of I/O service time for a single process (mp3 player) is pretty orthogonal to the per-device elevator settings. If you have a certain block device set to a max latency of 0.25s, and lots of processes are hammering the disk, then something will have to give, i.e. under heavy load this setting will be useless and not honoured. The solution to this particular mp3 player scenario would be something like the task scheduler policies we have. For example, maybe we could flag a given process so that all it's I/O requests go to the head of the queue. Cheers Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
I really think Rik has it right here. In particular, an MP3 player needs to be able to say, I have X milliseconds of buffer so make my worst case latency X milliseconds. The number of requests is the wrong metric, because the time required per request depends on disk geometry, disk caching, etc. Hans - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Tue, 12 Sep 2000, Rik van Riel wrote: > On Tue, 12 Sep 2000, Andrea Arcangeli wrote: > > On Tue, 12 Sep 2000, Rik van Riel wrote: > > > > >Uhmmm, isn't the elevator about request /latency/ ? > > > > Yes, but definitely not absolute "time" latency. > > > > How do you get a 1msec latency for a read request out of a > > blockdevice that writes 1 request in 2 seconds? See? > > Of course, if you set a rediculous latency figure you'll > get rediculously bad performance. However, that doesn't > say /anything/ about if the idea is a good one or not... People, Remember why the elevator algorithm was changed in the first place? It was introduced to solve a very specific problem. That problem: the original elevator code did not schedule I/O particularly fairly under certain I/O usage patterns. So it got fixed. Now, I see people trying to introduce the concept of elapsed time into that fix, which smells strongly of hack. How will this hack be cobbled into the elevator code so that it copes with block devices from fast RAID arrays to slow floppies to network block device! So I have to agree with Andrea that the concept of time does not belong in the elevator code. Keep it to a queue management system, and suddenly it scales to slow or fast devices without any gross device-type specific tuning. Cheers Chris - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Mon, 11 Sep 2000, Andi Kleen wrote: >Given, but adding the unlock_kernel() does not really need much effort, >it is a very cheap (programmer time wise) optimization. Well, since there seems to be interest in this (and it's indeed very cheap programmer time wise :) I will try to add such SMP optimization for the next 2.2.xaa release. BTW, unlock_kernel() isn't enough. I think the right implementation is: int old_lock_depth = current->lock_depth; if (old_lock_depth >= 0) { current->lock_depth = -1; spin_unlock(&kernel_flag); } /* copy user */ if (old_lock_depth >= 0) { if (current->lock_depth != -1) BUG(); current->lock_depth = old_lock_depth; spin_lock(&kernel_flag); } Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Mon, Sep 11, 2000 at 04:44:21PM +0200, Andrea Arcangeli wrote: > On Mon, 11 Sep 2000, Michael T. Babcock wrote: > > >Considering there are a lot of people still using 2.0.x because they find it > >more stable than the 2.2.x series, doesn't it make sense to give this > >scalability to people who are already running SMP boxes on 2.2.x and who may > >decide to use ReiserFS? > > Note that the thing isn't specific to reiserfs. All fs and networking (not > firewalling) would benfit from that. networking does it for a few kernel releases now, and it was a big improvement in some benchmarks (probably not that much on real world load) > > My point is that it's more productive to solve the last 2.4.x troubles > (that still corrupts d'Itri's fs in test8) than to try to achieve better > scalability performance with 2.2.x 8). Given, but adding the unlock_kernel() does not really need much effort, it is a very cheap (programmer time wise) optimization. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Mon, 11 Sep 2000, Michael T. Babcock wrote: >Considering there are a lot of people still using 2.0.x because they find it >more stable than the 2.2.x series, doesn't it make sense to give this >scalability to people who are already running SMP boxes on 2.2.x and who may >decide to use ReiserFS? Note that the thing isn't specific to reiserfs. All fs and networking (not firewalling) would benfit from that. My point is that it's more productive to solve the last 2.4.x troubles (that still corrupts d'Itri's fs in test8) than to try to achieve better scalability performance with 2.2.x 8). Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
Considering there are a lot of people still using 2.0.x because they find it more stable than the 2.2.x series, doesn't it make sense to give this scalability to people who are already running SMP boxes on 2.2.x and who may decide to use ReiserFS? - Original Message - From: "Andrea Arcangeli" <[EMAIL PROTECTED]> > On Mon, 11 Sep 2000, Andi Kleen wrote: > > >BTW, there is a another optimization that could help reiserfs a lot > >on SMP settings: do a unlock_kernel()/lock_kernel() around the user > >copies. It is quite legal to do that (you have to handle sleeping > >anyways in case of a page fault), and it allows CPUs to run in parallel > >for long running copies. > > I'd prefer not to spend time to make 2.2.x to scale better in SMP, 2.4.x > just fixed that problem by dropping the big lock in first place in the > read/write paths :). The copy-user reschedule points were bugfixes > instead. - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Mon, 11 Sep 2000, Andi Kleen wrote: >BTW, there is a another optimization that could help reiserfs a lot >on SMP settings: do a unlock_kernel()/lock_kernel() around the user >copies. It is quite legal to do that (you have to handle sleeping >anyways in case of a page fault), and it allows CPUs to run in parallel >for long running copies. I'd prefer not to spend time to make 2.2.x to scale better in SMP, 2.4.x just fixed that problem by dropping the big lock in first place in the read/write paths :). The copy-user reschedule points were bugfixes instead. Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Mon, Sep 11, 2000 at 08:15:15AM -0400, Chris Mason wrote: > LFS changes for filldir, reiserfs_readpage, and adds limit checking in > file_write to make sure we don't go above 2GB (Andi Kleen). Also fixes > include/linux/fs.h, which does not patch cleanly for 3.5.25 because of usb. > > Note, you might see debugging messages about items moving during > copy_from_user. These are safe, but I'm leaving them in for now as I'd > like to find out why copy_from_user is suddenly scheduling much more than > it used to. That's easy to explain. Andrea's latest aa contains some low latency patches, which add a if (current->need_resched) schedule() to copy*user to avoid bad schedule latencies for big copies. The result is that you see a lot more schedules, everytime the copy*user happens to hit the end of a time slice. BTW, there is a another optimization that could help reiserfs a lot on SMP settings: do a unlock_kernel()/lock_kernel() around the user copies. It is quite legal to do that (you have to handle sleeping anyways in case of a page fault), and it allows CPUs to run in parallel for long running copies. -Andi - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/
Re: (reiserfs) Re: More on 2.2.18pre2aa2
On Mon, 11 Sep 2000, Chris Mason wrote: >reiserfs-3.5.25, this patch. I tested against pre3-aa2. BTW, pre3-aa2 means 2.2.18pre2aa2.bz2 applyed on top of 2.2.18pre3. >Note, you might see debugging messages about items moving during >copy_from_user. These are safe, but I'm leaving them in for now as I'd >like to find out why copy_from_user is suddenly scheduling much more than >it used to. In 2.2.18pre2aa2.bz2 there's a latency bugfix, now a: read(fd, &buf, 0x7fff) write(fd, &buf, 0x7fff) sendfile(src, dst, NULL, 0x7fff) doesn't hang the machine anymore for several seconds. (well, really they are all three still a bit buggy because they don't interrupt themself when a signal arrives so we can't use senfile for `cp` of files smaller than 2g yet... but at least now you can do something in parallel even if you are so lucky to have some giga of fs cache) Andrea - To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to [EMAIL PROTECTED] Please read the FAQ at http://www.tux.org/lkml/