Re: Pluggable Disk Scheduler Project
> I wouldn't focus on just anticipation, but also other types of > schedulers (I/O scheduling influenced by nice value?) One note on this subject: it will be nice for things like bg-fsck, but it also brings up a point that has been overlooked so far: priority propagation. As far as I am aware, there's plenty of locks that a process can hold while waiting for I/O to complete (directories, snapshots, probably some others I missed) Now, normally when a high-priority thread needs a lock owned by another process, it'll bump up that thread's priority and yield, hopefully freeing the lock up quickly. This is obviously not quite so easy for IO. I haven't quite understood the code involved yet, as far as I can tell turnstiles panic when the lock-owning thread is sleeping. What we'll probably need to do is make priority propagation wake up a waiting-for-io process, which then needs to dig up its IO request (which may be anywhere in geom, but potentially held back by the scheduler) and make sure it's [re]queued with higher priority. If we don't do this, we'll get funny effects with a bg-fsck blocking some high-priority process indefinitely because it happens to be waiting for IO and holding the snapshot lock, on an IO-busy system. If we do this, we may get into significant fun with cutting into geom to allow requeuing, or waste some cpu with polling from the queuing geom. This point may not be immediately obvious to people coming from the IO/filesystem field, but it is something we should keep in mind. So Far, Karsten Behrmann signature.asc Description: PGP signature
Re: Pluggable Disk Scheduler Project
On 10/17/07, Fabio Checconi <[EMAIL PROTECTED]> wrote: > > From: Ulf Lilleengen <[EMAIL PROTECTED]> > > Date: Wed, Oct 17, 2007 03:09:35PM +0200 > > > > On ons, okt 17, 2007 at 02:19:07 +0200, Fabio Checconi wrote: > > > Maybe I've not enough experience/knowledge of the driver subsystem, > [...] > > If you look at it, Hybrid is just a generalization of the existing > > bioq_* API already defined. And this API is used by GEOM classes _before_ > > device drivers get the requests AFAIK. > > > > I looked at the Hybrid code, but I don't think that the bioq_* > family of calls can be the right place to start, for the problems > experienced during the Hybrid development with locking/anticipation > and because you can have the same request passing through multiple > bioqs during its path to the device (e.g., two stacked geoms using > two different bioqs and then a device driver using bioq_* to organize > its queue, or geoms using more than one bioq, like raid3; I think > the complexity can become unmanageable.) One could even think to > configure each single bioq in the system, but things can get very > complex in this way. > > > > For a simple example on a driver, the md-driver might be a good place to > > look. Note that I have little experience and knowledge of the driver > > subsystem myself. > > > > I'll take a look, thanks. > > > > Also note (from the Hybrid page): > > * we could not provide support for non work-conserving schedulers, due to a > [...] > > > > This certainly argues for having this in the GEOM layer, but perhaps it's > > possible to change the assumtions done in some drivers? The locking issue > > should perhaps be better planned though, and an audit of the driver disksort > > code is necessary. > > > > I need some more time to think about that :) > > > > Also: > > * as said, the ATA driver in 6.x/7.x moves the disksort one layer below the > > one we are working at, so this particular work won't help on ATA-based 6.x > > machines. > > We should figure out how to address this, because the work done at that > > layer is mostly a replica of the bioq_*() API. > > > > So, I see this can get a bit messy thinking of that the ATA drivers does > > disksorts on its own, but perhaps it would be possible to fix this by > > letting > > changing the general ATA driver to use it's own pluggable scheduler. > > > > Anyway, I shouldn't demand that you do this, especially since I don't have > > any code or anything to show to, and because you decide what you want to do. > > I still cannot say if a GEOM scheduler is better than a scheduler > put at a lower level, or if the bioq_* interface is better than any > other alternative, so your suggestions are welcome. Moreover I'd > really like to discuss/work together, or at least do things with > some agreement on them. If I'll have the time to experiment with > more than one solution I'll be happy to do that. > > > > However, I'd hate to see the Hybrid effort go to waste :) I was hoping some > > of the authors of the project would reply with their thoughts, so I CC'ed > > them. > > Well, the work done on Hybrid had also interesting aspects from > the algorithm side... but that's another story... > > ___ > freebsd-hackers@freebsd.org mailing list > http://lists.freebsd.org/mailman/listinfo/freebsd-hackers > To unsubscribe, send any mail to "[EMAIL PROTECTED]" > -- This .signature sanitized for your protection ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Pluggable Disk Scheduler Project
> From: Ulf Lilleengen <[EMAIL PROTECTED]> > Date: Wed, Oct 17, 2007 03:09:35PM +0200 > > On ons, okt 17, 2007 at 02:19:07 +0200, Fabio Checconi wrote: > > Maybe I've not enough experience/knowledge of the driver subsystem, [...] > If you look at it, Hybrid is just a generalization of the existing > bioq_* API already defined. And this API is used by GEOM classes _before_ > device drivers get the requests AFAIK. > I looked at the Hybrid code, but I don't think that the bioq_* family of calls can be the right place to start, for the problems experienced during the Hybrid development with locking/anticipation and because you can have the same request passing through multiple bioqs during its path to the device (e.g., two stacked geoms using two different bioqs and then a device driver using bioq_* to organize its queue, or geoms using more than one bioq, like raid3; I think the complexity can become unmanageable.) One could even think to configure each single bioq in the system, but things can get very complex in this way. > For a simple example on a driver, the md-driver might be a good place to > look. Note that I have little experience and knowledge of the driver > subsystem myself. > I'll take a look, thanks. > Also note (from the Hybrid page): > * we could not provide support for non work-conserving schedulers, due to a [...] > > This certainly argues for having this in the GEOM layer, but perhaps it's > possible to change the assumtions done in some drivers? The locking issue > should perhaps be better planned though, and an audit of the driver disksort > code is necessary. > I need some more time to think about that :) > Also: > * as said, the ATA driver in 6.x/7.x moves the disksort one layer below the > one we are working at, so this particular work won't help on ATA-based 6.x > machines. > We should figure out how to address this, because the work done at that > layer is mostly a replica of the bioq_*() API. > > So, I see this can get a bit messy thinking of that the ATA drivers does > disksorts on its own, but perhaps it would be possible to fix this by letting > changing the general ATA driver to use it's own pluggable scheduler. > > Anyway, I shouldn't demand that you do this, especially since I don't have > any code or anything to show to, and because you decide what you want to do. I still cannot say if a GEOM scheduler is better than a scheduler put at a lower level, or if the bioq_* interface is better than any other alternative, so your suggestions are welcome. Moreover I'd really like to discuss/work together, or at least do things with some agreement on them. If I'll have the time to experiment with more than one solution I'll be happy to do that. > However, I'd hate to see the Hybrid effort go to waste :) I was hoping some > of the authors of the project would reply with their thoughts, so I CC'ed > them. Well, the work done on Hybrid had also interesting aspects from the algorithm side... but that's another story... ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Pluggable Disk Scheduler Project
On Wed, Oct 17, 2007 at 03:09:35PM +0200, Ulf Lilleengen wrote: ... discussion on Hybrid vs. GEOM as a suitable location for ... pluggable disk schedulers > However, I'd hate to see the Hybrid effort go to waste :) I was hoping some > of the authors of the project would reply with their thoughts, so I CC'ed > them. we are in good contact with Fabio and i am monitoring the discussion, don't worry. cheers luigi ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Pluggable Disk Scheduler Project
On ons, okt 17, 2007 at 02:19:07 +0200, Fabio Checconi wrote: > > From: Ulf Lilleengen <[EMAIL PROTECTED]> > > Date: Wed, Oct 17, 2007 01:07:15PM +0200 > > > > On tir, okt 16, 2007 at 04:10:37 +0200, Karsten Behrmann wrote: > > Over to a more general view of it's architecture: > > > > When I looked at this project for the first time, I was under the impression > > that this would be best done in a GEOM class. > > > > However, I think the approach that was taken in the Hybrid project is > > better > > Ok. I think that such a solution requires a lot more effort on the > design and coding sides, as it requires the modification of the > drivers and can bring us problems with locking and with the queueing > assumptions that may vary on a per-driver basis. > I completely agree with the issue of converting device drivers, but at least it will be an _optional_ possibility (Having different scheduler plugins could make this possible). One does not necessary need to convert the drivers. > Maybe I've not enough experience/knowledge of the driver subsystem, > but I would not remove the queueing that is done now by the drivers > (think of ata freezepoints,) but instead I'd like to try to grab > the requests before they get to the driver (e.g., in/before their > d_strategy call) and have some sort of pull mechanism when requests > complete (still don't have any (serious) idea on that, I fear that > the right place to do that, for locking issues and so on, can be > driver dependent.) Any ideas on that? Which drivers can be good > starting points to try to write down some code? > If you look at it, Hybrid is just a generalization of the existing bioq_* API already defined. And this API is used by GEOM classes _before_ device drivers get the requests AFAIK. For a simple example on a driver, the md-driver might be a good place to look. Note that I have little experience and knowledge of the driver subsystem myself. Also note (from the Hybrid page): * we could not provide support for non work-conserving schedulers, due to a couple of reasons: 1. the assumption, in some drivers, that bioq_disksort() will make requests immediately available (so a subsequent bioq_first() will not return NULL). 2. the fact that there is no bioq_lock()/bioq_unlock(), so the scheduler does not have a safe way to generate requests for a given queue. This certainly argues for having this in the GEOM layer, but perhaps it's possible to change the assumtions done in some drivers? The locking issue should perhaps be better planned though, and an audit of the driver disksort code is necessary. Also: * as said, the ATA driver in 6.x/7.x moves the disksort one layer below the one we are working at, so this particular work won't help on ATA-based 6.x machines. We should figure out how to address this, because the work done at that layer is mostly a replica of the bioq_*() API. So, I see this can get a bit messy thinking of that the ATA drivers does disksorts on its own, but perhaps it would be possible to fix this by letting changing the general ATA driver to use it's own pluggable scheduler. Anyway, I shouldn't demand that you do this, especially since I don't have any code or anything to show to, and because you decide what you want to do. However, I'd hate to see the Hybrid effort go to waste :) I was hoping some of the authors of the project would reply with their thoughts, so I CC'ed them. -- Ulf Lilleengen ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Pluggable Disk Scheduler Project
> From: Ulf Lilleengen <[EMAIL PROTECTED]> > Date: Wed, Oct 17, 2007 01:07:15PM +0200 > > On tir, okt 16, 2007 at 04:10:37 +0200, Karsten Behrmann wrote: > Over to a more general view of it's architecture: > > When I looked at this project for the first time, I was under the impression > that this would be best done in a GEOM class. > > However, I think the approach that was taken in the Hybrid project is better Ok. I think that such a solution requires a lot more effort on the design and coding sides, as it requires the modification of the drivers and can bring us problems with locking and with the queueing assumptions that may vary on a per-driver basis. Maybe I've not enough experience/knowledge of the driver subsystem, but I would not remove the queueing that is done now by the drivers (think of ata freezepoints,) but instead I'd like to try to grab the requests before they get to the driver (e.g., in/before their d_strategy call) and have some sort of pull mechanism when requests complete (still don't have any (serious) idea on that, I fear that the right place to do that, for locking issues and so on, can be driver dependent.) Any ideas on that? Which drivers can be good starting points to try to write down some code? > Also, I got my test-box up again today, and will be trying your patch as soon > as I've upgraded it to CURRENT Fabio. Thank you very much! Please consider that my primary concern with the patch was its interface, the algorithm is just an example (it should give an idea of the performance loss due to the mechanism overhead with async requests, and some improvement on greedy sync loads.) ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Pluggable Disk Scheduler Project
On tir, okt 16, 2007 at 04:10:37 +0200, Karsten Behrmann wrote: > > Hi, > > is anybody working on the `Pluggable Disk Scheduler Project' from > > the ideas page? > I've been kicking the idea around in my head, but I'm probably newer to > everything involved than you are, so feel free to pick it up. If you want, > we can toss some ideas and code to each other, though I don't really > have anything on the latter. > > [...] > > After reading [1], [2] and its follow-ups the main problems that > > need to be addressed seem to be: > > > > o is working on disk scheduling worth at all? > Probably, one of the main applications would be to make the background > fsck a little more well-behaved. I agree, as I said before, the ability to give I/O priorities is probably one of the most important things. > > > o Where is the right place (in GEOM) for a disk scheduler? [...] > > > o How can anticipation be introduced into the GEOM framework? > I wouldn't focus on just anticipation, but also other types of > schedulers (I/O scheduling influenced by nice value?) > > > o What can be an interface for disk schedulers? > good question, but geom seems a good start ;) > > > o How to deal with devices that handle multiple request per time? > Bad news first: this is most disks out there, in a way ;) > SCSI has tagged queuing, ATA has native command queing or > whatever the ata people came up over their morning coffee today. > I'll mention a bit more about this further down. > > > o How to deal with metadata requests and other VFS issues? > Like any other disk request, though for priority-respecting > schedulers this may get rather interesting. > > [...] > > The main idea is to allow the scheduler to enqueue the requests > > having only one (other small fixed numbers can be better on some > > hardware) outstanding request and to pass new requests to its > > provider only after the service of the previous one ended. [...] > - servers where anticipatory performs better than elevator > - realtime environments that need a scheduler fitting their needs > - the background fsck, if someone implements a "priority" scheduler Apache is actally a good candidate according to the old antipacitory design document ( not sure of it's relevance today, but...) Over to a more general view of it's architecture: When I looked at this project for the first time, I was under the impression that this would be best done in a GEOM class. However, I think the approach that was taken in the Hybrid project is better because of the following reasons: - It makes it possible to use by _both_ GEOM classes and device drivers (Which might use some other scheduler type?). - Does not remove any configuratbility, since changing etc. can be done by user with sysctl. - Could make it possible for a GEOM class to decide for itself which scheduler it wants to use (most GEOM classes uses the standard bioq_disksort interface in disk_subr.c). - The ability to stack a GEOM class with a scheduler could easily be "emulated" by creating a GEOM class to utilize the disksched framework. All in all, I just think this approach gives more flexibility than putting it in a GEOM class that have to be added manually by a user. Just my thought on this. Also, I got my test-box up again today, and will be trying your patch as soon as I've upgraded it to CURRENT Fabio. -- Ulf Lilleengen ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Pluggable Disk Scheduler Project
> From: Karsten Behrmann <[EMAIL PROTECTED]> > Date: Tue, Oct 16, 2007 04:10:37PM +0200 > > > Hi, > > is anybody working on the `Pluggable Disk Scheduler Project' from > > the ideas page? > I've been kicking the idea around in my head, but I'm probably newer to > everything involved than you are, so feel free to pick it up. If you want, > we can toss some ideas and code to each other, though I don't really > have anything on the latter. Thank you for your answer, I'd really like to work/discuss with you and anyone else interested in this project. > > o Where is the right place (in GEOM) for a disk scheduler? > I have spent some time at eurobsdcon talking to Kirk and phk about > this, and the result was that I now know strong proponents both for > putting it into the disk drivers and for putting it into geom ;-) > > Personally, I would put it into geom. I'll go into more detail on > this later, but basically, geom seems a better fit for "high-level" > code than a device driver, and if done properly performance penalties > should be negligible. > I'm pretty interested even in the arguments for the opposite solution; I've started from GEOM because it seemed to be a) what was proposed/ requested on the ideas page, and b) cleaner at least for a prototype. I wanted to start with some code also to evaluate the performance penalties of that approach. I am a little bit scared from the perspective of changing the queueing mechanisms that drivers use, as this kind of modifications can be difficult to write, test and maintain, but I'd really like to know what people with experience in those kernel areas think about the possibility of doing more complex io scheduling, with some sort of unified interface, at this level. As a side note, by now I've not posted any performance number because at the moment I've only access to old ata drives that would not give significative results. > > o How can anticipation be introduced into the GEOM framework? > I wouldn't focus on just anticipation, but also other types of > schedulers (I/O scheduling influenced by nice value?) > That would be interesting, especially for the background fsck case. I think that some kind of fair sharing approach should be used; as you say below a priority driven scheduler can have relations with the VFS that are difficult to track. (This problem was pointed out also in one of the follow-ups to [1].) > > o How to deal with metadata requests and other VFS issues? > Like any other disk request, though for priority-respecting > schedulers this may get rather interesting. > > [...] > > The main idea is to allow the scheduler to enqueue the requests > > having only one (other small fixed numbers can be better on some > > hardware) outstanding request and to pass new requests to its > > provider only after the service of the previous one ended. > You'll want to queue at least two requests at once. The reason for > this is performance: > Currently, drivers queue their own I/O. This means that as soon > as a request completes (on devices that don't have in-device > queues), they can fairly quickly grab a new request from their > internal queue and push it back to the device from the interrupt > handler or some other fast method. Wouldn't that require, to be sustainable (unless you want a fast dispatch every two requests,) that the driver queue is always of length two or more? In this way you ask, to refill the driver queue, the upper scheduler to dispatch a new request every time a request is taken from the driver queue and sent to the disk, or in any other moment before the request under service is completed. In this way you cannot have an anticipation mechanism, because the next request you'll want to dispatch from the upper scheduler has not yet been issued (it will be only after the one being served is completed and after the userspace application restarts.) > Having the device idle while the response percolates up the geom > stack and a new request down will likely be rather wasteful. I completely agree on that. I've only done in this way because it was the less intrusive option I could find. What can be other more efficient alternatives? (Obviously without subverting any of the existing interfaces, and allowing the anticipation of requests.) > For disks with queuing, I'd recommend trying to keep the queue > reasonably full (unless the queuing strategy says otherwise), > for disks without queuing I'd say we want to push at least one > more request down. Personally, I think the sanest design would > be to have device drivers return a "temporary I/O error" along > the lines of EAGAIN, meaning their queue is full. > This can
Re: Pluggable Disk Scheduler Project
> Hi, > is anybody working on the `Pluggable Disk Scheduler Project' from > the ideas page? I've been kicking the idea around in my head, but I'm probably newer to everything involved than you are, so feel free to pick it up. If you want, we can toss some ideas and code to each other, though I don't really have anything on the latter. [...] > After reading [1], [2] and its follow-ups the main problems that > need to be addressed seem to be: > > o is working on disk scheduling worth at all? Probably, one of the main applications would be to make the background fsck a little more well-behaved. > o Where is the right place (in GEOM) for a disk scheduler? I have spent some time at eurobsdcon talking to Kirk and phk about this, and the result was that I now know strong proponents both for putting it into the disk drivers and for putting it into geom ;-) Personally, I would put it into geom. I'll go into more detail on this later, but basically, geom seems a better fit for "high-level" code than a device driver, and if done properly performance penalties should be negligible. > o How can anticipation be introduced into the GEOM framework? I wouldn't focus on just anticipation, but also other types of schedulers (I/O scheduling influenced by nice value?) > o What can be an interface for disk schedulers? good question, but geom seems a good start ;) > o How to deal with devices that handle multiple request per time? Bad news first: this is most disks out there, in a way ;) SCSI has tagged queuing, ATA has native command queing or whatever the ata people came up over their morning coffee today. I'll mention a bit more about this further down. > o How to deal with metadata requests and other VFS issues? Like any other disk request, though for priority-respecting schedulers this may get rather interesting. [...] > The main idea is to allow the scheduler to enqueue the requests > having only one (other small fixed numbers can be better on some > hardware) outstanding request and to pass new requests to its > provider only after the service of the previous one ended. You'll want to queue at least two requests at once. The reason for this is performance: Currently, drivers queue their own I/O. This means that as soon as a request completes (on devices that don't have in-device queues), they can fairly quickly grab a new request from their internal queue and push it back to the device from the interrupt handler or some other fast method. Having the device idle while the response percolates up the geom stack and a new request down will likely be rather wasteful. For disks with queuing, I'd recommend trying to keep the queue reasonably full (unless the queuing strategy says otherwise), for disks without queuing I'd say we want to push at least one more request down. Personally, I think the sanest design would be to have device drivers return a "temporary I/O error" along the lines of EAGAIN, meaning their queue is full. > The example scheduler in the draft takes the following approach: > > o a scheduling GEOM class is introduced. It can be stacked on > top of disk geoms, and schedules all the requests coming > from its consumers. I'm not absolutely sure that a new class > is really needed but I think that it can simplify testing and > experimenting with various solutions on the scheduler placement. Probably, though we'll want to make sure that they stack on top of (or are inside of?) the geoms talking to the disks, because it rarely makes sense to put a queuing geom on top of, say, a disklabel geom. The advantage of making it a full geom is configurability. You would be able to swap out a scheduler at runtime, select different sched- ulers for different disks, and potentially even test new schedulers without rebooting (though you wouldn't want to do that for benchmarks) > o Requests coming from consumers are passed down immediately > if there is no other request under service, otherwise they > are queued in a bioq. This is specific to the anticipatory scheduler. I would say in more general terms: - A queuing geom is to push all requests that it wants serviced down towards the disk, until the disk reports a queue full. A queuing geom is allowed to hold back requests even when the driver queue is not full yet, if it does not want the disk to attempt such I/O yet (such as the anticipatory scheduler waiting for another disk request near the last one, or the process-priority scheduler holding back a low- priority request that would potentially cause a long seek, until io has been idle) This dispels phk's anti-geom argument of "it will be inefficient because it will take longer for a new request to get to the driver" - if the queuing strategy had wanted the r
Re: Pluggable Disk Scheduler Project
Quoting Fabio Checconi <[EMAIL PROTECTED]> (from Thu, 11 Oct 2007 13:48:28 +0200): From: Ulf Lilleengen <[EMAIL PROTECTED]> Date: Thu, Oct 11, 2007 10:07:34AM +0200 On tor, okt 11, 2007 at 04:20:01 +0200, Fabio Checconi wrote: > o What can be an interface for disk schedulers? I think the interface developed in [1] is a pretty good one actually. I think the disksort-routines looked as a good place to do this. Even there it might not know enough about the hardware. > o How to deal with devices that handle multiple request per time? This is an example of the problems you get doing this in GEOM. You don't have very good knowledge of the hardware. One can't pass this info from the lower layers up into GEOM (maybe by adding some attribute querying interface in GEOM if it doesn't exist)? Bye, Alexander. -- The naked truth of it is, I have no shirt. -- William Shakespeare, "Love's Labour's Lost" http://www.Leidinger.netAlexander @ Leidinger.net: PGP ID = B0063FE7 http://www.FreeBSD.org netchild @ FreeBSD.org : PGP ID = 72077137 ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Pluggable Disk Scheduler Project
> From: Alexander Leidinger <[EMAIL PROTECTED]> > Date: Fri, Oct 12, 2007 08:18:35AM +0200 > > Quoting Fabio Checconi <[EMAIL PROTECTED]> (from Thu, 11 Oct 2007 > 13:48:28 +0200): > > >>From: Ulf Lilleengen <[EMAIL PROTECTED]> > >>Date: Thu, Oct 11, 2007 10:07:34AM +0200 > >> > >>On tor, okt 11, 2007 at 04:20:01 +0200, Fabio Checconi wrote: > >>> o How to deal with devices that handle multiple request per time? > >>This is an example of the problems you get doing this in GEOM. You > >>don't have > >>very good knowledge of the hardware. > > One can't pass this info from the lower layers up into GEOM (maybe by > adding some attribute querying interface in GEOM if it doesn't exist)? > I think the g_getattr() call is there/can be used for things like that. The scheduler should need only to know how many outstanding requests it can allow, otherwise it should be rather independend from the lower layers. Anyway hardware queueing brings also a different kind of problems, that is it can't be mixed easily with anticipation, because if you have syncronous requests and you dispatch more than one of them you are serving more than one process by definition, thus you can break anticipation, unless this thing is done very carefully (e.g., when switching from a process to another, or mixing syncronous and asynchronous requests.) ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"
Re: Pluggable Disk Scheduler Project
Hi, > From: Ulf Lilleengen <[EMAIL PROTECTED]> > Date: Thu, Oct 11, 2007 10:07:34AM +0200 > > On tor, okt 11, 2007 at 04:20:01 +0200, Fabio Checconi wrote: > > o is working on disk scheduling worth at all? > It is hard to say, but I'd like to run some benchmarks with this to see. > Also, noted in [2], newer hardware does more magic on their own, as well as > solid state drives coming along. > this is why I wanted to start with some kind of prototype, hoping that its simplicity does not limit too much the results we can obtain. > > o Where is the right place (in GEOM) for a disk scheduler? > As discussed in [2], some suggested that disk scheduling should be done on a > lower part of a kernel due to knowledge of hardware capabilities. > > As discussed in [1], ata for instance do it's own scheduling, so this might > ruin performance (Even the hardware might do some magic of it's own). I > think I tried disabling it though, so shouldn't be a big deal for testing. > I don't know if disabling the lower level queueing is needed, because if you have only one outstanding request (or just a few ones, for hardware that supports that, and that can be a parameter for the scheduler) the lower level queueing will not reorder the higher level schedule. > > o How can anticipation be introduced into the GEOM framework? > This is actually perhaps one of the most interesting points, since the > anticipation principle in itself fits here, but some other scheduling > features might not be useful. > Ok. Decoupling the anticipation from other scheduling details may not be easy, but this thing is all about trying :) > > o What can be an interface for disk schedulers? > I think the interface developed in [1] is a pretty good one actually. I think > the disksort-routines looked as a good place to do this. Even there it might > not know enough about the hardware. > > > o How to deal with devices that handle multiple request per time? > This is an example of the problems you get doing this in GEOM. You don't have > very good knowledge of the hardware. > > > So, as I've said, I'd like to know what you think about the subject, > > if I'm missing something, if there is some kind of interest on this > > and if/how can this work proceed. > > Also, what would be interesting is implementing I/O priorities for processes > to be able to give I/O time more fairly(or at least being able to set after > preference) to processes. This was done in the Hybrid project, but this is > something that definately could be done in GEOM. (I see you have some > fairness in the g_as_dispatch routine though). > I totally agree. My primary concern with this email was to know what others have done/think about the problem, and to try to identify some kind of interface and positioning for the scheduler. The actual scheduler has to be something _much_ more complex than this little thing. Hybrid ideas can be mapped to a CFQ-like scheduler (one C-LOOK queue per process, fair sharing among queues, anticipation on a per queue basis,) and I'm working on that with Paolo Valente (in CC,) but I think the infrastructure behind the scheduler is more important now, as it defines what the scheduler can do. > However, I'll try testing the work you've got. I'll see if I can get some > numbers with this when I get some disks up. > > Btw, I did run some benchmark when I tried chaning bioq_disksort into a FIFO > queue which didn's seem to lower performance (on SCSI and UMASS, but need to > test again with ATA). It was a long time ago, so it should be tried again > though. I think this can depend on the access patterns used for testing (on disks; of course on flash devices disk sorting is not needed at all.) If you have processes that do only synchronous requests there is almost no difference between a .*LOOK elevator and FIFO queueing, since in the queue there will always be only one request per process, and you switch between processes every time you serve a new request (of course the actual order will change, but the number of seeks is the factor that really limits the throughput in this situation. At least this is my understanding of the problem :) ) The test patterns we are using with Paolo try to pessimize the disk throughput reading in parallel (simply with a dd, that generates a typical example of greedy synchronous sequential read patterns,) from two or more files put on partitions at the opposite ends (at least considering their logical addresses) of the disk. This kind of access should generate something near to the worst case behavior for a work-conserving .*LOOK scheduler. Of course also the behavior for asyncronous requests has to be tested. Thank you very much for your feedback, I hope we can get some numbers to substantiate this topic, remembering also that a good interface is a requirement for a good scheduler. > > > > [1] http://wiki.freebsd.org/Hybrid > > > > [2] > > http://lists.freebsd.org/pipermail/free
Re: Pluggable Disk Scheduler Project
On tor, okt 11, 2007 at 04:20:01 +0200, Fabio Checconi wrote: > Hi, > is anybody working on the `Pluggable Disk Scheduler Project' from > the ideas page? Hi, I did look into it :) But then other projects came. > > To better understand how GEOM works, and how a (non work conserving) > disk scheduler can fit into it, I've written a very simple, yet > working, prototype: > > http://feanor.sssup.it/~fabio/freebsd/g_sched/geom-sched-class.patch > > > I'd like to take a better look at the problem and work on it, and > I'd like to know what you think about it. > > After reading [1], [2] and its follow-ups the main problems that > need to be addressed seem to be: > > o is working on disk scheduling worth at all? It is hard to say, but I'd like to run some benchmarks with this to see. Also, noted in [2], newer hardware does more magic on their own, as well as solid state drives coming along. > o Where is the right place (in GEOM) for a disk scheduler? As discussed in [2], some suggested that disk scheduling should be done on a lower part of a kernel due to knowledge of hardware capabilities. As discussed in [1], ata for instance do it's own scheduling, so this might ruin performance (Even the hardware might do some magic of it's own). I think I tried disabling it though, so shouldn't be a big deal for testing. > o How can anticipation be introduced into the GEOM framework? This is actually perhaps one of the most interesting points, since the anticipation principle in itself fits here, but some other scheduling features might not be useful. > o What can be an interface for disk schedulers? I think the interface developed in [1] is a pretty good one actually. I think the disksort-routines looked as a good place to do this. Even there it might not know enough about the hardware. > o How to deal with devices that handle multiple request per time? This is an example of the problems you get doing this in GEOM. You don't have very good knowledge of the hardware. > o How to deal with metadata requests and other VFS issues? > > I think that some answers need a little bit of experimenting with > real code and real hardware, so here it is this attempt. The > interface used in this toy prototype for the scheduler is something > like that: > > typedef void *gs_init_t (struct g_geom *geom); > typedef void gs_fini_t (void *data); > typedef void gs_start_t (void *data, struct bio *bio); > typedef void gs_done_t (void *data, struct bio *bio); > > struct g_gsched { > const char *gs_name; /* Scheduler name. */ > int gs_refs;/* Refcount, internal use. */ > > gs_init_t *gs_init; /* Called on geom creation. */ > gs_fini_t *gs_fini; /* Called on geom destruction. */ > gs_start_t *gs_start; /* Called on geom start. */ > gs_done_t *gs_done; /* Called on geom done. */ > > LIST_ENTRY(g_gsched) glist; /* List of schedulers, internal use. */ > }; > > The main idea is to allow the scheduler to enqueue the requests having only > one (other small fixed numbers can be better on some hardware) outstanding > request and to pass new requests to its provider only after the service of > the previous one ended. > > The example scheduler in the draft takes the following approach: > > o a scheduling GEOM class is introduced. It can be stacked on > top of disk geoms, and schedules all the requests coming > from its consumers. I'm not absolutely sure that a new class > is really needed but I think that it can simplify testing and > experimenting with various solutions on the scheduler placement. > o Requests coming from consumers are passed down immediately > if there is no other request under service, otherwise they > are queued in a bioq. > o When a request is served the scheduler is notified, then it > can pass down a new request, or, as in this toy anticipatory[3] > scheduler, wait for a new request from the same process, or > for a timeout to expire, and only after one of those events > make the next scheduling decision. > > So, as I've said, I'd like to know what you think about the subject, > if I'm missing something, if there is some kind of interest on this > and if/how can this work proceed. Also, what would be interesting is implementing I/O priorities for processes to be able to give I/O time more fairly(or at least being able to set after preference) to processes. This was done in the Hybrid project, but this is something that definately could be done in GEOM. (I see you have some fairness in the g_
Pluggable Disk Scheduler Project
Hi, is anybody working on the `Pluggable Disk Scheduler Project' from the ideas page? To better understand how GEOM works, and how a (non work conserving) disk scheduler can fit into it, I've written a very simple, yet working, prototype: http://feanor.sssup.it/~fabio/freebsd/g_sched/geom-sched-class.patch I'd like to take a better look at the problem and work on it, and I'd like to know what you think about it. After reading [1], [2] and its follow-ups the main problems that need to be addressed seem to be: o is working on disk scheduling worth at all? o Where is the right place (in GEOM) for a disk scheduler? o How can anticipation be introduced into the GEOM framework? o What can be an interface for disk schedulers? o How to deal with devices that handle multiple request per time? o How to deal with metadata requests and other VFS issues? I think that some answers need a little bit of experimenting with real code and real hardware, so here it is this attempt. The interface used in this toy prototype for the scheduler is something like that: typedef void *gs_init_t (struct g_geom *geom); typedef void gs_fini_t (void *data); typedef void gs_start_t (void *data, struct bio *bio); typedef void gs_done_t (void *data, struct bio *bio); struct g_gsched { const char *gs_name; /* Scheduler name. */ int gs_refs;/* Refcount, internal use. */ gs_init_t *gs_init; /* Called on geom creation. */ gs_fini_t *gs_fini; /* Called on geom destruction. */ gs_start_t *gs_start; /* Called on geom start. */ gs_done_t *gs_done; /* Called on geom done. */ LIST_ENTRY(g_gsched) glist; /* List of schedulers, internal use. */ }; The main idea is to allow the scheduler to enqueue the requests having only one (other small fixed numbers can be better on some hardware) outstanding request and to pass new requests to its provider only after the service of the previous one ended. The example scheduler in the draft takes the following approach: o a scheduling GEOM class is introduced. It can be stacked on top of disk geoms, and schedules all the requests coming from its consumers. I'm not absolutely sure that a new class is really needed but I think that it can simplify testing and experimenting with various solutions on the scheduler placement. o Requests coming from consumers are passed down immediately if there is no other request under service, otherwise they are queued in a bioq. o When a request is served the scheduler is notified, then it can pass down a new request, or, as in this toy anticipatory[3] scheduler, wait for a new request from the same process, or for a timeout to expire, and only after one of those events make the next scheduling decision. So, as I've said, I'd like to know what you think about the subject, if I'm missing something, if there is some kind of interest on this and if/how can this work proceed. Thanks in advance, fabio [1] http://wiki.freebsd.org/Hybrid [2] http://lists.freebsd.org/pipermail/freebsd-geom/2007-January/001854.html [3] The details of the anticipation are really not interesting as it is extremely simplified by purpose. [4] http://feanor.sssup.it/~fabio/freebsd/g_sched/ contains also an userspace client to experiment with the GEOM class. ___ freebsd-hackers@freebsd.org mailing list http://lists.freebsd.org/mailman/listinfo/freebsd-hackers To unsubscribe, send any mail to "[EMAIL PROTECTED]"