Re: Pluggable Disk Scheduler Project

2007-10-19 Thread Karsten Behrmann
> I wouldn't focus on just anticipation, but also other types of
> schedulers (I/O scheduling influenced by nice value?)

One note on this subject: it will be nice for things like bg-fsck,
but it also brings up a point that has been overlooked so far:
priority propagation.

As far as I am aware, there's plenty of locks that a process can
hold while waiting for I/O to complete (directories, snapshots,
probably some others I missed)
Now, normally when a high-priority thread needs a lock owned by
another process, it'll bump up that thread's priority and yield,
hopefully freeing the lock up quickly.
This is obviously not quite so easy for IO. I haven't quite
understood the code involved yet, as far as I can tell turnstiles
panic when the lock-owning thread is sleeping.

What we'll probably need to do is make priority propagation wake
up a waiting-for-io process, which then needs to dig up its IO
request (which may be anywhere in geom, but potentially held back
by the scheduler) and make sure it's [re]queued with higher
priority.
If we don't do this, we'll get funny effects with a bg-fsck blocking
some high-priority process indefinitely because it happens to be
waiting for IO and holding the snapshot lock, on an IO-busy system.
If we do this, we may get into significant fun with cutting into
geom to allow requeuing, or waste some cpu with polling from the
queuing geom.

This point may not be immediately obvious to people coming from the
IO/filesystem field, but it is something we should keep in mind.

So Far,
  Karsten Behrmann


signature.asc
Description: PGP signature


Re: Pluggable Disk Scheduler Project

2007-10-17 Thread Bruno Tavares
On 10/17/07, Fabio Checconi <[EMAIL PROTECTED]> wrote:
> > From: Ulf Lilleengen <[EMAIL PROTECTED]>
> > Date: Wed, Oct 17, 2007 03:09:35PM +0200
> >
> > On ons, okt 17, 2007 at 02:19:07 +0200, Fabio Checconi wrote:
> > > Maybe I've not enough experience/knowledge of the driver subsystem,
> [...]
> > If you look at it, Hybrid is just a generalization of the existing
> > bioq_* API already defined. And this API is used by GEOM classes _before_
> > device drivers get the requests AFAIK.
> >
>
> I looked at the Hybrid code, but I don't think that the bioq_*
> family of calls can be the right place to start, for the problems
> experienced during the Hybrid development with locking/anticipation
> and because you can have the same request passing through multiple
> bioqs during its path to the device (e.g., two stacked geoms using
> two different bioqs and then a device driver using bioq_* to organize
> its queue, or geoms using more than one bioq, like raid3; I think
> the complexity can become unmanageable.)  One could even think to
> configure each single bioq in the system, but things can get very
> complex in this way.
>
>
> > For a simple example on a driver, the md-driver might be a good place to
> > look. Note that I have little experience and knowledge of the driver
> > subsystem myself.
> >
>
> I'll take a look, thanks.
>
>
> > Also note (from the Hybrid page):
> > * we could not provide support for non work-conserving schedulers, due to a
> [...]
> >
> > This certainly argues for having this in the GEOM layer, but perhaps it's
> > possible to change the assumtions done in some drivers? The locking issue
> > should perhaps be better planned though, and an audit of the driver disksort
> > code is necessary.
> >
>
> I need some more time to think about that :)
>
>
> > Also:
> > * as said, the ATA driver in 6.x/7.x moves the disksort one layer below the
> >   one we are working at, so this particular work won't help on ATA-based 6.x
> >   machines.
> >   We should figure out how to address this, because the work done at that
> >   layer is mostly a replica of the bioq_*() API.
> >
> > So, I see this can get a bit messy thinking of that the ATA drivers does
> > disksorts on its own, but perhaps it would be possible to fix this by 
> > letting
> > changing the general ATA driver to use it's own pluggable scheduler.
> >
> > Anyway, I shouldn't demand that you do this, especially since I don't have
> > any code or anything to show to, and because you decide what you want to do.
>
> I still cannot say if a GEOM scheduler is better than a scheduler
> put at a lower level, or if the bioq_* interface is better than any
> other alternative, so your suggestions are welcome.  Moreover I'd
> really like to discuss/work together, or at least do things with
> some agreement on them.  If I'll have the time to experiment with
> more than one solution I'll be happy to do that.
>
>
> > However, I'd hate to see the Hybrid effort go to waste :) I was hoping some
> > of the authors of the project would reply with their thoughts, so I CC'ed
> > them.
>
> Well, the work done on Hybrid had also interesting aspects from
> the algorithm side... but that's another story...
>
> ___
> freebsd-hackers@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
> To unsubscribe, send any mail to "[EMAIL PROTECTED]"
>


-- 

 This .signature sanitized for your protection
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Pluggable Disk Scheduler Project

2007-10-17 Thread Fabio Checconi
> From: Ulf Lilleengen <[EMAIL PROTECTED]>
> Date: Wed, Oct 17, 2007 03:09:35PM +0200
>
> On ons, okt 17, 2007 at 02:19:07 +0200, Fabio Checconi wrote:
> > Maybe I've not enough experience/knowledge of the driver subsystem,
[...]
> If you look at it, Hybrid is just a generalization of the existing
> bioq_* API already defined. And this API is used by GEOM classes _before_
> device drivers get the requests AFAIK. 
> 

I looked at the Hybrid code, but I don't think that the bioq_*
family of calls can be the right place to start, for the problems
experienced during the Hybrid development with locking/anticipation
and because you can have the same request passing through multiple
bioqs during its path to the device (e.g., two stacked geoms using
two different bioqs and then a device driver using bioq_* to organize
its queue, or geoms using more than one bioq, like raid3; I think
the complexity can become unmanageable.)  One could even think to
configure each single bioq in the system, but things can get very
complex in this way.


> For a simple example on a driver, the md-driver might be a good place to
> look. Note that I have little experience and knowledge of the driver
> subsystem myself.
> 

I'll take a look, thanks.


> Also note (from the Hybrid page):
> * we could not provide support for non work-conserving schedulers, due to a
[...]
> 
> This certainly argues for having this in the GEOM layer, but perhaps it's
> possible to change the assumtions done in some drivers? The locking issue
> should perhaps be better planned though, and an audit of the driver disksort
> code is necessary.
> 

I need some more time to think about that :)


> Also:
> * as said, the ATA driver in 6.x/7.x moves the disksort one layer below the
>   one we are working at, so this particular work won't help on ATA-based 6.x
>   machines.
>   We should figure out how to address this, because the work done at that
>   layer is mostly a replica of the bioq_*() API. 
> 
> So, I see this can get a bit messy thinking of that the ATA drivers does
> disksorts on its own, but perhaps it would be possible to fix this by letting
> changing the general ATA driver to use it's own pluggable scheduler.
> 
> Anyway, I shouldn't demand that you do this, especially since I don't have
> any code or anything to show to, and because you decide what you want to do.

I still cannot say if a GEOM scheduler is better than a scheduler
put at a lower level, or if the bioq_* interface is better than any
other alternative, so your suggestions are welcome.  Moreover I'd
really like to discuss/work together, or at least do things with
some agreement on them.  If I'll have the time to experiment with
more than one solution I'll be happy to do that.


> However, I'd hate to see the Hybrid effort go to waste :) I was hoping some
> of the authors of the project would reply with their thoughts, so I CC'ed
> them. 

Well, the work done on Hybrid had also interesting aspects from
the algorithm side... but that's another story...

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Pluggable Disk Scheduler Project

2007-10-17 Thread Luigi Rizzo
On Wed, Oct 17, 2007 at 03:09:35PM +0200, Ulf Lilleengen wrote:

... discussion on Hybrid vs. GEOM as a suitable location for
... pluggable disk schedulers

> However, I'd hate to see the Hybrid effort go to waste :) I was hoping some
> of the authors of the project would reply with their thoughts, so I CC'ed
> them. 

we are in good contact with Fabio and i am monitoring the discussion,
don't worry.

cheers
luigi
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Pluggable Disk Scheduler Project

2007-10-17 Thread Ulf Lilleengen
On ons, okt 17, 2007 at 02:19:07 +0200, Fabio Checconi wrote:
> > From: Ulf Lilleengen <[EMAIL PROTECTED]>
> > Date: Wed, Oct 17, 2007 01:07:15PM +0200
> >
> > On tir, okt 16, 2007 at 04:10:37 +0200, Karsten Behrmann wrote:
> > Over to a more general view of it's architecture:
> > 
> > When I looked at this project for the first time, I was under the impression
> > that this would be best done in a GEOM class.
> > 
> > However, I think the approach that was taken in the Hybrid project is 
> > better 
> 
> Ok.  I think that such a solution requires a lot more effort on the
> design and coding sides, as it requires the modification of the
> drivers and can bring us problems with locking and with the queueing
> assumptions that may vary on a per-driver basis.
> 
I completely agree with the issue of converting device drivers, but at least
it will be an _optional_ possibility (Having different scheduler plugins
could make this possible). One does not necessary need to convert
the drivers. 
> Maybe I've not enough experience/knowledge of the driver subsystem,
> but I would not remove the queueing that is done now by the drivers
> (think of ata freezepoints,) but instead I'd like to try to grab
> the requests before they get to the driver (e.g., in/before their
> d_strategy call) and have some sort of pull mechanism when requests
> complete (still don't have any (serious) idea on that, I fear that
> the right place to do that, for locking issues and so on, can be
> driver dependent.)  Any ideas on that?  Which drivers can be good
> starting points to try to write down some code?
> 
If you look at it, Hybrid is just a generalization of the existing
bioq_* API already defined. And this API is used by GEOM classes _before_
device drivers get the requests AFAIK. 

For a simple example on a driver, the md-driver might be a good place to
look. Note that I have little experience and knowledge of the driver
subsystem myself.

Also note (from the Hybrid page):
* we could not provide support for non work-conserving schedulers, due to a
  couple of reasons:
 1. the assumption, in some drivers, that bioq_disksort() will make
 requests immediately available (so a subsequent bioq_first() will
 not return NULL).
 2. the fact that there is no bioq_lock()/bioq_unlock(), so the
 scheduler does not have a safe way to generate requests for a given
 queue. 

This certainly argues for having this in the GEOM layer, but perhaps it's
possible to change the assumtions done in some drivers? The locking issue
should perhaps be better planned though, and an audit of the driver disksort
code is necessary.

Also:
* as said, the ATA driver in 6.x/7.x moves the disksort one layer below the
  one we are working at, so this particular work won't help on ATA-based 6.x
  machines.
  We should figure out how to address this, because the work done at that
  layer is mostly a replica of the bioq_*() API. 

So, I see this can get a bit messy thinking of that the ATA drivers does
disksorts on its own, but perhaps it would be possible to fix this by letting
changing the general ATA driver to use it's own pluggable scheduler.

Anyway, I shouldn't demand that you do this, especially since I don't have
any code or anything to show to, and because you decide what you want to do.
However, I'd hate to see the Hybrid effort go to waste :) I was hoping some
of the authors of the project would reply with their thoughts, so I CC'ed
them. 

-- 
Ulf Lilleengen
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Pluggable Disk Scheduler Project

2007-10-17 Thread Fabio Checconi
> From: Ulf Lilleengen <[EMAIL PROTECTED]>
> Date: Wed, Oct 17, 2007 01:07:15PM +0200
>
> On tir, okt 16, 2007 at 04:10:37 +0200, Karsten Behrmann wrote:
> Over to a more general view of it's architecture:
> 
> When I looked at this project for the first time, I was under the impression
> that this would be best done in a GEOM class.
> 
> However, I think the approach that was taken in the Hybrid project is better 

Ok.  I think that such a solution requires a lot more effort on the
design and coding sides, as it requires the modification of the
drivers and can bring us problems with locking and with the queueing
assumptions that may vary on a per-driver basis.

Maybe I've not enough experience/knowledge of the driver subsystem,
but I would not remove the queueing that is done now by the drivers
(think of ata freezepoints,) but instead I'd like to try to grab
the requests before they get to the driver (e.g., in/before their
d_strategy call) and have some sort of pull mechanism when requests
complete (still don't have any (serious) idea on that, I fear that
the right place to do that, for locking issues and so on, can be
driver dependent.)  Any ideas on that?  Which drivers can be good
starting points to try to write down some code?


> Also, I got my test-box up again today, and will be trying your patch as soon
> as I've upgraded it to CURRENT Fabio.

Thank you very much!  Please consider that my primary concern with
the patch was its interface, the algorithm is just an example (it
should give an idea of the performance loss due to the mechanism
overhead with async requests, and some improvement on greedy sync
loads.)

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Pluggable Disk Scheduler Project

2007-10-17 Thread Ulf Lilleengen
On tir, okt 16, 2007 at 04:10:37 +0200, Karsten Behrmann wrote:
> > Hi,
> > is anybody working on the `Pluggable Disk Scheduler Project' from
> > the ideas page?
> I've been kicking the idea around in my head, but I'm probably newer to
> everything involved than you are, so feel free to pick it up. If you want,
> we can toss some ideas and code to each other, though I don't really
> have anything on the latter.
> 
> [...]
> > After reading [1], [2] and its follow-ups the main problems that
> > need to be addressed seem to be:
> > 
> > o is working on disk scheduling worth at all?
> Probably, one of the main applications would be to make the background
> fsck a little more well-behaved.
I agree, as I said before, the ability to give I/O priorities is probably one
of the most important things.
> 
> > o Where is the right place (in GEOM) for a disk scheduler?
[...]
> 
> > o How can anticipation be introduced into the GEOM framework?
> I wouldn't focus on just anticipation, but also other types of
> schedulers (I/O scheduling influenced by nice value?)
> 
> > o What can be an interface for disk schedulers?
> good question, but geom seems a good start ;)
> 
> > o How to deal with devices that handle multiple request per time?
> Bad news first: this is most disks out there, in a way ;)
> SCSI has tagged queuing, ATA has native command queing or
> whatever the ata people came up over their morning coffee today.
> I'll mention a bit more about this further down.
> 
> > o How to deal with metadata requests and other VFS issues?
> Like any other disk request, though for priority-respecting
> schedulers this may get rather interesting.
> 
> [...]
> > The main idea is to allow the scheduler to enqueue the requests
> > having only one (other small fixed numbers can be better on some
> > hardware) outstanding request and to pass new requests to its
> > provider only after the service of the previous one ended.
[...]
>  - servers where anticipatory performs better than elevator
>  - realtime environments that need a scheduler fitting their needs
>  - the background fsck, if someone implements a "priority" scheduler

Apache is actally a good candidate according to the old antipacitory design
document ( not sure of it's relevance today, but...)

Over to a more general view of it's architecture:

When I looked at this project for the first time, I was under the impression
that this would be best done in a GEOM class.

However, I think the approach that was taken in the Hybrid project is better 
because of the following reasons:

- It makes it possible to use by _both_ GEOM classes and device drivers (Which 
  might use some other scheduler type?).
- Does not remove any configuratbility, since changing etc. can be done by user 
  with sysctl.
- Could make it possible for a GEOM class to decide for itself which scheduler 
it
  wants to use (most GEOM classes uses the standard bioq_disksort interface in
  disk_subr.c).
- The ability to stack a GEOM class with a scheduler could easily be "emulated"
  by creating a GEOM class to utilize the disksched framework.

All in all, I just think this approach gives more flexibility than putting it in
a GEOM class that have to be added manually by a user.

Just my thought on this.

Also, I got my test-box up again today, and will be trying your patch as soon
as I've upgraded it to CURRENT Fabio.
-- 
Ulf Lilleengen
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Pluggable Disk Scheduler Project

2007-10-17 Thread Fabio Checconi
> From: Karsten Behrmann <[EMAIL PROTECTED]>
> Date: Tue, Oct 16, 2007 04:10:37PM +0200
>
> > Hi,
> > is anybody working on the `Pluggable Disk Scheduler Project' from
> > the ideas page?
> I've been kicking the idea around in my head, but I'm probably newer to
> everything involved than you are, so feel free to pick it up. If you want,
> we can toss some ideas and code to each other, though I don't really
> have anything on the latter.

Thank you for your answer, I'd really like to work/discuss with you
and anyone else interested in this project.


> > o Where is the right place (in GEOM) for a disk scheduler?
> I have spent some time at eurobsdcon talking to Kirk and phk about
> this, and the result was that I now know strong proponents both for
> putting it into the disk drivers and for putting it into geom ;-)
> 
> Personally, I would put it into geom. I'll go into more detail on
> this later, but basically, geom seems a better fit for "high-level"
> code than a device driver, and if done properly performance penalties
> should be negligible.
> 

I'm pretty interested even in the arguments for the opposite solution;
I've started from GEOM because it seemed to be a) what was proposed/
requested on the ideas page, and b) cleaner at least for a prototype.
I wanted to start with some code also to evaluate the performance
penalties of that approach.

I am a little bit scared from the perspective of changing the
queueing mechanisms that drivers use, as this kind of modifications
can be difficult to write, test and maintain, but I'd really like
to know what people with experience in those kernel areas think
about the possibility of doing more complex io scheduling, with
some sort of unified interface, at this level.

As a side note, by now I've not posted any performance number because
at the moment I've only access to old ata drives that would not
give significative results.


> > o How can anticipation be introduced into the GEOM framework?
> I wouldn't focus on just anticipation, but also other types of
> schedulers (I/O scheduling influenced by nice value?)
> 

That would be interesting, especially for the background fsck case.
I think that some kind of fair sharing approach should be used; as
you say below a priority driven scheduler can have relations with
the VFS that are difficult to track. (This problem was pointed out
also in one of the follow-ups to [1].)


> > o How to deal with metadata requests and other VFS issues?
> Like any other disk request, though for priority-respecting
> schedulers this may get rather interesting.
> 
> [...]
> > The main idea is to allow the scheduler to enqueue the requests
> > having only one (other small fixed numbers can be better on some
> > hardware) outstanding request and to pass new requests to its
> > provider only after the service of the previous one ended.
> You'll want to queue at least two requests at once. The reason for
> this is performance:
> Currently, drivers queue their own I/O. This means that as soon
> as a request completes (on devices that don't have in-device
> queues), they can fairly quickly grab a new request from their
> internal queue and push it back to the device from the interrupt
> handler or some other fast method.

Wouldn't that require, to be sustainable (unless you want a fast
dispatch every two requests,) that the driver queue is always of
length two or more?  In this way you ask, to refill the driver
queue, the upper scheduler to dispatch a new request every time a
request is taken from the driver queue and sent to the disk, or
in any other moment before the request under service is completed.
In this way you cannot have an anticipation mechanism, because
the next request you'll want to dispatch from the upper scheduler
has not yet been issued (it will be only after the one being served
is completed and after the userspace application restarts.)


> Having the device idle while the response percolates up the geom
> stack and a new request down will likely be rather wasteful.

I completely agree on that.  I've only done in this way because it
was the less intrusive option I could find.  What can be other more
efficient alternatives?  (Obviously without subverting any of the
existing interfaces, and allowing the anticipation of requests.)


> For disks with queuing, I'd recommend trying to keep the queue
> reasonably full (unless the queuing strategy says otherwise),
> for disks without queuing I'd say we want to push at least one
> more request down. Personally, I think the sanest design would
> be to have device drivers return a "temporary I/O error" along
> the lines of EAGAIN, meaning their queue is full.
> 

This can

Re: Pluggable Disk Scheduler Project

2007-10-16 Thread Karsten Behrmann
> Hi,
> is anybody working on the `Pluggable Disk Scheduler Project' from
> the ideas page?
I've been kicking the idea around in my head, but I'm probably newer to
everything involved than you are, so feel free to pick it up. If you want,
we can toss some ideas and code to each other, though I don't really
have anything on the latter.

[...]
> After reading [1], [2] and its follow-ups the main problems that
> need to be addressed seem to be:
> 
> o is working on disk scheduling worth at all?
Probably, one of the main applications would be to make the background
fsck a little more well-behaved.

> o Where is the right place (in GEOM) for a disk scheduler?
I have spent some time at eurobsdcon talking to Kirk and phk about
this, and the result was that I now know strong proponents both for
putting it into the disk drivers and for putting it into geom ;-)

Personally, I would put it into geom. I'll go into more detail on
this later, but basically, geom seems a better fit for "high-level"
code than a device driver, and if done properly performance penalties
should be negligible.

> o How can anticipation be introduced into the GEOM framework?
I wouldn't focus on just anticipation, but also other types of
schedulers (I/O scheduling influenced by nice value?)

> o What can be an interface for disk schedulers?
good question, but geom seems a good start ;)

> o How to deal with devices that handle multiple request per time?
Bad news first: this is most disks out there, in a way ;)
SCSI has tagged queuing, ATA has native command queing or
whatever the ata people came up over their morning coffee today.
I'll mention a bit more about this further down.

> o How to deal with metadata requests and other VFS issues?
Like any other disk request, though for priority-respecting
schedulers this may get rather interesting.

[...]
> The main idea is to allow the scheduler to enqueue the requests
> having only one (other small fixed numbers can be better on some
> hardware) outstanding request and to pass new requests to its
> provider only after the service of the previous one ended.
You'll want to queue at least two requests at once. The reason for
this is performance:
Currently, drivers queue their own I/O. This means that as soon
as a request completes (on devices that don't have in-device
queues), they can fairly quickly grab a new request from their
internal queue and push it back to the device from the interrupt
handler or some other fast method.
Having the device idle while the response percolates up the geom
stack and a new request down will likely be rather wasteful.
For disks with queuing, I'd recommend trying to keep the queue
reasonably full (unless the queuing strategy says otherwise),
for disks without queuing I'd say we want to push at least one
more request down. Personally, I think the sanest design would
be to have device drivers return a "temporary I/O error" along
the lines of EAGAIN, meaning their queue is full.

> The example scheduler in the draft takes the following approach:
> 
> o a scheduling GEOM class is introduced.  It can be stacked on
>   top of disk geoms, and schedules all the requests coming
>   from its consumers.  I'm not absolutely sure that a new class
>   is really needed but I think that it can simplify testing and
>   experimenting with various solutions on the scheduler placement.
Probably, though we'll want to make sure that they stack on top of
(or are inside of?) the geoms talking to the disks, because it rarely
makes sense to put a queuing geom on top of, say, a disklabel geom.

The advantage of making it a full geom is configurability. You would
be able to swap out a scheduler at runtime, select different sched-
ulers for different disks, and potentially even test new schedulers
without rebooting (though you wouldn't want to do that for benchmarks)

> o  Requests coming from consumers are passed down immediately
>   if there is no other request under service, otherwise they
>   are queued in a bioq.
This is specific to the anticipatory scheduler. I would say in more
general terms:
- A queuing geom is to push all requests that it wants serviced down
towards the disk, until the disk reports a queue full. A queuing geom
is allowed to hold back requests even when the driver queue is not
full yet, if it does not want the disk to attempt such I/O yet (such
as the anticipatory scheduler waiting for another disk request near
the last one, or the process-priority scheduler holding back a low-
priority request that would potentially cause a long seek, until io
has been idle)
This dispels phk's anti-geom argument of "it will be inefficient
because it will take longer for a new request to get to the driver" -
if the queuing strategy had wanted the r

Re: Pluggable Disk Scheduler Project

2007-10-12 Thread Alexander Leidinger
Quoting Fabio Checconi <[EMAIL PROTECTED]> (from Thu, 11 Oct 2007  
13:48:28 +0200):



From: Ulf Lilleengen <[EMAIL PROTECTED]>
Date: Thu, Oct 11, 2007 10:07:34AM +0200

On tor, okt 11, 2007 at 04:20:01 +0200, Fabio Checconi wrote:



> o What can be an interface for disk schedulers?
I think the interface developed in [1] is a pretty good one   
actually. I think

the disksort-routines looked as a good place to do this. Even there it might
not know enough about the hardware.

> o How to deal with devices that handle multiple request per time?
This is an example of the problems you get doing this in GEOM. You   
don't have

very good knowledge of the hardware.


One can't pass this info from the lower layers up into GEOM (maybe by  
adding some attribute querying interface in GEOM if it doesn't exist)?


Bye,
Alexander.

--
The naked truth of it is, I have no shirt.
-- William Shakespeare, "Love's Labour's Lost"

http://www.Leidinger.netAlexander @ Leidinger.net: PGP ID = B0063FE7
http://www.FreeBSD.org   netchild @ FreeBSD.org  : PGP ID = 72077137
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Pluggable Disk Scheduler Project

2007-10-12 Thread Fabio Checconi
> From: Alexander Leidinger <[EMAIL PROTECTED]>
> Date: Fri, Oct 12, 2007 08:18:35AM +0200
>
> Quoting Fabio Checconi <[EMAIL PROTECTED]> (from Thu, 11 Oct 2007  
> 13:48:28 +0200):
> 
> >>From: Ulf Lilleengen <[EMAIL PROTECTED]>
> >>Date: Thu, Oct 11, 2007 10:07:34AM +0200
> >>
> >>On tor, okt 11, 2007 at 04:20:01 +0200, Fabio Checconi wrote:
> >>> o How to deal with devices that handle multiple request per time?
> >>This is an example of the problems you get doing this in GEOM. You   
> >>don't have
> >>very good knowledge of the hardware.
> 
> One can't pass this info from the lower layers up into GEOM (maybe by  
> adding some attribute querying interface in GEOM if it doesn't exist)?
> 

I think the g_getattr() call is there/can be used for things like
that.  The scheduler should need only to know how many outstanding
requests it can allow, otherwise it should be rather independend
from the lower layers.

Anyway hardware queueing brings also a different kind of problems,
that is it can't be mixed easily with anticipation, because if you
have syncronous requests and you dispatch more than one of them you
are serving more than one process by definition, thus you can break
anticipation, unless this thing is done very carefully (e.g., when
switching from a process to another, or mixing syncronous and
asynchronous requests.)

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"


Re: Pluggable Disk Scheduler Project

2007-10-11 Thread Fabio Checconi
Hi,

> From: Ulf Lilleengen <[EMAIL PROTECTED]>
> Date: Thu, Oct 11, 2007 10:07:34AM +0200
>
> On tor, okt 11, 2007 at 04:20:01 +0200, Fabio Checconi wrote:
> > o is working on disk scheduling worth at all?
> It is hard to say, but I'd like to run some benchmarks with this to see.
> Also, noted in [2], newer hardware does more magic on their own, as well as
> solid state drives coming along.
> 

this is why I wanted to start with some kind of prototype, hoping
that its simplicity does not limit too much the results we can obtain.


> > o Where is the right place (in GEOM) for a disk scheduler?
> As discussed in [2], some suggested that disk scheduling should be done on a
> lower part of a kernel due to knowledge of hardware capabilities.
> 
> As discussed in [1], ata for instance do it's own scheduling, so this might
> ruin performance (Even the hardware might do some magic of it's own). I
> think I tried disabling it though, so shouldn't be a big deal for testing.
> 

I don't know if disabling the lower level queueing is needed, because
if you have only one outstanding request (or just a few ones, for
hardware that supports that, and that can be a parameter for the
scheduler) the lower level queueing will not reorder the higher level
schedule.


> > o How can anticipation be introduced into the GEOM framework?
> This is actually perhaps one of the most interesting points, since the
> anticipation principle in itself fits here, but some other scheduling
> features might not be useful.
> 

Ok.  Decoupling the anticipation from other scheduling details may
not be easy, but this thing is all about trying :)


> > o What can be an interface for disk schedulers?
> I think the interface developed in [1] is a pretty good one actually. I think
> the disksort-routines looked as a good place to do this. Even there it might
> not know enough about the hardware.
> 
> > o How to deal with devices that handle multiple request per time?
> This is an example of the problems you get doing this in GEOM. You don't have
> very good knowledge of the hardware.
> 
> > So, as I've said, I'd like to know what you think about the subject,
> > if I'm missing something, if there is some kind of interest on this
> > and if/how can this work proceed.
> 
> Also, what would be interesting is implementing I/O priorities for processes
> to be able to give I/O time more fairly(or at least being able to set after
> preference) to processes. This was done in the Hybrid project, but this is
> something that definately could be done in GEOM. (I see you have some
> fairness in the g_as_dispatch routine though). 
> 

I totally agree.  My primary concern with this email was to know
what others have done/think about the problem, and to try to identify
some kind of interface and positioning for the scheduler.  The
actual scheduler has to be something _much_ more complex than this
little thing.  Hybrid ideas can be mapped to a CFQ-like scheduler
(one C-LOOK queue per process, fair sharing among queues, anticipation
on a per queue basis,) and I'm working on that with Paolo Valente (in CC,)
but I think the infrastructure behind the scheduler is more important
now, as it defines what the scheduler can do.


> However, I'll try testing the work you've got. I'll see if I can get some
> numbers with this when I get some disks up.
> 
> Btw, I did run some benchmark when I tried chaning bioq_disksort into a FIFO
> queue which didn's seem to lower performance (on SCSI and UMASS, but need to
> test again with ATA). It was a long time ago, so it should be tried again
> though.

I think this can depend on the access patterns used for testing (on
disks; of course on flash devices disk sorting is not needed at
all.)  If you have processes that do only synchronous requests there
is almost no difference between a .*LOOK elevator and FIFO queueing,
since in the queue there will always be only one request per process,
and you switch between processes every time you serve a new request
(of course the actual order will change, but the number of seeks
is the factor that really limits the throughput in this situation.
At least this is my understanding of the problem :) )

The test patterns we are using with Paolo try to pessimize the disk
throughput reading in parallel (simply with a dd, that generates a
typical example of greedy synchronous sequential read patterns,)
from two or more files put on partitions at the opposite ends (at
least considering their logical addresses) of the disk.  This
kind of access should generate something near to the worst case
behavior for a work-conserving .*LOOK scheduler.  Of course also
the behavior for asyncronous requests has to be tested.

Thank you very much for your feedback, I hope we can get some
numbers to substantiate this topic, remembering also that a
good interface is a requirement for a good scheduler.


> > 
> > [1]  http://wiki.freebsd.org/Hybrid
> > 
> > [2]  
> > http://lists.freebsd.org/pipermail/free

Re: Pluggable Disk Scheduler Project

2007-10-11 Thread Ulf Lilleengen
On tor, okt 11, 2007 at 04:20:01 +0200, Fabio Checconi wrote:
> Hi,
> is anybody working on the `Pluggable Disk Scheduler Project' from
> the ideas page?
Hi,
I did look into it :) But then other projects came.
> 
> To better understand how GEOM works, and how a (non work conserving)
> disk scheduler can fit into it, I've written a very simple, yet
> working, prototype:
> 
> http://feanor.sssup.it/~fabio/freebsd/g_sched/geom-sched-class.patch
> 
> 
> I'd like to take a better look at the problem and work on it, and
> I'd like to know what you think about it.
> 
> After reading [1], [2] and its follow-ups the main problems that
> need to be addressed seem to be:
> 
> o is working on disk scheduling worth at all?
It is hard to say, but I'd like to run some benchmarks with this to see.
Also, noted in [2], newer hardware does more magic on their own, as well as
solid state drives coming along.

> o Where is the right place (in GEOM) for a disk scheduler?
As discussed in [2], some suggested that disk scheduling should be done on a
lower part of a kernel due to knowledge of hardware capabilities.

As discussed in [1], ata for instance do it's own scheduling, so this might
ruin performance (Even the hardware might do some magic of it's own). I
think I tried disabling it though, so shouldn't be a big deal for testing.

> o How can anticipation be introduced into the GEOM framework?
This is actually perhaps one of the most interesting points, since the
anticipation principle in itself fits here, but some other scheduling
features might not be useful.

> o What can be an interface for disk schedulers?
I think the interface developed in [1] is a pretty good one actually. I think
the disksort-routines looked as a good place to do this. Even there it might
not know enough about the hardware.

> o How to deal with devices that handle multiple request per time?
This is an example of the problems you get doing this in GEOM. You don't have
very good knowledge of the hardware.

> o How to deal with metadata requests and other VFS issues?
> 
> I think that some answers need a little bit of experimenting with
> real code and real hardware, so here it is this attempt.  The
> interface used in this toy prototype for the scheduler is something
> like that:
> 
> typedef void *gs_init_t (struct g_geom *geom);
> typedef void gs_fini_t (void *data);
> typedef void gs_start_t (void *data, struct bio *bio);
> typedef void gs_done_t (void *data, struct bio *bio);
> 
> struct g_gsched {
>   const char  *gs_name;   /* Scheduler name. */
>   int gs_refs;/* Refcount, internal use. */
> 
>   gs_init_t   *gs_init;   /* Called on geom creation. */
>   gs_fini_t   *gs_fini;   /* Called on geom destruction. */
>   gs_start_t  *gs_start;  /* Called on geom start. */
>   gs_done_t   *gs_done;   /* Called on geom done. */
> 
>   LIST_ENTRY(g_gsched) glist; /* List of schedulers, internal use. */
> };
> 
> The main idea is to allow the scheduler to enqueue the requests having only
> one (other small fixed numbers can be better on some hardware) outstanding
> request and to pass new requests to its provider only after the service of
> the previous one ended.
> 
> The example scheduler in the draft takes the following approach:
> 
> o a scheduling GEOM class is introduced.  It can be stacked on
>   top of disk geoms, and schedules all the requests coming
>   from its consumers.  I'm not absolutely sure that a new class
>   is really needed but I think that it can simplify testing and
>   experimenting with various solutions on the scheduler placement.
> o  Requests coming from consumers are passed down immediately
>   if there is no other request under service, otherwise they
>   are queued in a bioq.
> o  When a request is served the scheduler is notified, then it
>   can pass down a new request, or, as in this toy anticipatory[3]
>   scheduler, wait for a new request from the same process, or
>   for a timeout to expire, and only after one of those events
>   make the next scheduling decision.
> 
> So, as I've said, I'd like to know what you think about the subject,
> if I'm missing something, if there is some kind of interest on this
> and if/how can this work proceed.

Also, what would be interesting is implementing I/O priorities for processes
to be able to give I/O time more fairly(or at least being able to set after
preference) to processes. This was done in the Hybrid project, but this is
something that definately could be done in GEOM. (I see you have some
fairness in the g_

Pluggable Disk Scheduler Project

2007-10-10 Thread Fabio Checconi
Hi,
is anybody working on the `Pluggable Disk Scheduler Project' from
the ideas page?

To better understand how GEOM works, and how a (non work conserving)
disk scheduler can fit into it, I've written a very simple, yet
working, prototype:

http://feanor.sssup.it/~fabio/freebsd/g_sched/geom-sched-class.patch


I'd like to take a better look at the problem and work on it, and
I'd like to know what you think about it.

After reading [1], [2] and its follow-ups the main problems that
need to be addressed seem to be:

o is working on disk scheduling worth at all?
o Where is the right place (in GEOM) for a disk scheduler?
o How can anticipation be introduced into the GEOM framework?
o What can be an interface for disk schedulers?
o How to deal with devices that handle multiple request per time?
o How to deal with metadata requests and other VFS issues?

I think that some answers need a little bit of experimenting with
real code and real hardware, so here it is this attempt.  The
interface used in this toy prototype for the scheduler is something
like that:

typedef void *gs_init_t (struct g_geom *geom);
typedef void gs_fini_t (void *data);
typedef void gs_start_t (void *data, struct bio *bio);
typedef void gs_done_t (void *data, struct bio *bio);

struct g_gsched {
const char  *gs_name;   /* Scheduler name. */
int gs_refs;/* Refcount, internal use. */

gs_init_t   *gs_init;   /* Called on geom creation. */
gs_fini_t   *gs_fini;   /* Called on geom destruction. */
gs_start_t  *gs_start;  /* Called on geom start. */
gs_done_t   *gs_done;   /* Called on geom done. */

LIST_ENTRY(g_gsched) glist; /* List of schedulers, internal use. */
};

The main idea is to allow the scheduler to enqueue the requests having only
one (other small fixed numbers can be better on some hardware) outstanding
request and to pass new requests to its provider only after the service of
the previous one ended.

The example scheduler in the draft takes the following approach:

o a scheduling GEOM class is introduced.  It can be stacked on
  top of disk geoms, and schedules all the requests coming
  from its consumers.  I'm not absolutely sure that a new class
  is really needed but I think that it can simplify testing and
  experimenting with various solutions on the scheduler placement.
o  Requests coming from consumers are passed down immediately
  if there is no other request under service, otherwise they
  are queued in a bioq.
o  When a request is served the scheduler is notified, then it
  can pass down a new request, or, as in this toy anticipatory[3]
  scheduler, wait for a new request from the same process, or
  for a timeout to expire, and only after one of those events
  make the next scheduling decision.

So, as I've said, I'd like to know what you think about the subject,
if I'm missing something, if there is some kind of interest on this
and if/how can this work proceed.

Thanks in advance,
fabio


[1]  http://wiki.freebsd.org/Hybrid

[2]  http://lists.freebsd.org/pipermail/freebsd-geom/2007-January/001854.html

[3]  The details of the anticipation are really not interesting as it
is extremely simplified by purpose.

[4]  http://feanor.sssup.it/~fabio/freebsd/g_sched/ contains also an userspace
client to experiment with the GEOM class.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "[EMAIL PROTECTED]"