sx locks and memory barriers

2009-09-24 Thread Fabio Checconi
Hi all,
  looking at sys/sx.h I have some troubles understanding this comment:

 * A note about memory barriers.  Exclusive locks need to use the same
 * memory barriers as mutexes: _acq when acquiring an exclusive lock
 * and _rel when releasing an exclusive lock.  On the other side,
 * shared lock needs to use an _acq barrier when acquiring the lock
 * but, since they don't update any locked data, no memory barrier is
 * needed when releasing a shared lock.

In particular, I'm not understanding what prevents the following sequence
from happening:

CPU A   CPU B

sx_slock(data-lock);

sx_sunlock(data-lock);

/* reordered after the unlock
   by the cpu */
if (data-buffer)
sx_xlock(data-lock);
free(data-buffer);
data-buffer = NULL;
sx_xunlock(data-lock);

a = *data-buffer;

IOW, even if readers do not modify the data protected by the lock,
without a release barrier a memory access may leak past the unlock (as
the cpu won't notice any dependency between the unlock and the fetch,
feeling free to reorder them), thus potentially racing with an exclusive
writer accessing the data.

On architectures where atomic ops serialize memory accesses this would
never happen, otherwise the sequence above seems possible; am I missing
something?
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


SoC2009: Geom-based Disk Schedulers

2009-05-01 Thread Fabio Checconi
Hi all,
  I'm a PhD student, this summer I'll work on a SoC project on disk
scheduling.  I will extend the work we started with luigi, that we already
presented in [1, 2].  Two of the main areas that still need improvement,
and that will be considered during the project, are doing proper request
classification and performance tuning for the specific scheduling
algorithms.

More info available on the wiki page of the project:

  http://wiki.freebsd.org/SOC2009FabioChecconi


[1] http://lists.freebsd.org/pipermail/freebsd-stable/2009-January/047597.html
[2] http://lists.freebsd.org/pipermail/freebsd-stable/2009-March/048704.html
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: Pluggable Disk Scheduler Project

2007-10-17 Thread Fabio Checconi
 From: Karsten Behrmann [EMAIL PROTECTED]
 Date: Tue, Oct 16, 2007 04:10:37PM +0200

  Hi,
  is anybody working on the `Pluggable Disk Scheduler Project' from
  the ideas page?
 I've been kicking the idea around in my head, but I'm probably newer to
 everything involved than you are, so feel free to pick it up. If you want,
 we can toss some ideas and code to each other, though I don't really
 have anything on the latter.

Thank you for your answer, I'd really like to work/discuss with you
and anyone else interested in this project.


  o Where is the right place (in GEOM) for a disk scheduler?
 I have spent some time at eurobsdcon talking to Kirk and phk about
 this, and the result was that I now know strong proponents both for
 putting it into the disk drivers and for putting it into geom ;-)
 
 Personally, I would put it into geom. I'll go into more detail on
 this later, but basically, geom seems a better fit for high-level
 code than a device driver, and if done properly performance penalties
 should be negligible.
 

I'm pretty interested even in the arguments for the opposite solution;
I've started from GEOM because it seemed to be a) what was proposed/
requested on the ideas page, and b) cleaner at least for a prototype.
I wanted to start with some code also to evaluate the performance
penalties of that approach.

I am a little bit scared from the perspective of changing the
queueing mechanisms that drivers use, as this kind of modifications
can be difficult to write, test and maintain, but I'd really like
to know what people with experience in those kernel areas think
about the possibility of doing more complex io scheduling, with
some sort of unified interface, at this level.

As a side note, by now I've not posted any performance number because
at the moment I've only access to old ata drives that would not
give significative results.


  o How can anticipation be introduced into the GEOM framework?
 I wouldn't focus on just anticipation, but also other types of
 schedulers (I/O scheduling influenced by nice value?)
 

That would be interesting, especially for the background fsck case.
I think that some kind of fair sharing approach should be used; as
you say below a priority driven scheduler can have relations with
the VFS that are difficult to track. (This problem was pointed out
also in one of the follow-ups to [1].)


  o How to deal with metadata requests and other VFS issues?
 Like any other disk request, though for priority-respecting
 schedulers this may get rather interesting.
 
 [...]
  The main idea is to allow the scheduler to enqueue the requests
  having only one (other small fixed numbers can be better on some
  hardware) outstanding request and to pass new requests to its
  provider only after the service of the previous one ended.
 You'll want to queue at least two requests at once. The reason for
 this is performance:
 Currently, drivers queue their own I/O. This means that as soon
 as a request completes (on devices that don't have in-device
 queues), they can fairly quickly grab a new request from their
 internal queue and push it back to the device from the interrupt
 handler or some other fast method.

Wouldn't that require, to be sustainable (unless you want a fast
dispatch every two requests,) that the driver queue is always of
length two or more?  In this way you ask, to refill the driver
queue, the upper scheduler to dispatch a new request every time a
request is taken from the driver queue and sent to the disk, or
in any other moment before the request under service is completed.
In this way you cannot have an anticipation mechanism, because
the next request you'll want to dispatch from the upper scheduler
has not yet been issued (it will be only after the one being served
is completed and after the userspace application restarts.)


 Having the device idle while the response percolates up the geom
 stack and a new request down will likely be rather wasteful.

I completely agree on that.  I've only done in this way because it
was the less intrusive option I could find.  What can be other more
efficient alternatives?  (Obviously without subverting any of the
existing interfaces, and allowing the anticipation of requests.)


 For disks with queuing, I'd recommend trying to keep the queue
 reasonably full (unless the queuing strategy says otherwise),
 for disks without queuing I'd say we want to push at least one
 more request down. Personally, I think the sanest design would
 be to have device drivers return a temporary I/O error along
 the lines of EAGAIN, meaning their queue is full.
 

This can be easily done for asynchronous requests.  The troubles
arise when dealing with synchronous requests... i.e., if you dispatch
more than one synchronous request you are serving more than one
process, and you have long seek times between the requests you have
dispatched.  I think we should do (I'll do as soon as I have access
to some realistic test system) some 

Re: Pluggable Disk Scheduler Project

2007-10-17 Thread Fabio Checconi
 From: Ulf Lilleengen [EMAIL PROTECTED]
 Date: Wed, Oct 17, 2007 01:07:15PM +0200

 On tir, okt 16, 2007 at 04:10:37 +0200, Karsten Behrmann wrote:
 Over to a more general view of it's architecture:
 
 When I looked at this project for the first time, I was under the impression
 that this would be best done in a GEOM class.
 
 However, I think the approach that was taken in the Hybrid project is better 

Ok.  I think that such a solution requires a lot more effort on the
design and coding sides, as it requires the modification of the
drivers and can bring us problems with locking and with the queueing
assumptions that may vary on a per-driver basis.

Maybe I've not enough experience/knowledge of the driver subsystem,
but I would not remove the queueing that is done now by the drivers
(think of ata freezepoints,) but instead I'd like to try to grab
the requests before they get to the driver (e.g., in/before their
d_strategy call) and have some sort of pull mechanism when requests
complete (still don't have any (serious) idea on that, I fear that
the right place to do that, for locking issues and so on, can be
driver dependent.)  Any ideas on that?  Which drivers can be good
starting points to try to write down some code?


 Also, I got my test-box up again today, and will be trying your patch as soon
 as I've upgraded it to CURRENT Fabio.

Thank you very much!  Please consider that my primary concern with
the patch was its interface, the algorithm is just an example (it
should give an idea of the performance loss due to the mechanism
overhead with async requests, and some improvement on greedy sync
loads.)

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Pluggable Disk Scheduler Project

2007-10-17 Thread Fabio Checconi
 From: Ulf Lilleengen [EMAIL PROTECTED]
 Date: Wed, Oct 17, 2007 03:09:35PM +0200

 On ons, okt 17, 2007 at 02:19:07 +0200, Fabio Checconi wrote:
  Maybe I've not enough experience/knowledge of the driver subsystem,
[...]
 If you look at it, Hybrid is just a generalization of the existing
 bioq_* API already defined. And this API is used by GEOM classes _before_
 device drivers get the requests AFAIK. 
 

I looked at the Hybrid code, but I don't think that the bioq_*
family of calls can be the right place to start, for the problems
experienced during the Hybrid development with locking/anticipation
and because you can have the same request passing through multiple
bioqs during its path to the device (e.g., two stacked geoms using
two different bioqs and then a device driver using bioq_* to organize
its queue, or geoms using more than one bioq, like raid3; I think
the complexity can become unmanageable.)  One could even think to
configure each single bioq in the system, but things can get very
complex in this way.


 For a simple example on a driver, the md-driver might be a good place to
 look. Note that I have little experience and knowledge of the driver
 subsystem myself.
 

I'll take a look, thanks.


 Also note (from the Hybrid page):
 * we could not provide support for non work-conserving schedulers, due to a
[...]
 
 This certainly argues for having this in the GEOM layer, but perhaps it's
 possible to change the assumtions done in some drivers? The locking issue
 should perhaps be better planned though, and an audit of the driver disksort
 code is necessary.
 

I need some more time to think about that :)


 Also:
 * as said, the ATA driver in 6.x/7.x moves the disksort one layer below the
   one we are working at, so this particular work won't help on ATA-based 6.x
   machines.
   We should figure out how to address this, because the work done at that
   layer is mostly a replica of the bioq_*() API. 
 
 So, I see this can get a bit messy thinking of that the ATA drivers does
 disksorts on its own, but perhaps it would be possible to fix this by letting
 changing the general ATA driver to use it's own pluggable scheduler.
 
 Anyway, I shouldn't demand that you do this, especially since I don't have
 any code or anything to show to, and because you decide what you want to do.

I still cannot say if a GEOM scheduler is better than a scheduler
put at a lower level, or if the bioq_* interface is better than any
other alternative, so your suggestions are welcome.  Moreover I'd
really like to discuss/work together, or at least do things with
some agreement on them.  If I'll have the time to experiment with
more than one solution I'll be happy to do that.


 However, I'd hate to see the Hybrid effort go to waste :) I was hoping some
 of the authors of the project would reply with their thoughts, so I CC'ed
 them. 

Well, the work done on Hybrid had also interesting aspects from
the algorithm side... but that's another story...

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Pluggable Disk Scheduler Project

2007-10-12 Thread Fabio Checconi
 From: Alexander Leidinger [EMAIL PROTECTED]
 Date: Fri, Oct 12, 2007 08:18:35AM +0200

 Quoting Fabio Checconi [EMAIL PROTECTED] (from Thu, 11 Oct 2007  
 13:48:28 +0200):
 
 From: Ulf Lilleengen [EMAIL PROTECTED]
 Date: Thu, Oct 11, 2007 10:07:34AM +0200
 
 On tor, okt 11, 2007 at 04:20:01 +0200, Fabio Checconi wrote:
  o How to deal with devices that handle multiple request per time?
 This is an example of the problems you get doing this in GEOM. You   
 don't have
 very good knowledge of the hardware.
 
 One can't pass this info from the lower layers up into GEOM (maybe by  
 adding some attribute querying interface in GEOM if it doesn't exist)?
 

I think the g_getattr() call is there/can be used for things like
that.  The scheduler should need only to know how many outstanding
requests it can allow, otherwise it should be rather independend
from the lower layers.

Anyway hardware queueing brings also a different kind of problems,
that is it can't be mixed easily with anticipation, because if you
have syncronous requests and you dispatch more than one of them you
are serving more than one process by definition, thus you can break
anticipation, unless this thing is done very carefully (e.g., when
switching from a process to another, or mixing syncronous and
asynchronous requests.)

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]


Re: Pluggable Disk Scheduler Project

2007-10-11 Thread Fabio Checconi
Hi,

 From: Ulf Lilleengen [EMAIL PROTECTED]
 Date: Thu, Oct 11, 2007 10:07:34AM +0200

 On tor, okt 11, 2007 at 04:20:01 +0200, Fabio Checconi wrote:
  o is working on disk scheduling worth at all?
 It is hard to say, but I'd like to run some benchmarks with this to see.
 Also, noted in [2], newer hardware does more magic on their own, as well as
 solid state drives coming along.
 

this is why I wanted to start with some kind of prototype, hoping
that its simplicity does not limit too much the results we can obtain.


  o Where is the right place (in GEOM) for a disk scheduler?
 As discussed in [2], some suggested that disk scheduling should be done on a
 lower part of a kernel due to knowledge of hardware capabilities.
 
 As discussed in [1], ata for instance do it's own scheduling, so this might
 ruin performance (Even the hardware might do some magic of it's own). I
 think I tried disabling it though, so shouldn't be a big deal for testing.
 

I don't know if disabling the lower level queueing is needed, because
if you have only one outstanding request (or just a few ones, for
hardware that supports that, and that can be a parameter for the
scheduler) the lower level queueing will not reorder the higher level
schedule.


  o How can anticipation be introduced into the GEOM framework?
 This is actually perhaps one of the most interesting points, since the
 anticipation principle in itself fits here, but some other scheduling
 features might not be useful.
 

Ok.  Decoupling the anticipation from other scheduling details may
not be easy, but this thing is all about trying :)


  o What can be an interface for disk schedulers?
 I think the interface developed in [1] is a pretty good one actually. I think
 the disksort-routines looked as a good place to do this. Even there it might
 not know enough about the hardware.
 
  o How to deal with devices that handle multiple request per time?
 This is an example of the problems you get doing this in GEOM. You don't have
 very good knowledge of the hardware.
 
  So, as I've said, I'd like to know what you think about the subject,
  if I'm missing something, if there is some kind of interest on this
  and if/how can this work proceed.
 
 Also, what would be interesting is implementing I/O priorities for processes
 to be able to give I/O time more fairly(or at least being able to set after
 preference) to processes. This was done in the Hybrid project, but this is
 something that definately could be done in GEOM. (I see you have some
 fairness in the g_as_dispatch routine though). 
 

I totally agree.  My primary concern with this email was to know
what others have done/think about the problem, and to try to identify
some kind of interface and positioning for the scheduler.  The
actual scheduler has to be something _much_ more complex than this
little thing.  Hybrid ideas can be mapped to a CFQ-like scheduler
(one C-LOOK queue per process, fair sharing among queues, anticipation
on a per queue basis,) and I'm working on that with Paolo Valente (in CC,)
but I think the infrastructure behind the scheduler is more important
now, as it defines what the scheduler can do.


 However, I'll try testing the work you've got. I'll see if I can get some
 numbers with this when I get some disks up.
 
 Btw, I did run some benchmark when I tried chaning bioq_disksort into a FIFO
 queue which didn's seem to lower performance (on SCSI and UMASS, but need to
 test again with ATA). It was a long time ago, so it should be tried again
 though.

I think this can depend on the access patterns used for testing (on
disks; of course on flash devices disk sorting is not needed at
all.)  If you have processes that do only synchronous requests there
is almost no difference between a .*LOOK elevator and FIFO queueing,
since in the queue there will always be only one request per process,
and you switch between processes every time you serve a new request
(of course the actual order will change, but the number of seeks
is the factor that really limits the throughput in this situation.
At least this is my understanding of the problem :) )

The test patterns we are using with Paolo try to pessimize the disk
throughput reading in parallel (simply with a dd, that generates a
typical example of greedy synchronous sequential read patterns,)
from two or more files put on partitions at the opposite ends (at
least considering their logical addresses) of the disk.  This
kind of access should generate something near to the worst case
behavior for a work-conserving .*LOOK scheduler.  Of course also
the behavior for asyncronous requests has to be tested.

Thank you very much for your feedback, I hope we can get some
numbers to substantiate this topic, remembering also that a
good interface is a requirement for a good scheduler.


  
  [1]  http://wiki.freebsd.org/Hybrid
  
  [2]  
  http://lists.freebsd.org/pipermail/freebsd-geom/2007-January/001854.html
  
  [3]  The details

Pluggable Disk Scheduler Project

2007-10-10 Thread Fabio Checconi
Hi,
is anybody working on the `Pluggable Disk Scheduler Project' from
the ideas page?

To better understand how GEOM works, and how a (non work conserving)
disk scheduler can fit into it, I've written a very simple, yet
working, prototype:

http://feanor.sssup.it/~fabio/freebsd/g_sched/geom-sched-class.patch


I'd like to take a better look at the problem and work on it, and
I'd like to know what you think about it.

After reading [1], [2] and its follow-ups the main problems that
need to be addressed seem to be:

o is working on disk scheduling worth at all?
o Where is the right place (in GEOM) for a disk scheduler?
o How can anticipation be introduced into the GEOM framework?
o What can be an interface for disk schedulers?
o How to deal with devices that handle multiple request per time?
o How to deal with metadata requests and other VFS issues?

I think that some answers need a little bit of experimenting with
real code and real hardware, so here it is this attempt.  The
interface used in this toy prototype for the scheduler is something
like that:

typedef void *gs_init_t (struct g_geom *geom);
typedef void gs_fini_t (void *data);
typedef void gs_start_t (void *data, struct bio *bio);
typedef void gs_done_t (void *data, struct bio *bio);

struct g_gsched {
const char  *gs_name;   /* Scheduler name. */
int gs_refs;/* Refcount, internal use. */

gs_init_t   *gs_init;   /* Called on geom creation. */
gs_fini_t   *gs_fini;   /* Called on geom destruction. */
gs_start_t  *gs_start;  /* Called on geom start. */
gs_done_t   *gs_done;   /* Called on geom done. */

LIST_ENTRY(g_gsched) glist; /* List of schedulers, internal use. */
};

The main idea is to allow the scheduler to enqueue the requests having only
one (other small fixed numbers can be better on some hardware) outstanding
request and to pass new requests to its provider only after the service of
the previous one ended.

The example scheduler in the draft takes the following approach:

o a scheduling GEOM class is introduced.  It can be stacked on
  top of disk geoms, and schedules all the requests coming
  from its consumers.  I'm not absolutely sure that a new class
  is really needed but I think that it can simplify testing and
  experimenting with various solutions on the scheduler placement.
o  Requests coming from consumers are passed down immediately
  if there is no other request under service, otherwise they
  are queued in a bioq.
o  When a request is served the scheduler is notified, then it
  can pass down a new request, or, as in this toy anticipatory[3]
  scheduler, wait for a new request from the same process, or
  for a timeout to expire, and only after one of those events
  make the next scheduling decision.

So, as I've said, I'd like to know what you think about the subject,
if I'm missing something, if there is some kind of interest on this
and if/how can this work proceed.

Thanks in advance,
fabio


[1]  http://wiki.freebsd.org/Hybrid

[2]  http://lists.freebsd.org/pipermail/freebsd-geom/2007-January/001854.html

[3]  The details of the anticipation are really not interesting as it
is extremely simplified by purpose.

[4]  http://feanor.sssup.it/~fabio/freebsd/g_sched/ contains also an userspace
client to experiment with the GEOM class.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to [EMAIL PROTECTED]