Re: [kvm-devel] regression breaks lowmem reserved RAM

2008-04-01 Thread Jens Axboe
On Tue, Apr 01 2008, Andrea Arcangeli wrote:
 Looking a bit closer into this regression the reason this can't be
 right is that dma_addr common default is BLK_BOUNCE_HIGH and most
 machines have less than 4G. So if you do:
 
 if (b_pfn = (min_t(u64, 0x, BLK_BOUNCE_HIGH)  PAGE_SHIFT))
   dma = 1
 
 that will translate to:
 
  if (BLK_BOUNCE_HIGH = BLK_BOUNCE_HIGH)
   dma = 1
 
 So for 99% of hardware this will trigger unnecessary GFP_DMA
 allocations and isa pooling operations.
 
 Also note how the 32bit code still does b_pfn  blk_max_low_pfn.
 
 I guess this is what you were looking after. I didn't verify but as
 far as I can tell, this will stop the regression with isa dma
 operations at boot for 99% of blkdev/memory combinations out there and
 I guess this fixes the setups with 4G of ram and 32bit pci cards as
 well (this also retains symmetry with the 32bit code).
 
 Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]

Thanks Andrea, this looks much saner!

-- 
Jens Axboe


-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 4/6] virtio block driver

2007-09-21 Thread Jens Axboe
On Fri, Sep 21 2007, Rusty Russell wrote:
 On Thu, 2007-09-20 at 15:05 +0200, Jens Axboe wrote:
  On Thu, Sep 20 2007, Rusty Russell wrote:
   +static void end_dequeued_request(struct request *req,
   +  struct request_queue *q, int uptodate)
   +{
   + /* And so the insanity of the block layer infects us here. */
   + int nsectors = req-hard_nr_sectors;
   +
   + if (blk_pc_request(req)) {
   + nsectors = (req-data_len + 511)  9;
   + if (!nsectors)
   + nsectors = 1;
   + }
   + if (end_that_request_first(req, uptodate, nsectors))
   + BUG();
   + add_disk_randomness(req-rq_disk);
   + end_that_request_last(req, uptodate);
   +}
  
  We have end_queued_request(), lets add end_dequeued_request(). Below.
 
 OK, thanks, I'll throw that in the mix and test...

I've queued it for 2.6.24 as well, so should be in mainline in that time
frame.

   + vblk-sg[0].page = virt_to_page(vbr-out_hdr);
   + vblk-sg[0].offset = offset_in_page(vbr-out_hdr);
   + vblk-sg[0].length = sizeof(vbr-out_hdr);
   + num = blk_rq_map_sg(q, vbr-req, vblk-sg+1);
  
  This wont work for chained sglists, but I gather (I'm so funny) that it
  wont be an issue for you. How large are your sglists?
 
 Hmm, potentially extremely large.  What do I need to do for chained
 sglists?

Not a lot, actually. You snipped the problematic part in your reply,
though:

vblk-sg[num+1].page = virt_to_page(vbr-in_hdr);

which assumes that sg is a contigious piece of memory, for chained sg
lists that isn't true. sg chaining will be in 2.6.24, so if you really
do need large sglists, then that's the way to go.

blk_rq_map_sg() maps correctly for you, no changes needed there. But you
want to use sg_last() for adding to the end of the sglist. And then use
sg_next() to retrieve the next sg segment instead of sg + 1, and
for_each_sg() to loop over all segments.

Just something to keep in mind, if you plan on merging this post 2.6.23.

   + if (!do_req(q, vblk, req)) {
   + /* Queue full?  Wait. */
   + blk_stop_queue(q);
   + break;
   + }
  
  Personally I think this bool stuff is foul. You return false/true, but
  still use ! to test. That is just more confusing than the canonical 0/1
  good/bad return imho.
 
 Except 0/1 is not canonical in the kernel.  Arguably, -errno/0 is
 canonical.  OTOH, bool is clear.

-errno/0, then. 1 is typically used for 'failure without specific error
number' when -Exx doesn't apply. Like the above :-)

But lets just agree to disagree on the bool.

 if do_req() fails, we assume the queue is full.  I shall change the
 comment to that effect.

Thanks!

-- 
Jens Axboe


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 4/6] virtio block driver

2007-09-21 Thread Jens Axboe
On Fri, Sep 21 2007, Rusty Russell wrote:
 On Thu, 2007-09-20 at 14:27 +0200, Jens Axboe wrote:
  On Thu, Sep 20 2007, Rusty Russell wrote:
   The block driver uses scatter-gather lists with sg[0] being the
   request information (struct virtio_blk_outhdr) with the type, sector
   and inbuf id.  The next N sg entries are the bio itself, then the last
   sg is the status byte.  Whether the N entries are in or out depends on
   whether it's a read or a write.
   
   We accept the normal (SCSI) ioctls: they get handed through to the other
   side which can then handle it or reply that it's unsupported.  It's
   not clear that this actually works in general, since I don't know
   if blk_pc_request() requests have an accurate rq_data_dir().
  
  They should, if they imply a data transfer.
 
 OK, good.  We rely on that to mark input vs output sg elements.  I was
 wondering if there was some weird requests which did both input and
 output.

There very soon will be bidirectional requests with both in and out
elements, but they will be clearly marked as such (and the driver needs
to be marked capable). So nothing to worry about for now.

   Although we try to reply -ENOTTY on unsupported commands, the block
   layer in its infinite wisdom suppressed the error so ioctl(fd,
   CDROMEJECT) returns success to userspace.
  
  How about ever submitting a patch for that, instead of just repeatedly
  complaining about it?
 
 To be fair, I've left the complaint in that same patch, you're just
 reading it repeatedly 8)

That may be the case :-)

 I shall look through the code and see if I can figure out how to fix it.
 I'm assuming from your response that there's not some strange reason to
 preserve current behaviour.

It surely sounds like a bug, if you issue ioctl(fd, CDROMEJECT), the
driver sees it and returns -ENOTTY, but userland sees a 0 retval. So if
you have time, please do poke at it a bit.

-- 
Jens Axboe


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 4/6] virtio block driver

2007-09-20 Thread Jens Axboe
On Thu, Sep 20 2007, Rusty Russell wrote:
 +static void end_dequeued_request(struct request *req,
 +  struct request_queue *q, int uptodate)
 +{
 + /* And so the insanity of the block layer infects us here. */
 + int nsectors = req-hard_nr_sectors;
 +
 + if (blk_pc_request(req)) {
 + nsectors = (req-data_len + 511)  9;
 + if (!nsectors)
 + nsectors = 1;
 + }
 + if (end_that_request_first(req, uptodate, nsectors))
 + BUG();
 + add_disk_randomness(req-rq_disk);
 + end_that_request_last(req, uptodate);
 +}

We have end_queued_request(), lets add end_dequeued_request(). Below.

 + vblk-sg[0].page = virt_to_page(vbr-out_hdr);
 + vblk-sg[0].offset = offset_in_page(vbr-out_hdr);
 + vblk-sg[0].length = sizeof(vbr-out_hdr);
 + num = blk_rq_map_sg(q, vbr-req, vblk-sg+1);

This wont work for chained sglists, but I gather (I'm so funny) that it
wont be an issue for you. How large are your sglists?

 + if (!do_req(q, vblk, req)) {
 + /* Queue full?  Wait. */
 + blk_stop_queue(q);
 + break;
 + }

Personally I think this bool stuff is foul. You return false/true, but
still use ! to test. That is just more confusing than the canonical 0/1
good/bad return imho.


diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index f6d3994..c1dc23a 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -3719,15 +3719,24 @@ void end_that_request_last(struct request *req, int 
uptodate)
 EXPORT_SYMBOL(end_that_request_last);
 
 static inline void __end_request(struct request *rq, int uptodate,
-unsigned int nr_bytes)
+unsigned int nr_bytes, int dequeue)
 {
if (!end_that_request_chunk(rq, uptodate, nr_bytes)) {
-   blkdev_dequeue_request(rq);
+   if (dequeue)
+   blkdev_dequeue_request(rq);
add_disk_randomness(rq-rq_disk);
end_that_request_last(rq, uptodate);
}
 }
 
+static unsigned int rq_byte_size(struct request *rq)
+{
+   if (blk_fs_request(rq))
+   return rq-hard_nr_sectors  9;
+
+   return rq-data_len;
+}
+
 /**
  * end_queued_request - end all I/O on a queued request
  * @rq:the request being processed
@@ -3741,18 +3750,29 @@ static inline void __end_request(struct request *rq, 
int uptodate,
  **/
 void end_queued_request(struct request *rq, int uptodate)
 {
-   unsigned int nr_bytes;
-
-   if (blk_fs_request(rq))
-   nr_bytes = rq-hard_nr_sectors  9;
-   else
-   nr_bytes = rq-data_len;
-
-   __end_request(rq, uptodate, nr_bytes);
+   __end_request(rq, uptodate, rq_byte_size(rq), 1);
 }
 EXPORT_SYMBOL(end_queued_request);
 
 /**
+ * end_dequeued_request - end all I/O on a dequeued request
+ * @rq:the request being processed
+ * @uptodate:  error value or 0/1 uptodate flag
+ *
+ * Description:
+ * Ends all I/O on a request. The request must already have been
+ * dequeued using blkdev_dequeue_request(), as is normally the case
+ * for most drivers.
+ *
+ **/
+void end_dequeued_request(struct request *rq, int uptodate)
+{
+   __end_request(rq, uptodate, rq_byte_size(rq), 0);
+}
+EXPORT_SYMBOL(end_dequeued_request);
+
+
+/**
  * end_request - end I/O on the current segment of the request
  * @rq:the request being processed
  * @uptodate:  error value or 0/1 uptodate flag
@@ -3773,7 +3793,7 @@ EXPORT_SYMBOL(end_queued_request);
  **/
 void end_request(struct request *req, int uptodate)
 {
-   __end_request(req, uptodate, req-hard_cur_sectors  9);
+   __end_request(req, uptodate, req-hard_cur_sectors  9, 1);
 }
 EXPORT_SYMBOL(end_request);
 
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 12ab4d4..4b24265 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -732,6 +732,7 @@ extern int end_that_request_chunk(struct request *, int, 
int);
 extern void end_that_request_last(struct request *, int);
 extern void end_request(struct request *, int);
 extern void end_queued_request(struct request *, int);
+extern void end_dequeued_request(struct request *, int);
 extern void blk_complete_request(struct request *);
 
 /*

-- 
Jens Axboe


-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] More virtio users

2007-06-12 Thread Jens Axboe
On Mon, Jun 11 2007, Rusty Russell wrote:
 On Mon, 2007-06-11 at 08:41 +0200, Jens Axboe wrote:
  On Sun, Jun 10 2007, Rusty Russell wrote:
   On Sun, 2007-06-10 at 11:16 +0300, Avi Kivity wrote:
Rusty Russell wrote:
 Lguest doesn't have a framebuffer, so maybe this is a good thing for 
 me
 to hack on, but I promised myself I'd finish NAPI for the net device,
 and tag for block device first.
   

If you're touching the block device, passing a request's io priority to 
the host can be useful.
   
   OK, here's the interdiff.  I still don't handle non-fs requests, but I
   haven't seen any yet.  I should probably BUG_ON() there and wait for
   Jens to scream...
  
  Ehm no, that would certainly cause screaming :-)
  
  Checking for blk_fs_request() and terminating the request if you don't
  know how to handle it is the correct thing to do, a BUG() would
  definitely not be.
 
 So the problem is that I'd like to handle all of them, but I'm not clear
 what requests my device can get.  I can't see a source of any other type
 of request.

The other main request type is blk_pc_request(). In the data setup it's
indentical to blk_fs_request(), there's a bio chain off -bio. It's a
byte granularity entity though, so you should check -data_len for the
size of it. -cmd[] holds a SCSI cdb, which is the command you are
supposed to handle.

   + blk_queue_prep_rq(vblk-disk-queue, blk_queue_start_tag);
   +
  
  is quite a novel way, I actually had to look that code up to check
  whether it was correct. I'd much prefer a little wrapper around that,
 
 OK, but I got it from the blk_queue_start_tag documentation:
 
  *  Description:
  *This can either be used as a stand-alone helper, or possibly be
  *assigned as the queue prep_rq_fn (in which case struct request
  *automagically gets a tag assigned). Note that this function
  *assumes that any type of request can be queued! 

OK, bad documentation, I'll make a note to fix that! Sorry about that.

 I'm unsure on the whole ordered tag idea, though.  It seems like there
 are never be multiple requests in the queue with the same tag, so it
 effectively forces my (userspace) virtio server to serialize and
 fdatasync on every write request:
 
   if (outhdr-type  outhdr-tag != vblk-last_tag) {
   while (vblk-in_progress)
   handle_io_finish(fd, dev);
   fdatasync(fd);
   vblk-last_tag = outhdr-tag;
   }
 
 
 In fact, AFAICT any implementation of ordered tags will be forced into
 serial order.  Am I better off telling the block layer to drain and
 flush instead of ordered tags?  I can then push that through to the
 virtio server.

Perhaps you are misunderstanding what the tag is? The tag is a unique
identifier for a pending request, so you will by definition never have
requests sharing a tag value. But the tag is not to be considered as
ordered, unless it has the barrier flag set as well. So you only need to
serialize on the device side when blk_barrier_rq() is true.

-- 
Jens Axboe


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] More virtio users

2007-06-12 Thread Jens Axboe
On Tue, Jun 12 2007, Rusty Russell wrote:
 On Mon, 2007-06-11 at 09:33 +0200, Jens Axboe wrote:
  On Mon, Jun 11 2007, Rusty Russell wrote:
   So the problem is that I'd like to handle all of them, but I'm not clear
   what requests my device can get.  I can't see a source of any other type
   of request.
  
  The other main request type is blk_pc_request(). In the data setup it's
  indentical to blk_fs_request(), there's a bio chain off -bio. It's a
  byte granularity entity though, so you should check -data_len for the
  size of it. -cmd[] holds a SCSI cdb, which is the command you are
  supposed to handle.
 
 SCSI?  I'm even more lost now.
 
 Q: So what *are* the commands?

They are SCSI commands!

 Q: Who puts them in my queue?

If you want to support SG_IO for instance, you'd have to deal with SCSI
commands.

 I have *never* seen anything but an fs request come through to my
 driver.

You probably only did basic IO to it.

  Perhaps you are misunderstanding what the tag is? The tag is a unique
  identifier for a pending request, so you will by definition never have
  requests sharing a tag value.
 
 Yes, I started to suspect that.
 
  But the tag is not to be considered as
  ordered, unless it has the barrier flag set as well. So you only need to
  serialize on the device side when blk_barrier_rq() is true.
 
 blk_barrier_rq(req) is never true.  I put a BUG_ON(blk_barrier_rq(req))
 in my code and booted.  This is using the device as a root ext3
 filesystem.

 I reverted all the tag changes, still no barriers.  I added
 blk_queue_ordered(vblk-disk-queue, QUEUE_ORDERED_DRAIN, NULL);,
 still no barriers.

-o barrier=1 for ext3, it doesn't use barriers by default. Or
barrier=flush for reiserfs. XFS defaults to barriers on.

-- 
Jens Axboe


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH RFC 3/3] virtio infrastructure: example block driver

2007-06-04 Thread Jens Axboe
On Mon, Jun 04 2007, Carsten Otte wrote:
 Jens Axboe wrote:
 On Fri, Jun 01 2007, Carsten Otte wrote:
 With regard to compute power needed, almost none. The penalty is 
 latency, not overhead: A small request may sit on the request queue to 
 wait for other work to arrive until the queue gets unplugged. This 
 penality is compensated by the benefit of a good chance that more 
 requests will be merged during this time period.
 If we have this method both in host and guest, we have twice the 
 penalty with no added benefit.
 
 I don't buy that argument. We can easily expose the unplug delay, so you
 can kill it at what ever level you want. Or you could just do it in the
 driver right now, but that is a bit hackish.
 That would be preferable if the device driver can chose the unplug 
 delay, or even better it could be (guest)sysfs tuneable.

Right. We probably should make it sysfs configurable in any case, right
now it's a (somewhat) policy decision in the kernel with the delay and
unplug depth.

-- 
Jens Axboe


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH RFC 3/3] virtio infrastructure: example block driver

2007-06-04 Thread Jens Axboe
On Mon, Jun 04 2007, Rusty Russell wrote:
 On Mon, 2007-06-04 at 15:43 +0200, Jens Axboe wrote:
  On Mon, Jun 04 2007, Carsten Otte wrote:
   Jens Axboe wrote:
   On Fri, Jun 01 2007, Carsten Otte wrote:
   With regard to compute power needed, almost none. The penalty is 
   latency, not overhead: A small request may sit on the request queue to 
   wait for other work to arrive until the queue gets unplugged. This 
   penality is compensated by the benefit of a good chance that more 
   requests will be merged during this time period.
   If we have this method both in host and guest, we have twice the 
   penalty with no added benefit.
   
   I don't buy that argument. We can easily expose the unplug delay, so you
   can kill it at what ever level you want. Or you could just do it in the
   driver right now, but that is a bit hackish.
   That would be preferable if the device driver can chose the unplug 
   delay, or even better it could be (guest)sysfs tuneable.
  
  Right. We probably should make it sysfs configurable in any case, right
  now it's a (somewhat) policy decision in the kernel with the delay and
  unplug depth.
 
 The danger is that it's just another knob noone knows how to use.
 Perhaps simply setting it to 0 for the noop scheduler will cover all
 known cases?

Most people should not fiddle with it, the defaults are there for good
reason. I can provide a blk_queue_unplug_thresholds(q, depth, delay)
helper that you could use for the virtualized drivers, perhaps that
would be better for that use?

-- 
Jens Axboe


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH RFC 3/3] virtio infrastructure: example block driver

2007-06-04 Thread Jens Axboe
On Mon, Jun 04 2007, Carsten Otte wrote:
 Jens Axboe wrote:
 Most people should not fiddle with it, the defaults are there for good
 reason. I can provide a blk_queue_unplug_thresholds(q, depth, delay)
 helper that you could use for the virtualized drivers, perhaps that
 would be better for that use?
 Yea, we should'nt change the defaults without a good reason. That 
 would change things for all device drivers.
 This interface provides all functionality we need. I think we need a 
 knob in /sys/block/mydevice/queue/ in addition to that.

Something like this, totally untested (but trivial, so it should work
:-)

diff --git a/block/elevator.c b/block/elevator.c
index ce866eb..81e2a2d 100644
--- a/block/elevator.c
+++ b/block/elevator.c
@@ -638,7 +638,7 @@ void elv_insert(request_queue_t *q, struct request *rq, int 
where)
int nrq = q-rq.count[READ] + q-rq.count[WRITE]
- q-in_flight;
 
-   if (nrq = q-unplug_thresh)
+   if (nrq = q-unplug_thresh || !q-unplug_delay)
__generic_unplug_device(q);
}
 }
diff --git a/block/ll_rw_blk.c b/block/ll_rw_blk.c
index 6b5173a..aaefb32 100644
--- a/block/ll_rw_blk.c
+++ b/block/ll_rw_blk.c
@@ -785,6 +785,30 @@ void blk_queue_dma_alignment(request_queue_t *q, int mask)
 EXPORT_SYMBOL(blk_queue_dma_alignment);
 
 /**
+ * blk_queue_unplug_threshold - set automatic unplug thresholds for the queue
+ * @q: the request queue for the device
+ * @depth: the queue depth at which to do unplug
+ * @delay: maximum unplug timer delay
+ *
+ * Description:
+ *Set the desired unplug depth/threshold and delay for a given queue.
+ *The block layer has a set of good defaults for this, so this function
+ *should ONLY be used by drivers for virtualized environments, where
+ *you could potentially have several layers of queues that each do their
+ *own delay.
+ *
+ *If in doubt, don't use this function! The settings can also be
+ *tweaked from sysfs.
+ *
+ **/
+void blk_queue_unplug_threshold(request_queue_t *q, unsigned int depth,
+   unsigned long delay)
+{
+   q-unplug_thresh = depth;
+   q-unplug_delay = delay;
+}
+
+/**
  * blk_queue_find_tag - find a request by its tag and queue
  * @q:  The request queue for the device
  * @tag: The tag of the request
@@ -1550,7 +1574,8 @@ void blk_plug_device(request_queue_t *q)
return;
 
if (!test_and_set_bit(QUEUE_FLAG_PLUGGED, q-queue_flags)) {
-   mod_timer(q-unplug_timer, jiffies + q-unplug_delay);
+   if (q-unplug_delay)
+   mod_timer(q-unplug_timer, jiffies + q-unplug_delay);
blk_add_trace_generic(q, NULL, 0, BLK_TA_PLUG);
}
 }
@@ -3975,6 +4000,54 @@ static ssize_t queue_max_hw_sectors_show(struct 
request_queue *q, char *page)
return queue_var_show(max_hw_sectors_kb, (page));
 }
 
+static ssize_t queue_unplug_delay_show(struct request_queue *q, char *page)
+{
+   return queue_var_show(q-unplug_delay, page);
+}
+
+/*
+ * We don't bother rearming a running timer. It's just not worth it, the
+ * next unplug will get it right.
+ */
+static ssize_t queue_unplug_delay_store(struct request_queue *q,
+   const char *page, size_t count)
+{
+   unsigned long delay;
+   int ret;
+
+   ret = queue_var_store(delay, page, count);
+
+   spin_lock_irq(q-queue_lock);
+   q-unplug_delay = msecs_to_jiffies(delay);
+   spin_unlock_irq(q-queue_lock);
+
+   return ret;
+}
+
+static ssize_t queue_unplug_depth_show(struct request_queue *q, char *page)
+{
+   return queue_var_show(q-unplug_thresh, page);
+}
+
+/*
+ * We don't bother unplugging if we depth was reduced and we just happened
+ * to have a current queue depth of somewhere in between the old and new
+ * value.
+ */
+static ssize_t queue_unplug_depth_store(struct request_queue *q,
+   const char *page, size_t count)
+{
+   unsigned long depth;
+   int ret;
+
+   ret = queue_var_store(depth, page, count);
+
+   spin_lock_irq(q-queue_lock);
+   q-unplug_thresh = depth;
+   spin_unlock_irq(q-queue_lock);
+
+   return ret;
+}
 
 static struct queue_sysfs_entry queue_requests_entry = {
.attr = {.name = nr_requests, .mode = S_IRUGO | S_IWUSR },
@@ -4005,12 +4078,26 @@ static struct queue_sysfs_entry queue_iosched_entry = {
.store = elv_iosched_store,
 };
 
+static struct queue_sysfs_entry queue_unplug_depth_entry = {
+   .attr = {.name = unplug_depth, .mode = S_IRUGO | S_IWUSR },
+   .show = queue_unplug_depth_show,
+   .store = queue_unplug_depth_store,
+};
+
+static struct queue_sysfs_entry queue_unplug_delay_entry = {
+   .attr = {.name = unplug_delay_ms, .mode = S_IRUGO | S_IWUSR },
+   .show = queue_unplug_delay_show,
+   .store = queue_unplug_delay_store,
+};
+
 static struct

Re: [kvm-devel] [PATCH RFC 3/3] virtio infrastructure: example block driver

2007-06-01 Thread Jens Axboe
On Fri, Jun 01 2007, Carsten Otte wrote:
 Rusty Russell wrote:
 Now my lack of block-layer knowledge is showing.  I would have thought
 that if we want to do things like ionice(1) to work, we have to do some
 guest scheduling or pass that information down to the host.
 Yea that would only work on the host: one can use ionice to set the io 
 niceness of the entire guest. Individual processes inside the guest 
 are opaque to the host, and thus are opaque to its io scheduler.
 
 It seems preferable to do that in the host, especially when requests 
 of multiple guests end up on the same physical media (shared access, 
 or partitioned).
 
 What's the overhead in doing both?
 With regard to compute power needed, almost none. The penalty is 
 latency, not overhead: A small request may sit on the request queue to 
 wait for other work to arrive until the queue gets unplugged. This 
 penality is compensated by the benefit of a good chance that more 
 requests will be merged during this time period.
 If we have this method both in host and guest, we have twice the 
 penalty with no added benefit.

I don't buy that argument. We can easily expose the unplug delay, so you
can kill it at what ever level you want. Or you could just do it in the
driver right now, but that is a bit hackish.

 On the other hand, if we choose to hook into q-make_request_fn, we do 
 end up doing far more hypercalls: bios do not get merged on the guest 
 side.  We must do a hypercall per bio in this scenario, or we'll end 
 up adding latency again. In contrast, we can submit the entire content 
 of the queue with a single hypercall when calling at do_request().
 
 A third way out of that situation is to do queueing between guest and 
 host: on the first bio, guest does a hypercall. When the next bio 
 arrives, guest sees that the host has not finished processing the 
 queue yet and pushes another buffer without doing a notification. 
 We've also implemented this, with the result that our host stack was 
 quick enough to practically always process the bio before the guest 
 had the chance to submit another one. Performance was a nightmare, so 
 we discontinued pursuing that idea.

I'd greatly prefer maintaing a request_fn interface for this. The
make_request_fn/request_fn call ratio is at least 1, and typically a lot
larger (4kb blocks, 128kb request not uncommon - 32).

-- 
Jens Axboe


-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel