Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)

2012-07-30 Thread Paolo Bonzini
Il 30/07/2012 01:50, Rusty Russell ha scritto:
 Also, being the first user of chained scatterlist doesn't exactly give
 me warm fuzzies.
 
 We're far from the first user: they've been in the kernel for well over
 7 years.  They were introduced for the block layer, but they tended to
 ignore private uses of scatterlists like this one.

Yeah, but sg_chain has no users in drivers, only a private one in
lib/scatterlist.c.  The internal API could be changed to something else
and leave virtio-scsi screwed...

 Yes, we should do this.  But note that this means an iteration, so we
 might as well combine the loops :)

I'm really bad at posting pseudo-code, but you can count the number of
physically-contiguous entries at the beginning of the list only.  So if
everything is contiguous, you use a single non-indirect buffer and save
a kmalloc.  If you use indirect buffers, I suspect it's much less
effective to collapse physically-contiguous entries.  More elaborate
heuristics do need a loop, though.

Paolo
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)

2012-07-30 Thread Boaz Harrosh
On 07/30/2012 10:12 AM, Paolo Bonzini wrote:

 Il 30/07/2012 01:50, Rusty Russell ha scritto:
 Also, being the first user of chained scatterlist doesn't exactly give
 me warm fuzzies.

 We're far from the first user: they've been in the kernel for well over
 7 years.  They were introduced for the block layer, but they tended to
 ignore private uses of scatterlists like this one.
 
 Yeah, but sg_chain has no users in drivers, only a private one in
 lib/scatterlist.c.  The internal API could be changed to something else
 and leave virtio-scsi screwed...
 
 Yes, we should do this.  But note that this means an iteration, so we
 might as well combine the loops :)
 
 I'm really bad at posting pseudo-code, but you can count the number of
 physically-contiguous entries at the beginning of the list only.  So if
 everything is contiguous, you use a single non-indirect buffer and save
 a kmalloc.  If you use indirect buffers, I suspect it's much less
 effective to collapse physically-contiguous entries.  More elaborate
 heuristics do need a loop, though.
 


[All the below with a grain of salt, from my senile memory]

You must not forget some facts about the scatterlist received here at the LLD.
It has already been DMA mapped and locked by the generic layer.
Which means that the DMA engine has already collapsed physically-contiguous
entries. Those you get here are already unique physically.
(There were bugs in the past, where this was not true, please complain
 if you find them again)

A scatterlist is two different lists taking the same space, but with two
different length.
- One list is the PAGE pointers plus offset  length, which is bigger or
  equal to the 2nd list. The end marker corresponds to this list.

  This list is the input into the DMA engine.

- Second list is the physical DMA addresses list. With their physical-lengths.
  Offset is not needed because it is incorporated in the DMA address.

  This list is the output from the DMA engine.

  The reason 2nd list is shorter is because the DMA engine tries to minimize
  the physical scatter-list entries which is usually a limited HW resource.

  This list might follow chains but it's end is determined by the received
  sg_count from the DMA engine, not by the end marker.

At the time my opinion, and I think Rusty agreed, was that the scatterlist
should be split in two. The input page-ptr list is just the BIO, and the
output of the DMA-engine should just be the physical part of the sg_list,
as a separate parameter. But all this was berried under too much APIs and
the noise was two strong, for any single brave sole.

So I'd just trust blindly the returned sg_count from the DMA engine, it is
already optimized. I THINK

 Paolo


Boaz
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)

2012-07-29 Thread Rusty Russell
On Fri, 27 Jul 2012 10:11:26 +0200, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 27/07/2012 08:27, Rusty Russell ha scritto:
   +int virtqueue_add_buf_sg(struct virtqueue *_vq,
   + struct scatterlist *sg_out,
   + unsigned int out,
   + struct scatterlist *sg_in,
   + unsigned int in,
   + void *data,
   + gfp_t gfp)
  The point of chained scatterlists is they're self-terminated, so the
  in  out counts should be calculated.
  
  Counting them is not *that* bad, since we're about to read them all
  anyway.
  
  (Yes, the chained scatterlist stuff is complete crack, but I lost that
  debate years ago.)
  
  Here's my variant.  Networking, console and block seem OK, at least
  (ie. it booted!).
 
 I hate the for loops, even though we're about indeed to read all the
 scatterlists anyway... all they do is lengthen critical sections.

You're preaching to the choir: I agree.  But even more, I hate the
passing of a number and a scatterlist: it makes it doubly ambigious
whether we should use the number or the terminator.  And I really hate
bad APIs, even more than a bit of icache loss.

 Also, being the first user of chained scatterlist doesn't exactly give
 me warm fuzzies.

We're far from the first user: they've been in the kernel for well over
7 years.  They were introduced for the block layer, but they tended to
ignore private uses of scatterlists like this one.

 I think it's simpler if we provide an API to add individual buffers to
 the virtqueue, so that you can do multiple virtqueue_add_buf_more
 (whatever) before kicking the virtqueue.  The idea is that I can still
 use indirect buffers for the scatterlists that come from the block layer
 or from an skb, but I will use direct buffers for the request/response
 descriptors.  The direct buffers are always a small number (usually 2),
 so you can balance the effect by making the virtqueue bigger.  And for
 small reads and writes, you save a kmalloc on a very hot path.

This is why I hate chained scatterlists: there's no sane way to tell the
difference between passing a simple array and passing a chain.  We're
mugging the type system.

I think the right way of doing this is a flag.  We could abuse gfp_t,
but that's super-ugly.  Perhaps we should create our own
VQ_ATOMIC/VQ_KERNEL/VQ_DIRECT enum?

Or we could do what we previously suggested, and try to do some
intelligent size heuristic.  I've posted a patch from 3 years ago below.
 
 (BTW, scatterlists will have separate entries for each page; we do not
 need this in virtio buffers.  Collapsing physically-adjacent entries
 will speed up QEMU and will also help avoiding indirect buffers).

Yes, we should do this.  But note that this means an iteration, so we
might as well combine the loops :)

Cheers,
Rusty.

FIXME: remove printk
virtio: use indirect buffers based on demand (heuristic)

virtio_ring uses a ring buffer of descriptors: indirect support allows
a single descriptor to refer to a table of descriptors.  This saves
space in the ring, but requires a kmalloc/kfree.

Rather than try to figure out what the right threshold at which to use
indirect buffers, we drop the threshold dynamically when the ring is
under stress.

Note: to stress this, I reduced the ring size to 32 in lguest, and a
1G send reduced the threshold to 9.

Note2: I moved the BUG_ON()s above the indirect test, where they belong
(indirect falls thru on OOM, so the constraints still apply).
---
 drivers/virtio/virtio_ring.c |   61 ---
 1 file changed, 52 insertions(+), 9 deletions(-)

diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -63,6 +63,8 @@ struct vring_virtqueue
 
/* Host supports indirect buffers */
bool indirect;
+   /* Threshold before we go indirect. */
+   unsigned int indirect_threshold;
 
/* Number of free buffers */
unsigned int num_free;
@@ -139,6 +141,32 @@ static int vring_add_indirect(struct vri
return head;
 }
 
+static void adjust_threshold(struct vring_virtqueue *vq,
+unsigned int out, unsigned int in)
+{
+   /* There are really two species of virtqueue, and it matters here.
+* If there are no output parts, it's a normally full receive queue,
+* otherwise it's a normally empty send queue. */
+   if (out) {
+   /* Leave threshold unless we're full. */
+   if (out + in  vq-num_free)
+   return;
+   } else {
+   /* Leave threshold unless we're empty. */
+   if (vq-num_free != vq-vring.num)
+   return;
+   }
+
+   /* Never drop threshold below 1 */
+   vq-indirect_threshold /= 2;
+   vq-indirect_threshold |= 1;
+
+   printk(%s %s: indirect 

Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)

2012-07-27 Thread Rusty Russell
On Thu, 26 Jul 2012 15:05:39 +0200, Paolo Bonzini pbonz...@redhat.com wrote:
 Il 26/07/2012 09:58, Paolo Bonzini ha scritto:
  
   Please CC me on the convert to sg copy-less patches, It looks 
   interesting
  Sure.
 
 Well, here is the gist of it (note it won't apply on any public tree,
 hence no SoB yet).  It should be split in multiple changesets and you
 can make more simplifications on top of it, because
 virtio_scsi_target_state is not anymore variable-sized, but that's
 secondary.

ISTR starting on such a patch years ago, but the primitives to
manipulate a chained sg_list were nonexistent, so I dropped it,
waiting for it to be fully-baked or replaced.  That hasn't happened:

 + /* Remove scatterlist terminator, we will tack more items soon.  */
 + vblk-sg[num + out - 1].page_link = ~0x2;

I hate this interface:

 +int virtqueue_add_buf_sg(struct virtqueue *_vq,
 +  struct scatterlist *sg_out,
 +  unsigned int out,
 +  struct scatterlist *sg_in,
 +  unsigned int in,
 +  void *data,
 +  gfp_t gfp)

The point of chained scatterlists is they're self-terminated, so the
in  out counts should be calculated.

Counting them is not *that* bad, since we're about to read them all
anyway.

(Yes, the chained scatterlist stuff is complete crack, but I lost that
debate years ago.)

Here's my variant.  Networking, console and block seem OK, at least
(ie. it booted!).

From: Rusty Russell ru...@rustcorp.com.au
Subject: virtio: use chained scatterlists.

Rather than handing a scatterlist[] and out and in numbers to
virtqueue_add_buf(), hand two separate ones which can be chained.

I shall refrain from ranting about what a disgusting hack chained
scatterlists are.  I'll just note that this doesn't make things
simpler (see diff).

The scatterlists we use can be too large for the stack, so we put them
in our device struct and reuse them.  But in many cases we don't want
to pay the cost of sg_init_table() as we don't know how many elements
we'll have and we'd have to initialize the entire table.

This means we have two choices: carefully reset the end markers after
we call virtqueue_add_buf(), which we do in virtio_net for the xmit
path where it's easy and we want to be optimal.  Elsewhere we
implement a helper to unset the end markers after we've filled the
array.

Signed-off-by: Rusty Russell ru...@rustcorp.com.au
---
 drivers/block/virtio_blk.c  |   37 +-
 drivers/char/hw_random/virtio-rng.c |2 -
 drivers/char/virtio_console.c   |6 +--
 drivers/net/virtio_net.c|   67 ++---
 drivers/virtio/virtio_balloon.c |6 +--
 drivers/virtio/virtio_ring.c|   71 ++--
 include/linux/virtio.h  |5 +-
 net/9p/trans_virtio.c   |   46 +--
 8 files changed, 159 insertions(+), 81 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -100,6 +100,14 @@ static void blk_done(struct virtqueue *v
spin_unlock_irqrestore(vblk-disk-queue-queue_lock, flags);
 }
 
+static void sg_unset_end_markers(struct scatterlist *sg, unsigned int num)
+{
+   unsigned int i;
+
+   for (i = 0; i  num; i++)
+   sg[i].page_link = ~0x02;
+}
+
 static bool do_req(struct request_queue *q, struct virtio_blk *vblk,
   struct request *req)
 {
@@ -140,6 +148,7 @@ static bool do_req(struct request_queue 
}
}
 
+   /* We layout out scatterlist in a single array, out then in. */
sg_set_buf(vblk-sg[out++], vbr-out_hdr, sizeof(vbr-out_hdr));
 
/*
@@ -151,17 +160,8 @@ static bool do_req(struct request_queue 
if (vbr-req-cmd_type == REQ_TYPE_BLOCK_PC)
sg_set_buf(vblk-sg[out++], vbr-req-cmd, vbr-req-cmd_len);
 
+   /* This marks the end of the sg list at vblk-sg[out]. */
num = blk_rq_map_sg(q, vbr-req, vblk-sg + out);
-
-   if (vbr-req-cmd_type == REQ_TYPE_BLOCK_PC) {
-   sg_set_buf(vblk-sg[num + out + in++], vbr-req-sense, 
SCSI_SENSE_BUFFERSIZE);
-   sg_set_buf(vblk-sg[num + out + in++], vbr-in_hdr,
-  sizeof(vbr-in_hdr));
-   }
-
-   sg_set_buf(vblk-sg[num + out + in++], vbr-status,
-  sizeof(vbr-status));
-
if (num) {
if (rq_data_dir(vbr-req) == WRITE) {
vbr-out_hdr.type |= VIRTIO_BLK_T_OUT;
@@ -172,7 +172,22 @@ static bool do_req(struct request_queue 
}
}
 
-   if (virtqueue_add_buf(vblk-vq, vblk-sg, out, in, vbr, GFP_ATOMIC)0) {
+   if (vbr-req-cmd_type == REQ_TYPE_BLOCK_PC) {
+   sg_set_buf(vblk-sg[out + in++], vbr-req-sense,
+  SCSI_SENSE_BUFFERSIZE);
+   

Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)

2012-07-27 Thread Paolo Bonzini
Il 27/07/2012 08:27, Rusty Russell ha scritto:
  +int virtqueue_add_buf_sg(struct virtqueue *_vq,
  +   struct scatterlist *sg_out,
  +   unsigned int out,
  +   struct scatterlist *sg_in,
  +   unsigned int in,
  +   void *data,
  +   gfp_t gfp)
 The point of chained scatterlists is they're self-terminated, so the
 in  out counts should be calculated.
 
 Counting them is not *that* bad, since we're about to read them all
 anyway.
 
 (Yes, the chained scatterlist stuff is complete crack, but I lost that
 debate years ago.)
 
 Here's my variant.  Networking, console and block seem OK, at least
 (ie. it booted!).

I hate the for loops, even though we're about indeed to read all the
scatterlists anyway... all they do is lengthen critical sections.  Also,
being the first user of chained scatterlist doesn't exactly give me warm
fuzzies.

I think it's simpler if we provide an API to add individual buffers to
the virtqueue, so that you can do multiple virtqueue_add_buf_more
(whatever) before kicking the virtqueue.  The idea is that I can still
use indirect buffers for the scatterlists that come from the block layer
or from an skb, but I will use direct buffers for the request/response
descriptors.  The direct buffers are always a small number (usually 2),
so you can balance the effect by making the virtqueue bigger.  And for
small reads and writes, you save a kmalloc on a very hot path.

(BTW, scatterlists will have separate entries for each page; we do not
need this in virtio buffers.  Collapsing physically-adjacent entries
will speed up QEMU and will also help avoiding indirect buffers).

Paolo



--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)

2012-07-26 Thread Paolo Bonzini
Il 25/07/2012 23:04, Boaz Harrosh ha scritto:
 That not all architectures have ARCH_HAS_SG_CHAIN (though all those I
 care about do).  So I need to go through all architectures and make sure
 they use for_each_sg, or at least to change ARCH_HAS_SG_CHAIN to a
 Kconfig define so that dependencies can be expressed properly.
 
 What is actually preventing ARCH_HAS_SG_CHAIN from all these ARCHES?
 is that the DMA drivers not using for_each_sg(). Sounds like easy
 to fix.
 
 But yes a deep change would convert ARCH_HAS_SG_CHAIN to a Kconfig.
 
 If you want to be lazy, like me, You might just put a BUILD_BUG_ON
 in code, requesting the user to disable the driver for this ARCH.
 
 I bet there is more things to do at ARCH to enable virtualization
 then just support ARCH_HAS_SG_CHAIN. Be it just another requirement.

Actually, many arches run just fine with QEMU's binary code translation
(alpha, mips, coldfire).  And since it's already pretty slow, using
virtio to improve performance is nice even in that scenario (which does
not require any kernel change).  But it's just an icing on the cake
indeed.  I can live with a BUILD_BUG_ON or better a depends on
HAVE_SG_CHAIN for the time being.

 And BTW you won't need that new __sg_set_page API anymore.

 Kind of.

sg_init_table(sg, 2);
sg_set_buf(sg[0], req, sizeof(req));
sg_chain(sg[1], scsi_out(sc));

 is still a little bit worse than

__sg_set_buf(sg[0], req, sizeof(req));
__sg_chain(sg[1], scsi_out(sc));
 
 
 I believe they are the same, specially for the
 on the stack 2 elements array. Actually I think
 In both cases you need to at least call sg_init_table()
 once after allocation, No?

Because the scatterlist here would be allocated on the stack, I would
need to call it every time if I used sg_set_buf/sg_chain.

But sg_init_table is really just memset + set SG_MAGIC + sg_mark_end.
If you use the full-init APIs as above, you don't need any of those
(sg_chain undoes the flag changes in sg_mark_end and vice versa).

 But please for my sake do not call it __sg_chain. Call it
 something like sg_chain_not_end(). I hate those __ which
 for god sack means what? 
 (A good name is when I don't have to read the code, an __
  means fuck you go read the code)

Right, better call them sg_init_buf/sg_init_page/sg_init_chain.

In the meanwhile, we still have a bug to fix, and we need to choose
between Sen Wang's v1 (sg_set_page) or v2 (value assignment).  I'm still
leaning more towards v2, if only because I already tested that one myself.

Paolo
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)

2012-07-26 Thread Paolo Bonzini
Il 26/07/2012 09:56, Boaz Harrosh ha scritto:
  In the meanwhile, we still have a bug to fix, and we need to choose
  between Sen Wang's v1 (sg_set_page) or v2 (value assignment).  I'm still
  leaning more towards v2, if only because I already tested that one myself.
 
 It's your call, you know what I think ;-)
 
 I Agree none of them are bugs in current code, they will both work
 just the same.

Cool.  Then, Sen, please fix the commit message in v2 and repost.

 Please CC me on the convert to sg copy-less patches, It looks interesting

Sure.

Paolo
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)

2012-07-26 Thread Paolo Bonzini
Il 26/07/2012 09:58, Paolo Bonzini ha scritto:
 
  Please CC me on the convert to sg copy-less patches, It looks interesting
 Sure.

Well, here is the gist of it (note it won't apply on any public tree,
hence no SoB yet).  It should be split in multiple changesets and you
can make more simplifications on top of it, because
virtio_scsi_target_state is not anymore variable-sized, but that's
secondary.

The patch includes the conversion of virtio_ring.c to use sg_next.
It is a bit ugly because of backwards-compatibility, it can be fixed
by going through all drivers; not that there are many.

I'm still a bit worried of the future of this feature though.  I would
be the first and (because of the non-portability) presumably sole user
for some time.  Someone axing chained sg-lists in the future does not
seem too unlikely.  So in practice I would rather just add to virtio a
function that takes two struct scatterlist ** (or an array of struct
scatterlist * + size pairs).  Converting to sg_chain once it takes off
would be trivial.

Paolo
From 57f8d4a20cbe9b3b25e10cd0595d7ac102fc8f73 Mon Sep 17 00:00:00 2001
From: Paolo Bonzini pbonz...@redhat.com
Date: Thu, 26 Jul 2012 09:58:14 +0200
Subject: [PATCH] virtio-scsi: use chained sg_lists

Using chained sg_lists simplifies everything a lot.
The scatterlists we pass to virtio are always of bounded size,
and can be allocated on the stack.  This means we do not need to
take tgt_lock and struct virtio_scsi_target_state does not have
anymore a flexible array at the end, so we can avoid a pointer
access.
---
 drivers/block/virtio_blk.c   |3 +
 drivers/scsi/virtio_scsi.c   |   93 ---
 drivers/virtio/virtio_ring.c |   93 +++
 include/linux/virtio.h   |8 +++
 4 files changed, 114 insertions(+), 83 deletions(-)

diff --git a/drivers/scsi/virtio_scsi.c b/drivers/scsi/virtio_scsi.c
index 13f7ccb..ef65ea1 100644
--- a/drivers/scsi/virtio_scsi.c
+++ b/drivers/scsi/virtio_scsi.c
@@ -66,9 +66,6 @@ struct virtio_scsi_target_state {
 	struct virtio_scsi_vq *req_vq;
 
 	atomic_t reqs;
-
-	/* For sglist construction when adding commands to the virtqueue.  */
-	struct scatterlist sg[];
 };
 
 /* Driver instance state */
@@ -362,20 +359,6 @@ static void virtscsi_event_done(struct virtqueue *vq)
 	spin_unlock_irqrestore(vscsi-event_vq.vq_lock, flags);
 };
 
-static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
-			 struct scsi_data_buffer *sdb)
-{
-	struct sg_table *table = sdb-table;
-	struct scatterlist *sg_elem;
-	unsigned int idx = *p_idx;
-	int i;
-
-	for_each_sg(table-sgl, sg_elem, table-nents, i)
-		sg_set_buf(sg[idx++], sg_virt(sg_elem), sg_elem-length);
-
-	*p_idx = idx;
-}
-
 /**
  * virtscsi_map_cmd - map a scsi_cmd to a virtqueue scatterlist
  * @vscsi	: virtio_scsi state
@@ -384,52 +367,57 @@ static void virtscsi_map_sgl(struct scatterlist *sg, unsigned int *p_idx,
  * @in_num	: number of write-only elements
  * @req_size	: size of the request buffer
  * @resp_size	: size of the response buffer
- *
- * Called with tgt_lock held.
  */
-static void virtscsi_map_cmd(struct virtio_scsi_target_state *tgt,
-			 struct virtio_scsi_cmd *cmd,
-			 unsigned *out_num, unsigned *in_num,
+static void virtscsi_map_cmd(struct virtio_scsi_cmd *cmd,
+			 struct scatterlist *sg_out,
+			 unsigned *out_num,
+			 struct scatterlist *sg_in,
+			 unsigned *in_num,
 			 size_t req_size, size_t resp_size)
 {
 	struct scsi_cmnd *sc = cmd-sc;
-	struct scatterlist *sg = tgt-sg;
-	unsigned int idx = 0;
+
+	sg_init_table(sg_out, 2);
+	sg_init_table(sg_in, 2);
 
 	/* Request header.  */
-	sg_set_buf(sg[idx++], cmd-req, req_size);
+	sg_set_buf(sg_out[0], cmd-req, req_size);
+	*out_num = 1;
 
 	/* Data-out buffer.  */
-	if (sc  sc-sc_data_direction != DMA_FROM_DEVICE)
-		virtscsi_map_sgl(sg, idx, scsi_out(sc));
-
-	*out_num = idx;
+	if (sc  sc-sc_data_direction != DMA_FROM_DEVICE) {
+		struct sg_table *table = scsi_out(sc)-table;
+		sg_chain(sg_out, 2, table-sgl);
+		*out_num += table-nents;
+	}
 
 	/* Response header.  */
-	sg_set_buf(sg[idx++], cmd-resp, resp_size);
+	sg_set_buf(sg_in[0], cmd-resp, resp_size);
+	*in_num = 1;
 
 	/* Data-in buffer */
-	if (sc  sc-sc_data_direction != DMA_TO_DEVICE)
-		virtscsi_map_sgl(sg, idx, scsi_in(sc));
-
-	*in_num = idx - *out_num;
+	if (sc  sc-sc_data_direction != DMA_TO_DEVICE) {
+		struct sg_table *table = scsi_in(sc)-table;
+		sg_chain(sg_in, 2, table-sgl);
+		*in_num += table-nents;
+	}
 }
 
-static int virtscsi_kick_cmd(struct virtio_scsi_target_state *tgt,
-			 struct virtio_scsi_vq *vq,
+static int virtscsi_kick_cmd(struct virtio_scsi_vq *vq,
 			 struct virtio_scsi_cmd *cmd,
 			 size_t req_size, size_t resp_size, gfp_t gfp)
 {
+	struct scatterlist sg_out[2], sg_in[2];
 	unsigned int out_num, in_num;
 	unsigned long flags;
 	int ret;
 
-	spin_lock_irqsave(tgt-tgt_lock, flags);
-	virtscsi_map_cmd(tgt, cmd, 

virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)

2012-07-25 Thread Paolo Bonzini
Il 25/07/2012 15:26, Boaz Harrosh ha scritto:
 In SCSI land most LLDs should support chaining just by virtu of using the
 for_each_sg macro. That all it takes. Your code above does support it.

 Yes, it supports it but still has to undo them before passing to virtio.

 What my LLD does is add a request descriptor in front of the scatterlist
 that the LLD receives.  I would like to do this with a 2-item
 scatterlist: one for the request descriptor, and one which is a chain to
 the original scatterlist.  
 
 I hate that plan. Why yet override the scatter element yet again with a third
 union of a request descriptor

I'm not overriding (or did you mean overloading?) anything, and I think
you're reading too much in my words.

What I am saying is (for a WRITE command):

1) what I get is a scsi_cmnd which contains an N-element scatterlist.

2) virtio-scsi has to build the packet that is passed to the hardware
(it does not matter that the hardware is virtual).  This packet (per
virtio-scsi spec) has an N+1-element scatterlist, where the first
element is a request descriptor (struct virtio_scsi_cmd_req), and the
others describe the written data.

3) virtio takes care of converting the packet from a scatterlist
(which currently must be a flat one) to the hardware representation.
Here a walk is inevitable, so we don't care about this walk.

4) What I'm doing now: copying (and flattening) the N-element
scatterlist onto the last elements of an N+1 array that I pass to virtio.

  _ _ _ _ _ _
 |_|_|_|_|_|_|  scsi_cmnd scatterlist

 vvv COPY vvv
_ _ _ _ _ _ _
   |_|_|_|_|_|_|_|  scatterlist passed to virtio
|
virtio_scsi_cmd_req

Then I hand off the scatterlist to virtio.  virtio walks it and converts
it to hardware format.

5) What I want to do: create a 2-element scatterlist, the first being
the request descriptor and the second chaining to scsi_cmnd's N-element
scatterlist.

 _ _ _ _ _ _
|_|_|_|_|_|_|  scsi_cmnd scatterlist
_ _/
   |_|C|   scatterlist passed to virtio
|
virtio_scsi_cmd_req

Then I hand off the scatterlist to virtio.  virtio will still walk the
scatterlist chain, and convert it to N+1 elements for the hardware to
consume.  Still, removing one walk largely reduces the length of my
critical sections.  I also save some pointer-chasing because the
2-element scatterlist are short-lived and can reside on the stack.


Other details (you can probably skip these):

There is also a response descriptor.  In the case of writes this is the
only element that the hardware will write to, so in the case of writes
the written by hardware scatterlist has 1 element only and does not
need chaining.

Reads are entirely symmetric.  The hardware will read the request
descriptor from a 1-element scatterlist, and will write response+data
into an N+1-element scatterlist (the response descriptor precedes the
data that was read).  It can be treated in exactly the same way.  The
N+1-element scatterlist could also become a 2-element scatterlist with
chaining.

 Except that if I call sg_chain and my
 architecture does not define ARCH_HAS_SG_CHAIN, it will BUG out.  So I
 need to keep all the scatterlist allocation and copying crap that I have
 now within #ifdef, and it will bitrot.
 
 except that with the correct design you don't call sg_chain you just do:
   request_descriptor-sg_list = sg;

By the above it should be clear, that request_descriptor is not a
driver-private extension of the scsi_cmnd.  It is something passed to
the hardware.

 Well that can be done as well, (If done carefully) It should be easy to add
 chained fragments to both the end of a given chain just as at beginning.
 It only means that the last element of the appended-to chain moves to
 the next fragment and it's place is replaced by a link.

 But you cannot do that in constant time, can you?  And that means you do
 not enjoy any benefit in terms of cache misses etc.
 
 I did not understand constant time it is O(0) if that what you meant.

In the worst case it is a linked list, no?  So in the worst case
_finding_ the last element of the appended-to chain is O(n).

Actually, appending to the end is not a problem for virtio-scsi.  But
for example the virtio-blk spec places the response descriptor _after_
the input data.  I think this was a mistake, and I didn't repeat it for
virtio-scsi, but I cited it because in the past Rusty complained that
the long sglist implementation was SCSI-centric.

 Also, this assumes that I can modify the appended-to chain.  I'm not
 sure I can do this?
 
 Each case it's own. If the appended-to chain is const, yes you'll have
 to reallocate it and copy. Is that your case?

It will be virtio-blk's case, but we can ignore it.

Paolo
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)

2012-07-25 Thread Paolo Bonzini
Il 25/07/2012 17:28, Boaz Harrosh ha scritto:
 1) what I get is a scsi_cmnd which contains an N-element scatterlist.

 2) virtio-scsi has to build the packet that is passed to the hardware
 (it does not matter that the hardware is virtual).  This packet (per
 virtio-scsi spec) has an N+1-element scatterlist, where the first
 element is a request descriptor (struct virtio_scsi_cmd_req), and the
 others describe the written data.
 
 Then virtio-scsi spec is crap. It overloads the meaning of
 struct scatterlist of the first element in an array. to be a
 struct virtio_scsi_cmd_req.

What the holy fuck?  The first element simply _points_ to the struct
virtio_scsi_cmd_req, just like subsequent elements point to the data.

And the protocol of the device is _not_ a struct scatterlist[].  The
virtio _API_ takes that array and converts to a series of physical
address + offset pairs.

 Since you need to change the standard to support chaining then
 it is a good time to fix this.

Perhaps it is a good time for you to read the virtio spec.  You are
making a huge confusion between the LLD-virtio interface and the
virtio-hardware interface.  I'm talking only of the former.

 3) virtio takes care of converting the packet from a scatterlist
 (which currently must be a flat one) to the hardware representation.
 Here a walk is inevitable, so we don't care about this walk.
 
 hardware representation you mean aio or biovec, what ever the
 IO submission path uses at the host end?

No, I mean the way the virtio spec encodes the physical address + offset
pairs.

I stopped reading here.

Paolo
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: virtio(-scsi) vs. chained sg_lists (was Re: [PATCH] scsi: virtio-scsi: Fix address translation failure of HighMem pages used by sg list)

2012-07-25 Thread Boaz Harrosh
On 07/25/2012 11:06 PM, Paolo Bonzini wrote:

 Il 25/07/2012 21:16, Boaz Harrosh ha scritto:
 The picture confused me. It looked like the first element is the 
 virtio_scsi_cmd_req
 not an sgilist-element that points to the struct's buffer.

 In that case then yes your plan of making a two-elements fragment that 
 points to the
 original scsi-sglist is perfect. All you have to do is that, and all you 
 have to do
 at virtio is use the sg_for_each macro and you are done.

 You don't need any sglist allocation or reshaping. And you can easily support
 chaining. Looks like order of magnitude more simple then what you do now
 
 It is.
 
 So what is the problem?
 
 That not all architectures have ARCH_HAS_SG_CHAIN (though all those I
 care about do).  So I need to go through all architectures and make sure
 they use for_each_sg, or at least to change ARCH_HAS_SG_CHAIN to a
 Kconfig define so that dependencies can be expressed properly.
 


What is actually preventing ARCH_HAS_SG_CHAIN from all these ARCHES?
is that the DMA drivers not using for_each_sg(). Sounds like easy
to fix.

But yes a deep change would convert ARCH_HAS_SG_CHAIN to a Kconfig.

If you want to be lazy, like me, You might just put a BUILD_BUG_ON
in code, requesting the user to disable the driver for this ARCH.

I bet there is more things to do at ARCH to enable virtualization
then just support ARCH_HAS_SG_CHAIN. Be it just another requirement.

If you Document it and make sure current ARCHs are fine, it should
not ever trigger.

 And BTW you won't need that new __sg_set_page API anymore.
 
 Kind of.
 
sg_init_table(sg, 2);
sg_set_buf(sg[0], req, sizeof(req));
sg_chain(sg[1], scsi_out(sc));
 
 is still a little bit worse than
 
__sg_set_buf(sg[0], req, sizeof(req));
__sg_chain(sg[1], scsi_out(sc));
 


I believe they are the same, specially for the
on the stack 2 elements array. Actually I think
In both cases you need to at least call sg_init_table()
once after allocation, No?

Your old code with big array copy and re-shaping was
a better example of the need for your new API. Which I agree.

But please for my sake do not call it __sg_chain. Call it
something like sg_chain_not_end(). I hate those __ which
for god sack means what? 
(A good name is when I don't have to read the code, an __
 means fuck you go read the code)

 Paolo


Thanks
Boaz
--
To unsubscribe from this list: send the line unsubscribe linux-scsi in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html