Re: IB: increase DMA max_segment_size on Mellanox hardware

2011-01-24 Thread Or Gerlitz
David Dillow wrote:
 funny, if I understand this correct, assuming each scsi_host has a dma device
 associated with it, sg_dma_map/unmap can be taken out from the SCSI LLD code
 and be done in higher levels, just have someone scan everyone queuecommand 
 and
 command response code flows to clean that out, anyway
 
 Perhaps, except for the need for different devices to call different map
 commands, such as ib_dma_*() and sbus_dma_*() etc. There may be other
 reasons as well.

yes, I tend to think that there could be other reasons, else, if this (calling 
dma mapping from the SCSI LLD queuecommand code) is just something which was 
needed on the 2.4 days but not any more, day will come and someone will pick 
the glove and make this cleanup. 

BTW - ib_dma_map_xxx and friends are considered by me as bug and not a feature, 
it was push by the ipath/qib guys and I was holding a minority opinion that we 
need to stick to dma_map_xxx and have these drivers implement their own 
software IOMMU emulation. The reason my opinion wasn't accepted is that they're 
supported only on 64 bit platforms over which (that was the claim back then) 
each page has a kernel virtual address associated with it, oh well

 I'm still trying to understand the bigger picture with your patch set and 
 what
 role the mlx4/mthca patch has in it

 This patch is orthogonal to the SRP mapping changes and was not used for
 the performance numbers listed -- the 50% improvement was achieved with
 _no_ coalescing in the SG lists. Each 4 KB page was its own SG entry.

I see, thanks for clarifying this out load and clear, so maybe you'll first get 
the seven srp patches reviewed and merged, and only then see if/what benefit 
this patch brings, I'm a little bit worried of changing something below 
everyone's (srp, iser, rds, p9, nfs-rdma, lustre, etc) legs which doesn't have 
any notable benefit, I would be happy to hear others/opinions

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/8] Reliably generate large request from SRP

2011-01-24 Thread Bart Van Assche
On Mon, Jan 24, 2011 at 4:32 PM, Or Gerlitz ogerl...@voltaire.com wrote:
 David Dillow wrote:
 if we look on the 50% for SAS/1M IOs that you're presenting, can you tell
 what made the difference, srp went from sg_tablesize of 255 to 256 so the
 upper layers where able to provide 1M as one IO

 This win is from sg_tablesize going from 255 to 256 in this case; the HW
 really likes that better than getting two requests -- one for 1020 KB
 and one for 4 KB.

 Its always nice to find the simplest explanation to the greatest 
 improvement... going to the 2nd largest gains

 SAS   2M      520 MB/s        861 MB/s
 SAS   4M      529 MB/s        921 MB/s
 SAS   8M      600 MB/s        951 MB/s

 I wonder what made the difference here? it can't be only the 255 -- 256 
 sg_tablesize change, for the 2M case
 the change to use 512 pages FMRs could let you use one rkey/fmr for the whole 
 IO but not for 4M/8M

I think it would be interesting to have performance measurements with
a RAM disk as target too because it is hard to tell for someone not
familiar with the internals of the target used in this test which
performance gain is due to the initiator changes and which is due to
the target behavior.

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ibv_post_send/recv kernel path optimizations

2011-01-24 Thread Walukiewicz, Miroslaw
Sean,

The assumption here is that user space library prepares the vendor specific 
data in user-space using a shared page allocated by vendor driver. Here 
information about posted buffers is passed not through ib_wr but using the 
shared page. It is a reason why pointers indicating ib_wr in post_send are not 
set, they are not passed to kernel at all to avoid copying them in kernel. 

As there is no ib_wr structure in kernel there is no reference to bad_wr and a 
buffer that failed in this context so the only reasonable information about 
operation state passed using bad_wr could be return of binary information - 
operation successful (bad_wr = 0) or not (bad_wr != 0) 

Instead of using a specific case for RAW_QP it is possible to pass some 
information about posting buffers method by
enum ib_qp_create_flags {
IB_QP_CREATE_IPOIB_UD_LSO   = 1  0,
IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK   = 1  1
};
Extending it with IB_QP_CREATE_USE_SHARED_PAGE= 1  2,

In that case a new method could be used for any type of QP and will be backward 
compatible.

Regards,

Mirek
-Original Message-
From: Hefty, Sean 
Sent: Friday, January 21, 2011 4:50 PM
To: Walukiewicz, Miroslaw; Roland Dreier
Cc: Or Gerlitz; Jason Gunthorpe; linux-rdma@vger.kernel.org
Subject: RE: ibv_post_send/recv kernel path optimizations

 + qp = idr_read_qp(cmd.qp_handle, file-ucontext);
 + if (!qp)
 + goto out_raw_qp;
 +
 + if (qp-qp_type == IB_QPT_RAW_ETH) {
 + resp.bad_wr = 0;
 + ret = qp-device-post_send(qp, NULL, NULL);

This looks odd to me and can definitely confuse someone reading the code.  It 
adds assumptions to uverbs about the underlying driver implementation and ties 
that to the QP type.  I don't know if it makes more sense to key off something 
in the cmd or define some other property of the QP, but the NULL parameters 
into post_send are non-intuitive.
 
 + if (ret)
 + resp.bad_wr = cmd.wr_count;

Is this always the case?

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC 0/8] Reliably generate large request from SRP

2011-01-24 Thread David Dillow
On Mon, 2011-01-24 at 11:14 -0500, Bart Van Assche wrote:
 On Mon, Jan 24, 2011 at 4:32 PM, Or Gerlitz ogerl...@voltaire.com wrote:
  David Dillow wrote:
  if we look on the 50% for SAS/1M IOs that you're presenting, can you tell
  what made the difference, srp went from sg_tablesize of 255 to 256 so the
  upper layers where able to provide 1M as one IO
 
  This win is from sg_tablesize going from 255 to 256 in this case; the HW
  really likes that better than getting two requests -- one for 1020 KB
  and one for 4 KB.
 
  Its always nice to find the simplest explanation to the greatest
 improvement... going to the 2nd largest gains
 
  SAS   2M  520 MB/s861 MB/s
  SAS   4M  529 MB/s921 MB/s
  SAS   8M  600 MB/s951 MB/s
 
  I wonder what made the difference here? it can't be only the 255 --
  256 sg_tablesize change, for the 2M case
  the change to use 512 pages FMRs could let you use one rkey/fmr for
  the whole IO but not for 4M/8M
 
 I think it would be interesting to have performance measurements with
 a RAM disk as target too because it is hard to tell for someone not
 familiar with the internals of the target used in this test which
 performance gain is due to the initiator changes and which is due to
 the target behavior.

I think it is pretty obvious that the gain is due to the initiator
changes allowing us to drive the target the way it likes to be driven,
but perhaps I haven't given you enough information. The HW is backed by
a RAID6 (really RAID3 + two parity drives). Each 4 KB block is broken
into stripes across 8 512 byte sectors, and there is no write combining
when the write cache is disabled. So, when we're splitting 1 MB into a
1020 KB and a 4 KB request, that translates into a 127.5 KB and a 512
byte request to each backend storage device. With the patches, that
remains a single 128 KB request, or 256KB for 2M, etc. The low level
drives can optimize that much better.

I did runs against my IOP test harness, and it showed better performance
there as well, though that was unexpected -- I figured we'd see a slight
decline in IOPS. I have not yet investigated further, but you have the
code and are welcome to run tests and report results.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] opensm: accept looonnng partition.conf lines

2011-01-24 Thread Arthur Kepner

Sometimes we use partition.conf files with so many entries that 
the 1K buffer currently used by osm_prtn_config_parse_file() is 
too small. Use getline() to avoid that.

Signed-off-by: akep...@sgi.com
---
 opensm/opensm/osm_prtn_config.c |8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

---

diff --git a/opensm/opensm/osm_prtn_config.c b/opensm/opensm/osm_prtn_config.c
index 0d02597..34676ef 100644
--- a/opensm/opensm/osm_prtn_config.c
+++ b/opensm/opensm/osm_prtn_config.c
@@ -400,7 +400,9 @@ skip_header:
 int osm_prtn_config_parse_file(osm_log_t * p_log, osm_subn_t * p_subn,
   const char *file_name)
 {
-   char line[1024];
+   char *line = NULL;
+   ssize_t llen;
+   size_t n;
struct part_conf *conf = NULL;
FILE *file;
int lineno;
@@ -415,7 +417,7 @@ int osm_prtn_config_parse_file(osm_log_t * p_log, 
osm_subn_t * p_subn,
 
lineno = 0;
 
-   while (fgets(line, sizeof(line) - 1, file) != NULL) {
+   while ((llen = getline(line, n, file)) != -1) {
char *q, *p = line;
 
lineno++;
@@ -463,6 +465,8 @@ int osm_prtn_config_parse_file(osm_log_t * p_log, 
osm_subn_t * p_subn,
} while (q);
}
 
+   free(line);
+
fclose(file);
 
return 0;
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IB: increase DMA max_segment_size on Mellanox hardware

2011-01-24 Thread David Dillow
On Mon, 2011-01-24 at 10:18 -0500, Or Gerlitz wrote:
 David Dillow wrote:
  Perhaps, except for the need for different devices to call different map
  commands, such as ib_dma_*() and sbus_dma_*() etc. There may be other
  reasons as well.

 BTW - ib_dma_map_xxx and friends are considered by me as bug and not a
 feature,

I'm not a fan of them either, but I've not gone looking to see if I
could find a better solution either.

  I'm still trying to understand the bigger picture with your patch set and 
  what
  role the mlx4/mthca patch has in it
 
  This patch is orthogonal to the SRP mapping changes and was not used for
  the performance numbers listed -- the 50% improvement was achieved with
  _no_ coalescing in the SG lists. Each 4 KB page was its own SG entry.
 
 I see, thanks for clarifying this out load and clear, so maybe you'll
 first get the seven srp patches reviewed and merged, and only then see
 if/what benefit this patch brings, I'm a little bit worried of
 changing something below everyone's (srp, iser, rds, p9, nfs-rdma,
 lustre, etc) legs which doesn't have any notable benefit, I would be
 happy to hear others/opinions

Well, I believe it only affects the block merging -- I have to catch a
flight, so I cannot recheck right this moment -- that limits the effects
to SRP and iSER. The others don't use it at all.

We know SRP is fine, and a quick glance suggests that iSER is as well --
you limit the maximum IO size, so you don't get too large of request
lists anyways. If you were to expand that, the SCSI midlayer provides
hooks that could be used to clamp down the DMA size for devices hanging
off iSER. SRP cannot do that, as while it is safe to go to a smaller
value than the hardware supports, it is not safe to try and increase it
from the default.

I'll try to recheck once I get a few moments someplace quiet.
-- 
Dave Dillow
National Center for Computational Science
Oak Ridge National Laboratory
(865) 241-6602 office


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] infiniband: hw: amso1100: Fixed compile warning.

2011-01-24 Thread Bart Van Assche
On Sun, Jan 23, 2011 at 5:30 PM, Ralf Thielow
ralf.thie...@googlemail.com wrote:

 Fixed compile warning by use unsigned long instead of u64
 on assign NULL to c2_vq_req-reply_msg.

 Signed-off-by: Ralf Thielow ralf.thie...@googlemail.com
 ---
  drivers/infiniband/hw/amso1100/c2_vq.c |    6 +++---
  1 files changed, 3 insertions(+), 3 deletions(-)

 diff --git a/drivers/infiniband/hw/amso1100/c2_vq.c 
 b/drivers/infiniband/hw/amso1100/c2_vq.c
 index 9ce7819..807b730 100644
 --- a/drivers/infiniband/hw/amso1100/c2_vq.c
 +++ b/drivers/infiniband/hw/amso1100/c2_vq.c
 @@ -107,7 +107,7 @@ struct c2_vq_req *vq_req_alloc(struct c2_dev *c2dev)
        r = kmalloc(sizeof(struct c2_vq_req), GFP_KERNEL);
        if (r) {
                init_waitqueue_head(r-wait_object);
 -               r-reply_msg = (u64) NULL;
 +               r-reply_msg = (unsigned long) NULL;
                r-event = 0;
                r-cm_id = NULL;
                r-qp = NULL;
 @@ -123,7 +123,7 @@ struct c2_vq_req *vq_req_alloc(struct c2_dev *c2dev)
  */
  void vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *r)
  {
 -       r-reply_msg = (u64) NULL;
 +       r-reply_msg = (unsigned long) NULL;
        if (atomic_dec_and_test(r-refcnt)) {
                kfree(r);
        }
 @@ -151,7 +151,7 @@ void vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *r)
  void vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *r)
  {
        if (atomic_dec_and_test(r-refcnt)) {
 -               if (r-reply_msg != (u64) NULL)
 +               if (r-reply_msg != (unsigned long) NULL)
                        vq_repbuf_free(c2dev,
                                       (void *) (unsigned long) r-reply_msg);
                kfree(r);

Hello Ralf,

Have you considered using 0 instead of (u64) NULL or (unsigned
long) NULL ?

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html