Re: IB: increase DMA max_segment_size on Mellanox hardware
David Dillow wrote: funny, if I understand this correct, assuming each scsi_host has a dma device associated with it, sg_dma_map/unmap can be taken out from the SCSI LLD code and be done in higher levels, just have someone scan everyone queuecommand and command response code flows to clean that out, anyway Perhaps, except for the need for different devices to call different map commands, such as ib_dma_*() and sbus_dma_*() etc. There may be other reasons as well. yes, I tend to think that there could be other reasons, else, if this (calling dma mapping from the SCSI LLD queuecommand code) is just something which was needed on the 2.4 days but not any more, day will come and someone will pick the glove and make this cleanup. BTW - ib_dma_map_xxx and friends are considered by me as bug and not a feature, it was push by the ipath/qib guys and I was holding a minority opinion that we need to stick to dma_map_xxx and have these drivers implement their own software IOMMU emulation. The reason my opinion wasn't accepted is that they're supported only on 64 bit platforms over which (that was the claim back then) each page has a kernel virtual address associated with it, oh well I'm still trying to understand the bigger picture with your patch set and what role the mlx4/mthca patch has in it This patch is orthogonal to the SRP mapping changes and was not used for the performance numbers listed -- the 50% improvement was achieved with _no_ coalescing in the SG lists. Each 4 KB page was its own SG entry. I see, thanks for clarifying this out load and clear, so maybe you'll first get the seven srp patches reviewed and merged, and only then see if/what benefit this patch brings, I'm a little bit worried of changing something below everyone's (srp, iser, rds, p9, nfs-rdma, lustre, etc) legs which doesn't have any notable benefit, I would be happy to hear others/opinions Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/8] Reliably generate large request from SRP
On Mon, Jan 24, 2011 at 4:32 PM, Or Gerlitz ogerl...@voltaire.com wrote: David Dillow wrote: if we look on the 50% for SAS/1M IOs that you're presenting, can you tell what made the difference, srp went from sg_tablesize of 255 to 256 so the upper layers where able to provide 1M as one IO This win is from sg_tablesize going from 255 to 256 in this case; the HW really likes that better than getting two requests -- one for 1020 KB and one for 4 KB. Its always nice to find the simplest explanation to the greatest improvement... going to the 2nd largest gains SAS 2M 520 MB/s 861 MB/s SAS 4M 529 MB/s 921 MB/s SAS 8M 600 MB/s 951 MB/s I wonder what made the difference here? it can't be only the 255 -- 256 sg_tablesize change, for the 2M case the change to use 512 pages FMRs could let you use one rkey/fmr for the whole IO but not for 4M/8M I think it would be interesting to have performance measurements with a RAM disk as target too because it is hard to tell for someone not familiar with the internals of the target used in this test which performance gain is due to the initiator changes and which is due to the target behavior. Bart. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ibv_post_send/recv kernel path optimizations
Sean, The assumption here is that user space library prepares the vendor specific data in user-space using a shared page allocated by vendor driver. Here information about posted buffers is passed not through ib_wr but using the shared page. It is a reason why pointers indicating ib_wr in post_send are not set, they are not passed to kernel at all to avoid copying them in kernel. As there is no ib_wr structure in kernel there is no reference to bad_wr and a buffer that failed in this context so the only reasonable information about operation state passed using bad_wr could be return of binary information - operation successful (bad_wr = 0) or not (bad_wr != 0) Instead of using a specific case for RAW_QP it is possible to pass some information about posting buffers method by enum ib_qp_create_flags { IB_QP_CREATE_IPOIB_UD_LSO = 1 0, IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK = 1 1 }; Extending it with IB_QP_CREATE_USE_SHARED_PAGE= 1 2, In that case a new method could be used for any type of QP and will be backward compatible. Regards, Mirek -Original Message- From: Hefty, Sean Sent: Friday, January 21, 2011 4:50 PM To: Walukiewicz, Miroslaw; Roland Dreier Cc: Or Gerlitz; Jason Gunthorpe; linux-rdma@vger.kernel.org Subject: RE: ibv_post_send/recv kernel path optimizations + qp = idr_read_qp(cmd.qp_handle, file-ucontext); + if (!qp) + goto out_raw_qp; + + if (qp-qp_type == IB_QPT_RAW_ETH) { + resp.bad_wr = 0; + ret = qp-device-post_send(qp, NULL, NULL); This looks odd to me and can definitely confuse someone reading the code. It adds assumptions to uverbs about the underlying driver implementation and ties that to the QP type. I don't know if it makes more sense to key off something in the cmd or define some other property of the QP, but the NULL parameters into post_send are non-intuitive. + if (ret) + resp.bad_wr = cmd.wr_count; Is this always the case? - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC 0/8] Reliably generate large request from SRP
On Mon, 2011-01-24 at 11:14 -0500, Bart Van Assche wrote: On Mon, Jan 24, 2011 at 4:32 PM, Or Gerlitz ogerl...@voltaire.com wrote: David Dillow wrote: if we look on the 50% for SAS/1M IOs that you're presenting, can you tell what made the difference, srp went from sg_tablesize of 255 to 256 so the upper layers where able to provide 1M as one IO This win is from sg_tablesize going from 255 to 256 in this case; the HW really likes that better than getting two requests -- one for 1020 KB and one for 4 KB. Its always nice to find the simplest explanation to the greatest improvement... going to the 2nd largest gains SAS 2M 520 MB/s861 MB/s SAS 4M 529 MB/s921 MB/s SAS 8M 600 MB/s951 MB/s I wonder what made the difference here? it can't be only the 255 -- 256 sg_tablesize change, for the 2M case the change to use 512 pages FMRs could let you use one rkey/fmr for the whole IO but not for 4M/8M I think it would be interesting to have performance measurements with a RAM disk as target too because it is hard to tell for someone not familiar with the internals of the target used in this test which performance gain is due to the initiator changes and which is due to the target behavior. I think it is pretty obvious that the gain is due to the initiator changes allowing us to drive the target the way it likes to be driven, but perhaps I haven't given you enough information. The HW is backed by a RAID6 (really RAID3 + two parity drives). Each 4 KB block is broken into stripes across 8 512 byte sectors, and there is no write combining when the write cache is disabled. So, when we're splitting 1 MB into a 1020 KB and a 4 KB request, that translates into a 127.5 KB and a 512 byte request to each backend storage device. With the patches, that remains a single 128 KB request, or 256KB for 2M, etc. The low level drives can optimize that much better. I did runs against my IOP test harness, and it showed better performance there as well, though that was unexpected -- I figured we'd see a slight decline in IOPS. I have not yet investigated further, but you have the code and are welcome to run tests and report results. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] opensm: accept looonnng partition.conf lines
Sometimes we use partition.conf files with so many entries that the 1K buffer currently used by osm_prtn_config_parse_file() is too small. Use getline() to avoid that. Signed-off-by: akep...@sgi.com --- opensm/opensm/osm_prtn_config.c |8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) --- diff --git a/opensm/opensm/osm_prtn_config.c b/opensm/opensm/osm_prtn_config.c index 0d02597..34676ef 100644 --- a/opensm/opensm/osm_prtn_config.c +++ b/opensm/opensm/osm_prtn_config.c @@ -400,7 +400,9 @@ skip_header: int osm_prtn_config_parse_file(osm_log_t * p_log, osm_subn_t * p_subn, const char *file_name) { - char line[1024]; + char *line = NULL; + ssize_t llen; + size_t n; struct part_conf *conf = NULL; FILE *file; int lineno; @@ -415,7 +417,7 @@ int osm_prtn_config_parse_file(osm_log_t * p_log, osm_subn_t * p_subn, lineno = 0; - while (fgets(line, sizeof(line) - 1, file) != NULL) { + while ((llen = getline(line, n, file)) != -1) { char *q, *p = line; lineno++; @@ -463,6 +465,8 @@ int osm_prtn_config_parse_file(osm_log_t * p_log, osm_subn_t * p_subn, } while (q); } + free(line); + fclose(file); return 0; -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IB: increase DMA max_segment_size on Mellanox hardware
On Mon, 2011-01-24 at 10:18 -0500, Or Gerlitz wrote: David Dillow wrote: Perhaps, except for the need for different devices to call different map commands, such as ib_dma_*() and sbus_dma_*() etc. There may be other reasons as well. BTW - ib_dma_map_xxx and friends are considered by me as bug and not a feature, I'm not a fan of them either, but I've not gone looking to see if I could find a better solution either. I'm still trying to understand the bigger picture with your patch set and what role the mlx4/mthca patch has in it This patch is orthogonal to the SRP mapping changes and was not used for the performance numbers listed -- the 50% improvement was achieved with _no_ coalescing in the SG lists. Each 4 KB page was its own SG entry. I see, thanks for clarifying this out load and clear, so maybe you'll first get the seven srp patches reviewed and merged, and only then see if/what benefit this patch brings, I'm a little bit worried of changing something below everyone's (srp, iser, rds, p9, nfs-rdma, lustre, etc) legs which doesn't have any notable benefit, I would be happy to hear others/opinions Well, I believe it only affects the block merging -- I have to catch a flight, so I cannot recheck right this moment -- that limits the effects to SRP and iSER. The others don't use it at all. We know SRP is fine, and a quick glance suggests that iSER is as well -- you limit the maximum IO size, so you don't get too large of request lists anyways. If you were to expand that, the SCSI midlayer provides hooks that could be used to clamp down the DMA size for devices hanging off iSER. SRP cannot do that, as while it is safe to go to a smaller value than the hardware supports, it is not safe to try and increase it from the default. I'll try to recheck once I get a few moments someplace quiet. -- Dave Dillow National Center for Computational Science Oak Ridge National Laboratory (865) 241-6602 office -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] infiniband: hw: amso1100: Fixed compile warning.
On Sun, Jan 23, 2011 at 5:30 PM, Ralf Thielow ralf.thie...@googlemail.com wrote: Fixed compile warning by use unsigned long instead of u64 on assign NULL to c2_vq_req-reply_msg. Signed-off-by: Ralf Thielow ralf.thie...@googlemail.com --- drivers/infiniband/hw/amso1100/c2_vq.c | 6 +++--- 1 files changed, 3 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/amso1100/c2_vq.c b/drivers/infiniband/hw/amso1100/c2_vq.c index 9ce7819..807b730 100644 --- a/drivers/infiniband/hw/amso1100/c2_vq.c +++ b/drivers/infiniband/hw/amso1100/c2_vq.c @@ -107,7 +107,7 @@ struct c2_vq_req *vq_req_alloc(struct c2_dev *c2dev) r = kmalloc(sizeof(struct c2_vq_req), GFP_KERNEL); if (r) { init_waitqueue_head(r-wait_object); - r-reply_msg = (u64) NULL; + r-reply_msg = (unsigned long) NULL; r-event = 0; r-cm_id = NULL; r-qp = NULL; @@ -123,7 +123,7 @@ struct c2_vq_req *vq_req_alloc(struct c2_dev *c2dev) */ void vq_req_free(struct c2_dev *c2dev, struct c2_vq_req *r) { - r-reply_msg = (u64) NULL; + r-reply_msg = (unsigned long) NULL; if (atomic_dec_and_test(r-refcnt)) { kfree(r); } @@ -151,7 +151,7 @@ void vq_req_get(struct c2_dev *c2dev, struct c2_vq_req *r) void vq_req_put(struct c2_dev *c2dev, struct c2_vq_req *r) { if (atomic_dec_and_test(r-refcnt)) { - if (r-reply_msg != (u64) NULL) + if (r-reply_msg != (unsigned long) NULL) vq_repbuf_free(c2dev, (void *) (unsigned long) r-reply_msg); kfree(r); Hello Ralf, Have you considered using 0 instead of (u64) NULL or (unsigned long) NULL ? Bart. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html