Re: [PATCH V2 1/2] IB/core: Add support for enhanced atomic operations
Still in-development, but it should be ready very soon. We will be the first in-kernel user of atomics, as well as masked atomics. This tree does not support masked atomics yet because it is based on ofed 1.5.1. When 1.5.2 is officially released, I'll rebase, add mask support, and plan on pushing to mainline and ofed 1.6 as soon as it opens. May be I missed something, but how do you guys intend to reflect this new functionality vs. the capabilities? A new atomic_enhanced_cap ? The reason for asking is that the IB ordinary atomic repertoire fits nicely with those of PCIe Gen 3, so one could possible assume new HCAs supporting PCIe Gen3 to possess the ATOMIC_GLOB capability for atomic_cap. But I do not see that happening to the proposed Enhanced IB Atomics. Thanks, Håkon -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: SRP initiator and iSER initiator performance
Bart Van Assche, on 02/27/2010 10:27 PM wrote: On Mon, Jan 11, 2010 at 7:44 PM, Vladislav Bolkhovitin v...@vlnb.net wrote: [ ... ] SRP initiator seems to be not too well optimized for the best performance. ISER initiator is noticeably better in this area. (replying to an e-mail of one month ago) I'm not sure the above statement makes sense. Below you can find the performance results for 512-byte reads with a varying number of threads and a NULLIO target. With a sufficiently high number of threads this test saturated the two CPU cores of the initiator system but not the CPU core of the target system. So one can conclude from the numbers below that for the initiator and target software combinations used for this test that although the difference is small, the latency for the SRP traffic is slightly lower than that of the iSER traffic and also that the CPU usage of the SRP traffic is slightly lower than that of the iSER traffic. These numbers are quite impressive since one can conclude from the numbers below that for both protocols one I/O operation is completed by the initiator system in about 17 microseconds or 44000 clock cycles. iSER: 1 read : io=128MB, bw=13,755KB/s, iops=27,510, runt= 9529msec 2 read : io=256MB, bw=26,118KB/s, iops=52,235, runt= 10037msec 4 read : io=512MB, bw=48,985KB/s, iops=97,970, runt= 10703msec 8 read : io=1,024MB, bw=57,519KB/s, iops=115K, runt= 18230msec 16 read : io=2,048MB, bw=57,880KB/s, iops=116K, runt= 36233msec 32 read : io=4,096MB, bw=57,990KB/s, iops=116K, runt= 72328msec 64 read : io=8,192MB, bw=58,066KB/s, iops=116K, runt=144468msec CPU load for 64 threads (according to vmstat 2): 20% us + 80% sy on the initiator and 40% us + 20% sy + 40% id on the target. SRP: 1 read : io=128MB, bw=14,211KB/s, iops=28,422, runt= 9223msec 2 read : io=256MB, bw=26,275KB/s, iops=52,549, runt= 9977msec 4 read : io=512MB, bw=49,257KB/s, iops=98,513, runt= 10644msec 8 read : io=1,024MB, bw=60,322KB/s, iops=121K, runt= 17383msec 16 read : io=2,048MB, bw=61,272KB/s, iops=123K, runt= 34227msec 32 read : io=4,096MB, bw=61,176KB/s, iops=122K, runt= 68561msec 64 read : io=8,192MB, bw=60,963KB/s, iops=122K, runt=137602msec CPU load for 64 threads (according to vmstat 2): 20% us + 80% sy on the initiator and 0% us + 50% sy + 50% id on the target. Setup details: * The above output was generated with the following command: for i in 1 2 4 8 16 32 64; do printf %2d $i; io-load 512 $i ${initiator_device} | grep runt; done * The io-load script is as follows: #!/bin/sh blocksize=${1:-512} threads=${2:-1} dev=${3:-sdj} fio --bs=${blocksize} --buffered=0 --size=128M --ioengine=sg --rw=read --invalidate=1 --end_fsync=1 --thread --numjobs=${threads} --loops=1 --group_reporting --name=nullio --filename=/dev/${dev} * SRP target software: SCST r1522 compiled in release mode. * iSER target software: tgt 1.0.2. * InfiniBand hardware: QDR PCIe 2.0 HCA's. * Initiator system: 2.6.33-rc7 kernel (for-next branch of Rolands InfiniBand repository without the recently posted iSER and SRP performance improvement patches). SRP initiator was loaded with parameter srp_sg_tablesize=128 Frequency scaling was disabled. Runlevel: 3. CPU: E6750 @ 2.66GHz. * Target system: 2.6.30.7 kernel + SCST patches. Frequency scaling was disabled. Runlevel: 3. CPU: E8400 @ 3.00GHz booted with maxcpus=1. It's good if my impression was wrong. But you've got suspiciously low IOPS numbers. On your hardware you should have much more. Seems you experienced a bottleneck on the initiator somewhere above the drivers level (fio? sg engine? IRQs or context switches count?), so your results could be not really related to the topic. Oprofile and lockstat output can shed more light on this. Vlad -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] IB/ipoib: fix dangling pointer references to ipoib_neigh and ipoib_path
When using connected mode, ipoib_cm_create_tx() kmallocs a struct ipoib_cm_tx which contains pointers to ipoib_neigh and ipoib_path. If the paths are flushed or the struct neighbour is destroyed, the pointers held by struct ipoib_cm_tx can reference freed memory. Signed-off-by: Ralph Campbell ralph.campb...@qlogic.com --- I was able to remove the kref_t added to struct ipoib_path and ipoib_neigh. The key is to delete references to ipoib_neigh and ipoib_path when ipoib_neigh_cleanup() is called. drivers/infiniband/ulp/ipoib/ipoib.h | 24 ++- drivers/infiniband/ulp/ipoib/ipoib_cm.c| 105 +- drivers/infiniband/ulp/ipoib/ipoib_main.c | 267 +++- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 95 +++-- 4 files changed, 222 insertions(+), 269 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 753a983..84bb561 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -415,9 +415,7 @@ static inline struct ipoib_neigh **to_ipoib_neigh(struct neighbour *neigh) INFINIBAND_ALEN, sizeof(void *)); } -struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh, - struct net_device *dev); -void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh); +void ipoib_neigh_flush(struct ipoib_neigh *neigh); extern struct workqueue_struct *ipoib_workqueue; @@ -464,7 +462,8 @@ void ipoib_dev_cleanup(struct net_device *dev); void ipoib_mcast_join_task(struct work_struct *work); void ipoib_mcast_carrier_on_task(struct work_struct *work); -void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb); +void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb, + struct ipoib_neigh *neigh); void ipoib_mcast_restart_task(struct work_struct *work); int ipoib_mcast_start_thread(struct net_device *dev); @@ -567,9 +566,10 @@ void ipoib_cm_dev_stop(struct net_device *dev); int ipoib_cm_dev_init(struct net_device *dev); int ipoib_cm_add_mode_attr(struct net_device *dev); void ipoib_cm_dev_cleanup(struct net_device *dev); -struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path, - struct ipoib_neigh *neigh); +void ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path, + struct ipoib_neigh *neigh); void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx); +void ipoib_cm_flush_path(struct net_device *dev, struct ipoib_path *path); void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff *skb, unsigned int mtu); void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc); @@ -646,10 +646,10 @@ void ipoib_cm_dev_cleanup(struct net_device *dev) } static inline -struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path, - struct ipoib_neigh *neigh) +void ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path, + struct ipoib_neigh *neigh) { - return NULL; + return; } static inline @@ -659,6 +659,12 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx) } static inline +void ipoib_cm_flush_path(struct net_device *dev, struct ipoib_path *path) +{ + return; +} + +static inline int ipoib_cm_add_mode_attr(struct net_device *dev) { return 0; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 30bdf42..4ec4ebc 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -794,31 +794,14 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) if (wc-status != IB_WC_SUCCESS wc-status != IB_WC_WR_FLUSH_ERR) { - struct ipoib_neigh *neigh; - ipoib_dbg(priv, failed cm send event (status=%d, wrid=%d vend_err %x)\n, wc-status, wr_id, wc-vendor_err); spin_lock_irqsave(priv-lock, flags); - neigh = tx-neigh; - - if (neigh) { - neigh-cm = NULL; - list_del(neigh-list); - if (neigh-ah) - ipoib_put_ah(neigh-ah); - ipoib_neigh_free(dev, neigh); - - tx-neigh = NULL; - } - - if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, tx-flags)) { - list_move(tx-list, priv-cm.reap_list); - queue_work(ipoib_workqueue, priv-cm.reap_task); - } clear_bit(IPOIB_FLAG_OPER_UP, tx-flags); + ipoib_cm_destroy_tx(tx); spin_unlock_irqrestore(priv-lock, flags); } @@
Re: [ewg] nfsrdma fails to write big file,
Roland: I'll put together a patch based on 5 with a comment that indicates why I think 5 is the number. Since Vu has verified this behaviorally as well, I'm comfortable that our understanding of the code is sound. I'm on the road right now, so it won't be until tomorrow though. Thanks, Tom Vu Pham wrote: -Original Message- From: Tom Tucker [mailto:t...@opengridcomputing.com] Sent: Saturday, February 27, 2010 8:23 PM To: Vu Pham Cc: Roland Dreier; linux-rdma@vger.kernel.org; Mahesh Siddheshwar; e...@lists.openfabrics.org Subject: Re: [ewg] nfsrdma fails to write big file, Roland Dreier wrote: + /* +* Add room for frmr register and invalidate WRs +* Requests sometimes have two chunks, each chunk +* requires to have different frmr. The safest +* WRs required are max_send_wr * 6; however, we +* get send completions and poll fast enough, it +* is pretty safe to have max_send_wr * 4. +*/ + ep-rep_attr.cap.max_send_wr *= 4; Seems like a bad design if there is a possibility of work queue overflow; if you're counting on events occurring in a particular order or completions being handled fast enough, then your design is going to fail in some high load situations, which I don't think you want. Vu, Would you please try the following: - Set the multiplier to 5 - Set the number of buffer credits small as follows echo 4 /proc/sys/sunrpc/rdma_slot_table_entries - Rerun your test and see if you can reproduce the problem? I did the above and was unable to reproduce, but I would like to see if you can to convince ourselves that 5 is the right number. Tom, I did the above and can not reproduce either. I think 5 is the right number; however, we should optimize it later. -vu -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: rnfs: rq_respages pointer is bad
Hi David: That looks like a bug to me and it looks like what you propose is the correct fix. My only reservation is that if you are correct then how did this work at all without data corruption for large writes on x86_64? I'm on the road right now, so I can't dig too deep until Wednesday, but at this point your analysis looks correct to me. Tom David J. Wilder wrote: Tom I have been chasing an rnfs related Oops in svc_process(). I have found the source of the Oops but I am not sure of my fix. I am seeing the problem on ppc64, kernel 2.6.32, I have not tried other arch yet. The source of the problem is in rdma_read_complete(), I am finding that rqstp-rq_respages is set to point past the end of the rqstp-rq_pages page list. This results in a NULL reference in svc_process() when passing rq_respages[0] to page_address(). In rdma_read_complete() we are using rqstp-rq_arg.pages as the base of the page list then indexing by page_no, however rq_arg.pages is not pointing to the start of the list so rq_respages ends up pointing to: rqstp-rq_pages[(head-count+1) + head-hdr_count] In my case, it ends up pointing one past the end of the list by one. Here is the change I made. static int rdma_read_complete(struct svc_rqst *rqstp, struct svc_rdma_op_ctxt *head) { int page_no; int ret; BUG_ON(!head); /* Copy RPC pages */ for (page_no = 0; page_no head-count; page_no++) { put_page(rqstp-rq_pages[page_no]); rqstp-rq_pages[page_no] = head-pages[page_no]; } /* Point rq_arg.pages past header */ rqstp-rq_arg.pages = rqstp-rq_pages[head-hdr_count]; rqstp-rq_arg.page_len = head-arg.page_len; rqstp-rq_arg.page_base = head-arg.page_base; /* rq_respages starts after the last arg page */ - rqstp-rq_respages = rqstp-rq_arg.pages[page_no]; + rqstp-rq_respages = rqstp-rq_pages[page_no]; . . . The change works for me, but I am not sure it is safe to assume the rqstp-rq_pages[head-count] will always point to the last arg page. Dave. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html