Re: [PATCH V2 1/2] IB/core: Add support for enhanced atomic operations

2010-03-01 Thread Håkon Bugge
 Still in-development, but it should be ready very soon. We will be the
 first in-kernel user of atomics, as well as masked atomics. This tree
 does not support masked atomics yet because it is based on ofed 1.5.1.
 When 1.5.2 is officially released, I'll rebase, add mask support, and
 plan on pushing to mainline and ofed 1.6 as soon as it opens.



May be I missed something, but how do you guys intend to reflect this new 
functionality vs. the capabilities? A new atomic_enhanced_cap ?

The reason for asking is that the IB ordinary atomic repertoire fits nicely 
with those of PCIe Gen 3, so one could possible assume new HCAs supporting PCIe 
Gen3 to possess the ATOMIC_GLOB capability for atomic_cap. But I do not see 
that happening to the proposed Enhanced IB Atomics.


Thanks, Håkon

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: SRP initiator and iSER initiator performance

2010-03-01 Thread Vladislav Bolkhovitin

Bart Van Assche, on 02/27/2010 10:27 PM wrote:

On Mon, Jan 11, 2010 at 7:44 PM, Vladislav Bolkhovitin v...@vlnb.net wrote:

[ ... ]

SRP initiator seems to be not too well optimized for the best performance. ISER 
initiator is noticeably better in this area.


(replying to an e-mail of one month ago)

I'm not sure the above statement makes sense. Below you can find the
performance results for 512-byte reads with a varying number of
threads and a NULLIO target. With a sufficiently high number of
threads this test saturated the two CPU cores of the initiator system
but not the CPU core of the target system. So one can conclude from
the numbers below that for the initiator and target software
combinations used for this test that although the difference is small,
the latency for the SRP traffic is slightly lower than that of the
iSER traffic and also that the CPU usage of the SRP traffic is
slightly lower than that of the iSER traffic. These numbers are quite
impressive since one can conclude from the numbers below that for both
protocols one I/O operation is completed by the initiator system in
about 17 microseconds or 44000 clock cycles.

iSER:
 1   read : io=128MB, bw=13,755KB/s, iops=27,510, runt=  9529msec
 2   read : io=256MB, bw=26,118KB/s, iops=52,235, runt= 10037msec
 4   read : io=512MB, bw=48,985KB/s, iops=97,970, runt= 10703msec
 8   read : io=1,024MB, bw=57,519KB/s, iops=115K, runt= 18230msec
16   read : io=2,048MB, bw=57,880KB/s, iops=116K, runt= 36233msec
32   read : io=4,096MB, bw=57,990KB/s, iops=116K, runt= 72328msec
64   read : io=8,192MB, bw=58,066KB/s, iops=116K, runt=144468msec
CPU load for 64 threads (according to vmstat 2): 20% us + 80% sy on
the initiator and 40% us + 20% sy + 40% id on the target.

SRP:
 1   read : io=128MB, bw=14,211KB/s, iops=28,422, runt=  9223msec
 2   read : io=256MB, bw=26,275KB/s, iops=52,549, runt=  9977msec
 4   read : io=512MB, bw=49,257KB/s, iops=98,513, runt= 10644msec
 8   read : io=1,024MB, bw=60,322KB/s, iops=121K, runt= 17383msec
16   read : io=2,048MB, bw=61,272KB/s, iops=123K, runt= 34227msec
32   read : io=4,096MB, bw=61,176KB/s, iops=122K, runt= 68561msec
64   read : io=8,192MB, bw=60,963KB/s, iops=122K, runt=137602msec
CPU load for 64 threads (according to vmstat 2): 20% us + 80% sy on
the initiator and 0% us + 50% sy + 50% id on the target.

Setup details:
* The above output was generated with the following command:
for i in 1 2 4 8 16 32 64; do printf %2d  $i; io-load 512 $i
${initiator_device} | grep runt; done
* The io-load script is as follows:
#!/bin/sh
blocksize=${1:-512}
threads=${2:-1}
dev=${3:-sdj}
fio --bs=${blocksize} --buffered=0 --size=128M --ioengine=sg
--rw=read --invalidate=1 --end_fsync=1 --thread --numjobs=${threads}
--loops=1 --group_reporting --name=nullio --filename=/dev/${dev}

* SRP target software: SCST r1522 compiled in release mode.
* iSER target software: tgt 1.0.2.

* InfiniBand hardware: QDR PCIe 2.0 HCA's.

* Initiator system:
2.6.33-rc7 kernel (for-next branch of Rolands InfiniBand repository
without the recently posted iSER and SRP performance improvement
patches).
SRP initiator was loaded with parameter srp_sg_tablesize=128
Frequency scaling was disabled.
Runlevel: 3.
CPU: E6750 @ 2.66GHz.

* Target system:
2.6.30.7 kernel + SCST patches.
Frequency scaling was disabled.
Runlevel: 3.
CPU: E8400 @ 3.00GHz booted with maxcpus=1.


It's good if my impression was wrong. But you've got suspiciously low 
IOPS numbers. On your hardware you should have much more. Seems you 
experienced a bottleneck on the initiator somewhere above the drivers 
level (fio? sg engine? IRQs or context switches count?), so your results 
could be not really related to the topic. Oprofile and lockstat output 
can shed more light on this.


Vlad
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2] IB/ipoib: fix dangling pointer references to ipoib_neigh and ipoib_path

2010-03-01 Thread Ralph Campbell
When using connected mode, ipoib_cm_create_tx() kmallocs a
struct ipoib_cm_tx which contains pointers to ipoib_neigh and
ipoib_path. If the paths are flushed or the struct neighbour is
destroyed, the pointers held by struct ipoib_cm_tx can reference
freed memory.

Signed-off-by: Ralph Campbell ralph.campb...@qlogic.com
---

I was able to remove the kref_t added to struct ipoib_path and ipoib_neigh.
The key is to delete references to ipoib_neigh and ipoib_path
when ipoib_neigh_cleanup() is called.

 drivers/infiniband/ulp/ipoib/ipoib.h   |   24 ++-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c|  105 +-
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |  267 +++-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c |   95 +++--
 4 files changed, 222 insertions(+), 269 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 753a983..84bb561 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -415,9 +415,7 @@ static inline struct ipoib_neigh **to_ipoib_neigh(struct 
neighbour *neigh)
 INFINIBAND_ALEN, sizeof(void *));
 }
 
-struct ipoib_neigh *ipoib_neigh_alloc(struct neighbour *neigh,
- struct net_device *dev);
-void ipoib_neigh_free(struct net_device *dev, struct ipoib_neigh *neigh);
+void ipoib_neigh_flush(struct ipoib_neigh *neigh);
 
 extern struct workqueue_struct *ipoib_workqueue;
 
@@ -464,7 +462,8 @@ void ipoib_dev_cleanup(struct net_device *dev);
 
 void ipoib_mcast_join_task(struct work_struct *work);
 void ipoib_mcast_carrier_on_task(struct work_struct *work);
-void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb);
+void ipoib_mcast_send(struct net_device *dev, void *mgid, struct sk_buff *skb,
+ struct ipoib_neigh *neigh);
 
 void ipoib_mcast_restart_task(struct work_struct *work);
 int ipoib_mcast_start_thread(struct net_device *dev);
@@ -567,9 +566,10 @@ void ipoib_cm_dev_stop(struct net_device *dev);
 int ipoib_cm_dev_init(struct net_device *dev);
 int ipoib_cm_add_mode_attr(struct net_device *dev);
 void ipoib_cm_dev_cleanup(struct net_device *dev);
-struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct 
ipoib_path *path,
-   struct ipoib_neigh *neigh);
+void ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path,
+   struct ipoib_neigh *neigh);
 void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx);
+void ipoib_cm_flush_path(struct net_device *dev, struct ipoib_path *path);
 void ipoib_cm_skb_too_long(struct net_device *dev, struct sk_buff *skb,
   unsigned int mtu);
 void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc);
@@ -646,10 +646,10 @@ void ipoib_cm_dev_cleanup(struct net_device *dev)
 }
 
 static inline
-struct ipoib_cm_tx *ipoib_cm_create_tx(struct net_device *dev, struct 
ipoib_path *path,
-   struct ipoib_neigh *neigh)
+void ipoib_cm_create_tx(struct net_device *dev, struct ipoib_path *path,
+   struct ipoib_neigh *neigh)
 {
-   return NULL;
+   return;
 }
 
 static inline
@@ -659,6 +659,12 @@ void ipoib_cm_destroy_tx(struct ipoib_cm_tx *tx)
 }
 
 static inline
+void ipoib_cm_flush_path(struct net_device *dev, struct ipoib_path *path)
+{
+   return;
+}
+
+static inline
 int ipoib_cm_add_mode_attr(struct net_device *dev)
 {
return 0;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 30bdf42..4ec4ebc 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -794,31 +794,14 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct 
ib_wc *wc)
 
if (wc-status != IB_WC_SUCCESS 
wc-status != IB_WC_WR_FLUSH_ERR) {
-   struct ipoib_neigh *neigh;
-
ipoib_dbg(priv, failed cm send event 
   (status=%d, wrid=%d vend_err %x)\n,
   wc-status, wr_id, wc-vendor_err);
 
spin_lock_irqsave(priv-lock, flags);
-   neigh = tx-neigh;
-
-   if (neigh) {
-   neigh-cm = NULL;
-   list_del(neigh-list);
-   if (neigh-ah)
-   ipoib_put_ah(neigh-ah);
-   ipoib_neigh_free(dev, neigh);
-
-   tx-neigh = NULL;
-   }
-
-   if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, tx-flags)) {
-   list_move(tx-list, priv-cm.reap_list);
-   queue_work(ipoib_workqueue, priv-cm.reap_task);
-   }
 
clear_bit(IPOIB_FLAG_OPER_UP, tx-flags);
+   ipoib_cm_destroy_tx(tx);
 
spin_unlock_irqrestore(priv-lock, flags);
}
@@ 

Re: [ewg] nfsrdma fails to write big file,

2010-03-01 Thread Tom Tucker

Roland:

I'll put together a patch based on 5 with a comment that indicates why I 
think 5 is the number. Since Vu has verified this behaviorally as well, 
I'm comfortable that our understanding of the code is sound. I'm on the 
road right now, so it won't be until tomorrow though.


Thanks,
Tom


Vu Pham wrote:
  

-Original Message-
From: Tom Tucker [mailto:t...@opengridcomputing.com]
Sent: Saturday, February 27, 2010 8:23 PM
To: Vu Pham
Cc: Roland Dreier; linux-rdma@vger.kernel.org; Mahesh Siddheshwar;
e...@lists.openfabrics.org
Subject: Re: [ewg] nfsrdma fails to write big file,

Roland Dreier wrote:


  +   /*
  +* Add room for frmr register and invalidate WRs
  +* Requests sometimes have two chunks, each chunk
  +* requires to have different frmr. The safest
  +* WRs required are max_send_wr * 6; however, we
  +* get send completions and poll fast enough, it
  +* is pretty safe to have max_send_wr * 4.
  +*/
  +   ep-rep_attr.cap.max_send_wr *= 4;

Seems like a bad design if there is a possibility of work queue
overflow; if you're counting on events occurring in a particular
  

order


or completions being handled fast enough, then your design is
  

going
  

to


fail in some high load situations, which I don't think you want.


  

Vu,

Would you please try the following:

- Set the multiplier to 5
- Set the number of buffer credits small as follows echo 4 
/proc/sys/sunrpc/rdma_slot_table_entries
- Rerun your test and see if you can reproduce the problem?

I did the above and was unable to reproduce, but I would like to see


if
  

you can to convince ourselves that 5 is the right number.





Tom,

I did the above and can not reproduce either.

I think 5 is the right number; however, we should optimize it later.

-vu
  


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: rnfs: rq_respages pointer is bad

2010-03-01 Thread Tom Tucker

Hi David:

That looks like a bug to me and it looks like what you propose is the 
correct fix. My only reservation is that if you are correct then how did 
this work at all without data corruption for large writes on x86_64?


I'm on the road right now, so I can't dig too deep until Wednesday, but 
at this point your analysis looks correct to me.


Tom


David J. Wilder wrote:

Tom

I have been chasing an rnfs related Oops in svc_process().  I have found
the source of the Oops but I am not sure of my fix.  I am seeing the
problem on ppc64, kernel 2.6.32, I have not tried other arch yet.

The source of the problem is in rdma_read_complete(), I am finding that
rqstp-rq_respages is set to point past the end of the rqstp-rq_pages
page list.  This results in a NULL reference in svc_process() when
passing rq_respages[0] to page_address().

In rdma_read_complete() we are using rqstp-rq_arg.pages as the base of
the page list then indexing by page_no, however rq_arg.pages is not
pointing to the start of the list so rq_respages ends up pointing to:

rqstp-rq_pages[(head-count+1) + head-hdr_count]

In my case, it ends up pointing one past the end of the list by one.

Here is the change I made.

static int rdma_read_complete(struct svc_rqst *rqstp,
  struct svc_rdma_op_ctxt *head)
{
int page_no;
int ret;

BUG_ON(!head);

/* Copy RPC pages */
for (page_no = 0; page_no  head-count; page_no++) {
put_page(rqstp-rq_pages[page_no]);
rqstp-rq_pages[page_no] = head-pages[page_no];
}
/* Point rq_arg.pages past header */
rqstp-rq_arg.pages = rqstp-rq_pages[head-hdr_count];
rqstp-rq_arg.page_len = head-arg.page_len;
rqstp-rq_arg.page_base = head-arg.page_base;

/* rq_respages starts after the last arg page */
-   rqstp-rq_respages = rqstp-rq_arg.pages[page_no];
+   rqstp-rq_respages = rqstp-rq_pages[page_no];
.
.
.

The change works for me, but I am not sure it is safe to assume the
rqstp-rq_pages[head-count] will always point to the last arg page.

Dave.
  


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html