[PATCH] ib/ehca: fix in_wc handling in process_mad()

2010-02-16 Thread Alexander Schmidt
If the caller does not pass a valid in_wc to process_mad(),
return MAD failure as it is not possible to generate a valid
MAD redirect response.

Signed-off-by: Alexander Schmidt al...@linux.vnet.ibm.com
---

Hi Roland,

this is another patch we would like to get in your next tree for
2.6.34.

 drivers/infiniband/hw/ehca/ehca_sqp.c |2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- linux-2.6.orig/drivers/infiniband/hw/ehca/ehca_sqp.c
+++ linux-2.6/drivers/infiniband/hw/ehca/ehca_sqp.c
@@ -222,7 +222,7 @@ int ehca_process_mad(struct ib_device *i
 {
int ret;
 
-   if (!port_num || port_num  ibdev-phys_port_cnt)
+   if (!port_num || port_num  ibdev-phys_port_cnt || !in_wc)
return IB_MAD_RESULT_FAILURE;
 
/* accept only pma request */
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] MLX4 Strangeness

2010-02-16 Thread Tom Tucker

Tziporet Koren wrote:

On 2/15/2010 10:24 PM, Tom Tucker wrote:
  

Hello,

I am seeing some very strange behavior on my MLX4 adapters running 2.7
firmware and the latest OFED 1.5.1. Two systems are involved and each
have dual ported MTHCA DDR adapter and MLX4 adapters.

The scenario starts with NFSRDMA stress testing between the two systems
running bonnie++ and iozone concurrently. The test completes and there
is no issue. Then 6 minutes pass and the server times out the
connection and shuts down the RC connection to the client.

  From this point on, using the RDMA CM, a new RC QP can be brought up
and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system
fails with IB_WC_RETRY_EXC_ERR. I have confirmed:

- that arp completed successfully and the neighbor entries are
populated on both the client and server
- that the QP are in the RTS state on both the client and server
- that there are RECV WR posted to the RQ on the server and they did not
error out
- that no RECV WR completed successfully or in error on the server
- that there are SEND WR posted to the QP on the client
- the client side SEND_WR fails with error 12 as mentioned above

I have also confirmed the following with a different application (i.e.
rping):

server# rping -s
client# rping -c -a 192.168.80.129

fails with the exact same error, i.e.
client# rping -c -a 192.168.80.129
cq completion failed status 12
wait for RDMA_WRITE_ADV state 10
client DISCONNECT EVENT...

However, if I run rping the other way, it works fine, that is,

client# rping -s
server# rping -c -a 192.168.80.135

It runs without error until I stop it.

Does anyone have any ideas on how I might debug this?


   


Tom
What is the vendor syndrome error when you get a completion with error?

  

Hang on... compiling
Does the issue occurs only on the ConnectX cards (mlx4) or also on the 
InfiniHost cards (mthca)


  


Only the MLX4 cards.


Tziporet

___
ewg mailing list
e...@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
  


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 1/3] opensm: Use local variables when searching for torus-2QoS master spanning tree root.

2010-02-16 Thread Jim Schutt
Otherwise 1) presence of the wrong switches is checked; and 2) the y-loop
in good_xy_ring() can segfault on an out-of-bounds switch array x index.

Signed-off-by: Jim Schutt jasc...@sandia.gov
---
 opensm/opensm/osm_ucast_torus.c |   13 +++--
 1 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/opensm/opensm/osm_ucast_torus.c b/opensm/opensm/osm_ucast_torus.c
index e2eb324..728e56c 100644
--- a/opensm/opensm/osm_ucast_torus.c
+++ b/opensm/opensm/osm_ucast_torus.c
@@ -8751,22 +8751,23 @@ ib_api_status_t torus_mcast_stree(void *context, 
osm_mgrp_box_t *mgb)
 }
 
 static
-bool good_xy_ring(struct torus *t, int x, int y, int z)
+bool good_xy_ring(struct torus *t, const int x, const int y, const int z)
 {
struct t_switch sw = t-sw;
bool good_ring = true;
+   int x_tst, y_tst;
 
-   for (x = 0; x  t-x_sz  good_ring; x++)
-   good_ring = sw[x][y][z];
+   for (x_tst = 0; x_tst  t-x_sz  good_ring; x_tst++)
+   good_ring = sw[x_tst][y][z];
 
-   for (y = 0; y  t-y_sz  good_ring; y++)
-   good_ring = sw[x][y][z];
+   for (y_tst = 0; y_tst  t-y_sz  good_ring; y_tst++)
+   good_ring = sw[x][y_tst][z];
 
return good_ring;
 }
 
 static
-struct t_switch *find_plane_mid(struct torus *t, int z)
+struct t_switch *find_plane_mid(struct torus *t, const int z)
 {
int x, dx, xm = t-x_sz / 2;
int y, dy, ym = t-y_sz / 2;
-- 
1.5.6.GIT


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 0/3] opensm: Bug fixes for torus-2QoS patchset

2010-02-16 Thread Jim Schutt
These patches fix bugs discovered during further testing of the
torus-2QoS routing module for OpenSM.

(See http://www.spinics.net/lists/linux-rdma/msg01438.html
and http://www.spinics.net/lists/linux-rdma/msg01938.html)


Jim Schutt (3):
  opensm: Use local variables when searching for torus-2QoS master
spanning tree root.
  opensm: Fix handling of torus-2QoS topology discovery for radix 4
torus dimensions.
  opensm: Avoid havoc in dump_ucast_routes() caused by torus-2QoS
persistent use of osm_port_t:priv.

 opensm/include/opensm/osm_switch.h |   12 +
 opensm/opensm/osm_dump.c   |2 +-
 opensm/opensm/osm_switch.c |7 +-
 opensm/opensm/osm_ucast_mgr.c  |1 +
 opensm/opensm/osm_ucast_torus.c|  418 +++-
 5 files changed, 193 insertions(+), 247 deletions(-)


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] opensm: Fix handling of torus-2QoS topology discovery for radix 4 torus dimensions.

2010-02-16 Thread Jim Schutt
Torus-2QoS finds the torus topology in a fabric using an algorithm that
looks for 8 adjacent switches which form the corners of a cube, by looking
for 4 adjacent switches which form the corners of a face on that cube.

When a torus dimension has radix 4 (e.g. the y dimension in a 5x4x8 torus),
1-D rings which span that dimension cannot be distinguished topologically
from the faces the algorithm is trying to construct.

Code that prevents that situation from arising should only be applied in
cases where a torus dimension has radix 4, but due to a missing test, it
could be applied inappropriately.

This commit fixes the bug by adding the missing test.  It also restructures
the code in question to remove code duplication by adding helper functions.

Signed-off-by: Jim Schutt jasc...@sandia.gov
---
 opensm/opensm/osm_ucast_torus.c |  405 ---
 1 files changed, 168 insertions(+), 237 deletions(-)

diff --git a/opensm/opensm/osm_ucast_torus.c b/opensm/opensm/osm_ucast_torus.c
index 728e56c..ab0e6a6 100644
--- a/opensm/opensm/osm_ucast_torus.c
+++ b/opensm/opensm/osm_ucast_torus.c
@@ -1956,38 +1956,16 @@ struct f_switch *tfind_2d_perpendicular(struct t_switch 
*tsw0,
return ffind_2d_perpendicular(tsw0-tmp, tsw1-tmp, tsw2-tmp);
 }
 
-/*
- * These functions return true when it safe to call
- * tfind_3d_perpendicular()/ffind_3d_perpendicular().
- */
 static
-bool safe_x_perpendicular(struct torus *t, int i, int j, int k)
+bool safe_x_ring(struct torus *t, int i, int j, int k)
 {
-   int jm1, jp1, jp2, km1, kp1, kp2;
-
-   /*
-* If the dimensions perpendicular to the search direction are
-* not radix 4 torus dimensions, it is always safe to search for
-* a perpendicular.
-*/
-   if ((t-y_sz != 4  t-z_sz != 4) ||
-   (t-flags  Y_MESH  t-flags  Z_MESH) ||
-   (t-y_sz != 4  (t-flags  Z_MESH)) ||
-   (t-z_sz != 4  (t-flags  Y_MESH)))
-   return true;
-
-   jm1 = canonicalize(j - 1, t-y_sz);
-   jp1 = canonicalize(j + 1, t-y_sz);
-   jp2 = canonicalize(j + 2, t-y_sz);
-
-   km1 = canonicalize(k - 1, t-z_sz);
-   kp1 = canonicalize(k + 1, t-z_sz);
-   kp2 = canonicalize(k + 2, t-z_sz);
+   int im1, ip1, ip2;
+   bool success = true;
 
/*
-* Here we are checking for enough appropriate links having been
-* installed into the torus to prevent an incorrect link from being
-* considered as a perpendicular candidate.
+* If this x-direction radix-4 ring has at least two links
+* already installed into the torus,  then this ring does not
+* prevent us from looking for y or z direction perpendiculars.
 *
 * It is easier to check for the appropriate switches being installed
 * into the torus than it is to check for the links, so force the
@@ -1995,93 +1973,111 @@ bool safe_x_perpendicular(struct torus *t, int i, int 
j, int k)
 *
 * Recall that canonicalize(n - 2, 4) == canonicalize(n + 2, 4).
 */
-   if (((!!t-sw[i][jm1][k] +
- !!t-sw[i][jp1][k] + !!t-sw[i][jp2][k] = 2) 
-(!!t-sw[i][j][km1] +
- !!t-sw[i][j][kp1] + !!t-sw[i][j][kp2] = 2))) {
-
-   bool success = true;
-
-   if (t-sw[i][jp2][k]  t-sw[i][jm1][k])
-   success = link_tswitches(t, 1,
-t-sw[i][jp2][k],
-t-sw[i][jm1][k])
-success;
-
-   if (t-sw[i][jm1][k]  t-sw[i][j][k])
-   success = link_tswitches(t, 1,
-t-sw[i][jm1][k],
-t-sw[i][j][k])
-success;
-
-   if (t-sw[i][j][k]  t-sw[i][jp1][k])
-   success = link_tswitches(t, 1,
-t-sw[i][j][k],
-t-sw[i][jp1][k])
-success;
-
-   if (t-sw[i][jp1][k]  t-sw[i][jp2][k])
-   success = link_tswitches(t, 1,
-t-sw[i][jp1][k],
-t-sw[i][jp2][k])
-success;
-
-   if (t-sw[i][j][kp2]  t-sw[i][j][km1])
-   success = link_tswitches(t, 2,
-t-sw[i][j][kp2],
-t-sw[i][j][km1])
-success;
-
-   if (t-sw[i][j][km1]  t-sw[i][j][k])
-   success = link_tswitches(t, 2,
-t-sw[i][j][km1],
-t-sw[i][j][k])
-

opensm: Status of torus-2QoS patchset?

2010-02-16 Thread Jim Schutt
Hi Sasha,

Do you have any feedback regarding my patches to add
a new routing module specialized for 2D/3D torus topologies?
I was hoping there was some chance this work might make it
into the OFED 1.6 release.

Thanks -- Jim





--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] MLX4 Strangeness

2010-02-16 Thread Tom Tucker

Tziporet Koren wrote:

On 2/15/2010 10:24 PM, Tom Tucker wrote:
  

Hello,

I am seeing some very strange behavior on my MLX4 adapters running 2.7
firmware and the latest OFED 1.5.1. Two systems are involved and each
have dual ported MTHCA DDR adapter and MLX4 adapters.

The scenario starts with NFSRDMA stress testing between the two systems
running bonnie++ and iozone concurrently. The test completes and there
is no issue. Then 6 minutes pass and the server times out the
connection and shuts down the RC connection to the client.

  From this point on, using the RDMA CM, a new RC QP can be brought up
and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system
fails with IB_WC_RETRY_EXC_ERR. I have confirmed:

- that arp completed successfully and the neighbor entries are
populated on both the client and server
- that the QP are in the RTS state on both the client and server
- that there are RECV WR posted to the RQ on the server and they did not
error out
- that no RECV WR completed successfully or in error on the server
- that there are SEND WR posted to the QP on the client
- the client side SEND_WR fails with error 12 as mentioned above

I have also confirmed the following with a different application (i.e.
rping):

server# rping -s
client# rping -c -a 192.168.80.129

fails with the exact same error, i.e.
client# rping -c -a 192.168.80.129
cq completion failed status 12
wait for RDMA_WRITE_ADV state 10
client DISCONNECT EVENT...

However, if I run rping the other way, it works fine, that is,

client# rping -s
server# rping -c -a 192.168.80.135

It runs without error until I stop it.

Does anyone have any ideas on how I might debug this?


   


Tom
What is the vendor syndrome error when you get a completion with error?

  
Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 closed (-103)
Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 
81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81003c9e3200 ex  src_qp  wc_flags, 0 pkey_index
Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 closed (-103)
Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 
81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81002f2d8400 ex  src_qp  wc_flags, 0 pkey_index


Repeat forever

So the vendor err is 244.

Does the issue occurs only on the ConnectX cards (mlx4) or also on the 
InfiniHost cards (mthca)


Tziporet

___
ewg mailing list
e...@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
  


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


is it possible to avoid syncing after an rdma write?

2010-02-16 Thread Andy Grover
Right now, RDS follows each RDMA write op with a Send op, which 1)
causes an interrupt and 2) includes the info we need to call
ib_dma_sync_sg_for_cpu() for the target of the rdma write.

We want to omit the Send. If we don't do the sync on the machine that is
the target of the RDMA write, the result is... what exactly? I assume
the write to memory is snooped by CPUs, so their cachelines will be
properly invalidated. However, Linux DMA-API docs seem pretty clear in
insisting on the sync.

Is the issue IOMMUs? Or for compatibility with bounce buffering?

Thanks in advance -- Regards -- Andy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] MLX4 Strangeness

2010-02-16 Thread Tom Tucker

Tom Tucker wrote:

Tziporet Koren wrote:

On 2/15/2010 10:24 PM, Tom Tucker wrote:
 

Hello,

I am seeing some very strange behavior on my MLX4 adapters running 2.7
firmware and the latest OFED 1.5.1. Two systems are involved and each
have dual ported MTHCA DDR adapter and MLX4 adapters.

The scenario starts with NFSRDMA stress testing between the two systems
running bonnie++ and iozone concurrently. The test completes and there
is no issue. Then 6 minutes pass and the server times out the
connection and shuts down the RC connection to the client.

  From this point on, using the RDMA CM, a new RC QP can be brought up
and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system
fails with IB_WC_RETRY_EXC_ERR. I have confirmed:

- that arp completed successfully and the neighbor entries are
populated on both the client and server
- that the QP are in the RTS state on both the client and server
- that there are RECV WR posted to the RQ on the server and they did 
not

error out
- that no RECV WR completed successfully or in error on the server
- that there are SEND WR posted to the QP on the client
- the client side SEND_WR fails with error 12 as mentioned above

I have also confirmed the following with a different application (i.e.
rping):

server# rping -s
client# rping -c -a 192.168.80.129

fails with the exact same error, i.e.
client# rping -c -a 192.168.80.129
cq completion failed status 12
wait for RDMA_WRITE_ADV state 10
client DISCONNECT EVENT...

However, if I run rping the other way, it works fine, that is,

client# rping -s
server# rping -c -a 192.168.80.135

It runs without error until I stop it.

Does anyone have any ideas on how I might debug this?


   

Tom
What is the vendor syndrome error when you get a completion with error?

  
Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 closed (-103)
Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 
81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81003c9e3200 ex  src_qp  wc_flags, 0 pkey_index
Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 closed (-103)
Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 
81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81002f2d8400 ex  src_qp  wc_flags, 0 pkey_index


Repeat forever

So the vendor err is 244.



Please ignore this. This log skips the failing WR (:-\). I need to do 
another trace.




Does the issue occurs only on the ConnectX cards (mlx4) or also on 
the InfiniHost cards (mthca)


Tziporet

___
ewg mailing list
e...@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
  





--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ewg] MLX4 Strangeness

2010-02-16 Thread Tom Tucker


More info...

Reboot the client and try to reconnect to a server that has not been 
rebooted fails in the same way.


It must be an issue with the server. I see no completions on the server 
or any indication that an RDMA_SEND was incoming. Is there some way to 
dump adapter state or otherwise see if there was traffic on the wire?


Tom


Tom Tucker wrote:

Tom Tucker wrote:

Tziporet Koren wrote:

On 2/15/2010 10:24 PM, Tom Tucker wrote:
 

Hello,

I am seeing some very strange behavior on my MLX4 adapters running 2.7
firmware and the latest OFED 1.5.1. Two systems are involved and each
have dual ported MTHCA DDR adapter and MLX4 adapters.

The scenario starts with NFSRDMA stress testing between the two 
systems

running bonnie++ and iozone concurrently. The test completes and there
is no issue. Then 6 minutes pass and the server times out the
connection and shuts down the RC connection to the client.

  From this point on, using the RDMA CM, a new RC QP can be brought up
and moved to RTS, however, the first RDMA_SEND to the NFS SERVER 
system

fails with IB_WC_RETRY_EXC_ERR. I have confirmed:

- that arp completed successfully and the neighbor entries are
populated on both the client and server
- that the QP are in the RTS state on both the client and server
- that there are RECV WR posted to the RQ on the server and they 
did not

error out
- that no RECV WR completed successfully or in error on the server
- that there are SEND WR posted to the QP on the client
- the client side SEND_WR fails with error 12 as mentioned above

I have also confirmed the following with a different application (i.e.
rping):

server# rping -s
client# rping -c -a 192.168.80.129

fails with the exact same error, i.e.
client# rping -c -a 192.168.80.129
cq completion failed status 12
wait for RDMA_WRITE_ADV state 10
client DISCONNECT EVENT...

However, if I run rping the other way, it works fine, that is,

client# rping -s
server# rping -c -a 192.168.80.135

It runs without error until I stop it.

Does anyone have any ideas on how I might debug this?


   

Tom
What is the vendor syndrome error when you get a completion with error?

  
Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 closed (-103)
Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 
81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81003c9e3200 ex  src_qp  wc_flags, 0 pkey_index
Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 closed (-103)
Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 
192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16
Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 
81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 
81002f2d8400 ex  src_qp  wc_flags, 0 pkey_index


Repeat forever

So the vendor err is 244.



Please ignore this. This log skips the failing WR (:-\). I need to do 
another trace.




Does the issue occurs only on the ConnectX cards (mlx4) or also on 
the InfiniHost cards (mthca)


Tziporet

___
ewg mailing list
e...@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
  








--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: is it possible to avoid syncing after an rdma write?

2010-02-16 Thread Jason Gunthorpe
On Tue, Feb 16, 2010 at 03:29:48PM -0800, Andy Grover wrote:
 Right now, RDS follows each RDMA write op with a Send op, which 1)
 causes an interrupt and 2) includes the info we need to call
 ib_dma_sync_sg_for_cpu() for the target of the rdma write.
 
 We want to omit the Send. If we don't do the sync on the machine that is
 the target of the RDMA write, the result is... what exactly? I assume
 the write to memory is snooped by CPUs, so their cachelines will be
 properly invalidated. However, Linux DMA-API docs seem pretty clear in
 insisting on the sync.

I'm curious about this too, but I will point out that at least the
user RDMA interface has no match for the kernel DMA calls, so in
practice RDMA does not work on systems that require them. That means
bounce buffering is not used and IO/CPU caches are coherent.

Though, I guess, the kernel could use weaker memory ordering types in
kernel mode that do require the DMA api calls.

 Is the issue IOMMUs? Or for compatibility with bounce buffering?

As long as the memory is registered the IOMMU should remain
configured.

What do you intend to replace the SEND with? spin on last byte? There
are other issues to consider like ordering within the PCI-E fabric..

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: is it possible to avoid syncing after an rdma write?

2010-02-16 Thread Paul Grun
Why not use an RDMA write w/ immed?  That forces the consumption of a
receive WQE and can be used to create a completion event.  Since the
immediate data is carried in the last packet of a multi-packet RDMA write,
you are guaranteed that all data has been placed in the receive buffer, in
order.

I'm a hardware guy, so this may be completely off-the-wall w.r.t. this
particular discussion.
-Paul

-Original Message-
From: linux-rdma-ow...@vger.kernel.org
[mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe
Sent: Tuesday, February 16, 2010 4:58 PM
To: Andy Grover
Cc: linux-rdma@vger.kernel.org
Subject: Re: is it possible to avoid syncing after an rdma write?

On Tue, Feb 16, 2010 at 03:29:48PM -0800, Andy Grover wrote:
 Right now, RDS follows each RDMA write op with a Send op, which 1)
 causes an interrupt and 2) includes the info we need to call
 ib_dma_sync_sg_for_cpu() for the target of the rdma write.
 
 We want to omit the Send. If we don't do the sync on the machine that is
 the target of the RDMA write, the result is... what exactly? I assume
 the write to memory is snooped by CPUs, so their cachelines will be
 properly invalidated. However, Linux DMA-API docs seem pretty clear in
 insisting on the sync.

I'm curious about this too, but I will point out that at least the
user RDMA interface has no match for the kernel DMA calls, so in
practice RDMA does not work on systems that require them. That means
bounce buffering is not used and IO/CPU caches are coherent.

Though, I guess, the kernel could use weaker memory ordering types in
kernel mode that do require the DMA api calls.

 Is the issue IOMMUs? Or for compatibility with bounce buffering?

As long as the memory is registered the IOMMU should remain
configured.

What do you intend to replace the SEND with? spin on last byte? There
are other issues to consider like ordering within the PCI-E fabric..

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: is it possible to avoid syncing after an rdma write?

2010-02-16 Thread Paul Grun
Two advantages come to mind vs an RDMA Write followed by a SEND:
Using a SEND will consume a second WQE on the send side, and the
synchronizing SEND will cause an entire new transaction, which will consume
a(n infinitesimally) small amount of additional wire bandwidth, as well as
incurring a(infinitesimally) small likelihood of a dropped or lost packets.

Nits?  Yes, probably infinitesimally small ones.   (hardware guys tend to
worry about the small ones.)

To answer Andy's original question, the behavior on the receive side is not
guaranteed until control of the receive buffer has been formally returned to
the receiver.  I expect that most HCAs are pretty well behaved here, as are
most CPU/memory/root complexes...but you never know.  Can anybody guarantee
that the inbound packet gets written to the memory in order?  

If something odd did happen, it seems like one of those places that would
require an incredible stroke of luck to debug.

OTOH, I know that many applications simply poll the receive buffer looking
for a flag everyday and get away with it.

-Original Message-
From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com] 
Sent: Tuesday, February 16, 2010 5:12 PM
To: Paul Grun
Cc: 'Andy Grover'; linux-rdma@vger.kernel.org
Subject: Re: is it possible to avoid syncing after an rdma write?

On Tue, Feb 16, 2010 at 05:05:21PM -0800, Paul Grun wrote:
 Why not use an RDMA write w/ immed?  That forces the consumption of a
 receive WQE and can be used to create a completion event.  Since the
 immediate data is carried in the last packet of a multi-packet RDMA write,
 you are guaranteed that all data has been placed in the receive buffer, in
 order.

Yes, RDMA WRITE w/ immediate data is perfectly fine. I've even
implemented some protocols that use it to good effect.

Not sure what the performance trade off is like though. The immediate
data pretty much behaves exactly like a SEND WC on the receive side,
but there may be some performance and latency advantages, particularly
on the send side.

Jason


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html