[PATCH] ib/ehca: fix in_wc handling in process_mad()
If the caller does not pass a valid in_wc to process_mad(), return MAD failure as it is not possible to generate a valid MAD redirect response. Signed-off-by: Alexander Schmidt al...@linux.vnet.ibm.com --- Hi Roland, this is another patch we would like to get in your next tree for 2.6.34. drivers/infiniband/hw/ehca/ehca_sqp.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) --- linux-2.6.orig/drivers/infiniband/hw/ehca/ehca_sqp.c +++ linux-2.6/drivers/infiniband/hw/ehca/ehca_sqp.c @@ -222,7 +222,7 @@ int ehca_process_mad(struct ib_device *i { int ret; - if (!port_num || port_num ibdev-phys_port_cnt) + if (!port_num || port_num ibdev-phys_port_cnt || !in_wc) return IB_MAD_RESULT_FAILURE; /* accept only pma request */ -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] MLX4 Strangeness
Tziporet Koren wrote: On 2/15/2010 10:24 PM, Tom Tucker wrote: Hello, I am seeing some very strange behavior on my MLX4 adapters running 2.7 firmware and the latest OFED 1.5.1. Two systems are involved and each have dual ported MTHCA DDR adapter and MLX4 adapters. The scenario starts with NFSRDMA stress testing between the two systems running bonnie++ and iozone concurrently. The test completes and there is no issue. Then 6 minutes pass and the server times out the connection and shuts down the RC connection to the client. From this point on, using the RDMA CM, a new RC QP can be brought up and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system fails with IB_WC_RETRY_EXC_ERR. I have confirmed: - that arp completed successfully and the neighbor entries are populated on both the client and server - that the QP are in the RTS state on both the client and server - that there are RECV WR posted to the RQ on the server and they did not error out - that no RECV WR completed successfully or in error on the server - that there are SEND WR posted to the QP on the client - the client side SEND_WR fails with error 12 as mentioned above I have also confirmed the following with a different application (i.e. rping): server# rping -s client# rping -c -a 192.168.80.129 fails with the exact same error, i.e. client# rping -c -a 192.168.80.129 cq completion failed status 12 wait for RDMA_WRITE_ADV state 10 client DISCONNECT EVENT... However, if I run rping the other way, it works fine, that is, client# rping -s server# rping -c -a 192.168.80.135 It runs without error until I stop it. Does anyone have any ideas on how I might debug this? Tom What is the vendor syndrome error when you get a completion with error? Hang on... compiling Does the issue occurs only on the ConnectX cards (mlx4) or also on the InfiniHost cards (mthca) Only the MLX4 cards. Tziporet ___ ewg mailing list e...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] opensm: Use local variables when searching for torus-2QoS master spanning tree root.
Otherwise 1) presence of the wrong switches is checked; and 2) the y-loop in good_xy_ring() can segfault on an out-of-bounds switch array x index. Signed-off-by: Jim Schutt jasc...@sandia.gov --- opensm/opensm/osm_ucast_torus.c | 13 +++-- 1 files changed, 7 insertions(+), 6 deletions(-) diff --git a/opensm/opensm/osm_ucast_torus.c b/opensm/opensm/osm_ucast_torus.c index e2eb324..728e56c 100644 --- a/opensm/opensm/osm_ucast_torus.c +++ b/opensm/opensm/osm_ucast_torus.c @@ -8751,22 +8751,23 @@ ib_api_status_t torus_mcast_stree(void *context, osm_mgrp_box_t *mgb) } static -bool good_xy_ring(struct torus *t, int x, int y, int z) +bool good_xy_ring(struct torus *t, const int x, const int y, const int z) { struct t_switch sw = t-sw; bool good_ring = true; + int x_tst, y_tst; - for (x = 0; x t-x_sz good_ring; x++) - good_ring = sw[x][y][z]; + for (x_tst = 0; x_tst t-x_sz good_ring; x_tst++) + good_ring = sw[x_tst][y][z]; - for (y = 0; y t-y_sz good_ring; y++) - good_ring = sw[x][y][z]; + for (y_tst = 0; y_tst t-y_sz good_ring; y_tst++) + good_ring = sw[x][y_tst][z]; return good_ring; } static -struct t_switch *find_plane_mid(struct torus *t, int z) +struct t_switch *find_plane_mid(struct torus *t, const int z) { int x, dx, xm = t-x_sz / 2; int y, dy, ym = t-y_sz / 2; -- 1.5.6.GIT -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] opensm: Bug fixes for torus-2QoS patchset
These patches fix bugs discovered during further testing of the torus-2QoS routing module for OpenSM. (See http://www.spinics.net/lists/linux-rdma/msg01438.html and http://www.spinics.net/lists/linux-rdma/msg01938.html) Jim Schutt (3): opensm: Use local variables when searching for torus-2QoS master spanning tree root. opensm: Fix handling of torus-2QoS topology discovery for radix 4 torus dimensions. opensm: Avoid havoc in dump_ucast_routes() caused by torus-2QoS persistent use of osm_port_t:priv. opensm/include/opensm/osm_switch.h | 12 + opensm/opensm/osm_dump.c |2 +- opensm/opensm/osm_switch.c |7 +- opensm/opensm/osm_ucast_mgr.c |1 + opensm/opensm/osm_ucast_torus.c| 418 +++- 5 files changed, 193 insertions(+), 247 deletions(-) -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] opensm: Fix handling of torus-2QoS topology discovery for radix 4 torus dimensions.
Torus-2QoS finds the torus topology in a fabric using an algorithm that looks for 8 adjacent switches which form the corners of a cube, by looking for 4 adjacent switches which form the corners of a face on that cube. When a torus dimension has radix 4 (e.g. the y dimension in a 5x4x8 torus), 1-D rings which span that dimension cannot be distinguished topologically from the faces the algorithm is trying to construct. Code that prevents that situation from arising should only be applied in cases where a torus dimension has radix 4, but due to a missing test, it could be applied inappropriately. This commit fixes the bug by adding the missing test. It also restructures the code in question to remove code duplication by adding helper functions. Signed-off-by: Jim Schutt jasc...@sandia.gov --- opensm/opensm/osm_ucast_torus.c | 405 --- 1 files changed, 168 insertions(+), 237 deletions(-) diff --git a/opensm/opensm/osm_ucast_torus.c b/opensm/opensm/osm_ucast_torus.c index 728e56c..ab0e6a6 100644 --- a/opensm/opensm/osm_ucast_torus.c +++ b/opensm/opensm/osm_ucast_torus.c @@ -1956,38 +1956,16 @@ struct f_switch *tfind_2d_perpendicular(struct t_switch *tsw0, return ffind_2d_perpendicular(tsw0-tmp, tsw1-tmp, tsw2-tmp); } -/* - * These functions return true when it safe to call - * tfind_3d_perpendicular()/ffind_3d_perpendicular(). - */ static -bool safe_x_perpendicular(struct torus *t, int i, int j, int k) +bool safe_x_ring(struct torus *t, int i, int j, int k) { - int jm1, jp1, jp2, km1, kp1, kp2; - - /* -* If the dimensions perpendicular to the search direction are -* not radix 4 torus dimensions, it is always safe to search for -* a perpendicular. -*/ - if ((t-y_sz != 4 t-z_sz != 4) || - (t-flags Y_MESH t-flags Z_MESH) || - (t-y_sz != 4 (t-flags Z_MESH)) || - (t-z_sz != 4 (t-flags Y_MESH))) - return true; - - jm1 = canonicalize(j - 1, t-y_sz); - jp1 = canonicalize(j + 1, t-y_sz); - jp2 = canonicalize(j + 2, t-y_sz); - - km1 = canonicalize(k - 1, t-z_sz); - kp1 = canonicalize(k + 1, t-z_sz); - kp2 = canonicalize(k + 2, t-z_sz); + int im1, ip1, ip2; + bool success = true; /* -* Here we are checking for enough appropriate links having been -* installed into the torus to prevent an incorrect link from being -* considered as a perpendicular candidate. +* If this x-direction radix-4 ring has at least two links +* already installed into the torus, then this ring does not +* prevent us from looking for y or z direction perpendiculars. * * It is easier to check for the appropriate switches being installed * into the torus than it is to check for the links, so force the @@ -1995,93 +1973,111 @@ bool safe_x_perpendicular(struct torus *t, int i, int j, int k) * * Recall that canonicalize(n - 2, 4) == canonicalize(n + 2, 4). */ - if (((!!t-sw[i][jm1][k] + - !!t-sw[i][jp1][k] + !!t-sw[i][jp2][k] = 2) -(!!t-sw[i][j][km1] + - !!t-sw[i][j][kp1] + !!t-sw[i][j][kp2] = 2))) { - - bool success = true; - - if (t-sw[i][jp2][k] t-sw[i][jm1][k]) - success = link_tswitches(t, 1, -t-sw[i][jp2][k], -t-sw[i][jm1][k]) -success; - - if (t-sw[i][jm1][k] t-sw[i][j][k]) - success = link_tswitches(t, 1, -t-sw[i][jm1][k], -t-sw[i][j][k]) -success; - - if (t-sw[i][j][k] t-sw[i][jp1][k]) - success = link_tswitches(t, 1, -t-sw[i][j][k], -t-sw[i][jp1][k]) -success; - - if (t-sw[i][jp1][k] t-sw[i][jp2][k]) - success = link_tswitches(t, 1, -t-sw[i][jp1][k], -t-sw[i][jp2][k]) -success; - - if (t-sw[i][j][kp2] t-sw[i][j][km1]) - success = link_tswitches(t, 2, -t-sw[i][j][kp2], -t-sw[i][j][km1]) -success; - - if (t-sw[i][j][km1] t-sw[i][j][k]) - success = link_tswitches(t, 2, -t-sw[i][j][km1], -t-sw[i][j][k]) -
opensm: Status of torus-2QoS patchset?
Hi Sasha, Do you have any feedback regarding my patches to add a new routing module specialized for 2D/3D torus topologies? I was hoping there was some chance this work might make it into the OFED 1.6 release. Thanks -- Jim -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] MLX4 Strangeness
Tziporet Koren wrote: On 2/15/2010 10:24 PM, Tom Tucker wrote: Hello, I am seeing some very strange behavior on my MLX4 adapters running 2.7 firmware and the latest OFED 1.5.1. Two systems are involved and each have dual ported MTHCA DDR adapter and MLX4 adapters. The scenario starts with NFSRDMA stress testing between the two systems running bonnie++ and iozone concurrently. The test completes and there is no issue. Then 6 minutes pass and the server times out the connection and shuts down the RC connection to the client. From this point on, using the RDMA CM, a new RC QP can be brought up and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system fails with IB_WC_RETRY_EXC_ERR. I have confirmed: - that arp completed successfully and the neighbor entries are populated on both the client and server - that the QP are in the RTS state on both the client and server - that there are RECV WR posted to the RQ on the server and they did not error out - that no RECV WR completed successfully or in error on the server - that there are SEND WR posted to the QP on the client - the client side SEND_WR fails with error 12 as mentioned above I have also confirmed the following with a different application (i.e. rping): server# rping -s client# rping -c -a 192.168.80.129 fails with the exact same error, i.e. client# rping -c -a 192.168.80.129 cq completion failed status 12 wait for RDMA_WRITE_ADV state 10 client DISCONNECT EVENT... However, if I run rping the other way, it works fine, that is, client# rping -s server# rping -c -a 192.168.80.135 It runs without error until I stop it. Does anyone have any ideas on how I might debug this? Tom What is the vendor syndrome error when you get a completion with error? Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81003c9e3200 ex src_qp wc_flags, 0 pkey_index Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81002f2d8400 ex src_qp wc_flags, 0 pkey_index Repeat forever So the vendor err is 244. Does the issue occurs only on the ConnectX cards (mlx4) or also on the InfiniHost cards (mthca) Tziporet ___ ewg mailing list e...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
is it possible to avoid syncing after an rdma write?
Right now, RDS follows each RDMA write op with a Send op, which 1) causes an interrupt and 2) includes the info we need to call ib_dma_sync_sg_for_cpu() for the target of the rdma write. We want to omit the Send. If we don't do the sync on the machine that is the target of the RDMA write, the result is... what exactly? I assume the write to memory is snooped by CPUs, so their cachelines will be properly invalidated. However, Linux DMA-API docs seem pretty clear in insisting on the sync. Is the issue IOMMUs? Or for compatibility with bounce buffering? Thanks in advance -- Regards -- Andy -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] MLX4 Strangeness
Tom Tucker wrote: Tziporet Koren wrote: On 2/15/2010 10:24 PM, Tom Tucker wrote: Hello, I am seeing some very strange behavior on my MLX4 adapters running 2.7 firmware and the latest OFED 1.5.1. Two systems are involved and each have dual ported MTHCA DDR adapter and MLX4 adapters. The scenario starts with NFSRDMA stress testing between the two systems running bonnie++ and iozone concurrently. The test completes and there is no issue. Then 6 minutes pass and the server times out the connection and shuts down the RC connection to the client. From this point on, using the RDMA CM, a new RC QP can be brought up and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system fails with IB_WC_RETRY_EXC_ERR. I have confirmed: - that arp completed successfully and the neighbor entries are populated on both the client and server - that the QP are in the RTS state on both the client and server - that there are RECV WR posted to the RQ on the server and they did not error out - that no RECV WR completed successfully or in error on the server - that there are SEND WR posted to the QP on the client - the client side SEND_WR fails with error 12 as mentioned above I have also confirmed the following with a different application (i.e. rping): server# rping -s client# rping -c -a 192.168.80.129 fails with the exact same error, i.e. client# rping -c -a 192.168.80.129 cq completion failed status 12 wait for RDMA_WRITE_ADV state 10 client DISCONNECT EVENT... However, if I run rping the other way, it works fine, that is, client# rping -s server# rping -c -a 192.168.80.135 It runs without error until I stop it. Does anyone have any ideas on how I might debug this? Tom What is the vendor syndrome error when you get a completion with error? Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81003c9e3200 ex src_qp wc_flags, 0 pkey_index Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81002f2d8400 ex src_qp wc_flags, 0 pkey_index Repeat forever So the vendor err is 244. Please ignore this. This log skips the failing WR (:-\). I need to do another trace. Does the issue occurs only on the ConnectX cards (mlx4) or also on the InfiniHost cards (mthca) Tziporet ___ ewg mailing list e...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ewg] MLX4 Strangeness
More info... Reboot the client and try to reconnect to a server that has not been rebooted fails in the same way. It must be an issue with the server. I see no completions on the server or any indication that an RDMA_SEND was incoming. Is there some way to dump adapter state or otherwise see if there was traffic on the wire? Tom Tom Tucker wrote: Tom Tucker wrote: Tziporet Koren wrote: On 2/15/2010 10:24 PM, Tom Tucker wrote: Hello, I am seeing some very strange behavior on my MLX4 adapters running 2.7 firmware and the latest OFED 1.5.1. Two systems are involved and each have dual ported MTHCA DDR adapter and MLX4 adapters. The scenario starts with NFSRDMA stress testing between the two systems running bonnie++ and iozone concurrently. The test completes and there is no issue. Then 6 minutes pass and the server times out the connection and shuts down the RC connection to the client. From this point on, using the RDMA CM, a new RC QP can be brought up and moved to RTS, however, the first RDMA_SEND to the NFS SERVER system fails with IB_WC_RETRY_EXC_ERR. I have confirmed: - that arp completed successfully and the neighbor entries are populated on both the client and server - that the QP are in the RTS state on both the client and server - that there are RECV WR posted to the RQ on the server and they did not error out - that no RECV WR completed successfully or in error on the server - that there are SEND WR posted to the QP on the client - the client side SEND_WR fails with error 12 as mentioned above I have also confirmed the following with a different application (i.e. rping): server# rping -s client# rping -c -a 192.168.80.129 fails with the exact same error, i.e. client# rping -c -a 192.168.80.129 cq completion failed status 12 wait for RDMA_WRITE_ADV state 10 client DISCONNECT EVENT... However, if I run rping the other way, it works fine, that is, client# rping -s server# rping -c -a 192.168.80.135 It runs without error until I stop it. Does anyone have any ideas on how I might debug this? Tom What is the vendor syndrome error when you get a completion with error? Feb 16 15:08:29 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:51:27 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:01 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81003c9e3200 ex src_qp wc_flags, 0 pkey_index Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 closed (-103) Feb 16 15:52:06 vic10 kernel: rpcrdma: connection to 192.168.80.129:20049 on mlx4_0, memreg 5 slots 32 ird 16 Feb 16 15:52:40 vic10 kernel: rpcrdma_event_process:160 wr_id 81002879a000 status 5 opcode 0 vendor_err 244 byte_len 0 qp 81002f2d8400 ex src_qp wc_flags, 0 pkey_index Repeat forever So the vendor err is 244. Please ignore this. This log skips the failing WR (:-\). I need to do another trace. Does the issue occurs only on the ConnectX cards (mlx4) or also on the InfiniHost cards (mthca) Tziporet ___ ewg mailing list e...@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: is it possible to avoid syncing after an rdma write?
On Tue, Feb 16, 2010 at 03:29:48PM -0800, Andy Grover wrote: Right now, RDS follows each RDMA write op with a Send op, which 1) causes an interrupt and 2) includes the info we need to call ib_dma_sync_sg_for_cpu() for the target of the rdma write. We want to omit the Send. If we don't do the sync on the machine that is the target of the RDMA write, the result is... what exactly? I assume the write to memory is snooped by CPUs, so their cachelines will be properly invalidated. However, Linux DMA-API docs seem pretty clear in insisting on the sync. I'm curious about this too, but I will point out that at least the user RDMA interface has no match for the kernel DMA calls, so in practice RDMA does not work on systems that require them. That means bounce buffering is not used and IO/CPU caches are coherent. Though, I guess, the kernel could use weaker memory ordering types in kernel mode that do require the DMA api calls. Is the issue IOMMUs? Or for compatibility with bounce buffering? As long as the memory is registered the IOMMU should remain configured. What do you intend to replace the SEND with? spin on last byte? There are other issues to consider like ordering within the PCI-E fabric.. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: is it possible to avoid syncing after an rdma write?
Why not use an RDMA write w/ immed? That forces the consumption of a receive WQE and can be used to create a completion event. Since the immediate data is carried in the last packet of a multi-packet RDMA write, you are guaranteed that all data has been placed in the receive buffer, in order. I'm a hardware guy, so this may be completely off-the-wall w.r.t. this particular discussion. -Paul -Original Message- From: linux-rdma-ow...@vger.kernel.org [mailto:linux-rdma-ow...@vger.kernel.org] On Behalf Of Jason Gunthorpe Sent: Tuesday, February 16, 2010 4:58 PM To: Andy Grover Cc: linux-rdma@vger.kernel.org Subject: Re: is it possible to avoid syncing after an rdma write? On Tue, Feb 16, 2010 at 03:29:48PM -0800, Andy Grover wrote: Right now, RDS follows each RDMA write op with a Send op, which 1) causes an interrupt and 2) includes the info we need to call ib_dma_sync_sg_for_cpu() for the target of the rdma write. We want to omit the Send. If we don't do the sync on the machine that is the target of the RDMA write, the result is... what exactly? I assume the write to memory is snooped by CPUs, so their cachelines will be properly invalidated. However, Linux DMA-API docs seem pretty clear in insisting on the sync. I'm curious about this too, but I will point out that at least the user RDMA interface has no match for the kernel DMA calls, so in practice RDMA does not work on systems that require them. That means bounce buffering is not used and IO/CPU caches are coherent. Though, I guess, the kernel could use weaker memory ordering types in kernel mode that do require the DMA api calls. Is the issue IOMMUs? Or for compatibility with bounce buffering? As long as the memory is registered the IOMMU should remain configured. What do you intend to replace the SEND with? spin on last byte? There are other issues to consider like ordering within the PCI-E fabric.. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: is it possible to avoid syncing after an rdma write?
Two advantages come to mind vs an RDMA Write followed by a SEND: Using a SEND will consume a second WQE on the send side, and the synchronizing SEND will cause an entire new transaction, which will consume a(n infinitesimally) small amount of additional wire bandwidth, as well as incurring a(infinitesimally) small likelihood of a dropped or lost packets. Nits? Yes, probably infinitesimally small ones. (hardware guys tend to worry about the small ones.) To answer Andy's original question, the behavior on the receive side is not guaranteed until control of the receive buffer has been formally returned to the receiver. I expect that most HCAs are pretty well behaved here, as are most CPU/memory/root complexes...but you never know. Can anybody guarantee that the inbound packet gets written to the memory in order? If something odd did happen, it seems like one of those places that would require an incredible stroke of luck to debug. OTOH, I know that many applications simply poll the receive buffer looking for a flag everyday and get away with it. -Original Message- From: Jason Gunthorpe [mailto:jguntho...@obsidianresearch.com] Sent: Tuesday, February 16, 2010 5:12 PM To: Paul Grun Cc: 'Andy Grover'; linux-rdma@vger.kernel.org Subject: Re: is it possible to avoid syncing after an rdma write? On Tue, Feb 16, 2010 at 05:05:21PM -0800, Paul Grun wrote: Why not use an RDMA write w/ immed? That forces the consumption of a receive WQE and can be used to create a completion event. Since the immediate data is carried in the last packet of a multi-packet RDMA write, you are guaranteed that all data has been placed in the receive buffer, in order. Yes, RDMA WRITE w/ immediate data is perfectly fine. I've even implemented some protocols that use it to good effect. Not sure what the performance trade off is like though. The immediate data pretty much behaves exactly like a SEND WC on the receive side, but there may be some performance and latency advantages, particularly on the send side. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html