[openib-general] [PATCH] Opensm - osm_mcast_mgr.c add type casting
Hi Hal, The following patch adds a missing type casting in the return value of the function osm_mcast_mgr_compute_max_hops. Thanks, Yael Signed-off-by: Yael Kalka [EMAIL PROTECTED] Index: osm_mcast_mgr.c === --- osm_mcast_mgr.c (revision 5307) +++ osm_mcast_mgr.c (working copy) @@ -269,7 +269,7 @@ osm_mcast_mgr_compute_max_hops( } OSM_LOG_EXIT( p_mgr-p_log ); - return( max_hops ); + return(float)(max_hops); } /** ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] Opensm - type changing in st.h/c files
Hi Hal, There was a problem with some of the types defined when compiling on 64bit windows machines. The following patch adds support for these as well. Thanks, Yael Signed-off-by: Yael Kalka [EMAIL PROTECTED] Index: include/opensm/st.h === --- include/opensm/st.h (revision 5307) +++ include/opensm/st.h (working copy) @@ -50,14 +50,21 @@ BEGIN_C_DECLS -typedef unsigned long st_data_t; +#if (__WORDSIZE == 64) || defined (_WIN64) +#define st_ptr_t unsigned long long +#else +#define st_ptr_t unsigned long +#endif + +typedef st_ptr_t st_data_t; + #define ST_DATA_T_DEFINED typedef struct st_table st_table; struct st_hash_type { int (*compare)(void *, void *); - int (*hash)(void *); + st_ptr_t (*hash)(void *); }; struct st_table { Index: opensm/st.c === --- opensm/st.c (revision 5307) +++ opensm/st.c (working copy) @@ -41,7 +41,6 @@ # include config.h #endif /* HAVE_CONFIG_H */ -#include config.h #include stdio.h #include stdlib.h #include string.h @@ -73,7 +72,7 @@ struct st_table_entry { * */ static int numcmp(void *, void *); -static int numhash(void *); +static st_ptr_t numhash(void *); static struct st_hash_type type_numhash = { numcmp, numhash, @@ -83,7 +82,7 @@ static struct st_hash_type type_numhash /* extern int strcmp(const char *, const char *); */ static int strhash(const char *); -static inline int st_strhash(void *key) +static inline st_ptr_t st_strhash(void *key) { return strhash((const char *)key); } @@ -619,12 +618,12 @@ static int numcmp(x, y) void *x, *y; { - return (long)x != (long)y; + return (st_ptr_t)x != (st_ptr_t)y; } -static int +static st_ptr_t numhash(n) void *n; { - return (long)n; + return (st_ptr_t)n; } ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: openib and mellanox hca problem
I wander whether we manage to locate the bridge. It would be interesting to build mthca with debug enabled. Quoting r. Michael Di Domenico [EMAIL PROTECTED]: What specifically would you like to know? On 2/7/06, Roland Dreier [EMAIL PROTECTED] wrote: Feb 7 16:59:48 linux14-ts kernel: ib_mthca :07:00.0: PCI device did not come back after reset, aborting. Can you give more details on the system where you saw this? - R. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: openib and mellanox hca problem
If you really suspect timing issues, you can always increase timeouts: look for msleep in mthca_reset.c and try bumping up the numbers. Anyway - could you please enable mthca debug in menuconfig? This would give us some more information on whats going on. Quoting r. Ranjit Pandit [EMAIL PROTECTED]: Subject: Re: openib and mellanox hca problem Michael, I have seen this problem before.. See following mail thread http://www.mail-archive.com/openib-general@openib.org/msg13861.html Commenting out call to mthca_reset() in mthca_main.c worked around the problem on my system, and as far as I can tell, did not have any negative impact. It will be good if someone reviews the reset path in mthca. Ranjit On 2/7/06, Michael Di Domenico [EMAIL PROTECTED] wrote: I'm trying to build a system using the openib drivers with a mellanox hca card. I don't have much information about the card itself, it's in a server right now... But I downloaded openib today from the svn source, installed it onto a fresh copy of Fedora Core 4 with Kernel version 2.6.15.3... Everything seemed to compile fine and install okay. I've been following the instructions from the wiki page thus far without a problem. I get upto this step modprobe ib_mthca and get the below error in /var/log/messages. Strangely enough all the modules load, and i do a udevstart, but i never get a /dev/infiniband directory and /sys/class/infiniband directory is empty. Does anyone know how i might fix this, or point me to some better documentation then what is on the wiki? Thanks - Michael Feb 7 16:59:37 linux14-ts kernel: ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) Feb 7 16:59:37 linux14-ts kernel: ib_mthca: Initializing :07:00.0 Feb 7 16:59:37 linux14-ts kernel: ACPI: PCI Interrupt :07:00.0[?] - GSI 26 (level, low) - IRQ 217 Feb 7 16:59:48 linux14-ts kernel: ib_mthca :07:00.0: PCI device did not come back after reset, aborting. Feb 7 16:59:48 linux14-ts kernel: ib_mthca :07:00.0: Failed to reset HCA, aborting. Feb 7 16:59:48 linux14-ts kernel: ACPI: PCI interrupt for device :07:00.0 disabled --- lspci output 06:03.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev ff) (prog-if ff) !!! Unknown header type 7f 07:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev ff) (prog-if ff) !!! Unknown header type 7f -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] iser: cleanups changeset
kind of huge cleanup as part of the preparations for the RFC iscsi_iser.h | 166 +-- iser_initiator.c | 69 -- iser_memory.c| 138 + iser_verbs.c | 52 - 4 files changed, 156 insertions(+), 269 deletions(-) r5336 | ogerlitz | 2006-02-08 13:13:17 +0200 (Wed, 08 Feb 2006) | 4 lines cleanps Signed-off-by: Or Gerlitz [EMAIL PROTECTED] ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] trying to run cmpost example
Hi all, one more newbie question. Here is my ib modules installation (2.6.15 kernel from ftp.kernel.org) lsmod | grep ib ib_umad26472 0 ib_ucm 31992 0 ib_cm 50648 1 ib_ucm ib_mthca 156244 0 ib_uverbs 57968 0 ib_ipoib 61736 0 ib_sa 24568 1 ib_ipoib ib_mad 56548 4 ib_umad,ib_cm,ib_mthca,ib_sa ib_core71344 8 ib_umad,ib_ucm,ib_cm,ib_mthca,ib_uverbs,ib_ipoib,ib_sa,ib_mad I run cmpost from libibcm/example directory as root ls -la /dev/infiniband/ucm0 gives : crw-r--r-- 1 root root 231, 255 2006-02-08 13:28 /dev/infiniband/ucm0 Prompt LD_LIBRARY_PATH=/usr/local/lib ./cmpost libibcm: error -1:6 opening device /dev/infiniband/ucm0 starting server listen request failed test complete Does somebody have an idea of what is missing ? All my lib code comes from the svn repository, do I need to modify the 2.6.15 infiniband directory ? Thanks in advance, xavier ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: ipoib_mcast_send.patch
Quoting r. Roland Dreier [EMAIL PROTECTED]: Subject: Re: ipoib_mcast_send.patch Michael I agree. Do you want to fix it or should I? If you get a chance that would be great. I'm at the OpenIB workshop now so I probably can't seriously look at it until tomorrow at the earliest. Here you are. The following is in ipoib_broadcast_gid.patch in svn. --- The way priv-broadcast is initialized in ipoib_mcast_join_task() is somewhat unsafe, since there's no lock and conceivably a send-only join could complete before priv-broadcast is fully set up. Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED] Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c === --- openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (revision 5336) +++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -533,8 +533,9 @@ void ipoib_mcast_join_task(void *dev_ptr } if (!priv-broadcast) { - priv-broadcast = ipoib_mcast_alloc(dev, 1); - if (!priv-broadcast) { + struct ipoib_mcast *broadcast; + broadcast = ipoib_mcast_alloc(dev, 1); + if (!broadcast) { ipoib_warn(priv, failed to allocate broadcast group\n); mutex_lock(mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, priv-flags)) @@ -544,10 +545,11 @@ void ipoib_mcast_join_task(void *dev_ptr return; } - memcpy(priv-broadcast-mcmember.mgid.raw, priv-dev-broadcast + 4, + spin_lock_irq(priv-lock); + priv-broadcast = broadcast; + memcpy(broadcast-mcmember.mgid.raw, priv-dev-broadcast + 4, sizeof (union ib_gid)); - spin_lock_irq(priv-lock); __ipoib_mcast_add(dev, priv-broadcast); spin_unlock_irq(priv-lock); } -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: openib and mellanox hca problem
Quoting Michael Di Domenico [EMAIL PROTECTED]: Feb 7 16:59:48 linux14-ts kernel: ib_mthca :07:00.0: PCI device did not come back after reset, aborting. Feb 7 16:59:48 linux14-ts kernel: ib_mthca :07:00.0: Failed to reset HCA, aborting. Feb 7 16:59:48 linux14-ts kernel: ACPI: PCI interrupt for device :07:00.0 disabled --- lspci output 06:03.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev ff) (prog-if ff) !!! Unknown header type 7f 07:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev ff) (prog-if ff) !!! Unknown header type 7f This could be a hardware problem. Please contact your mellanox FAE representative. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: openib and mellanox hca problem
On 2/8/06, Michael S. Tsirkin [EMAIL PROTECTED] wrote: Quoting Michael Di Domenico [EMAIL PROTECTED]: --- lspci output 06:03.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev ff) (prog-if ff) !!! Unknown header type 7f 07:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev ff) (prog-if ff) !!! Unknown header type 7f This could be a hardware problem. Please contact your mellanox FAE representative. It shouldn't be. These machines were working fine with a copy of REL3 using a 2.4 kernel and the silverstorm hca stack. This has only creeped up when i switched to Fedora Core v4 v2.6 kernel and the openib stack ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: openib and mellanox hca problem
On 2/8/06, Michael S. Tsirkin [EMAIL PROTECTED] wrote: If you really suspect timing issues, you can always increase timeouts: look for msleep in mthca_reset.c and try bumping up the numbers. Anyway - could you please enable mthca debug in menuconfig? This would give us some more information on whats going on. I enabled debug in the module config recompiled and tried to reload using modprobe ib_mthca and got the same results? Am i missing a debug parameter somewhere? Or should it just spit out more information automatically? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal
One more issue to discuss. Does Completion of Recv that matches RDMA Write with Immediate Data automatically sync local memory or Consumer still need to do lmr_sync_rdma_write prior to accessing RDMAed data. Arkady Kanevsky email: [EMAIL PROTECTED] Network Appliance Inc. phone: 781-768-5395 1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195 Waltham, MA 02451 central phone: 781-768-5300 -Original Message- From: Caitlin Bestler [mailto:[EMAIL PROTECTED] Sent: Tuesday, February 07, 2006 7:40 PM To: [EMAIL PROTECTED]; Larsen, Roy K; Arlin Davis; Hefty, Sean Cc: openib-general@openib.org Subject: RE: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal [EMAIL PROTECTED] wrote: We have problem no matter which option we choose. The current Transport Level Requirement state: There is a one-to-one correspondence between send operation on one Endpoint of the Connection and recv operations on the other Endpoint of the Connection. There is no correspondence between RDMA operations on one Endpoint of the Connection and recv or send data transfer operation on the other Endpoint of the Connection. Receive operations on a Connection must be completed in the order of posting of their corresponding sends. The Immediate data and Atomic ops violate these requirements including ordering rules. I had started updating these rules when I generated the first draft of the requirements. They are included in the enclosed pdf file. But they do not cover Atomic ops that also impact transport requirements. This chapter of the spec have not been changed since DAPL 1.0 and I am very concern with any changes to it. Arkady If RDMA Write with Immediate is viewed as being the equivalent of doing RDMA Write and then an RDMA Send the correspondence rule is maintained. But *only* if the rdma write with immediate has all of the semantics of a Send. Atomics do not violate the rules if you view them as being a variation on an RDMA Read. They are an RDMA Read with modify. The real question is whether it makes sense to put it in the RDMA device. It is also not subject to emulation at a highe layer. With send with invalidate we know how InfiniBand *will* support it, because of the IB 1.2 verbs. We do not know that for atomics over iWARP. We do not know whether it will be added, more importantly we do not know *how* it would be added if it were added. That makes coming up with a transport neutral definition very premature. In particular, if atomics were added to iWARP there is a distinct design option where it would *not* be the same work queue as RDMA Reads (adding atomics through Queue ID 3 would make layering on top of a current implementation much easier. But it would mean that atomic credits would be distinct from read credits. This is a very strong reason to defer attempting to define RDMA Atomics in a transport neutral fashion. Yahoo! Groups Links * To visit your group on the web, go to: http://groups.yahoo.com/group/dat-discussions/ * To unsubscribe from this group, send an email to: [EMAIL PROTECTED] * Your use of Yahoo! Groups is subject to: http://docs.yahoo.com/info/terms/ ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: openib and mellanox hca problem
Quoting r. Michael Di Domenico [EMAIL PROTECTED]: Subject: Re: openib and mellanox hca problem On 2/8/06, Michael S. Tsirkin [EMAIL PROTECTED] wrote: If you really suspect timing issues, you can always increase timeouts: look for msleep in mthca_reset.c and try bumping up the numbers. Anyway - could you please enable mthca debug in menuconfig? This would give us some more information on whats going on. I enabled debug in the module config recompiled and tried to reload using modprobe ib_mthca and got the same results? Am i missing a debug parameter somewhere? Or should it just spit out more information automatically? Yes, it should spit out things like Found bridge. Are you sure you installed it properly? To check, you can try to stick mthca_dbg(mdev, Here\n); at the beginning of mthca_reset and see that it gets printed. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal
[EMAIL PROTECTED] wrote: One more issue to discuss. Does Completion of Recv that matches RDMA Write with Immediate Data automatically sync local memory or Consumer still need to do lmr_sync_rdma_write prior to accessing RDMAed data. Why would it be any different than for a plain receive? The intent is the same, to indicate that prior Writes have completed. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: openib and mellanox hca problem
On 2/8/06, Michael S. Tsirkin [EMAIL PROTECTED] wrote: Quoting r. Michael Di Domenico [EMAIL PROTECTED]: Subject: Re: openib and mellanox hca problem On 2/8/06, Michael S. Tsirkin [EMAIL PROTECTED] wrote: If you really suspect timing issues, you can always increase timeouts: look for msleep in mthca_reset.c and try bumping up the numbers. Anyway - could you please enable mthca debug in menuconfig? This would give us some more information on whats going on. I enabled debug in the module config recompiled and tried to reload using modprobe ib_mthca and got the same results? Am i missing a debug parameter somewhere? Or should it just spit out more information automatically? Yes, it should spit out things like Found bridge. Are you sure you installed it properly? To check, you can try to stick mthca_dbg(mdev, Here\n); at the beginning of mthca_reset and see that it gets printed. definately working... Feb 8 10:01:23 linux14-ts kernel: ib_mthca: Mellanox InfiniBand HCA driver v0.06 (June 23, 2005) Feb 8 10:01:23 linux14-ts kernel: ib_mthca: Initializing :07:00.0 Feb 8 10:01:23 linux14-ts kernel: ACPI: PCI Interrupt :07:00.0[?] - GSI 26 (level, low) - IRQ 217 Feb 8 10:01:23 linux14-ts kernel: ib_mthca :07:00.0: Here Feb 8 10:01:23 linux14-ts kernel: ib_mthca :07:00.0: Found bridge: :06:03.0 Feb 8 10:01:34 linux14-ts kernel: ib_mthca :07:00.0: PCI device did not come back after reset, aborting. Feb 8 10:01:34 linux14-ts kernel: ib_mthca :07:00.0: Failed to reset HCA, aborting. Feb 8 10:01:34 linux14-ts kernel: ACPI: PCI interrupt for device :07:00.0 disabled ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] 2: ALL MAJOR DESIGNER REPLICA //ATCHES! Save $35
Replica Watch Why spend thousands of dollars on the real deal when a replica watch looks so much alike that only an expert could tell the difference... And you only pay a fraction of the price. CLICK HERE NOW FOR DETAILS! To unsubscribe click here! ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: openib and mellanox hca problem
Quoting r. Michael Di Domenico [EMAIL PROTECTED]: Subject: Re: openib and mellanox hca problem Roland, I've attached the dmesg and lspci outputs... You really want lspci *before* mthca got loaded. This one just shows the card's incommunicado. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: openib and mellanox hca problem
Quoting Michael Di Domenico [EMAIL PROTECTED]: Feb 8 10:01:23 linux14-ts kernel: ib_mthca :07:00.0: Found bridge: :06:03.0 Hmm, looks like the bridge lookup worked fine. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] FW: [PATCH 1 of 3] mad: large RMPP support
Sorry for breaking the thread (Outlook is problematic). Jack -Original Message- From: Jack Morgenstein Sent: Wednesday, February 08, 2006 6:23 PM To: 'Sean Hefty' Cc: Michael S. Tsirkin; '[EMAIL PROTECTED]' Subject: RE: [PATCH 1 of 3] mad: large RMPP support Sorry for not echoing to openib -- I'm having problems with mutt and our server (replying to this from Outlook will not place the reply in the thread). I would much rather use the linked list. We may need to allocate a rather large contiguous array (ib_mad_segments segment array) for queries involving a large cluster, and such an allocation has a larger probability of failure. For example, a 1000 host cluster, with 2 ports per HCA will have at least 4000 records in a SubnAdmGetTableResp for all PortInfo records on the network (2000 for HCAs, and at least 2000 for the switch ports). Such a query response will generate an RMPP of size 256K -- 1000 segments, or a 4K buffer on an X86 machine just for the array (assuming one allocation per RMPP segment -- N=1). b. Regarding using buffers which contain N RMPP segments, this becomes a management nightmare: If choose N too large, we may fail to allocate segments in a large RMPP, so that the entire RMPP fails (where it could succeed if N=1). Having N=1 guarantees that if we can succeed in our allocation, we will. I do not consider variable-size N within a single RMPP, since this will be very complicatated and error-prone. We could re-allocate everything if some N does not work -- also very complex. Regarding the order N-squared algorithm for finding the next RMPP segment to send, MST and I agree that this is not acceptable. We are considering an algorithm which stores the current segment pointer in struct ib_mad_send_wr_private so that when getting the next segment we simply go to the next link. We're still ironing out proper handling of the last acknowledged processing (maintaining a pointer to the last-acked segment, upgrading the last-acked pointer when a new ack arrives -- this might still involve linear searches). Regarding the payload pointer, I agree. It is also trivial to move it to the ib_mad_send_wr_private structure, hiding it from the user. Regarding the 64-byte boundary, why is this important? Jack -Original Message- From: Sean Hefty [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 08, 2006 3:01 AM To: Jack Morgenstein Cc: openib-general@openib.org Subject: RE: [PATCH 1 of 3] mad: large RMPP support Based on what you've done, I'd like to suggest changing interface similar to that shown below. I believe that this could be done with minor changes to the current patches. Detailed comments that led to suggesting this change are inline in my responses. struct ib_mad_segments { u32 num_segments; u32 segment_size; void*segment[0]; }; struct ib_mad_send_buf { ... void*mad; /* First MAD segment */ struct ib_mad_segments *segments; /* RMPP segments 1 */ ... }; This will avoid walking through a list to find segments, and allows for efficient allocation of the segment data buffers. Multiple segments could be allocated through a single kzalloc. (For example, every n-th segment would start a new allocation, making deallocation easy as well.) +struct ib_mad_multipacket_seg { + struct list_head list; + u32 size; + u8 data[0]; +}; Should we ensure that the data alignment is on a 64-byte boundary? struct ib_mad_send_buf { struct ib_mad_send_buf *next; - void*mad; + void*mad; /* RMPP: first segment, + including the MAD header */ + void*mad_payload; /* RMPP: changed per segment */ Mad_payload doesn't appear to be directly accessible directly by the user. It should be hidden. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] [PATCH] [RFC] - example user mode rdma ping/pong program using CMA
All, Attached is a user-mode program, called rping, that uses librdmacm and libibverbs to implement a ping-pong program over an RC connection. The program utilizes SEND, RECV, RDMA READ, and WRITE ops, as well as cq channels to get cq events, and rdma_get_event() to detect CMA events. It is multi-threaded. I've built it as an example program in librdmacm/examples and tested it with mthca. It is useful to test CMA as well as all the major rdma operations in a transport-neutral way. If you all find it has utility, please pull it into librdmacm/examples. Signed-off-by: Steve Wise [EMAIL PROTECTED] Index: Makefile.am === --- Makefile.am (revision 5330) +++ Makefile.am (working copy) @@ -18,9 +18,11 @@ src_librdmacm_la_SOURCES = src/cma.c src_librdmacm_la_LDFLAGS = -avoid-version $(rdmacm_version_script) -bin_PROGRAMS = examples/ucmatose +bin_PROGRAMS = examples/ucmatose examples/rping examples_ucmatose_SOURCES = examples/cmatose.c examples_ucmatose_LDADD = $(top_builddir)/src/librdmacm.la +examples_rping_SOURCES = examples/rping.c +examples_rping_LDADD = $(top_builddir)/src/librdmacm.la librdmacmincludedir = $(includedir)/rdma Index: examples/rping.c === --- examples/rping.c(revision 0) +++ examples/rping.c(revision 0) @@ -0,0 +1,1175 @@ +/* + * Copyright (c) 2005 Ammasso, Inc. All rights reserved. + * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved. + * + * This software is available to you under a choice of one of two + * licenses. You may choose to be licensed under the terms of the GNU + * General Public License (GPL) Version 2, available from the file + * COPYING in the main directory of this source tree, or the + * OpenIB.org BSD license below: + * + * Redistribution and use in source and binary forms, with or + * without modification, are permitted provided that the following + * conditions are met: + * + * - Redistributions of source code must retain the above + *copyright notice, this list of conditions and the following + *disclaimer. + * + * - Redistributions in binary form must reproduce the above + *copyright notice, this list of conditions and the following + *disclaimer in the documentation and/or other materials + *provided with the distribution. + * + * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND, + * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF + * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND + * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS + * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN + * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN + * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE + * SOFTWARE. + */ + +#include getopt.h +#include stdlib.h +#include string.h +#include stdio.h +#include errno.h +#include sys/types.h +#include netinet/in.h +#include sys/socket.h +#include netdb.h +#include byteswap.h +#include semaphore.h +#include arpa/inet.h +#include pthread.h + +#include rdma/rdma_cma.h + +static int debug = 0; +#define DEBUG_LOG if (debug) printf + +/* + * rping ping/pong loop: + * client sends source rkey/addr/len + * server receives source rkey/add/len + * server rdma reads ping data from source + * server sends go ahead on rdma read completion + * client sends sink rkey/addr/len + * server receives sink rkey/addr/len + * server rdma writes pong data to sink + * server sends go ahead on rdma write completion + * repeat loop + */ + +/* + * These states are used to signal events between the completion handler + * and the main client or server thread. + * + * Once CONNECTED, they cycle through RDMA_READ_ADV, RDMA_WRITE_ADV, + * and RDMA_WRITE_COMPLETE for each ping. + */ +typedef enum { + IDLE = 1, + CONNECT_REQUEST, + CONNECTED, + RDMA_READ_ADV, + RDMA_READ_COMPLETE, + RDMA_WRITE_ADV, + RDMA_WRITE_COMPLETE, + ERROR +} state_t; + +/* + * Default max buffer size for IO... + */ +#define RPING_BUFSIZE 64*1024 +#define RPING_SQ_DEPTH 16 + +/* + * Control block struct. + */ +struct rping_cb { + int server; /* 0 iff client */ + pthread_t cqthread; + struct ibv_comp_channel *channel; + struct ibv_cq *cq; + struct ibv_pd *pd; + struct ibv_qp *qp; + + struct ibv_recv_wr rq_wr; /* recv work request record */ + struct ibv_sge recv_sgl;/* recv single SGE */ + char *recv_buf; /* malloc'd buffer */ + struct ibv_mr *recv_mr; /* MR associated with this buffer */ + + struct ibv_send_wr sq_wr; /* send work requrest record */ + struct ibv_sge send_sgl; + char *send_buf; /* single send buf */ +
[openib-general] problem with user-verb WC's
While working on the openIB port for PVFS2, I've stumbled across some problems in posting rdma requests via the user-verbs interface with ib_mthca drivers. According to a 'TODO' buried in the gen2 src/linux-kernel/infiniband/hw/ : MW support: ib_mthca does not support memory windows The opcodes that I receive for non-rdma requests are all correct, however, when posting rdma requests, I'm consistently getting work completions with opcodes of: IBV_WC_BIND_MW I'm not making any (known) calls or requests to bind to a memory window, or for that matter to create a memory window. So how does a completion event get generated with an opcode indicating a currently unimplemented feature has just finished? And are there other reasons why I should/would be getting this type of completion? Thanks, Kyle -- Kyle Schochenmaier [EMAIL PROTECTED] Research Assistant, Dr. Brett Bode AmesLab - US Dept.Energy Scalable Computing Laboratory ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA
Attached is a user-mode program, called rping, that uses librdmacm and libibverbs to implement a ping-pong program over an RC connection. The program utilizes SEND, RECV, RDMA READ, and WRITE ops, as well as cq channels to get cq events, and rdma_get_event() to detect CMA events. It is multi-threaded. I've built it as an example program in librdmacm/examples and tested it with mthca. It is useful to test CMA as well as all the major rdma operations in a transport-neutral way. If you all find it has utility, please pull it into librdmacm/examples. Thanks. I may not get a chance to test this for a couple of days, but some additional tests for librdmacm would definitely be useful. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA
On Wed, 2006-02-08 at 18:45 +0200, Michael S. Tsirkin wrote: Quoting r. Steve Wise [EMAIL PROTECTED]: Subject: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA All, Attached is a user-mode program, called rping, that uses librdmacm and libibverbs to implement a ping-pong program over an RC connection. The program utilizes SEND, RECV, RDMA READ, and WRITE ops, as well as cq channels to get cq events, and rdma_get_event() to detect CMA events. It is multi-threaded. I've built it as an example program in librdmacm/examples and tested it with mthca. It is useful to test CMA as well as all the major rdma operations in a transport-neutral way. If you all find it has utility, please pull it into librdmacm/examples. Signed-off-by: Steve Wise [EMAIL PROTECTED] Steve, looks like you have at most a single receive work request posted at the receive workqueue at all times. If true, this is *really* not a good idea, performance-wise, even if you actually have at most 1 packet in flight. Hey Michael, There is at most only one SEND in flight. This is a test program, not a performance program. Its goal is to utilize SEND, RECV, RDMA READ, and RDMA WRITE as well as CMA to setup the connection... Thanks, Steve. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA
Hey Michael, There is at most only one SEND in flight. This is a test program, not a performance program. Its goal is to utilize SEND, RECV, RDMA READ, and RDMA WRITE as well as CMA to setup the connection... Thanks, Steve. By the way, in case its not clear: The SEND/RECV exchanges are done just to advertise source and sink memory regions, and to indicate completion of rdma read and write operations to the peer. The ping/pong data is transferred with rdma read and write operations. Thanks for the feedback! ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Ifdown/ifup pick up the wrong ib interface configuration file
Check your ifcfg-ib0/ifcfg-ib1 script to see whether the interface name matches. Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: openib and mellanox hca problem
On 2/8/06, Michael S. Tsirkin [EMAIL PROTECTED] wrote: Quoting r. Michael Di Domenico [EMAIL PROTECTED]: Subject: Re: openib and mellanox hca problem Roland, I've attached the dmesg and lspci outputs... You really want lspci *before* mthca got loaded. This one just shows the card's incommunicado. I'm going to try and rollback to RedHat EL4 IA32 and see if i can get the machines up and using the silverstorm host stack and make everything works fine. unforgunately we dont have a stack for fedora core 4 on ia32 on ia64 afterwards i'll load up the openib stack and see what happens... thanks for the help ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA
Quoting r. Sean Hefty [EMAIL PROTECTED]: Subject: RE: [openib-general] Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA Steve, looks like you have at most a single receive work request posted at the receive workqueue at all times. If true, this is *really* not a good idea, performance-wise, even if you actually have at most 1 packet in flight. Can you provide some more details on this? See 9.7.7.2 end-to-end (message level) flow control -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] FW: [PATCH 1 of 3] mad: large RMPP support
For example, a 1000 host cluster, with 2 ports per HCA will have at least 4000 records in a SubnAdmGetTableResp for all PortInfo records on the network (2000 for HCAs, and at least 2000 for the switch ports). Such a query response will generate an RMPP of size 256K -- 1000 segments, or a 4K buffer on an X86 machine just for the array (assuming one allocation per RMPP segment -- N=1). I think that this is a good reason to use an array. Walking a 1000 entry list 1000 times is a substantial performance hit. Lost MADs and retries will make this worse. A 4K buffer for the array is less than the 8K total needed for the 1000 list items. We're already talking about allocating over 256K of memory just for the data payload. An additional contiguous 4k buffer seems like a minor issue. I'm not convinced that there's a real issue here. To support ridiculously large transfers from userspace, we may need to push the RMPP handling up into userspace. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] error when using libsdp
Hi, I have compiled and configured libsdp and when I start my application I get this message : default libsdp configuration is used Error 97 calling socket for SDP socket errno 97 gives #define EAFNOSUPPORT97 /* Address family not supported by protocol */ How can I enable the SDP support ? Thanks in advance, xavier ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: openib and mellanox hca problem
Michael I wander whether we manage to locate the bridge. It Michael would be interesting to build mthca with debug enabled. Certainly building with CONFIG_INFINIBAND_MTHCA_DEBUG=y would be a good idea. But even without debug, if we don't find a bridge, we should see the warning from the code: if (!bridge) { /* * Didn't find a bridge for a Tavor device -- * assume we're in no-bridge mode and hope for * the best. */ mthca_warn(mdev, No bridge found for %s\n, pci_name(mdev-pdev)); } - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA
Quoting r. Steve Wise [EMAIL PROTECTED]: Subject: Re: Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA Hey Michael, There is at most only one SEND in flight. This is a test program, not a performance program. Its goal is to utilize SEND, RECV, RDMA READ, and RDMA WRITE as well as CMA to setup the connection... Thanks, Steve. By the way, in case its not clear: The SEND/RECV exchanges are done just to advertise source and sink memory regions, and to indicate completion of rdma read and write operations to the peer. The ping/pong data is transferred with rdma read and write operations. Thanks for the feedback! Code tends to get copied around ... its easy to imagine someone copying this and measuring the send latency. Just posting many WRs in the initialization sequence, with no other code changes, will fix this problem. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: error when using libsdp
Quoting r. Xavier Grave [EMAIL PROTECTED]: Subject: error when using libsdp Hi, I have compiled and configured libsdp and when I start my application I get this message : default libsdp configuration is used Error 97 calling socket for SDP socket errno 97 gives #define EAFNOSUPPORT97 /* Address family not supported by protocol */ How can I enable the SDP support ? Thanks in advance, xavier Did you load the ib_sdp module? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: openib and mellanox hca problem
Quoting Roland Dreier [EMAIL PROTECTED]: Certainly building with CONFIG_INFINIBAND_MTHCA_DEBUG=y would be a good idea. But even without debug, if we don't find a bridge, we should see the warning from the code: Right, I wanded to check we got the right bus/device number, and it seems we did. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA
On Wed, 2006-02-08 at 19:10 +0200, Michael S. Tsirkin wrote: Quoting r. Sean Hefty [EMAIL PROTECTED]: Subject: RE: [openib-general] Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA Steve, looks like you have at most a single receive work request posted at the receive workqueue at all times. If true, this is *really* not a good idea, performance-wise, even if you actually have at most 1 packet in flight. Can you provide some more details on this? See 9.7.7.2 end-to-end (message level) flow control I just read this section in the 1.2 version of the spec, and I still don't understand what the issue really is? 9.7.7.2 talks about IBA doing flow control based on the RECV WQEs posted. rping always ensures that there is a RECV posted before the peer can send. This is ensured by the rping protocol itself (see the comment at the front of rping.c describing the ping loop). I'm only ever sending one outstanding message via SEND/RECV. I would rather post exactly what is needed, than post some number of RECVs just to be safe. Sorry if I'm being dense. What am I missing here? Steve. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA
By the way, in case its not clear: The SEND/RECV exchanges are done just to advertise source and sink memory regions, and to indicate completion of rdma read and write operations to the peer. The ping/pong data is transferred with rdma read and write operations. Thanks for the feedback! Code tends to get copied around ... its easy to imagine someone copying this and measuring the send latency. Just posting many WRs in the initialization sequence, with no other code changes, will fix this problem. Each ping/pong iteration with rping is composed of 2 sends on the client side, 2 sends on the server side, plus 1 rdma read and 1 rdma write on the server side. Again, latency performance (or any performance) isn't a goal of this program. Testing CMA, CQ and CMA event notifications, and send/recv/rr/rw are the goals. snipit from the patch: +/* + * rping ping/pong loop: + * client sends source rkey/addr/len + * server receives source rkey/add/len + * server rdma reads ping data from source + * server sends go ahead on rdma read completion + * client sends sink rkey/addr/len + * server receives sink rkey/addr/len + * server rdma writes pong data to sink + * server sends go ahead on rdma write completion + * repeat loop + */ Can you be more specific on what you think I should change? Are you suggesting I post more RECVs? Steve. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ibstat problem
Yes, We discovered this yesterday. You built the libraries and did not build the diag. tools. Once you do this, things work. I do still have a few problems on sending messages out multicast though. Sean Steve Wise wrote: Anyone see this before? - vic17:~ # ibstat ibstat: relocation error: ibstat: symbol argv0, version IBCOMMON_1.0 not defined in file libibcommon.so.1 with link time reference vic17:~ # uname -a Linux vic17 2.6.15.2-kdb #4 SMP PREEMPT Mon Feb 6 17:24:41 CST 2006 i686 i686 i386 GNU/Linux vic17:~ # - [EMAIL PROTECTED] src]$ svn info Path: . URL: https://openib.org/svn/gen2/trunk/src Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd Revision: 5330 Node Kind: directory Schedule: normal Last Changed Author: ogerlitz Last Changed Rev: 5330 Last Changed Date: 2006-02-07 07:23:38 -0600 (Tue, 07 Feb 2006) [EMAIL PROTECTED] src]$ ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: openib and mellanox hca problem
On 2/8/06, Michael S. Tsirkin [EMAIL PROTECTED] wrote: Quoting Roland Dreier [EMAIL PROTECTED]: Certainly building with CONFIG_INFINIBAND_MTHCA_DEBUG=y would be a good idea. But even without debug, if we don't find a bridge, we should see the warning from the code: Right, I wanded to check we got the right bus/device number, and it seems we did. FYI... Changed over to RHEL4 IA32 w/ SilverStorm Host Stack v3.2.0.0.21 and now i get the below info and a working infiniband setup... Since I have two servers, I'm going to leave this one working and try openib on the second machine... # uname -a Linux linux14.silverstorm.com 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:30:39 EST 2005 i686 i686 i386 GNU/Linux # lspci -vvv 06:03.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1) (prog-if 00 [Normal decode]) Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- Latency: 64, Cache Line Size 10 Bus: primary=06, secondary=07, subordinate=07, sec-latency=64 I/O behind bridge: f000-0fff Memory behind bridge: fe50-fe7f Prefetchable memory behind bridge: eac0-fbc0 Secondary status: 66Mhz+ FastB2B- ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- Reset- FastB2B- Capabilities: [70] PCI-X bridge device. Secondary Status: 64bit+, 133MHz+, SCD-, USC-, SCO-, SRD- Freq=3 Status: Bus=6 Dev=3 Func=0 64bit+ 133MHz+ SCD- USC-, SCO-, SRD- : Upstream: Capacity=512, Commitment Limit=512 : Downstream: Capacity=128, Commitment Limit=128 07:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1) Subsystem: Mellanox Technologies MT23108 InfiniHost Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop- ParErr- Stepping- SERR+ FastB2B- Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium TAbort- TAbort- MAbort- SERR- PERR- Latency: 64, Cache Line Size 10 Interrupt: pin A routed to IRQ 217 Region 0: Memory at fe70 (64-bit, non-prefetchable) [size=1M] Region 2: Memory at fb00 (64-bit, prefetchable) [size=8M] Region 4: Memory at f000 (64-bit, prefetchable) [size=128M] Capabilities: [40] MSI-X: Enable- Mask- TabSize=32 Vector table: BAR=0 offset=00082000 PBA: BAR=0 offset=00082200 Capabilities: [50] Vital Product Data Capabilities: [60] Message Signalled Interrupts: 64bit+ Queue=0/5 Enable- Address: Data: Capabilities: [70] PCI-X non-bridge device. Command: DPERE- ERO- RBC=3 OST=1 Status: Bus=7 Dev=0 Func=0 64bit+ 133MHz+ SCD- USC-, DC=simple, DMMRBC=3, DMOST=1, DMCRS=0, RSCEM- ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] Re: Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA
On Wed, 2006-02-08 at 09:51 -0800, Caitlin Bestler wrote: [EMAIL PROTECTED] wrote: By the way, in case its not clear: The SEND/RECV exchanges are done just to advertise source and sink memory regions, and to indicate completion of rdma read and write operations to the peer. The ping/pong data is transferred with rdma read and write operations. Thanks for the feedback! Code tends to get copied around ... its easy to imagine someone copying this and measuring the send latency. Just posting many WRs in the initialization sequence, with no other code changes, will fix this problem. Each ping/pong iteration with rping is composed of 2 sends on the client side, 2 sends on the server side, plus 1 rdma read and 1 rdma write on the server side. Again, latency performance (or any performance) isn't a goal of this program. Testing CMA, CQ and CMA event notifications, and send/recv/rr/rw are the goals. snipit from the patch: +/* + * rping ping/pong loop: + * client sends source rkey/addr/len + * server receives source rkey/add/len + * server rdma reads ping data from source + * server sends go ahead on rdma read completion + * client sends sink rkey/addr/len + * server receives sink rkey/addr/len + * server rdma writes pong data to sink + * server sends go ahead on rdma write completion + * repeat loop + */ Why does the server send go ahead after rdma write completion? No particular reason. It should be able to just post the send after posting the rdma write without waiting. When the rdma write completes has no device/transport independent meaning. You're correct. It does not need to wait for the rdma write completion... ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA
Quoting r. Steve Wise [EMAIL PROTECTED]: Subject: Re: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA On Wed, 2006-02-08 at 19:10 +0200, Michael S. Tsirkin wrote: Quoting r. Sean Hefty [EMAIL PROTECTED]: Subject: RE: [openib-general] Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA Steve, looks like you have at most a single receive work request posted at the receive workqueue at all times. If true, this is *really* not a good idea, performance-wise, even if you actually have at most 1 packet in flight. Can you provide some more details on this? See 9.7.7.2 end-to-end (message level) flow control I just read this section in the 1.2 version of the spec, and I still don't understand what the issue really is? 9.7.7.2 talks about IBA doing flow control based on the RECV WQEs posted. rping always ensures that there is a RECV posted before the peer can send. This is ensured by the rping protocol itself (see the comment at the front of rping.c describing the ping loop). I'm only ever sending one outstanding message via SEND/RECV. I would rather post exactly what is needed, than post some number of RECVs just to be safe. Sorry if I'm being dense. What am I missing here? Steve. As far as I know, the credits are only updated by the ACK messages. If there is a single work request outstanding on the RQ, the ACK of the SEND message will have the credit field value 0 (since exactly one receive WR was outstanding, and that is now consumed). As a result the remote side withh think that there are no receive WQEs and will slow down (what spec refers to as limited WQE). -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: Re: [PATCH] [RFC] - example user mode rdmaping/pongprogramusing CMA
Quoting Steve Wise [EMAIL PROTECTED]: Can you be more specific on what you think I should change? Are you suggesting I post more RECVs? During the initialization stage, post the same receive WR multiple times (according to the RQ size). Nothing needs to be touched in the loop: when you get a CQE, post just one receive WR. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA
I suggest this in rping_setup_buffers: while (!rc = ibv_post_recv(cbp-qp, cbp-rq_wr, bad_wr)); This way you will never have 0 end-to-end credits. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA
On Wed, 2006-02-08 at 21:11 +0200, Michael S. Tsirkin wrote: I suggest this in rping_setup_buffers: while (!rc = ibv_post_recv(cbp-qp, cbp-rq_wr, bad_wr)); This way you will never have 0 end-to-end credits. I can do this easily, but it bothers me to post the same buffer multiple times, knowing the application doesn't need it (and would fail if more than one RECV is consumed at a time), just to make the transport more efficient. Is this common practice for IB applications? Thanks, Steve. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA
Steve Is this common practice for IB applications? No, I think it's more of a cute trick that works in your particular case. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: ipoib_mcast_send.patch
I think we might want to be even more paranoid and wait until the broadcast join succeeds before allowing send-only joins. Otherwise we could create a send-only MCG with the wrong Q_Key, SL, etc. something like this maybe? --- infiniband/ulp/ipoib/ipoib_multicast.c (revision 5337) +++ infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -222,6 +222,13 @@ static int ipoib_mcast_join_finish(struc sizeof (union ib_gid))) { priv-qkey = be32_to_cpu(priv-broadcast-mcmember.qkey); priv-tx_wr.wr.ud.remote_qkey = priv-qkey; + + /* +* Make sure that all the attributes are visible +* before we set the attached bit, so that send-only +* joins don't get started with incorrect attributes. +*/ + smp_wmb(); } if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, mcast-flags)) { @@ -533,8 +540,10 @@ void ipoib_mcast_join_task(void *dev_ptr } if (!priv-broadcast) { - priv-broadcast = ipoib_mcast_alloc(dev, 1); - if (!priv-broadcast) { + struct ipoib_mcast *broadcast; + + broadcast = ipoib_mcast_alloc(dev, 1); + if (!broadcast) { ipoib_warn(priv, failed to allocate broadcast group\n); mutex_lock(mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, priv-flags)) @@ -544,10 +553,11 @@ void ipoib_mcast_join_task(void *dev_ptr return; } - memcpy(priv-broadcast-mcmember.mgid.raw, priv-dev-broadcast + 4, + spin_lock_irq(priv-lock); + memcpy(broadcast-mcmember.mgid.raw, priv-dev-broadcast + 4, sizeof (union ib_gid)); + priv-broadcast = broadcast; - spin_lock_irq(priv-lock); __ipoib_mcast_add(dev, priv-broadcast); spin_unlock_irq(priv-lock); } ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] 全て無料でご近 所さん探し
*:。.:*:・'゜☆。.:*:・'゜★ *:。.:*:・'゜☆。.:*:・'゜★ *:。.:*: 寒い冬は誰と過ごしますか? *:。.:*:・'゜☆。.:*:・'゜★ *:。.:*:・'゜☆。.:*:・'゜★ *:。.:*: 男女会員数30万人以上!! 今がチャンスの完全無料コミュニティーにご参加下さい。 -- 真菜 21歳 学生 題名:遊びたいよー 彼氏にフラレちゃって淋しい毎日を過ごしています。あーぁ、私って 運がないのかな?今年こそはいい年にしたいなぁ。最近、楽しい事が ないので一緒に遊びませんか?色んな事を忘れてはじけたいです。 http://www.sweet-ch.com/?es -- 里子 31歳 OL 題名:31歳独身お茶組してます お茶組して三年目…派遣社員として入って正社員の座を 射止めたはいいんですが… それも上司と口車に乗せられて…なんか低給料で全然稼げないんですよ… 最悪なんですけど…だから夜とか少しバイトとかしてます。 日曜とか休みの日が多いけどバイトとか入ったら夜とかも仕事してます。 メールだったら時間関係なくお付き合いできるかなって思って。 家にPCあるので一緒にメッセンジャーでもしませんか?待ってますね。 http://www.sweet-ch.com/?es -- 順子 40歳 主婦 題名:お外で楽しみたいな たまに主婦したりってしてます。でも亭主との夜の関係が一年以上ないし そろそろハメを外しちゃおうかなって考えて登録しました。実際歳より 若いって見られる事も多いので、体もエステとか行ってその辺の40代には 負けてないって自分でも思うけど。どうですか?私はお外で楽しみたいな とか思ってますけど。秘密厳守の人でお願いします。 http://www.sweet-ch.com/?es ◎ご近所さん探し◎ ┏★ 完全無料 ┏┃┛ エッチな子も恋いしたい子もいっぱい ★┛ http://www.meets-u.net/?mm ━注意事項━━ 本メールマガジン掲載に関する情報に関しては一切責任を負いません。 掲載情報の利用に際しては、各人が自分の責任で行なって下さい。 いかなる損害に関しても一切責任を負いかねますのでご了承下さい。 情報は必ずご自分でご確認ください。 掲載された記事の一部または全部を許可なく転載することを禁止致します。 ━━━ ━【購読解除について】 ※ 購読解除方法 万が一18歳未満の方に届いた場合や、登録解除をご希望の方は お手数ですが下記までお願い致します。 [EMAIL PROTECTED] ━━━ ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA
Quoting r. Roland Dreier [EMAIL PROTECTED]: Subject: Re: Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA Steve Is this common practice for IB applications? No, I think it's more of a cute trick that works in your particular case. Correct. Real apps are unlikely to get by with a single outstanding WR. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: ipoib_mcast_send.patch
Michael Right, but I thought atomic test_and_set_bit implied Michael smp_wmb already? So did I but then I looked in the kernel source and now I think that set_bit operations are only ordered against other bitops that touch the same word. For example ia64 just uses cmpxchg to implement the bitops, and powerpc just uses locked loads and stores. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] IPoIB and lid change
Hi, Roland! One issue we have with IPoIB is that IPoIB may cache a remote node path for a long time. Remote LID may get changed e.g. if the SM is changed, and IPoIB might lose connectivity. One simple way to address this would be to have a list of all address handles per net device and kill them on an SM change event. What do you think? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: Re: ipoib_mcast_send.patch
Quoting r. Roland Dreier [EMAIL PROTECTED]: Subject: Re: Re: ipoib_mcast_send.patch Michael Right, but I thought atomic test_and_set_bit implied Michael smp_wmb already? So did I but then I looked in the kernel source and now I think that set_bit operations are only ordered against other bitops that touch the same word. For example ia64 just uses cmpxchg to implement the bitops, and powerpc just uses locked loads and stores. - R. Hmm. Roland, which kernel version is that? On 2.6.15 I see in include/asm-powerpc/bitops.h static __inline__ int test_and_set_bit(unsigned long nr, volatile unsigned long *addr) { unsigned long old, t; unsigned long mask = BITOP_MASK(nr); unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr); __asm__ __volatile__( EIEIO_ON_SMP 1:PPC_LLARX %0,0,%3 # test_and_set_bit\n or %1,%0,%2 \n PPC405_ERR77(0,%3) PPC_STLCX %1,0,%3 \n bne- 1b ISYNC_ON_SMP : =r (old), =r (t) : r (mask), r (p) : cc, memory); return (old mask) != 0; } EIEIO_ON_SMP is a write barrier on smp, isnt it? I see this in 2.6.11: include/asm-ppc64/bitops.h static __inline__ int test_and_set_bit(unsigned long nr, volatile unsigned long *addr) { unsigned long old, t; unsigned long mask = 1UL (nr 0x3f); unsigned long *p = ((unsigned long *)addr) + (nr 6); __asm__ __volatile__( EIEIO_ON_SMP 1: ldarx %0,0,%3 # test_and_set_bit\n\ or %1,%0,%2 \n\ stdcx. %1,0,%3 \n\ bne-1b ISYNC_ON_SMP : =r (old), =r (t) : r (mask), r (p) : cc, memory); return (old mask) != 0; } EIEIO_ON_SMP is exactly what is needed, no? /* * The test_and_*_bit operations are taken to imply a memory barrier * on SMP systems. */ ... /* * test_and_*_bit do imply a memory barrier (?) */ static __inline__ int test_and_set_bit(int nr, volatile unsigned long *addr) { unsigned int old, t; unsigned int mask = 1 (nr 0x1f); volatile unsigned int *p = ((volatile unsigned int *)addr) + (nr 5); __asm__ __volatile__(SMP_WMB \n\ 1: lwarx %0,0,%4 \n\ or %1,%0,%3 \n PPC405_ERR77(0,%4) stwcx. %1,0,%4 \n\ bne 1b SMP_MB : =r (old), =r (t), =m (*p) : r (mask), r (p), m (*p) : cc, memory); return (old mask) != 0; } Ahem. It does look to me like atomics imply smp_wmb. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA
At 11:04 AM 2/8/2006, Michael S. Tsirkin wrote: Quoting r. Steve Wise [EMAIL PROTECTED]: Subject: Re: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA On Wed, 2006-02-08 at 19:10 +0200, Michael S. Tsirkin wrote: Quoting r. Sean Hefty [EMAIL PROTECTED]: Subject: RE: [openib-general] Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA Steve, looks like you have at most a single receive work request posted at the receive workqueue at all times. If true, this is *really* not a good idea, performance-wise, even if you actually have at most 1 packet in flight. Can you provide some more details on this? See 9.7.7.2 end-to-end (message level) flow control I just read this section in the 1.2 version of the spec, and I still don't understand what the issue really is? 9.7.7.2 talks about IBA doing flow control based on the RECV WQEs posted. rping always ensures that there is a RECV posted before the peer can send. This is ensured by the rping protocol itself (see the comment at the front of rping.c describing the ping loop). I'm only ever sending one outstanding message via SEND/RECV. I would rather post exactly what is needed, than post some number of RECVs just to be safe. Sorry if I'm being dense. What am I missing here? Steve. As far as I know, the credits are only updated by the ACK messages. If there is a single work request outstanding on the RQ, the ACK of the SEND message will have the credit field value 0 (since exactly one receive WR was outstanding, and that is now consumed). As a result the remote side withh think that there are no receive WQEs and will slow down (what spec refers to as limited WQE). Correct. The ACK / NAK protocol used by IB is used to return credits. In order to pipeline to improve performance, then you must post multiple receive work requests in order to account for the expected round trip time of the fabric and the associated CA processing. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA
At 11:35 AM 2/8/2006, Steve Wise wrote: I just read this section in the 1.2 version of the spec, and I still don't understand what the issue really is? 9.7.7.2 talks about IBA doing flow control based on the RECV WQEs posted. rping always ensures that there is a RECV posted before the peer can send. This is ensured by the rping protocol itself (see the comment at the front of rping.c describing the ping loop). I'm only ever sending one outstanding message via SEND/RECV. I would rather post exactly what is needed, than post some number of RECVs just to be safe. Sorry if I'm being dense. What am I missing here? Steve. As far as I know, the credits are only updated by the ACK messages. If there is a single work request outstanding on the RQ, the ACK of the SEND message will have the credit field value 0 (since exactly one receive WR was outstanding, and that is now consumed). As a result the remote side withh think that there are no receive WQEs and will slow down (what spec refers to as limited WQE). Oh. I understand now. This is an issue with only 1 RQ WQE posted and how IB tries to inform the peer transport of the WQE count. For iWARP, none of this transport-level flow control happens (and I'm more familiar with iWARP than IB). For iWARP, we decided to not implement application receiver based flow control due to two items:TCP provides transport-level flow control (IB does not provide the equivalent per se) and upon examination of the majority of the ULP, they exchange and track the number of receive buffers allowed to be processed thus there is no need to replicate this in iWARP. There are some subtleties as well between a message-based transport and a byte stream such as TCP that go into the equation but these are not that important for most application writers to deal with. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: ipoib_mcast_send.patch
So something like this should be good enough: --- infiniband/ulp/ipoib/ipoib_multicast.c (revision 5337) +++ infiniband/ulp/ipoib/ipoib_multicast.c (working copy) @@ -533,8 +533,10 @@ void ipoib_mcast_join_task(void *dev_ptr } if (!priv-broadcast) { - priv-broadcast = ipoib_mcast_alloc(dev, 1); - if (!priv-broadcast) { + struct ipoib_mcast *broadcast; + + broadcast = ipoib_mcast_alloc(dev, 1); + if (!broadcast) { ipoib_warn(priv, failed to allocate broadcast group\n); mutex_lock(mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, priv-flags)) @@ -544,10 +546,11 @@ void ipoib_mcast_join_task(void *dev_ptr return; } - memcpy(priv-broadcast-mcmember.mgid.raw, priv-dev-broadcast + 4, + spin_lock_irq(priv-lock); + memcpy(broadcast-mcmember.mgid.raw, priv-dev-broadcast + 4, sizeof (union ib_gid)); + priv-broadcast = broadcast; - spin_lock_irq(priv-lock); __ipoib_mcast_add(dev, priv-broadcast); spin_unlock_irq(priv-lock); } @@ -701,7 +704,9 @@ void ipoib_mcast_send(struct net_device */ spin_lock(priv-lock); - if (!test_bit(IPOIB_MCAST_STARTED, priv-flags) || !priv-broadcast) { + if (!test_bit(IPOIB_MCAST_STARTED, priv-flags)|| + !priv-broadcast|| + !test_bit(IPOIB_MCAST_FLAG_ATTACHED, priv-broadcast-flags)) { ++priv-stats.tx_dropped; dev_kfree_skb_any(skb); goto unlock; ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal
At 09:16 PM 2/6/2006, Sean Hefty wrote: The requirement is to provide an API that supports RDMA writes with immediate data. A send that follows an RDMA write is not immediate data, and the API should not be constructed around trying to make it so. To be clear, I believe that write with immediate should be part of the normal APIs, rather than an extension, but should be designed around those devices that provide it natively. One thing to keep in mind is that the IBTA workgroup responsible for the transport wanted to eliminate immediate data support entirely but it was retained solely to enable VIA application migration (even though the application base was quite small). If that requirement could have been eliminated, then it would have been gone in a heart beat. Given a RDMA-WRITE followed by a SEND provides the same application semantics based on the use models, iWARP chose not to support immediate data. So, here we have a long discussion on attempting to perpetuate a concept that is not universal across transports and was deemed to have minimal value that most wanted to see removed from the architecture. One has to question the value of trying to develop any API / software to support immediate data instead of just enabling the preferred method which is RDMA WRITE - SEND. I agree with those who have contended that this is difficult to do in a general purpose fashion. When all of this is taken into account, it seems the only good engineering answer is to eliminate immediate data support by the software and focused on the method that works across all interconnects. Mike ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: IPoIB and lid change
Michael One simple way to address this would be to have a list of Michael all address handles per net device and kill them on an SM Michael change event. Seems reasonable. It seems a little painful to implement at a first glance but I might be looking at it wrong. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal
Michael So, here we have a long discussion on attempting to Michael perpetuate a concept that is not universal across Michael transports and was deemed to have minimal value that most Michael wanted to see removed from the architecture. But this discussion is being driven by an application developer who does see value in immediate data. Arlin, can you quantify the benefit you see from RDMA write with immediate vs. RDMA write followed by a send? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: ipoib_mcast_send.patch
Quoting r. Roland Dreier [EMAIL PROTECTED]: Subject: Re: ipoib_mcast_send.patch So something like this should be good enough: --- infiniband/ulp/ipoib/ipoib_multicast.c(revision 5337) +++ infiniband/ulp/ipoib/ipoib_multicast.c(working copy) @@ -533,8 +533,10 @@ void ipoib_mcast_join_task(void *dev_ptr } if (!priv-broadcast) { - priv-broadcast = ipoib_mcast_alloc(dev, 1); - if (!priv-broadcast) { + struct ipoib_mcast *broadcast; + + broadcast = ipoib_mcast_alloc(dev, 1); + if (!broadcast) { ipoib_warn(priv, failed to allocate broadcast group\n); mutex_lock(mcast_mutex); if (test_bit(IPOIB_MCAST_RUN, priv-flags)) @@ -544,10 +546,11 @@ void ipoib_mcast_join_task(void *dev_ptr return; } - memcpy(priv-broadcast-mcmember.mgid.raw, priv-dev-broadcast + 4, + spin_lock_irq(priv-lock); + memcpy(broadcast-mcmember.mgid.raw, priv-dev-broadcast + 4, sizeof (union ib_gid)); + priv-broadcast = broadcast; - spin_lock_irq(priv-lock); __ipoib_mcast_add(dev, priv-broadcast); spin_unlock_irq(priv-lock); } Thats identical to what I posted till this point - right? @@ -701,7 +704,9 @@ void ipoib_mcast_send(struct net_device */ spin_lock(priv-lock); - if (!test_bit(IPOIB_MCAST_STARTED, priv-flags) || !priv-broadcast) { + if (!test_bit(IPOIB_MCAST_STARTED, priv-flags)|| + !priv-broadcast|| + !test_bit(IPOIB_MCAST_FLAG_ATTACHED, priv-broadcast-flags)) { ++priv-stats.tx_dropped; dev_kfree_skb_any(skb); goto unlock; I thought its important for performance to queue packets under mcast-pkt_queue? If not why do we do it? Maybe we shouldnt call netif_carrier_on if we drop all packets? -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: ipoib_mcast_send.patch
Michael Thats identical to what I posted till this point - right? I think I added one blank line, but other than that, yes. Michael I thought its important for performance to queue packets Michael under mcast-pkt_queue? If not why do we do it? Maybe we Michael shouldnt call netif_carrier_on if we drop all packets? The queueing is there so that we aren't guaranteed to drop the first multicast packet sent to a given group. I'm not sure that it really is important, but it does seem like it would be bad to lose that packet every time. From reading the code we can't call netif_carrier_on until after priv-broadcast has the attached flag set. In ipoib_mcast_join_task(), we have if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, priv-broadcast-flags)) { ipoib_mcast_join(dev, priv-broadcast, 0); return; } and then at the very bottom netif_carrier_on(dev); - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: IPoIB and lid change
Quoting r. Roland Dreier [EMAIL PROTECTED]: Subject: Re: IPoIB and lid change Michael One simple way to address this would be to have a list of Michael all address handles per net device and kill them on an SM Michael change event. Seems reasonable. It seems a little painful to implement at a first glance but I might be looking at it wrong. It will be very easy once you merge ipoib_all_neigh_issues_2.patch since that gets us a list of neigh to wwal on sm event. -- Michael S. Tsirkin Staff Engineer, Mellanox Technologies ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal
Roland Dreier wrote: Michael So, here we have a long discussion on attempting to Michael perpetuate a concept that is not universal across Michael transports and was deemed to have minimal value that most Michael wanted to see removed from the architecture. But this discussion is being driven by an application developer who does see value in immediate data. Arlin, can you quantify the benefit you see from RDMA write with immediate vs. RDMA write followed by a send? We need speed and simplicity. A very latency sensitive application that requires immediate notification of RDMA write completion on the remote node without ANY latency penalties associated with combining operations, HCA priority rules across QPs, wire congestion, etc. An application that has no requirement for messaging outside of remote rdma write completion notifications. The application would not have to register and manage additional message buffers on either side, we can just size the queues accordingly and post zero byte messages. We need something that would be equivelent to setting there polling on the last byte of inbound data. But, since data ordering within an operation is not guaranteed that is not an option. So, rdma with immediate data is the most optimal and simplistic method for indication of RDMA-write completion that we have available today. In fact, I would like to see it increased in size to make it even more useful. -arlin ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Could not retrieve handle to the HCA InfiniHost0 (VAPI_EINVAL_HCA_ID)
Hello All, When i do a vstat i get the following error. What does this mean. vstat1 HCA found: hca_id=InfiniHost0Error: Could not retrieve handle to the HCA InfiniHost0 (VAPI_EINVAL_HCA_ID) /var/log/messages has this [KERNEL_IB][_tslbTavorPnPEventHandler][/var/tmp/IBGD//tmp/openib/infiniband/ib_verbs/hw/provider/tavor_main.c:352]_tslbTavorPnPEventHandler: could not add HCA InfiniHost0 (-19) what are the possible things that might have gone wrong ? does any one know. Rangam ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal
Arlin A very latency sensitive application that requires Arlin immediate notification of RDMA write completion on the Arlin remote node without ANY latency penalties associated with Arlin combining operations, HCA priority rules across QPs, wire Arlin congestion, etc. An application that has no requirement for Arlin messaging outside of remote rdma write completion Arlin notifications. The application would not have to register Arlin and manage additional message buffers on either side, we Arlin can just size the queues accordingly and post zero byte Arlin messages. We need something that would be equivelent to Arlin setting there polling on the last byte of inbound Arlin data. But, since data ordering within an operation is not Arlin guaranteed that is not an option. So, rdma with immediate Arlin data is the most optimal and simplistic method for Arlin indication of RDMA-write completion that we have available Arlin today. In fact, I would like to see it increased in size to Arlin make it even more useful. Hmm. Can you put a number on how much better RDMA write with immediate is on current HCA hardware? How does using the underlying OpenIB verbs ability to post a list of work requests compare (ie posting an RDMA write followed by a send in one verbs call)? Maybe post multiple is a better direction for DAT. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal
One thing to keep in mind is that the IBTA workgroup responsible for the transport wanted to eliminate immediate data support entirely but it was retained solely to enable VIA application migration (even though the application base was quite small). If that requirement could have been eliminated, then it would have been gone in a heart beat. Given a RDMA-WRITE followed by a SEND provides the same application semantics based on the use models, iWARP chose not to support immediate data. Mike, I was not part of the original IBTA discussions and I wont argue whether this facility should or shouldnt have been include. Nevertheless, it is part of the specification, there are HCA vendors that implement it, and we have applications that make use of it. I would, however, disagree with your assertion that write followed by a send is semantically equivalent to write immediate. Ordering may be semantically the same, but the service is not. Receive work completions are explicitly indicated as being associated with immediate data and therefore an associated write completion. A write followed by a send does not provide the same indication semantic. Roy ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Antigen found FILE FILTER= *.* file
Antigen for Exchange found 21_price.zip-bqqvauygc.exe matching FILE FILTER= *.* file filter. The file is currently Removed. The message, [openib-general] price, was sent from [EMAIL PROTECTED] and was discovered in SMTP Messages\Inbound located at Quadrics/First Administrative Group/EXCH01. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal
[EMAIL PROTECTED] wrote: Arlin A very latency sensitive application that requires Arlin immediate notification of RDMA write completion on the Arlin remote node without ANY latency penalties associated with Arlin combining operations, HCA priority rules across QPs, wire Arlin congestion, etc. An application that has no requirement for Arlin messaging outside of remote rdma write completion Arlin notifications. The application would not have to register Arlin and manage additional message buffers on either side, we Arlin can just size the queues accordingly and post zero byte Arlin messages. We need something that would be equivelent to Arlin setting there polling on the last byte of inbound Arlin data. But, since data ordering within an operation is not Arlin guaranteed that is not an option. So, rdma with immediate Arlin data is the most optimal and simplistic method for Arlin indication of RDMA-write completion that we have available Arlin today. In fact, I would like to see it increased in size to Arlin make it even more useful. Hmm. Can you put a number on how much better RDMA write with immediate is on current HCA hardware? How does using the underlying OpenIB verbs ability to post a list of work requests compare (ie posting an RDMA write followed by a send in one verbs call)? Maybe post multiple is a better direction for DAT. The distinction between Write and Send versus post multiple is that it maintains a very simple one-to-one correspondence with the post_recv at the data sink. I also do not see how the *application* keeping the write and send semantics can have a negative performance implication if we allow InfiniBand Providers to encode it as an RDMA Write with Immediate. If the Data Source needs to communicate to the Data Sink that a specific RDMA Write transfer is done then it is sending a message. Information transfer and synchronization is occuring. I fail to see the value, let alone the optimization, of layering on an extra bit of information disguised as an opcode and using a specific transport's encoding methods as the model for a transport neutral API (particularly one at the DAT layer, at the verb layer it is a different issue because at the verb layer we do not want to hide any hardware capabilities even while encouraging safe harbor transport neutral practices). If distinquishing between 32-bit messages and 32-bit immediates that can arrive in indeterminate order is really that important to your application then maybe you really needed a 33-bit message to begin with. Encoding application layer information via your choice of carrier pigeon is not a very robust strategy. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] $BL5NABN83;PKe(B
当サイトは「女性優先」制を採用しており、女性会員の要求に従 うのです。 このメールは非会員の貴方に女性を紹介する事について、女性 (舞子・愛子姉妹)本人の依頼をされた男性だけに送られているメー ルなので、期待に答えてあげてください。 メッセージ: 自営業2人の姉妹なんですけど、興味ないですか?【舞子・愛子】です! 私達は2人で男性に奉仕するのが好きなんです(*^_^*)でもそんな相手見つけにくいし、恥ずかしいし、思い切って入会しました!別に私達をイかせてくれなくてもいいので、3Pのお相手してくださいm(__)mアドはPFに書いておりますので、良ければ写メとアド付けてお返事ください(^_-)-☆ 貴方は【無料体験】の利用者として、 ( http://www.kool-king.net?002 )をアクセスして、【無料体験】から舞子・愛子様と連絡してください。 なお、お客様からのメールが無い場合は、他の方へご紹介することとなりますので、なるべく早めのメール送信をお願いします。 メール送信はこちらから、直接舞子・愛子様へお送りください。 http://www.kool-king.net?002 至急、返事下さい! [EMAIL PROTECTED] ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA
[EMAIL PROTECTED] wrote: At 11:35 AM 2/8/2006, Steve Wise wrote: I just read this section in the 1.2 version of the spec, and I still don't understand what the issue really is? 9.7.7.2 talks about IBA doing flow control based on the RECV WQEs posted. rping always ensures that there is a RECV posted before the peer can send. This is ensured by the rping protocol itself (see the comment at the front of rping.c describing the ping loop). I'm only ever sending one outstanding message via SEND/RECV. I would rather post exactly what is needed, than post some number of RECVs just to be safe. Sorry if I'm being dense. What am I missing here? Steve. As far as I know, the credits are only updated by the ACK messages. If there is a single work request outstanding on the RQ, the ACK of the SEND message will have the credit field value 0 (since exactly one receive WR was outstanding, and that is now consumed). As a result the remote side withh think that there are no receive WQEs and will slow down (what spec refers to as limited WQE). Oh. I understand now. This is an issue with only 1 RQ WQE posted and how IB tries to inform the peer transport of the WQE count. For iWARP, none of this transport-level flow control happens (and I'm more familiar with iWARP than IB). For iWARP, we decided to not implement application receiver based flow control due to two items:TCP provides transport-level flow control (IB does not provide the equivalent per se) and upon examination of the majority of the ULP, they exchange and track the number of receive buffers allowed to be processed thus there is no need to replicate this in iWARP. There are some subtleties as well between a message-based transport and a byte stream such as TCP that go into the equation but these are not that important for most application writers to deal with. Mike But in terms of compiling the safe harbor transport neutral recommended programming practices, I think this is a valid point. Having one spare buffer is a good safety mechanism at the application layer in general, *and* it may prevent snarls in the transport layer flow control. Suggesting that consumers avoid letting the RQ hit empty strikes me as aa valid transport neutral recommendation. And we'll improve the public education by following those recommendations in sample and test programs. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [dat-discussions] [openib-general] [RFC]DAT2.0immediatedataproposal
Hmm. Can you put a number on how much better RDMA write with immediate is on current HCA hardware? How does using the underlying OpenIB verbs ability to post a list of work requests compare (ie posting an RDMA write followed by a send in one verbs call)? Maybe post multiple is a better direction for DAT. A post multiple call as a general API makes sense, but I think that's a separate issue. Given that IB provides true immediate data with RDMA writes, a way should be available to make use of it. I don't know what the performance numbers between using a write with immediate versus a write followed by a send, but I don't think that anyone could argue that the write with immediate wouldn't perform better. To me, the question is whether write with immediate is supported as a transport specific extension, which was Arlin's original patch, or through some standard API. The attempt to make the API standard, so that iWarp could emulate it (poorly in my view), is what appears to be driving the disagreements. It also appears to me that the decisions are coming down to one of the following. If iWarp can emulate write with immediate, then a generic API should be used. If iWarp cannot properly emulate write with immediate, then the API should be transport specific. It's curious to me that in both cases, iWarp is driving the API decision and design for something that is an IB specific feature. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] FW: [PATCH 1 of 3] mad: large RMPP support
My point was not the total storage used for the array (it ends up more than the linked list, as you noted). I'm concerned that an allocation of a 4K buffer may fail in a situation where lots of small allocations of around 256 bytes would succeed. Is your point that if we fail to allocate a 4K buffer, we're in deep trouble already? Note that I've only considered a 1000 host cluster. What about scalability (e.g., 10,000 nodes -- we then need a 40K buffer) -- the linked list has no scalability problem (no need to push RMPP handling to user space). Regarding the list-walk, if we track the last-sent segment in the list, there is no need to do the list walk (we simply get the next segment in the list). We'll only have a short list walk when the ack pointer gets updated (need to walk forward only current-RMPP-ack-window-size items in the linked list from the previously ack'ed item). -- What is the reason you are thinking about 64-byte boundary support? Jack -Original Message- From: Sean Hefty [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 08, 2006 7:13 PM To: Jack Morgenstein; openib-general@openib.org Subject: RE: [openib-general] FW: [PATCH 1 of 3] mad: large RMPP support For example, a 1000 host cluster, with 2 ports per HCA will have at least 4000 records in a SubnAdmGetTableResp for all PortInfo records on the network (2000 for HCAs, and at least 2000 for the switch ports). Such a query response will generate an RMPP of size 256K -- 1000 segments, or a 4K buffer on an X86 machine just for the array (assuming one allocation per RMPP segment -- N=1). I think that this is a good reason to use an array. Walking a 1000 entry list 1000 times is a substantial performance hit. Lost MADs and retries will make this worse. A 4K buffer for the array is less than the 8K total needed for the 1000 list items. We're already talking about allocating over 256K of memory just for the data payload. An additional contiguous 4k buffer seems like a minor issue. I'm not convinced that there's a real issue here. To support ridiculously large transfers from userspace, we may need to push the RMPP handling up into userspace. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] FW: [PATCH 1 of 3] mad: large RMPP support
I'm concerned that an allocation of a 4K buffer may fail in a situation where lots of small allocations of around 256 bytes would succeed. Is your point that if we fail to allocate a 4K buffer, we're in deep trouble already? Note that I've only considered a 1000 host cluster. Yes - if we can't allocate a 4k buffer, it seems highly unlikely that we'd be able to allocate 1000 256-byte buffers. What about scalability (e.g., 10,000 nodes -- we then need a 40K buffer) -- the linked list has no scalability problem (no need to push RMPP handling to user space). I did consider this, and I don't know when we'll start hitting issues allocating a single data buffer. But we're going to ask for 10,000 256-byte buffers - over 2.5 MB of kernel memory in order to perform this single data transfer. Is it likely that we can allocate that much memory, but not the 40k buffer? I really don't know. If the answer is yes, then I agree that using a linked list would be better. Regarding the list-walk, if we track the last-sent segment in the list, there is no need to do the list walk (we simply get the next segment in the list). We'll only have a short list walk when the ack pointer gets updated (need to walk forward only current-RMPP-ack-window-size items in the linked list from the previously ack'ed item). I thought of this as well. For efficiency, you need to track the last sent and last acked, meaning that the list will be walked at most twice. You may be able to jump the ack pointer to last sent if that is a common case. What is the reason you are thinking about 64-byte boundary support? I was concerned about 64-byte values in the MADs aligned on a 32-byte boundary. But then I think that some of the MADs have this issue anyway by architectural design. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general