[openib-general] [PATCH] Opensm - osm_mcast_mgr.c add type casting

2006-02-08 Thread Yael Kalka

Hi Hal,

The following patch adds a missing type casting in the return value of
the function osm_mcast_mgr_compute_max_hops.

Thanks,
Yael

Signed-off-by:  Yael Kalka [EMAIL PROTECTED]

Index: osm_mcast_mgr.c
===
--- osm_mcast_mgr.c (revision 5307)
+++ osm_mcast_mgr.c (working copy)
@@ -269,7 +269,7 @@ osm_mcast_mgr_compute_max_hops(
   }
 
   OSM_LOG_EXIT( p_mgr-p_log );
-  return( max_hops );
+  return(float)(max_hops);
 }
 
 /**

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] [PATCH] Opensm - type changing in st.h/c files

2006-02-08 Thread Yael Kalka

Hi Hal,

There was a problem with some of the types defined when compiling on
64bit windows machines. The following patch adds support for these as
well.

Thanks,
Yael

Signed-off-by:  Yael Kalka [EMAIL PROTECTED]

Index: include/opensm/st.h
===
--- include/opensm/st.h (revision 5307)
+++ include/opensm/st.h (working copy)
@@ -50,14 +50,21 @@
 
 BEGIN_C_DECLS
 
-typedef unsigned long st_data_t;
+#if (__WORDSIZE == 64) || defined (_WIN64)
+#define st_ptr_t unsigned long long
+#else
+#define st_ptr_t unsigned long
+#endif
+
+typedef st_ptr_t st_data_t;
+
 #define ST_DATA_T_DEFINED
 
 typedef struct st_table st_table;
 
 struct st_hash_type {
   int (*compare)(void *, void *);
-  int (*hash)(void *);
+  st_ptr_t (*hash)(void *);
 };
 
 struct st_table {
Index: opensm/st.c
===
--- opensm/st.c (revision 5307)
+++ opensm/st.c (working copy)
@@ -41,7 +41,6 @@
 #  include config.h
 #endif /* HAVE_CONFIG_H */
 
-#include config.h
 #include stdio.h
 #include stdlib.h
 #include string.h
@@ -73,7 +72,7 @@ struct st_table_entry {
  *
  */
 static int numcmp(void *, void *);
-static int numhash(void *);
+static st_ptr_t numhash(void *);
 static struct st_hash_type type_numhash = {
   numcmp,
   numhash,
@@ -83,7 +82,7 @@ static struct st_hash_type type_numhash 
 /* extern int strcmp(const char *, const char *); */ 
 static int strhash(const char *);
 
-static inline int st_strhash(void *key)
+static inline st_ptr_t st_strhash(void *key)
 {
   return strhash((const char *)key);
 }
@@ -619,12 +618,12 @@ static int
 numcmp(x, y)
  void *x, *y;
 {
-  return (long)x != (long)y;
+  return (st_ptr_t)x != (st_ptr_t)y;
 }
 
-static int
+static st_ptr_t
 numhash(n)
  void *n;
 {
-  return (long)n;
+  return (st_ptr_t)n;
 }

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: openib and mellanox hca problem

2006-02-08 Thread Michael S. Tsirkin
I wander whether we manage to locate the bridge.
It would be interesting to build mthca with debug enabled.

Quoting r. Michael Di Domenico [EMAIL PROTECTED]:
 
 What specifically would you like to know?
 
 On 2/7/06, Roland Dreier [EMAIL PROTECTED] wrote:
   Feb  7 16:59:48 linux14-ts kernel: ib_mthca :07:00.0: PCI device did 
   not come back after reset, aborting.
 
  Can you give more details on the system where you saw this?
 
   - R.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: openib and mellanox hca problem

2006-02-08 Thread Michael S. Tsirkin
If you really suspect timing issues, you can always
increase timeouts: look for msleep in mthca_reset.c and try bumping up
the numbers.

Anyway - could you please enable mthca debug in menuconfig?
This would give us some more information on whats going on.


Quoting r. Ranjit Pandit [EMAIL PROTECTED]:
 Subject: Re: openib and mellanox hca problem
 
 Michael,
 
 I have seen this problem before..
 See following mail thread
 
 http://www.mail-archive.com/openib-general@openib.org/msg13861.html
 
 Commenting out call to mthca_reset() in mthca_main.c worked around the
 problem on my system, and as far as I can tell, did not have any
 negative impact.
 
 It will be good if someone reviews the reset path in mthca.
 
 Ranjit
 
 
 On 2/7/06, Michael Di Domenico [EMAIL PROTECTED] wrote:
  I'm trying to build a system using the openib drivers with a mellanox
  hca card.  I don't have much information about the card itself, it's
  in a server right now...
 
  But I downloaded openib today from the svn source, installed it onto a
  fresh copy of Fedora Core 4 with Kernel version 2.6.15.3...
  Everything seemed to compile fine and install okay.  I've been
  following the instructions from the wiki page thus far without a
  problem.  I get upto this step
 
  modprobe ib_mthca
 
  and get the below error in /var/log/messages.  Strangely enough all
  the modules load, and i do a udevstart, but i never get a
  /dev/infiniband directory and /sys/class/infiniband directory is
  empty.
 
  Does anyone know how i might fix this, or point me to some better
  documentation then what is on the wiki?
 
  Thanks
  - Michael
 
 
  Feb  7 16:59:37 linux14-ts kernel: ib_mthca: Mellanox InfiniBand HCA
  driver v0.06 (June 23, 2005)
  Feb  7 16:59:37 linux14-ts kernel: ib_mthca: Initializing :07:00.0
  Feb  7 16:59:37 linux14-ts kernel: ACPI: PCI Interrupt :07:00.0[?]
  - GSI 26 (level, low) - IRQ 217
  Feb  7 16:59:48 linux14-ts kernel: ib_mthca :07:00.0: PCI device
  did not come back after reset, aborting.
  Feb  7 16:59:48 linux14-ts kernel: ib_mthca :07:00.0: Failed to
  reset HCA, aborting.
  Feb  7 16:59:48 linux14-ts kernel: ACPI: PCI interrupt for device
  :07:00.0 disabled
 
 
  --- lspci output
  06:03.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev ff)
  (prog-if ff)
  !!! Unknown header type 7f
 
  07:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev ff)
  (prog-if ff)
  !!! Unknown header type 7f

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] iser: cleanups changeset

2006-02-08 Thread Or Gerlitz
kind of huge cleanup as part of the preparations for the RFC

 iscsi_iser.h |  166 +--
 iser_initiator.c |   69 --
 iser_memory.c|  138 +
 iser_verbs.c |   52 -
 4 files changed, 156 insertions(+), 269 deletions(-)


r5336 | ogerlitz | 2006-02-08 13:13:17 +0200 (Wed, 08 Feb 2006) | 4 lines

cleanps

Signed-off-by: Or Gerlitz [EMAIL PROTECTED]


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] trying to run cmpost example

2006-02-08 Thread Xavier Grave
Hi all,

one more newbie question.
Here is my ib modules installation (2.6.15 kernel from ftp.kernel.org)
lsmod | grep ib
ib_umad26472  0 
ib_ucm 31992  0 
ib_cm  50648  1 ib_ucm
ib_mthca  156244  0 
ib_uverbs  57968  0 
ib_ipoib   61736  0 
ib_sa  24568  1 ib_ipoib
ib_mad 56548  4 ib_umad,ib_cm,ib_mthca,ib_sa
ib_core71344  8
ib_umad,ib_ucm,ib_cm,ib_mthca,ib_uverbs,ib_ipoib,ib_sa,ib_mad

I run cmpost from libibcm/example directory as root
ls -la /dev/infiniband/ucm0 gives : crw-r--r-- 1 root root 231, 255
2006-02-08 13:28 /dev/infiniband/ucm0
Prompt LD_LIBRARY_PATH=/usr/local/lib ./cmpost
libibcm: error -1:6 opening device /dev/infiniband/ucm0
starting server
listen request failed
test complete

Does somebody have an idea of what is missing ?
All my lib code comes from the svn repository, do I need to modify the
2.6.15 infiniband directory ?

Thanks in advance, xavier

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: ipoib_mcast_send.patch

2006-02-08 Thread Michael S. Tsirkin
Quoting r. Roland Dreier [EMAIL PROTECTED]:
 Subject: Re: ipoib_mcast_send.patch
 
 Michael I agree. Do you want to fix it or should I?
 
 If you get a chance that would be great.  I'm at the OpenIB workshop
 now so I probably can't seriously look at it until tomorrow at the
 earliest.

Here you are. The following is in ipoib_broadcast_gid.patch in svn.

---

The way priv-broadcast is initialized in ipoib_mcast_join_task() is somewhat
unsafe, since there's no lock and conceivably a send-only join could complete
before priv-broadcast is fully set up.

Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED]

Index: openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
===
--- openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c   (revision 5336)
+++ openib/drivers/infiniband/ulp/ipoib/ipoib_multicast.c   (working copy)
@@ -533,8 +533,9 @@ void ipoib_mcast_join_task(void *dev_ptr
}
 
if (!priv-broadcast) {
-   priv-broadcast = ipoib_mcast_alloc(dev, 1);
-   if (!priv-broadcast) {
+   struct ipoib_mcast *broadcast;
+   broadcast = ipoib_mcast_alloc(dev, 1);
+   if (!broadcast) {
ipoib_warn(priv, failed to allocate broadcast 
group\n);
mutex_lock(mcast_mutex);
if (test_bit(IPOIB_MCAST_RUN, priv-flags))
@@ -544,10 +545,11 @@ void ipoib_mcast_join_task(void *dev_ptr
return;
}
 
-   memcpy(priv-broadcast-mcmember.mgid.raw, priv-dev-broadcast 
+ 4,
+   spin_lock_irq(priv-lock);
+   priv-broadcast = broadcast;
+   memcpy(broadcast-mcmember.mgid.raw, priv-dev-broadcast + 4,
   sizeof (union ib_gid));
 
-   spin_lock_irq(priv-lock);
__ipoib_mcast_add(dev, priv-broadcast);
spin_unlock_irq(priv-lock);
}

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: openib and mellanox hca problem

2006-02-08 Thread Michael S. Tsirkin
Quoting Michael Di Domenico [EMAIL PROTECTED]:
 Feb  7 16:59:48 linux14-ts kernel: ib_mthca :07:00.0: PCI device
 did not come back after reset, aborting.
 Feb  7 16:59:48 linux14-ts kernel: ib_mthca :07:00.0: Failed to
 reset HCA, aborting.
 Feb  7 16:59:48 linux14-ts kernel: ACPI: PCI interrupt for device
 :07:00.0 disabled
 
 
 --- lspci output
 06:03.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev ff)
 (prog-if ff)
 !!! Unknown header type 7f
 
 07:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev ff)
 (prog-if ff)
 !!! Unknown header type 7f

This could be a hardware problem. Please contact your mellanox FAE
representative.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: openib and mellanox hca problem

2006-02-08 Thread Michael Di Domenico
On 2/8/06, Michael S. Tsirkin [EMAIL PROTECTED] wrote:
 Quoting Michael Di Domenico [EMAIL PROTECTED]:
 
  --- lspci output
  06:03.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev ff)
  (prog-if ff)
  !!! Unknown header type 7f
 
  07:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev ff)
  (prog-if ff)
  !!! Unknown header type 7f

 This could be a hardware problem. Please contact your mellanox FAE
 representative.


It shouldn't be.  These machines were working fine with a copy of REL3
using a 2.4 kernel and the silverstorm hca stack.  This has only
creeped up when i switched to Fedora Core v4 v2.6 kernel and the
openib stack
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: openib and mellanox hca problem

2006-02-08 Thread Michael Di Domenico
On 2/8/06, Michael S. Tsirkin [EMAIL PROTECTED] wrote:
 If you really suspect timing issues, you can always
 increase timeouts: look for msleep in mthca_reset.c and try bumping up
 the numbers.

 Anyway - could you please enable mthca debug in menuconfig?
 This would give us some more information on whats going on.

I enabled debug in the module config recompiled and tried to reload
using modprobe ib_mthca and got the same results?  Am i missing a
debug parameter somewhere?  Or should it just spit out more
information automatically?
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal

2006-02-08 Thread Kanevsky, Arkady
One more issue to discuss.
Does Completion of Recv that matches RDMA Write with Immediate Data
automatically sync local memory or Consumer still need to do
lmr_sync_rdma_write prior to accessing RDMAed data.

Arkady Kanevsky   email: [EMAIL PROTECTED]
Network Appliance Inc.   phone: 781-768-5395
1601 Trapelo Rd. - Suite 16.Fax: 781-895-1195
Waltham, MA 02451   central phone: 781-768-5300
 

 -Original Message-
 From: Caitlin Bestler [mailto:[EMAIL PROTECTED] 
 Sent: Tuesday, February 07, 2006 7:40 PM
 To: [EMAIL PROTECTED]; Larsen, Roy K; Arlin 
 Davis; Hefty, Sean
 Cc: openib-general@openib.org
 Subject: RE: [dat-discussions] [openib-general] [RFC] 
 DAT2.0immediatedataproposal
 
 [EMAIL PROTECTED] wrote:
  We have problem no matter which option we choose.
  The current Transport Level Requirement state:
  
  There is a one-to-one correspondence between send operation on one 
  Endpoint of the Connection and recv operations on the other 
 Endpoint 
  of the Connection.
  There is no correspondence between RDMA operations on one 
 Endpoint of 
  the Connection and recv or send data transfer operation on 
 the other 
  Endpoint of the Connection.
  Receive operations on a Connection must be completed in the 
 order of 
  posting of their corresponding sends.
  
  The Immediate data and Atomic ops violate these 
 requirements including 
  ordering rules.
  
  I had started updating these rules when I generated the 
 first draft of 
  the requirements. They are included in the enclosed pdf file.
  But they do not cover Atomic ops that also impact transport 
  requirements. This chapter of the spec have not been changed since 
  DAPL 1.0 and I am very concern with any changes to it.
  
  Arkady
  
 
 If RDMA Write with Immediate is viewed as being the 
 equivalent of doing RDMA Write and then an RDMA Send the 
 correspondence rule is maintained. But *only* if the rdma 
 write with immediate
 has all of the semantics of a Send.
 
 Atomics do not violate the rules if you view them as being a 
 variation on an RDMA Read. They are an RDMA Read with modify.
 The real question is whether it makes sense to put it in the 
 RDMA device. It is also not subject to emulation at a highe layer. 
 
 With send with invalidate we know how InfiniBand *will* 
 support it, because of the IB 1.2 verbs. We do not know that 
 for atomics over iWARP. We do not know whether it will be 
 added, more importantly we do not know *how* it would be 
 added if it were added. That makes coming up with a transport 
 neutral definition very premature.
 In particular, if atomics were added to iWARP there is a 
 distinct design option where it would *not* be the same work 
 queue as RDMA Reads (adding atomics through Queue ID 3 would 
 make layering on top of a current implementation much easier. 
 But it would mean that atomic credits would be distinct from 
 read credits. This is a very strong reason to defer 
 attempting to define RDMA Atomics in a transport neutral fashion.
 
  
 
 
 
 
  
 Yahoo! Groups Links
 
 * To visit your group on the web, go to:
 http://groups.yahoo.com/group/dat-discussions/
 
 * To unsubscribe from this group, send an email to:
 [EMAIL PROTECTED]
 
 * Your use of Yahoo! Groups is subject to:
 http://docs.yahoo.com/info/terms/
  
 
 
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: openib and mellanox hca problem

2006-02-08 Thread Michael S. Tsirkin
Quoting r. Michael Di Domenico [EMAIL PROTECTED]:
 Subject: Re: openib and mellanox hca problem
 
 On 2/8/06, Michael S. Tsirkin [EMAIL PROTECTED] wrote:
  If you really suspect timing issues, you can always
  increase timeouts: look for msleep in mthca_reset.c and try bumping up
  the numbers.
 
  Anyway - could you please enable mthca debug in menuconfig?
  This would give us some more information on whats going on.
 
 I enabled debug in the module config recompiled and tried to reload
 using modprobe ib_mthca and got the same results?  Am i missing a
 debug parameter somewhere?  Or should it just spit out more
 information automatically?

Yes, it should spit out things like Found bridge.
Are you sure you installed it properly?

To check, you can try to stick mthca_dbg(mdev, Here\n); at the beginning of
mthca_reset and see that it gets printed.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal

2006-02-08 Thread Caitlin Bestler
[EMAIL PROTECTED] wrote:
 One more issue to discuss.
 Does Completion of Recv that matches RDMA Write with
 Immediate Data automatically sync local memory or Consumer
 still need to do lmr_sync_rdma_write prior to accessing RDMAed data.
 

Why would it be any different than for a plain receive?
The intent is the same, to indicate that prior Writes have completed.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: openib and mellanox hca problem

2006-02-08 Thread Michael Di Domenico
On 2/8/06, Michael S. Tsirkin [EMAIL PROTECTED] wrote:
 Quoting r. Michael Di Domenico [EMAIL PROTECTED]:
  Subject: Re: openib and mellanox hca problem
 
  On 2/8/06, Michael S. Tsirkin [EMAIL PROTECTED] wrote:
   If you really suspect timing issues, you can always
   increase timeouts: look for msleep in mthca_reset.c and try bumping up
   the numbers.
  
   Anyway - could you please enable mthca debug in menuconfig?
   This would give us some more information on whats going on.
 
  I enabled debug in the module config recompiled and tried to reload
  using modprobe ib_mthca and got the same results?  Am i missing a
  debug parameter somewhere?  Or should it just spit out more
  information automatically?

 Yes, it should spit out things like Found bridge.
 Are you sure you installed it properly?

 To check, you can try to stick mthca_dbg(mdev, Here\n); at the beginning of
 mthca_reset and see that it gets printed.

definately working...

Feb  8 10:01:23 linux14-ts kernel: ib_mthca: Mellanox InfiniBand HCA
driver v0.06 (June 23, 2005)
Feb  8 10:01:23 linux14-ts kernel: ib_mthca: Initializing :07:00.0
Feb  8 10:01:23 linux14-ts kernel: ACPI: PCI Interrupt :07:00.0[?]
- GSI 26 (level, low) - IRQ 217
Feb  8 10:01:23 linux14-ts kernel: ib_mthca :07:00.0: Here
Feb  8 10:01:23 linux14-ts kernel: ib_mthca :07:00.0: Found
bridge: :06:03.0
Feb  8 10:01:34 linux14-ts kernel: ib_mthca :07:00.0: PCI device
did not come back after reset, aborting.
Feb  8 10:01:34 linux14-ts kernel: ib_mthca :07:00.0: Failed to
reset HCA, aborting.
Feb  8 10:01:34 linux14-ts kernel: ACPI: PCI interrupt for device
:07:00.0 disabled
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] 2: ALL MAJOR DESIGNER REPLICA //ATCHES! Save $35

2006-02-08 Thread postmaster



Replica Watch 
Why spend thousands of dollars on the real deal when
  a replica watch looks so much alike that only an expert could tell the difference...
  And you only pay a fraction of the price.
  



CLICK
  HERE NOW FOR DETAILS!
To unsubscribe click here!


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Re: openib and mellanox hca problem

2006-02-08 Thread Michael S. Tsirkin
Quoting r. Michael Di Domenico [EMAIL PROTECTED]:
 Subject: Re: openib and mellanox hca problem
 
 Roland,
 
 I've attached the dmesg and lspci outputs...

You really want lspci *before* mthca got loaded.
This one just shows the card's incommunicado.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: openib and mellanox hca problem

2006-02-08 Thread Michael S. Tsirkin
Quoting Michael Di Domenico [EMAIL PROTECTED]:
 Feb  8 10:01:23 linux14-ts kernel: ib_mthca :07:00.0: Found
 bridge: :06:03.0

Hmm, looks like the bridge lookup worked fine.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] FW: [PATCH 1 of 3] mad: large RMPP support

2006-02-08 Thread Jack Morgenstein
Sorry for breaking the thread (Outlook is problematic).
Jack

-Original Message-
From: Jack Morgenstein 
Sent: Wednesday, February 08, 2006 6:23 PM
To: 'Sean Hefty'
Cc: Michael S. Tsirkin; '[EMAIL PROTECTED]'
Subject: RE: [PATCH 1 of 3] mad: large RMPP support

Sorry for not echoing to openib -- I'm having problems with mutt and our
server (replying to this from Outlook will not place the reply in the
thread).

I would much rather use the linked list.
We may need to allocate a rather large contiguous array (ib_mad_segments
segment array) for queries involving a large cluster, and such an
allocation has a larger probability of failure.

For example, a 1000 host cluster, with 2 ports per HCA will have at
least 4000 records in a SubnAdmGetTableResp for all PortInfo records on
the network (2000 for HCAs, and at least 2000 for the switch ports).
Such a query response will generate an RMPP of size 256K -- 1000
segments, or a 4K buffer on an X86 machine just for the array (assuming
one allocation per RMPP segment -- N=1).

b. Regarding using buffers which contain N RMPP segments, this becomes a
management nightmare:
If choose N too large, we may fail to allocate segments in a
large RMPP, so that the entire RMPP fails (where it could succeed if
N=1).   Having N=1 guarantees that if we can succeed in our allocation,
we will.  I do not consider variable-size N within a single RMPP, since
this will be very complicatated and error-prone.

We could re-allocate everything if some N does not work -- also very
complex.

Regarding the order N-squared algorithm for finding the next RMPP
segment to send, MST and I agree that this is not acceptable.  We are
considering an algorithm which stores the current segment pointer in
struct ib_mad_send_wr_private so that when getting the next segment we
simply go to the next link.  We're still ironing out proper handling
of the last acknowledged processing (maintaining a pointer to the
last-acked segment, upgrading the last-acked pointer when a new ack
arrives -- this might still involve linear searches).

Regarding the payload pointer, I agree. It is also trivial to move it to
the ib_mad_send_wr_private structure, hiding it from the user.

Regarding the 64-byte boundary, why is this important?

Jack


-Original Message-
From: Sean Hefty [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 08, 2006 3:01 AM
To: Jack Morgenstein
Cc: openib-general@openib.org
Subject: RE: [PATCH 1 of 3] mad: large RMPP support

Based on what you've done, I'd like to suggest changing interface
similar to
that shown below.  I believe that this could be done with minor changes
to the
current patches.  Detailed comments that led to suggesting this change
are
inline in my responses.

struct ib_mad_segments {
u32 num_segments;
u32 segment_size;
void*segment[0];
};

struct ib_mad_send_buf {
...
void*mad; /* First MAD segment */
struct ib_mad_segments  *segments;  /* RMPP segments  1 */
...
};

This will avoid walking through a list to find segments, and allows for
efficient allocation of the segment data buffers.  Multiple segments
could be
allocated through a single kzalloc.  (For example, every n-th segment
would
start a new allocation, making deallocation easy as well.)


+struct ib_mad_multipacket_seg {
+  struct list_head list;
+  u32 size;
+  u8 data[0];
+};

Should we ensure that the data alignment is on a 64-byte boundary?

 struct ib_mad_send_buf {
   struct ib_mad_send_buf  *next;
-  void*mad;
+  void*mad; /* RMPP: first segment,
+   including the MAD header */
+  void*mad_payload; /* RMPP: changed per
segment */

Mad_payload doesn't appear to be directly accessible directly by the
user.  It
should be hidden.

- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] [PATCH] [RFC] - example user mode rdma ping/pong program using CMA

2006-02-08 Thread Steve Wise
All,

Attached is a user-mode program, called rping, that uses librdmacm and
libibverbs to implement a ping-pong program over an RC connection.  The
program utilizes SEND, RECV, RDMA READ, and WRITE ops, as well as cq
channels to get cq events, and rdma_get_event() to detect CMA events.
It is multi-threaded.  

I've built it as an example program in librdmacm/examples and tested it
with mthca.  It is useful to test CMA as well as all the major rdma
operations in a transport-neutral way.

If you all find it has utility, please pull it into librdmacm/examples.


Signed-off-by: Steve Wise [EMAIL PROTECTED]



Index: Makefile.am
===
--- Makefile.am (revision 5330)
+++ Makefile.am (working copy)
@@ -18,9 +18,11 @@
 src_librdmacm_la_SOURCES = src/cma.c
 src_librdmacm_la_LDFLAGS = -avoid-version $(rdmacm_version_script)
 
-bin_PROGRAMS = examples/ucmatose
+bin_PROGRAMS = examples/ucmatose examples/rping
 examples_ucmatose_SOURCES = examples/cmatose.c
 examples_ucmatose_LDADD = $(top_builddir)/src/librdmacm.la
+examples_rping_SOURCES = examples/rping.c
+examples_rping_LDADD = $(top_builddir)/src/librdmacm.la
 
 librdmacmincludedir = $(includedir)/rdma
 
Index: examples/rping.c
===
--- examples/rping.c(revision 0)
+++ examples/rping.c(revision 0)
@@ -0,0 +1,1175 @@
+/*
+ * Copyright (c) 2005 Ammasso, Inc. All rights reserved.
+ * Copyright (c) 2006 Open Grid Computing, Inc. All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ * Redistribution and use in source and binary forms, with or
+ * without modification, are permitted provided that the following
+ * conditions are met:
+ *
+ *  - Redistributions of source code must retain the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer.
+ *
+ *  - Redistributions in binary form must reproduce the above
+ *copyright notice, this list of conditions and the following
+ *disclaimer in the documentation and/or other materials
+ *provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED AS IS, WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+
+#include getopt.h
+#include stdlib.h
+#include string.h
+#include stdio.h
+#include errno.h
+#include sys/types.h
+#include netinet/in.h
+#include sys/socket.h
+#include netdb.h
+#include byteswap.h
+#include semaphore.h
+#include arpa/inet.h
+#include pthread.h
+
+#include rdma/rdma_cma.h
+
+static int debug = 0;
+#define DEBUG_LOG if (debug) printf
+
+/*
+ * rping ping/pong loop:
+ * client sends source rkey/addr/len
+ * server receives source rkey/add/len
+ * server rdma reads ping data from source
+ * server sends go ahead on rdma read completion
+ * client sends sink rkey/addr/len
+ * server receives sink rkey/addr/len
+ * server rdma writes pong data to sink
+ * server sends go ahead on rdma write completion
+ * repeat loop
+ */
+
+/*
+ * These states are used to signal events between the completion handler
+ * and the main client or server thread.
+ *
+ * Once CONNECTED, they cycle through RDMA_READ_ADV, RDMA_WRITE_ADV, 
+ * and RDMA_WRITE_COMPLETE for each ping.
+ */
+typedef enum {
+   IDLE = 1,
+   CONNECT_REQUEST,
+   CONNECTED,
+   RDMA_READ_ADV,
+   RDMA_READ_COMPLETE,
+   RDMA_WRITE_ADV,
+   RDMA_WRITE_COMPLETE,
+   ERROR
+} state_t;
+
+/*
+ * Default max buffer size for IO...
+ */
+#define RPING_BUFSIZE 64*1024
+#define RPING_SQ_DEPTH 16
+
+/*
+ * Control block struct.
+ */
+struct rping_cb {
+   int server; /* 0 iff client */
+   pthread_t cqthread;
+   struct ibv_comp_channel *channel;
+   struct ibv_cq *cq;
+   struct ibv_pd *pd;
+   struct ibv_qp *qp;
+
+   struct ibv_recv_wr rq_wr;   /* recv work request record */
+   struct ibv_sge recv_sgl;/* recv single SGE */
+   char *recv_buf; /* malloc'd buffer */
+   struct ibv_mr *recv_mr; /* MR associated with this buffer */
+
+   struct ibv_send_wr sq_wr;   /* send work requrest record */
+   struct ibv_sge send_sgl;
+   char *send_buf; /* single send buf */
+   

[openib-general] problem with user-verb WC's

2006-02-08 Thread Kyle Schochenmaier
While working on the openIB port for PVFS2, I've stumbled across some 
problems in posting rdma requests via the user-verbs interface with 
ib_mthca drivers.


According to a 'TODO' buried in the gen2 src/linux-kernel/infiniband/hw/  :
MW support:   ib_mthca does not support memory windows

The opcodes that I receive for non-rdma requests are all correct,
however, when posting rdma requests, I'm consistently getting work 
completions with opcodes of:

IBV_WC_BIND_MW

I'm not making any (known) calls or requests to bind to a memory window, 
or for that matter to create a memory window.
So how does a completion event get generated with an opcode indicating a 
currently unimplemented feature has just finished?
And are there other reasons why I should/would be getting this type of 
completion?


Thanks,
   Kyle


--
Kyle Schochenmaier
[EMAIL PROTECTED]
Research Assistant, Dr. Brett Bode
AmesLab - US Dept.Energy
Scalable Computing Laboratory 


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA

2006-02-08 Thread Sean Hefty
Attached is a user-mode program, called rping, that uses librdmacm and
libibverbs to implement a ping-pong program over an RC connection.  The
program utilizes SEND, RECV, RDMA READ, and WRITE ops, as well as cq
channels to get cq events, and rdma_get_event() to detect CMA events.
It is multi-threaded.

I've built it as an example program in librdmacm/examples and tested it
with mthca.  It is useful to test CMA as well as all the major rdma
operations in a transport-neutral way.

If you all find it has utility, please pull it into librdmacm/examples.

Thanks.  I may not get a chance to test this for a couple of days, but some
additional tests for librdmacm would definitely be useful.

- Sean

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA

2006-02-08 Thread Steve Wise
On Wed, 2006-02-08 at 18:45 +0200, Michael S. Tsirkin wrote:
 Quoting r. Steve Wise [EMAIL PROTECTED]:
  Subject: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA
  
  All,
  
  Attached is a user-mode program, called rping, that uses librdmacm and
  libibverbs to implement a ping-pong program over an RC connection.  The
  program utilizes SEND, RECV, RDMA READ, and WRITE ops, as well as cq
  channels to get cq events, and rdma_get_event() to detect CMA events.
  It is multi-threaded.  
  
  I've built it as an example program in librdmacm/examples and tested it
  with mthca.  It is useful to test CMA as well as all the major rdma
  operations in a transport-neutral way.
  
  If you all find it has utility, please pull it into librdmacm/examples.
  
  
  Signed-off-by: Steve Wise [EMAIL PROTECTED]
 
 Steve, looks like you have at most a single receive work request posted at the
 receive workqueue at all times.
 If true, this is *really* not a good idea, performance-wise, even if you
 actually have at most 1 packet in flight.

Hey Michael,

There is at most only one SEND in flight.  This is a test program, not a
performance program.  Its goal is to utilize SEND, RECV, RDMA READ, and
RDMA WRITE as well as CMA to setup the connection...

Thanks,

Steve.



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA

2006-02-08 Thread Steve Wise
 Hey Michael,
 
 There is at most only one SEND in flight.  This is a test program, not a
 performance program.  Its goal is to utilize SEND, RECV, RDMA READ, and
 RDMA WRITE as well as CMA to setup the connection...
 
 Thanks,
 
 Steve.

By the way, in case its not clear:  The SEND/RECV exchanges are done
just to advertise source and sink memory regions, and to indicate
completion of rdma read and write operations to the peer.  The
ping/pong data is transferred with rdma read and write operations.

Thanks for the feedback!


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Ifdown/ifup pick up the wrong ib interface configuration file

2006-02-08 Thread Shirley Ma

Check your ifcfg-ib0/ifcfg-ib1
script to see whether the interface name matches.

Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Re: openib and mellanox hca problem

2006-02-08 Thread Michael Di Domenico
On 2/8/06, Michael S. Tsirkin [EMAIL PROTECTED] wrote:
 Quoting r. Michael Di Domenico [EMAIL PROTECTED]:
  Subject: Re: openib and mellanox hca problem
 
  Roland,
 
  I've attached the dmesg and lspci outputs...

 You really want lspci *before* mthca got loaded.
 This one just shows the card's incommunicado.

I'm going to try and rollback to RedHat EL4 IA32 and see if i can get
the machines up and using the silverstorm host stack and make
everything works fine.  unforgunately we dont have a stack for fedora
core 4 on ia32 on ia64

afterwards i'll load up the openib stack and see what happens...

thanks for the help
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA

2006-02-08 Thread Michael S. Tsirkin
Quoting r. Sean Hefty [EMAIL PROTECTED]:
 Subject: RE: [openib-general] Re: [PATCH] [RFC] - example user mode 
 rdmaping/pongprogram using CMA
 
 Steve, looks like you have at most a single receive work request posted at 
 the
 receive workqueue at all times.
 If true, this is *really* not a good idea, performance-wise, even if you
 actually have at most 1 packet in flight.
 
 Can you provide some more details on this?

See 9.7.7.2 end-to-end (message level) flow control

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] FW: [PATCH 1 of 3] mad: large RMPP support

2006-02-08 Thread Sean Hefty
For example, a 1000 host cluster, with 2 ports per HCA will have at
least 4000 records in a SubnAdmGetTableResp for all PortInfo records on
the network (2000 for HCAs, and at least 2000 for the switch ports).
Such a query response will generate an RMPP of size 256K -- 1000
segments, or a 4K buffer on an X86 machine just for the array (assuming
one allocation per RMPP segment -- N=1).

I think that this is a good reason to use an array.  Walking a 1000 entry list
1000 times is a substantial performance hit.  Lost MADs and retries will make
this worse.

A 4K buffer for the array is less than the 8K total needed for the 1000 list
items.  We're already talking about allocating over 256K of memory just for the
data payload.  An additional contiguous 4k buffer seems like a minor issue.  I'm
not convinced that there's a real issue here.

To support ridiculously large transfers from userspace, we may need to push the
RMPP handling up into userspace.

- Sean

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] error when using libsdp

2006-02-08 Thread Xavier Grave
Hi,

I have compiled and configured libsdp and when I start my application I
get this message :
default libsdp configuration is used
Error 97 calling socket for SDP socket
errno 97 gives 
#define EAFNOSUPPORT97  /* Address family not supported by
protocol */
How can I enable the SDP support ?

Thanks in advance, xavier

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: openib and mellanox hca problem

2006-02-08 Thread Roland Dreier
Michael I wander whether we manage to locate the bridge.  It
Michael would be interesting to build mthca with debug enabled.

Certainly building with CONFIG_INFINIBAND_MTHCA_DEBUG=y would be a
good idea.  But even without debug, if we don't find a bridge, we
should see the warning from the code:

if (!bridge) {
/*
 * Didn't find a bridge for a Tavor device --
 * assume we're in no-bridge mode and hope for
 * the best.
 */
mthca_warn(mdev, No bridge found for %s\n,
   pci_name(mdev-pdev));
}

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA

2006-02-08 Thread Michael S. Tsirkin
Quoting r. Steve Wise [EMAIL PROTECTED]:
 Subject: Re: Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using 
 CMA
 
  Hey Michael,
  
  There is at most only one SEND in flight.  This is a test program, not a
  performance program.  Its goal is to utilize SEND, RECV, RDMA READ, and
  RDMA WRITE as well as CMA to setup the connection...
  
  Thanks,
  
  Steve.
 
 By the way, in case its not clear:  The SEND/RECV exchanges are done
 just to advertise source and sink memory regions, and to indicate
 completion of rdma read and write operations to the peer.  The
 ping/pong data is transferred with rdma read and write operations.
 
 Thanks for the feedback!
 

Code tends to get copied around ... its easy to imagine someone
copying this and measuring the send latency. Just posting many WRs
in the initialization sequence, with no other code changes,
will fix this problem.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: error when using libsdp

2006-02-08 Thread Michael S. Tsirkin
Quoting r. Xavier Grave [EMAIL PROTECTED]:
 Subject: error when using libsdp
 
 Hi,
 
 I have compiled and configured libsdp and when I start my application I
 get this message :
 default libsdp configuration is used
 Error 97 calling socket for SDP socket
 errno 97 gives 
 #define EAFNOSUPPORT97  /* Address family not supported by
 protocol */
 How can I enable the SDP support ?
 
 Thanks in advance, xavier
 

Did you load the ib_sdp module?

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: openib and mellanox hca problem

2006-02-08 Thread Michael S. Tsirkin
Quoting Roland Dreier [EMAIL PROTECTED]:
 Certainly building with CONFIG_INFINIBAND_MTHCA_DEBUG=y would be a
 good idea.  But even without debug, if we don't find a bridge, we
 should see the warning from the code:

Right, I wanded to check we got the right bus/device number, and it seems
we did.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA

2006-02-08 Thread Steve Wise
On Wed, 2006-02-08 at 19:10 +0200, Michael S. Tsirkin wrote:
 Quoting r. Sean Hefty [EMAIL PROTECTED]:
  Subject: RE: [openib-general] Re: [PATCH] [RFC] - example user mode 
  rdmaping/pongprogram using CMA
  
  Steve, looks like you have at most a single receive work request posted at 
  the
  receive workqueue at all times.
  If true, this is *really* not a good idea, performance-wise, even if you
  actually have at most 1 packet in flight.
  
  Can you provide some more details on this?
 
 See 9.7.7.2 end-to-end (message level) flow control
 

I just read this section in the 1.2 version of the spec, and I still
don't understand what the issue really is?  9.7.7.2 talks about IBA
doing flow control based on the RECV WQEs posted. rping always ensures
that there is a RECV posted before the peer can send.  This is ensured
by the rping protocol itself (see the comment at the front of rping.c
describing the ping loop).

I'm only ever sending one outstanding message via SEND/RECV.  I would
rather post exactly what is needed, than post some number of RECVs just
to be safe.  Sorry if I'm being dense.  What am I missing here?

Steve.



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA

2006-02-08 Thread Steve Wise
  By the way, in case its not clear:  The SEND/RECV exchanges are done
  just to advertise source and sink memory regions, and to indicate
  completion of rdma read and write operations to the peer.  The
  ping/pong data is transferred with rdma read and write operations.
  
  Thanks for the feedback!
  
 
 Code tends to get copied around ... its easy to imagine someone
 copying this and measuring the send latency. Just posting many WRs
 in the initialization sequence, with no other code changes,
 will fix this problem.
 

Each ping/pong iteration with rping is composed of 2 sends on the
client side, 2 sends on the server side, plus 1 rdma read and 1 rdma
write on the server side.  

Again, latency performance (or any performance) isn't a goal of this
program.  Testing CMA, CQ and CMA event notifications, and
send/recv/rr/rw are the goals.
 

snipit from the patch:

+/*
+ * rping ping/pong loop:
+ * client sends source rkey/addr/len
+ * server receives source rkey/add/len
+ * server rdma reads ping data from source
+ * server sends go ahead on rdma read completion
+ * client sends sink rkey/addr/len
+ * server receives sink rkey/addr/len
+ * server rdma writes pong data to sink
+ * server sends go ahead on rdma write completion
+ * repeat loop
+ */


Can you be more specific on what you think I should change?  Are you
suggesting I post more RECVs?   

Steve.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] ibstat problem

2006-02-08 Thread Sean Hubbell

Yes,

 We discovered this yesterday. You built the libraries and did not 
build the diag. tools. Once you do this, things work. I do still have a 
few problems on sending messages out multicast though.


Sean


Steve Wise wrote:


Anyone see this before?

-

vic17:~ # ibstat
ibstat: relocation error: ibstat: symbol argv0, version IBCOMMON_1.0 not
defined in file libibcommon.so.1 with link time reference
vic17:~ # uname -a
Linux vic17 2.6.15.2-kdb #4 SMP PREEMPT Mon Feb 6 17:24:41 CST 2006 i686
i686 i386 GNU/Linux
vic17:~ #


-

[EMAIL PROTECTED] src]$ svn info
Path: .
URL: https://openib.org/svn/gen2/trunk/src
Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd
Revision: 5330
Node Kind: directory
Schedule: normal
Last Changed Author: ogerlitz
Last Changed Rev: 5330
Last Changed Date: 2006-02-07 07:23:38 -0600 (Tue, 07 Feb 2006)

[EMAIL PROTECTED] src]$



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


 




___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: openib and mellanox hca problem

2006-02-08 Thread Michael Di Domenico
On 2/8/06, Michael S. Tsirkin [EMAIL PROTECTED] wrote:
 Quoting Roland Dreier [EMAIL PROTECTED]:
  Certainly building with CONFIG_INFINIBAND_MTHCA_DEBUG=y would be a
  good idea.  But even without debug, if we don't find a bridge, we
  should see the warning from the code:

 Right, I wanded to check we got the right bus/device number, and it seems
 we did.

FYI...

Changed over to RHEL4 IA32 w/ SilverStorm Host Stack v3.2.0.0.21 and
now i get the below info and a working infiniband setup...

Since I have two servers, I'm going to leave this one working and try
openib on the second machine...

# uname -a
Linux linux14.silverstorm.com 2.6.9-5.ELsmp #1 SMP Wed Jan 5 19:30:39
EST 2005 i686 i686 i386 GNU/Linux

# lspci -vvv
06:03.0 PCI bridge: Mellanox Technologies MT23108 PCI Bridge (rev a1)
(prog-if 00 [Normal decode])
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium
TAbort- TAbort- MAbort- SERR- PERR-
Latency: 64, Cache Line Size 10
Bus: primary=06, secondary=07, subordinate=07, sec-latency=64
I/O behind bridge: f000-0fff
Memory behind bridge: fe50-fe7f
Prefetchable memory behind bridge: eac0-fbc0
Secondary status: 66Mhz+ FastB2B- ParErr- DEVSEL=medium
TAbort- TAbort- MAbort- SERR- PERR-
BridgeCtl: Parity+ SERR+ NoISA- VGA- MAbort- Reset- FastB2B-
Capabilities: [70] PCI-X bridge device.
Secondary Status: 64bit+, 133MHz+, SCD-, USC-, SCO-, SRD- Freq=3
Status: Bus=6 Dev=3 Func=0 64bit+ 133MHz+ SCD- USC-, SCO-, SRD-
: Upstream: Capacity=512, Commitment Limit=512
: Downstream: Capacity=128, Commitment Limit=128

07:00.0 InfiniBand: Mellanox Technologies MT23108 InfiniHost (rev a1)
Subsystem: Mellanox Technologies MT23108 InfiniHost
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV+ VGASnoop-
ParErr- Stepping- SERR+ FastB2B-
Status: Cap+ 66Mhz+ UDF- FastB2B- ParErr- DEVSEL=medium
TAbort- TAbort- MAbort- SERR- PERR-
Latency: 64, Cache Line Size 10
Interrupt: pin A routed to IRQ 217
Region 0: Memory at fe70 (64-bit, non-prefetchable) [size=1M]
Region 2: Memory at fb00 (64-bit, prefetchable) [size=8M]
Region 4: Memory at f000 (64-bit, prefetchable) [size=128M]
Capabilities: [40] MSI-X: Enable- Mask- TabSize=32
Vector table: BAR=0 offset=00082000
PBA: BAR=0 offset=00082200
Capabilities: [50] Vital Product Data
Capabilities: [60] Message Signalled Interrupts: 64bit+
Queue=0/5 Enable-
Address:   Data: 
Capabilities: [70] PCI-X non-bridge device.
Command: DPERE- ERO- RBC=3 OST=1
Status: Bus=7 Dev=0 Func=0 64bit+ 133MHz+ SCD- USC-,
DC=simple, DMMRBC=3, DMOST=1, DMCRS=0, RSCEM-
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] Re: Re: [PATCH] [RFC] - example user mode rdmaping/pongprogram using CMA

2006-02-08 Thread Steve Wise
On Wed, 2006-02-08 at 09:51 -0800, Caitlin Bestler wrote:
 [EMAIL PROTECTED] wrote:
  By the way, in case its not clear:  The SEND/RECV exchanges are done
  just to advertise source and sink memory regions, and to indicate
  completion of rdma read and write operations to the peer.  The
  ping/pong data is transferred with rdma read and write operations.
  
  Thanks for the feedback!
  
  
  Code tends to get copied around ... its easy to imagine someone
  copying this and measuring the send latency. Just posting many WRs in
  the initialization sequence, with no other code changes, will fix
  this problem. 
  
  
  Each ping/pong iteration with rping is composed of 2 sends
  on the client side, 2 sends on the server side, plus 1 rdma
  read and 1 rdma write on the server side.
  
  Again, latency performance (or any performance) isn't a goal
  of this program.  Testing CMA, CQ and CMA event
  notifications, and send/recv/rr/rw are the goals.
  
  
  snipit from the patch:
  
  +/*
  + * rping ping/pong loop:
  + * client sends source rkey/addr/len
  + * server receives source rkey/add/len
  + * server rdma reads ping data from source
  + * server sends go ahead on rdma read completion
  + * client sends sink rkey/addr/len
  + * server receives sink rkey/addr/len
  + * server rdma writes pong data to sink
  + * server sends go ahead on rdma write completion + *
  repeat loop + */
  
 
 Why does the server send go ahead after rdma write completion?

No particular reason.

 It should be able to just post the send after posting the rdma
 write without waiting. When the rdma write completes has no
 device/transport independent meaning.

You're correct.  It does not need to wait for the rdma write
completion...



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA

2006-02-08 Thread Michael S. Tsirkin
Quoting r. Steve Wise [EMAIL PROTECTED]:
 Subject: Re: [openib-general] Re: [PATCH] [RFC] - example user 
 moderdmaping/pongprogram using CMA
 
 On Wed, 2006-02-08 at 19:10 +0200, Michael S. Tsirkin wrote:
  Quoting r. Sean Hefty [EMAIL PROTECTED]:
   Subject: RE: [openib-general] Re: [PATCH] [RFC] - example user mode 
   rdmaping/pongprogram using CMA
   
   Steve, looks like you have at most a single receive work request posted 
   at the
   receive workqueue at all times.
   If true, this is *really* not a good idea, performance-wise, even if you
   actually have at most 1 packet in flight.
   
   Can you provide some more details on this?
  
  See 9.7.7.2 end-to-end (message level) flow control
  
 
 I just read this section in the 1.2 version of the spec, and I still
 don't understand what the issue really is?  9.7.7.2 talks about IBA
 doing flow control based on the RECV WQEs posted. rping always ensures
 that there is a RECV posted before the peer can send.  This is ensured
 by the rping protocol itself (see the comment at the front of rping.c
 describing the ping loop).
 
 I'm only ever sending one outstanding message via SEND/RECV.  I would
 rather post exactly what is needed, than post some number of RECVs just
 to be safe.  Sorry if I'm being dense.  What am I missing here?
 
 Steve.
 

As far as I know, the credits are only updated by the ACK messages.
If there is a single work request outstanding on the RQ,
the ACK of the SEND message will have the credit field value 0
(since exactly one receive WR was outstanding, and that is now consumed).

As a result the remote side withh think that there are no
receive WQEs and will slow down (what spec refers to as limited WQE).


-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: Re: [PATCH] [RFC] - example user mode rdmaping/pongprogramusing CMA

2006-02-08 Thread Michael S. Tsirkin
Quoting Steve Wise [EMAIL PROTECTED]:
 Can you be more specific on what you think I should change?  Are you
 suggesting I post more RECVs?   

During the initialization stage, post the same receive WR multiple times
(according to the RQ size).

Nothing needs to be touched in the loop: when you get a CQE, post just one
receive WR.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA

2006-02-08 Thread Michael S. Tsirkin
I suggest this in rping_setup_buffers:
while (!rc = ibv_post_recv(cbp-qp, cbp-rq_wr, bad_wr));

This way you will never have 0 end-to-end credits.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA

2006-02-08 Thread Steve Wise
On Wed, 2006-02-08 at 21:11 +0200, Michael S. Tsirkin wrote:
 I suggest this in rping_setup_buffers:
 while (!rc = ibv_post_recv(cbp-qp, cbp-rq_wr, bad_wr));
 
 This way you will never have 0 end-to-end credits.
 

I can do this easily, but it bothers me to post the same buffer multiple
times, knowing the application doesn't need it (and would fail if more
than one RECV is consumed at a time), just to make the transport more
efficient.  

Is this common practice for IB applications?

Thanks,

Steve.



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA

2006-02-08 Thread Roland Dreier
Steve Is this common practice for IB applications?

No, I think it's more of a cute trick that works in your particular case.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: ipoib_mcast_send.patch

2006-02-08 Thread Roland Dreier
I think we might want to be even more paranoid and wait until the
broadcast join succeeds before allowing send-only joins.  Otherwise
we could create a send-only MCG with the wrong Q_Key, SL, etc.

something like this maybe?

--- infiniband/ulp/ipoib/ipoib_multicast.c  (revision 5337)
+++ infiniband/ulp/ipoib/ipoib_multicast.c  (working copy)
@@ -222,6 +222,13 @@ static int ipoib_mcast_join_finish(struc
sizeof (union ib_gid))) {
priv-qkey = be32_to_cpu(priv-broadcast-mcmember.qkey);
priv-tx_wr.wr.ud.remote_qkey = priv-qkey;
+
+   /*
+* Make sure that all the attributes are visible
+* before we set the attached bit, so that send-only
+* joins don't get started with incorrect attributes.
+*/
+   smp_wmb();
}
 
if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, mcast-flags)) {
@@ -533,8 +540,10 @@ void ipoib_mcast_join_task(void *dev_ptr
}
 
if (!priv-broadcast) {
-   priv-broadcast = ipoib_mcast_alloc(dev, 1);
-   if (!priv-broadcast) {
+   struct ipoib_mcast *broadcast;
+
+   broadcast = ipoib_mcast_alloc(dev, 1);
+   if (!broadcast) {
ipoib_warn(priv, failed to allocate broadcast 
group\n);
mutex_lock(mcast_mutex);
if (test_bit(IPOIB_MCAST_RUN, priv-flags))
@@ -544,10 +553,11 @@ void ipoib_mcast_join_task(void *dev_ptr
return;
}
 
-   memcpy(priv-broadcast-mcmember.mgid.raw, priv-dev-broadcast 
+ 4,
+   spin_lock_irq(priv-lock);
+   memcpy(broadcast-mcmember.mgid.raw, priv-dev-broadcast + 4,
   sizeof (union ib_gid));
+   priv-broadcast = broadcast;
 
-   spin_lock_irq(priv-lock);
__ipoib_mcast_add(dev, priv-broadcast);
spin_unlock_irq(priv-lock);
}
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] 全て無料でご近 所さん探し

2006-02-08 Thread kr93u72js
*:。.:*:・'゜☆。.:*:・'゜★ *:。.:*:・'゜☆。.:*:・'゜★ *:。.:*:

 寒い冬は誰と過ごしますか?

*:。.:*:・'゜☆。.:*:・'゜★ *:。.:*:・'゜☆。.:*:・'゜★ *:。.:*:

男女会員数30万人以上!!
今がチャンスの完全無料コミュニティーにご参加下さい。

--
真菜 21歳 学生
題名:遊びたいよー
彼氏にフラレちゃって淋しい毎日を過ごしています。あーぁ、私って
運がないのかな?今年こそはいい年にしたいなぁ。最近、楽しい事が
ないので一緒に遊びませんか?色んな事を忘れてはじけたいです。
http://www.sweet-ch.com/?es
--
里子 31歳 OL
題名:31歳独身お茶組してます
お茶組して三年目…派遣社員として入って正社員の座を
射止めたはいいんですが…
それも上司と口車に乗せられて…なんか低給料で全然稼げないんですよ…
最悪なんですけど…だから夜とか少しバイトとかしてます。
日曜とか休みの日が多いけどバイトとか入ったら夜とかも仕事してます。
メールだったら時間関係なくお付き合いできるかなって思って。
家にPCあるので一緒にメッセンジャーでもしませんか?待ってますね。
http://www.sweet-ch.com/?es
--
順子 40歳 主婦
題名:お外で楽しみたいな
たまに主婦したりってしてます。でも亭主との夜の関係が一年以上ないし
そろそろハメを外しちゃおうかなって考えて登録しました。実際歳より
若いって見られる事も多いので、体もエステとか行ってその辺の40代には
負けてないって自分でも思うけど。どうですか?私はお外で楽しみたいな
とか思ってますけど。秘密厳守の人でお願いします。
http://www.sweet-ch.com/?es


◎ご近所さん探し◎
 ┏★ 完全無料   
   ┏┃┛  エッチな子も恋いしたい子もいっぱい
   ★┛    http://www.meets-u.net/?mm


━注意事項━━
本メールマガジン掲載に関する情報に関しては一切責任を負いません。
掲載情報の利用に際しては、各人が自分の責任で行なって下さい。
いかなる損害に関しても一切責任を負いかねますのでご了承下さい。
情報は必ずご自分でご確認ください。
掲載された記事の一部または全部を許可なく転載することを禁止致します。
━━━

━【購読解除について】

※ 購読解除方法
 万が一18歳未満の方に届いた場合や、登録解除をご希望の方は
 お手数ですが下記までお願い致します。
[EMAIL PROTECTED]

━━━

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram using CMA

2006-02-08 Thread Michael S. Tsirkin
Quoting r. Roland Dreier [EMAIL PROTECTED]:
 Subject: Re: Re: [PATCH] [RFC] - example user mode rdma ping/pongprogram 
 using CMA
 
 Steve Is this common practice for IB applications?
 
 No, I think it's more of a cute trick that works in your particular case.
 

Correct. Real apps are unlikely to get by with a single outstanding WR.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: ipoib_mcast_send.patch

2006-02-08 Thread Roland Dreier
Michael Right, but I thought atomic test_and_set_bit implied
Michael smp_wmb already?

So did I but then I looked in the kernel source and now I think that
set_bit operations are only ordered against other bitops that touch
the same word.  For example ia64 just uses cmpxchg to implement the
bitops, and powerpc just uses locked loads and stores.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] IPoIB and lid change

2006-02-08 Thread Michael S. Tsirkin
Hi, Roland!
One issue we have with IPoIB is that IPoIB may cache a remote node path for a
long time. Remote LID may get changed e.g. if the SM is changed, and IPoIB might
lose connectivity.

One simple way to address this would be to have a list of all
address handles per net device and kill them on an SM change event.

What do you think?

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: Re: ipoib_mcast_send.patch

2006-02-08 Thread Michael S. Tsirkin
Quoting r. Roland Dreier [EMAIL PROTECTED]:
 Subject: Re: Re: ipoib_mcast_send.patch
 
 Michael Right, but I thought atomic test_and_set_bit implied
 Michael smp_wmb already?
 
 So did I but then I looked in the kernel source and now I think that
 set_bit operations are only ordered against other bitops that touch
 the same word.  For example ia64 just uses cmpxchg to implement the
 bitops, and powerpc just uses locked loads and stores.
 
  - R.
 


Hmm. Roland, which kernel version is that? 

On 2.6.15 I see in include/asm-powerpc/bitops.h

static __inline__ int test_and_set_bit(unsigned long nr,
   volatile unsigned long *addr)
{
unsigned long old, t;
unsigned long mask = BITOP_MASK(nr);
unsigned long *p = ((unsigned long *)addr) + BITOP_WORD(nr);

__asm__ __volatile__(
EIEIO_ON_SMP
1:PPC_LLARX %0,0,%3  # test_and_set_bit\n
or %1,%0,%2 \n
PPC405_ERR77(0,%3)
PPC_STLCX %1,0,%3 \n
bne-   1b
ISYNC_ON_SMP
: =r (old), =r (t)
: r (mask), r (p)
: cc, memory);

return (old  mask) != 0;
}

EIEIO_ON_SMP is a write barrier on smp, isnt it?

I see this in 2.6.11: include/asm-ppc64/bitops.h

static __inline__ int test_and_set_bit(unsigned long nr, volatile unsigned long
*addr)
{
unsigned long old, t;
unsigned long mask = 1UL  (nr  0x3f);
unsigned long *p = ((unsigned long *)addr) + (nr  6);

__asm__ __volatile__(
EIEIO_ON_SMP
1: ldarx   %0,0,%3 # test_and_set_bit\n\
or  %1,%0,%2 \n\
stdcx.  %1,0,%3 \n\
bne-1b
ISYNC_ON_SMP
: =r (old), =r (t)
: r (mask), r (p)
: cc, memory);

return (old  mask) != 0;
}

EIEIO_ON_SMP is exactly what is needed, no?

/*
 * The test_and_*_bit operations are taken to imply a memory barrier
 * on SMP systems.
 */


...

/*
 * test_and_*_bit do imply a memory barrier (?)
 */
static __inline__ int test_and_set_bit(int nr, volatile unsigned long *addr)
{
unsigned int old, t;
unsigned int mask = 1  (nr  0x1f);
volatile unsigned int *p = ((volatile unsigned int *)addr) + (nr  5);

__asm__ __volatile__(SMP_WMB \n\
1:  lwarx   %0,0,%4 \n\
or  %1,%0,%3 \n
PPC405_ERR77(0,%4)
   stwcx.  %1,0,%4 \n\
bne 1b
SMP_MB
: =r (old), =r (t), =m (*p)
: r (mask), r (p), m (*p)
: cc, memory);

return (old  mask) != 0;
}

Ahem. It does look to me like atomics imply smp_wmb.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA

2006-02-08 Thread Michael Krause


At 11:04 AM 2/8/2006, Michael S. Tsirkin wrote:
Quoting r. Steve Wise
[EMAIL PROTECTED]:
 Subject: Re: [openib-general] Re: [PATCH] [RFC] - example user
moderdmaping/pongprogram using CMA
 
 On Wed, 2006-02-08 at 19:10 +0200, Michael S. Tsirkin wrote:
  Quoting r. Sean Hefty [EMAIL PROTECTED]:
   Subject: RE: [openib-general] Re: [PATCH] [RFC] - example
user mode rdmaping/pongprogram using CMA
   
   Steve, looks like you have at most a single receive
work request posted at the
   receive workqueue at all times.
   If true, this is *really* not a good idea,
performance-wise, even if you
   actually have at most 1 packet in flight.
   
   Can you provide some more details on this?
  
  See 9.7.7.2 end-to-end (message level) flow control
  
 
 I just read this section in the 1.2 version of the spec, and I
still
 don't understand what the issue really is? 9.7.7.2 talks about
IBA
 doing flow control based on the RECV WQEs posted. rping always
ensures
 that there is a RECV posted before the peer can send. This is
ensured
 by the rping protocol itself (see the comment at the front of
rping.c
 describing the ping loop).
 
 I'm only ever sending one outstanding message via SEND/RECV. I
would
 rather post exactly what is needed, than post some number of RECVs
just
 to be safe. Sorry if I'm being dense. What am I
missing here?
 
 Steve.
 
As far as I know, the credits are only updated by the ACK messages.
If there is a single work request outstanding on the RQ,
the ACK of the SEND message will have the credit field value 0
(since exactly one receive WR was outstanding, and that is now
consumed).
As a result the remote side withh think that there are
no
receive WQEs and will slow down (what spec refers to as limited
WQE).
Correct. The ACK / NAK protocol used by IB is used to return
credits. In order to pipeline to improve performance, then you must
post multiple receive work requests in order to account for the expected
round trip time of the fabric and the associated CA processing.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA

2006-02-08 Thread Michael Krause


At 11:35 AM 2/8/2006, Steve Wise wrote:
  
  I just read this section in the 1.2 version of the spec, and I
still
  don't understand what the issue really is? 9.7.7.2 talks
about IBA
  doing flow control based on the RECV WQEs posted. rping always
ensures
  that there is a RECV posted before the peer can send.
This is ensured
  by the rping protocol itself (see the comment at the front of
rping.c
  describing the ping loop).
  
  I'm only ever sending one outstanding message via
SEND/RECV. I would
  rather post exactly what is needed, than post some number of
RECVs just
  to be safe. Sorry if I'm being dense. What am
I missing here?
  
  Steve.
  
 
 As far as I know, the credits are only updated by the ACK
messages.
 If there is a single work request outstanding on the RQ,
 the ACK of the SEND message will have the credit field value 0
 (since exactly one receive WR was outstanding, and that is now
consumed).
 
 As a result the remote side withh think that there are
no
 receive WQEs and will slow down (what spec refers to as limited
WQE).
Oh. I understand now. This is an issue with only 1 RQ WQE
posted and
how IB tries to inform the peer transport of the WQE count. For
iWARP,
none of this transport-level flow control happens (and I'm more
familiar
with iWARP than IB).
For iWARP, we decided to not implement application receiver based flow
control due to two items:TCP provides transport-level flow control (IB
does not provide the equivalent per se) and upon examination of the
majority of the ULP, they exchange and track the number of receive
buffers allowed to be processed thus there is no need to replicate this
in iWARP. There are some subtleties as well between a message-based
transport and a byte stream such as TCP that go into the equation but
these are not that important for most application writers to deal
with.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Re: ipoib_mcast_send.patch

2006-02-08 Thread Roland Dreier
So something like this should be good enough:

--- infiniband/ulp/ipoib/ipoib_multicast.c  (revision 5337)
+++ infiniband/ulp/ipoib/ipoib_multicast.c  (working copy)
@@ -533,8 +533,10 @@ void ipoib_mcast_join_task(void *dev_ptr
}
 
if (!priv-broadcast) {
-   priv-broadcast = ipoib_mcast_alloc(dev, 1);
-   if (!priv-broadcast) {
+   struct ipoib_mcast *broadcast;
+
+   broadcast = ipoib_mcast_alloc(dev, 1);
+   if (!broadcast) {
ipoib_warn(priv, failed to allocate broadcast 
group\n);
mutex_lock(mcast_mutex);
if (test_bit(IPOIB_MCAST_RUN, priv-flags))
@@ -544,10 +546,11 @@ void ipoib_mcast_join_task(void *dev_ptr
return;
}
 
-   memcpy(priv-broadcast-mcmember.mgid.raw, priv-dev-broadcast 
+ 4,
+   spin_lock_irq(priv-lock);
+   memcpy(broadcast-mcmember.mgid.raw, priv-dev-broadcast + 4,
   sizeof (union ib_gid));
+   priv-broadcast = broadcast;
 
-   spin_lock_irq(priv-lock);
__ipoib_mcast_add(dev, priv-broadcast);
spin_unlock_irq(priv-lock);
}
@@ -701,7 +704,9 @@ void ipoib_mcast_send(struct net_device 
 */
spin_lock(priv-lock);
 
-   if (!test_bit(IPOIB_MCAST_STARTED, priv-flags) || !priv-broadcast) {
+   if (!test_bit(IPOIB_MCAST_STARTED, priv-flags)||
+   !priv-broadcast||
+   !test_bit(IPOIB_MCAST_FLAG_ATTACHED, priv-broadcast-flags)) {
++priv-stats.tx_dropped;
dev_kfree_skb_any(skb);
goto unlock;
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal

2006-02-08 Thread Michael Krause


At 09:16 PM 2/6/2006, Sean Hefty wrote:
The requirement is to
provide an API that supports RDMA writes with immediate
data. A send that follows an RDMA write is not immediate data,
and the API
should not be constructed around trying to make it so.
To be clear, I believe that write with immediate should be part of the
normal
APIs, rather than an extension, but should be designed around those
devices that
provide it natively.

One thing to keep in mind is that the IBTA workgroup responsible for the
transport wanted to eliminate immediate data support entirely but it was
retained solely to enable VIA application migration (even though the
application base was quite small). If that requirement could have
been eliminated, then it would have been gone in a heart beat.
Given a RDMA-WRITE followed by a SEND provides the same application
semantics based on the use models, iWARP chose not to support immediate
data. 
So, here we have a long discussion on attempting to perpetuate a concept
that is not universal across transports and was deemed to have minimal
value that most wanted to see removed from the architecture. One
has to question the value of trying to develop any API / software to
support immediate data instead of just enabling the preferred method
which is RDMA WRITE - SEND. I agree with those who have contended
that this is difficult to do in a general purpose fashion. When all
of this is taken into account, it seems the only good engineering answer
is to eliminate immediate data support by the software and focused on the
method that works across all interconnects.
Mike

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Re: IPoIB and lid change

2006-02-08 Thread Roland Dreier
Michael One simple way to address this would be to have a list of
Michael all address handles per net device and kill them on an SM
Michael change event.

Seems reasonable.  It seems a little painful to implement at a first
glance but I might be looking at it wrong.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal

2006-02-08 Thread Roland Dreier
Michael So, here we have a long discussion on attempting to
Michael perpetuate a concept that is not universal across
Michael transports and was deemed to have minimal value that most
Michael wanted to see removed from the architecture.

But this discussion is being driven by an application developer who
does see value in immediate data.

Arlin, can you quantify the benefit you see from RDMA write with
immediate vs. RDMA write followed by a send?

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: ipoib_mcast_send.patch

2006-02-08 Thread Michael S. Tsirkin
Quoting r. Roland Dreier [EMAIL PROTECTED]:
 Subject: Re: ipoib_mcast_send.patch
 
 So something like this should be good enough:
 
 --- infiniband/ulp/ipoib/ipoib_multicast.c(revision 5337)
 +++ infiniband/ulp/ipoib/ipoib_multicast.c(working copy)
 @@ -533,8 +533,10 @@ void ipoib_mcast_join_task(void *dev_ptr
   }
  
   if (!priv-broadcast) {
 - priv-broadcast = ipoib_mcast_alloc(dev, 1);
 - if (!priv-broadcast) {
 + struct ipoib_mcast *broadcast;
 +
 + broadcast = ipoib_mcast_alloc(dev, 1);
 + if (!broadcast) {
   ipoib_warn(priv, failed to allocate broadcast 
 group\n);
   mutex_lock(mcast_mutex);
   if (test_bit(IPOIB_MCAST_RUN, priv-flags))
 @@ -544,10 +546,11 @@ void ipoib_mcast_join_task(void *dev_ptr
   return;
   }
  
 - memcpy(priv-broadcast-mcmember.mgid.raw, priv-dev-broadcast 
 + 4,
 + spin_lock_irq(priv-lock);
 + memcpy(broadcast-mcmember.mgid.raw, priv-dev-broadcast + 4,
  sizeof (union ib_gid));
 + priv-broadcast = broadcast;
  
 - spin_lock_irq(priv-lock);
   __ipoib_mcast_add(dev, priv-broadcast);
   spin_unlock_irq(priv-lock);
   }

Thats identical to what I posted till this point - right?

 @@ -701,7 +704,9 @@ void ipoib_mcast_send(struct net_device 
*/
   spin_lock(priv-lock);
  
 - if (!test_bit(IPOIB_MCAST_STARTED, priv-flags) || !priv-broadcast) {
 + if (!test_bit(IPOIB_MCAST_STARTED, priv-flags)||
 + !priv-broadcast||
 + !test_bit(IPOIB_MCAST_FLAG_ATTACHED, priv-broadcast-flags)) {
   ++priv-stats.tx_dropped;
   dev_kfree_skb_any(skb);
   goto unlock;
 

I thought its important for performance to queue packets under
mcast-pkt_queue? If not why do we do it?
Maybe we shouldnt call netif_carrier_on if we drop all packets?

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: ipoib_mcast_send.patch

2006-02-08 Thread Roland Dreier
Michael Thats identical to what I posted till this point - right?

I think I added one blank line, but other than that, yes.

Michael I thought its important for performance to queue packets
Michael under mcast-pkt_queue? If not why do we do it?  Maybe we
Michael shouldnt call netif_carrier_on if we drop all packets?

The queueing is there so that we aren't guaranteed to drop the first
multicast packet sent to a given group.  I'm not sure that it really
is important, but it does seem like it would be bad to lose that
packet every time.

From reading the code we can't call netif_carrier_on until after
priv-broadcast has the attached flag set.  In ipoib_mcast_join_task(),
we have

if (!test_bit(IPOIB_MCAST_FLAG_ATTACHED, priv-broadcast-flags)) {
ipoib_mcast_join(dev, priv-broadcast, 0);
return;
}

and then at the very bottom

netif_carrier_on(dev);

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: IPoIB and lid change

2006-02-08 Thread Michael S. Tsirkin
Quoting r. Roland Dreier [EMAIL PROTECTED]:
 Subject: Re: IPoIB and lid change
 
 Michael One simple way to address this would be to have a list of
 Michael all address handles per net device and kill them on an SM
 Michael change event.
 
 Seems reasonable.  It seems a little painful to implement at a first
 glance but I might be looking at it wrong.

It will be very easy once you merge ipoib_all_neigh_issues_2.patch
since that gets us a list of neigh to wwal on sm event.

-- 
Michael S. Tsirkin
Staff Engineer, Mellanox Technologies
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal

2006-02-08 Thread Arlin Davis

Roland Dreier wrote:


   Michael So, here we have a long discussion on attempting to
   Michael perpetuate a concept that is not universal across
   Michael transports and was deemed to have minimal value that most
   Michael wanted to see removed from the architecture.

But this discussion is being driven by an application developer who
does see value in immediate data.

Arlin, can you quantify the benefit you see from RDMA write with
immediate vs. RDMA write followed by a send?

 


We need speed and simplicity.

A very latency sensitive application that requires immediate 
notification of RDMA write completion on the remote node without ANY 
latency penalties associated with combining operations, HCA priority 
rules across QPs, wire congestion, etc. An application that has no 
requirement for messaging outside of remote rdma write completion 
notifications. The application would not have to register and manage 
additional message buffers on either side, we can just size the queues 
accordingly and post zero byte messages. We need something that would be 
equivelent to setting there polling on the last byte of inbound data. 
But, since data ordering within an operation is not guaranteed that is 
not an option. So, rdma with immediate data is the most optimal and 
simplistic method for indication of RDMA-write completion that we have 
available today. In fact, I would like to see it increased in size to 
make it even more useful.


-arlin






___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Could not retrieve handle to the HCA InfiniHost0 (VAPI_EINVAL_HCA_ID)

2006-02-08 Thread Srirangam Addepalli
Hello All,

When i do a vstat i get the following error. What does this mean.


vstat1 HCA found: hca_id=InfiniHost0Error: Could not retrieve handle to the HCA InfiniHost0 (VAPI_EINVAL_HCA_ID)

/var/log/messages has this

[KERNEL_IB][_tslbTavorPnPEventHandler][/var/tmp/IBGD//tmp/openib/infiniband/ib_verbs/hw/provider/tavor_main.c:352]_tslbTavorPnPEventHandler: could not add HCA InfiniHost0 (-19)

what are the possible things that might have gone wrong ? does any one know.


Rangam
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal

2006-02-08 Thread Roland Dreier
Arlin A very latency sensitive application that requires
Arlin immediate notification of RDMA write completion on the
Arlin remote node without ANY latency penalties associated with
Arlin combining operations, HCA priority rules across QPs, wire
Arlin congestion, etc. An application that has no requirement for
Arlin messaging outside of remote rdma write completion
Arlin notifications. The application would not have to register
Arlin and manage additional message buffers on either side, we
Arlin can just size the queues accordingly and post zero byte
Arlin messages. We need something that would be equivelent to
Arlin setting there polling on the last byte of inbound
Arlin data. But, since data ordering within an operation is not
Arlin guaranteed that is not an option. So, rdma with immediate
Arlin data is the most optimal and simplistic method for
Arlin indication of RDMA-write completion that we have available
Arlin today. In fact, I would like to see it increased in size to
Arlin make it even more useful.

Hmm.  Can you put a number on how much better RDMA write with
immediate is on current HCA hardware?  How does using the underlying
OpenIB verbs ability to post a list of work requests compare (ie
posting an RDMA write followed by a send in one verbs call)?
Maybe post multiple is a better direction for DAT.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal

2006-02-08 Thread Larsen, Roy K








One thing to keep in mind is that the IBTA workgroup
responsible for the transport wanted to eliminate immediate data support
entirely but it was retained solely to enable VIA application migration (even
though the application base was quite small). If that requirement could
have been eliminated, then it would have been gone in a heart beat. Given
a RDMA-WRITE followed by a SEND provides the same application semantics based
on the use models, iWARP chose not to support immediate data.



Mike, 



I was not part of the original IBTA discussions and I wont argue
whether this facility should or shouldnt have been include. Nevertheless,
it is part of the specification, there are HCA vendors that implement it, and
we have applications that make use of it. I would, however, disagree with
your assertion that write followed by a send is semantically equivalent to
write immediate. Ordering may be semantically the same, but the service
is not. Receive work completions are explicitly indicated as being
associated with immediate data and therefore an associated write completion. A
write followed by a send does not provide the same indication semantic.



Roy






___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Antigen found FILE FILTER= *.* file

2006-02-08 Thread Antigen_EXCH01
Antigen for Exchange found 21_price.zip-bqqvauygc.exe matching FILE FILTER=  
*.* file filter.
The file is currently Removed.  The message, [openib-general] price, was
sent from [EMAIL PROTECTED] and was discovered in SMTP Messages\Inbound
located at Quadrics/First Administrative Group/EXCH01.


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [dat-discussions] [openib-general] [RFC] DAT2.0immediatedataproposal

2006-02-08 Thread Caitlin Bestler
[EMAIL PROTECTED] wrote:
 Arlin A very latency sensitive application that requires
 Arlin immediate notification of RDMA write completion on the
 Arlin remote node without ANY latency penalties associated with
 Arlin combining operations, HCA priority rules across QPs, wire
 Arlin congestion, etc. An application that has no requirement for
 Arlin messaging outside of remote rdma write completion
 Arlin notifications. The application would not have to register
 Arlin and manage additional message buffers on either side, we
 Arlin can just size the queues accordingly and post zero byte
 Arlin messages. We need something that would be equivelent to
 Arlin setting there polling on the last byte of inbound
 Arlin data. But, since data ordering within an operation is not
 Arlin guaranteed that is not an option. So, rdma with immediate
 Arlin data is the most optimal and simplistic method for
 Arlin indication of RDMA-write completion that we have available
 Arlin today. In fact, I would like to see it increased in size to
 Arlin make it even more useful.
 
 Hmm.  Can you put a number on how much better RDMA write with
 immediate is on current HCA hardware?  How does using the
 underlying OpenIB verbs ability to post a list of work
 requests compare (ie posting an RDMA write followed by a send
 in one verbs call)?
 Maybe post multiple is a better direction for DAT.
 

The distinction between Write and Send versus post multiple
is that it maintains a very simple one-to-one correspondence
with the post_recv at the data sink.

I also do not see how the *application* keeping the write and send
semantics can have a negative performance implication if we allow
InfiniBand Providers to encode it as an RDMA Write with Immediate.

If the Data Source needs to communicate to the Data Sink that
a specific RDMA Write transfer is done then it is sending a
message. Information transfer and synchronization is occuring.

I fail to see the value, let alone the optimization, of layering
on an extra bit of information disguised as an opcode and using
a specific transport's encoding methods as the model for a transport
neutral API (particularly one at the DAT layer, at the verb layer
it is a different issue because at the verb layer we do not want
to hide any hardware capabilities even while encouraging safe
harbor transport neutral practices).

If distinquishing between 32-bit messages and 32-bit immediates
that can arrive in indeterminate order is really that important
to your application then maybe you really needed a 33-bit message
to begin with. Encoding application layer information via your
choice of carrier pigeon is not a very robust strategy.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] $BL5NABN83;PKe(B

2006-02-08 Thread 【舞子・愛子】
当サイトは「女性優先」制を採用しており、女性会員の要求に従
うのです。
 このメールは非会員の貴方に女性を紹介する事について、女性
(舞子・愛子姉妹)本人の依頼をされた男性だけに送られているメー
ルなので、期待に答えてあげてください。


メッセージ:
自営業2人の姉妹なんですけど、興味ないですか?【舞子・愛子】です!
私達は2人で男性に奉仕するのが好きなんです(*^_^*)でもそんな相手見つけにくいし、恥ずかしいし、思い切って入会しました!別に私達をイかせてくれなくてもいいので、3Pのお相手してくださいm(__)mアドはPFに書いておりますので、良ければ写メとアド付けてお返事ください(^_-)-☆



貴方は【無料体験】の利用者として、
( http://www.kool-king.net?002 )をアクセスして、【無料体験】から舞子・愛子様と連絡してください。

なお、お客様からのメールが無い場合は、他の方へご紹介することとなりますので、なるべく早めのメール送信をお願いします。

メール送信はこちらから、直接舞子・愛子様へお送りください。
http://www.kool-king.net?002 


至急、返事下さい!  









[EMAIL PROTECTED]

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] Re: [PATCH] [RFC] - example user moderdmaping/pongprogram using CMA

2006-02-08 Thread Caitlin Bestler
[EMAIL PROTECTED] wrote:
 At 11:35 AM 2/8/2006, Steve Wise wrote:
 
 
 

 I just read this section in the 1.2 version of the
 spec, and I still
 don't understand what the issue really is?  9.7.7.2
 talks about IBA
 doing flow control based on the RECV WQEs posted.
 rping always ensures
 that there is a RECV posted before the peer can
 send.  This is ensured
 by the rping protocol itself (see the comment at
 the front of rping.c
 describing the ping loop).

 I'm only ever sending one outstanding message via SEND/RECV.
I
   would   rather post exactly what is needed, than post some
 number of RECVs just
 to be safe.  Sorry if I'm being dense.  What am I
 missing here?

 Steve.

   
As far as I know, the credits are only updated by the
 ACK messages.
If there is a single work request outstanding on the RQ,
the ACK of the SEND message will have the credit field value 0
(since exactly one receive WR was outstanding, and
 that is now consumed).
   
As a result the remote side withh think that there are no
receive WQEs and will slow down (what spec refers to
 as limited WQE).
 
   Oh.  I understand now.  This is an issue with only 1 RQ
 WQE posted and
   how IB tries to inform the peer transport of the WQE
 count.  For iWARP,
   none of this transport-level flow control happens (and
 I'm more familiar
   with iWARP than IB).
 
 
 For iWARP, we decided to not implement application receiver
 based flow control due to two items:TCP provides
 transport-level flow control (IB does not provide the
 equivalent per se) and upon examination of the majority of
 the ULP, they exchange and track the number of receive
 buffers allowed to be processed thus there is no need to
 replicate this in iWARP.  There are some subtleties as well
 between a message-based transport and a byte stream such as
 TCP that go into the equation but these are not that
 important for most application writers to deal with.
 
 Mike


But in terms of compiling the safe harbor transport neutral
recommended programming practices, I think this is a valid
point. Having one spare buffer is a good safety mechanism
at the application layer in general, *and* it may prevent
snarls in the transport layer flow control. 

Suggesting that consumers avoid letting the RQ hit empty
strikes me as aa valid transport neutral recommendation.
And we'll improve the public education by following those
recommendations in sample and test programs.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [dat-discussions] [openib-general] [RFC]DAT2.0immediatedataproposal

2006-02-08 Thread Sean Hefty
Hmm.  Can you put a number on how much better RDMA write with
immediate is on current HCA hardware?  How does using the underlying
OpenIB verbs ability to post a list of work requests compare (ie
posting an RDMA write followed by a send in one verbs call)?
Maybe post multiple is a better direction for DAT.

A post multiple call as a general API makes sense, but I think that's a
separate issue.

Given that IB provides true immediate data with RDMA writes, a way should be
available to make use of it.  I don't know what the performance numbers between
using a write with immediate versus a write followed by a send, but I don't
think that anyone could argue that the write with immediate wouldn't perform
better.

To me, the question is whether write with immediate is supported as a transport
specific extension, which was Arlin's original patch, or through some standard
API.  The attempt to make the API standard, so that iWarp could emulate it
(poorly in my view), is what appears to be driving the disagreements.

It also appears to me that the decisions are coming down to one of the
following.  If iWarp can emulate write with immediate, then a generic API should
be used.  If iWarp cannot properly emulate write with immediate, then the API
should be transport specific.  It's curious to me that in both cases, iWarp is
driving the API decision and design for something that is an IB specific
feature.

- Sean


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] FW: [PATCH 1 of 3] mad: large RMPP support

2006-02-08 Thread Jack Morgenstein
My point was not the total storage used for the array (it ends up more
than the linked list, as you noted).

I'm concerned that an allocation of a 4K buffer may fail in a situation
where lots of small allocations of around 256 bytes would succeed.  Is
your point that if we fail to allocate a 4K buffer, we're in deep
trouble already?  Note that I've only considered a 1000 host cluster.
What about scalability (e.g., 10,000 nodes -- we then need a 40K buffer)
-- the linked list has no scalability problem (no need to push RMPP
handling to user space).

Regarding the list-walk, if we track the last-sent segment in the
list, there is no need to do the list walk (we simply get the next
segment in the list).  We'll only have a short list walk when the ack
pointer gets updated (need to walk forward only
current-RMPP-ack-window-size items in the linked list from the
previously ack'ed item).

--
What is the reason you are thinking about 64-byte boundary support?

Jack


-Original Message-
From: Sean Hefty [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, February 08, 2006 7:13 PM
To: Jack Morgenstein; openib-general@openib.org
Subject: RE: [openib-general] FW: [PATCH 1 of 3] mad: large RMPP support

For example, a 1000 host cluster, with 2 ports per HCA will have at
least 4000 records in a SubnAdmGetTableResp for all PortInfo records on
the network (2000 for HCAs, and at least 2000 for the switch ports).
Such a query response will generate an RMPP of size 256K -- 1000
segments, or a 4K buffer on an X86 machine just for the array (assuming
one allocation per RMPP segment -- N=1).

I think that this is a good reason to use an array.  Walking a 1000
entry list
1000 times is a substantial performance hit.  Lost MADs and retries will
make
this worse.

A 4K buffer for the array is less than the 8K total needed for the 1000
list
items.  We're already talking about allocating over 256K of memory just
for the
data payload.  An additional contiguous 4k buffer seems like a minor
issue.  I'm
not convinced that there's a real issue here.

To support ridiculously large transfers from userspace, we may need to
push the
RMPP handling up into userspace.

- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] FW: [PATCH 1 of 3] mad: large RMPP support

2006-02-08 Thread Sean Hefty
I'm concerned that an allocation of a 4K buffer may fail in a situation
where lots of small allocations of around 256 bytes would succeed.  Is
your point that if we fail to allocate a 4K buffer, we're in deep
trouble already?  Note that I've only considered a 1000 host cluster.

Yes - if we can't allocate a 4k buffer, it seems highly unlikely that we'd be
able to allocate 1000 256-byte buffers.

What about scalability (e.g., 10,000 nodes -- we then need a 40K buffer)
-- the linked list has no scalability problem (no need to push RMPP
handling to user space).

I did consider this, and I don't know when we'll start hitting issues allocating
a single data buffer.  But we're going to ask for 10,000 256-byte buffers - over
2.5 MB of kernel memory in order to perform this single data transfer.  Is it
likely that we can allocate that much memory, but not the 40k buffer?  I really
don't know.  If the answer is yes, then I agree that using a linked list would
be better.

Regarding the list-walk, if we track the last-sent segment in the
list, there is no need to do the list walk (we simply get the next
segment in the list).  We'll only have a short list walk when the ack
pointer gets updated (need to walk forward only
current-RMPP-ack-window-size items in the linked list from the
previously ack'ed item).

I thought of this as well.  For efficiency, you need to track the last sent and
last acked, meaning that the list will be walked at most twice.  You may be able
to jump the ack pointer to last sent if that is a common case.

What is the reason you are thinking about 64-byte boundary support?

I was concerned about 64-byte values in the MADs aligned on a 32-byte boundary.
But then I think that some of the MADs have this issue anyway by architectural
design.

- Sean

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general