[openib-general] Re: [openib-commits] r2426 - gen2/users/jlentini/linux-kernel/dat-provider

2005-05-25 Thread Bernhard Fischer
On Mon, May 23, 2005 at 10:32:12AM -0400, James Lentini wrote:

Bernhard,

Thank you for pointing these items out. Comments below:

On Sat, 21 May 2005, Bernhard Fischer wrote:

On Fri, May 20, 2005 at 03:05:08PM -0700, [EMAIL PROTECTED] wrote:
Author: jlentini
Date: 2005-05-20 15:05:07 -0700 (Fri, 20 May 2005)
New Revision: 2426

-if (DAPL_GET_CQE_OPTYPE(cqe_ptr) == OP_RECEIVE) {
+ \t\t work_req_id %lli\n, cqe_ptr-wr_id);
[[:space:]] --^ ?

Which space are you referring to? The one between \t\t and 
work_req_id?
Yes, i ment the one between \t\t and work_req_id. I now see that those
are part of the formatting, so i retract this comment. Sorry.

@@ -667,21 +657,21 @@

#ifdef DAPL_DBG
 /* Current gen2 mthca is not setting the opcode in 
 seccesful cqe  */

s/secces/succes/g

-/* The opcode will be OP_SEND or OP_RECEIVE acording 
the is_send bit  */
+/* The opcode will be IB_WC_SEND or IB_WC_RECV 
acording the is_send bit  */
acording -enoparse: s/acord/accord/g

also in:
gen2/branches/shaharf-ibat/src/userspace/management/osm/opensm/osm_state_mgr.c
gen2/branches/roland-uverbs/src/userspace/management/osm/opensm/osm_sa_mcmember_record.c
gen2/utils/src/linux-user/ibdm/datamodel/ibdm.i
gen2/utils/src/linux-user/IBMgtSim/src/ibdm.i
gen2/utils/src/linux-user/IBMgtSim/utils/RunSimTest
there: callabcks: -enoparse;
continue with acord:
gen2/users/jlentini/linux-kernel/dat-provider/dapl_evd_util.c
gen2/trunk/src/userspace/management/osm/opensm/osm_state_mgr.c

I've fixed my error:

gen2/users/jlentini/linux-kernel/dat-provider/dapl_evd_util.c

Ok, thank you.

The others belong to Roland, Shahar, and Eitan.

-{
-struct dat_ep_attr ep_attr;
-struct dat_named_attr ep_state;
+(void) dapl_modify_qp_state_to_error(ep_ptr-qp_handle);

sun will hopefuly fix all of
egrep -ri (^[[:space:]]*\(void\))
openib.gen2/upstream/gen2/trunk/|egrep -v svn-(text|base)
so i won't comment on that single occurance above ;)

I didn't realize the convention was to not have spaces before a cast. 
This isn't in Documentation/CodingStyle. I know I've been adding these 
here and there. I'll fix these as I see them.

I ment that Tom Duffy and you will take care of converting a couple of
those into void functions, so the casts will no longer be needed. It's
not worth the effort to remove the space after the cast. In contrast,
personally, i find that the space makes to code easier to read.

cheers,
Bernhard
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [HELP] Encounter Kernel Panic when Add MellanoxHCA Supporting on 2.6.9 Kernel

2005-05-25 Thread Cong, Lenber
I tried the patches (2.6.12-to-2.6.9, not svn backport) on an EM64T desktop 
(without HCA card). The kernel can be installed successfully.

I still can't reboot the kernel on Xeon SMP server, even with the new patches 
(svn backport). The same error was encountered.

Then I disabled the option CONFIG_DEBUG_SPINLOCK.
The error message disappeared, but the kernel still can't be booted.

Can I assume it is the problem of HCA card? Or the issue is relative with the 
SMP platform? So strange.. 

Thanks - Lenber

-Original Message-
From: Woodruff, Robert J 
Sent: 2005525 6:34
To: Cong, Lenber; openib-general@openib.org
Cc: 'Roland Dreier'
Subject: RE: [openib-general] [HELP] Encounter Kernel Panic when Add 
MellanoxHCA Supporting on 2.6.9 Kernel

Roland wrote,  
I just tried the latest svn on 2.6.11 with CONFIG_DEBUG_SPINLOCK
turned on, and I didn't see any problems.  The message

driver/infiniband/hw/mthca/mthca_allocator.c: 46: spin_is_locked on 
uninitialized spinlock: f70f7dac

is coming from CHECK_LOCK, which is turned on with
CONFIG_DEBUG_SPINLOCK.  However there should be more traceback
information printed to the console as well... did that get dumped as
well?

Bob Roland, has anything been fixed since the 2.6.12 drop in
Bob mthca that could account for this panic ?

Not that I know of...

 - R.

I just installed the 

infiniband-backport-2.6.12-to-2.6.9-kernel-fixups-01.diff   
infiniband-backport-2.6.12-to-2.6.9-openib-drivers-02.diff  
infiniband-backport-2.6.12-to-2.6.9-openib-fixups-03.diff  

backport patches on a couple of old 900Mhz IA32 Xeon boxes 
and was able to build the kernel, load IPoIB and ping another node.
I used the Redhat configuration file /boot/config-2.6.9-5.ELsmp,
did a make oldconfig and selected modules for all of the infiniband drivers.
Then I built and installed the kernel with no problems. 

Maybe it is the platform (I have seen problems in the past with
the BIOS on some platforms being able to map the Mellanox H/W correctly)
or could bad Mellanox H/W cause this ?

Do you have any other platforms that you could try it on ?

woody


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] [PATCH] pingpong test: zero-initialize all attributes

2005-05-25 Thread Michael S. Tsirkin
Some address handle attributes (notably static rate flow control)
were uninitialized. Fix this by initializing all fields to 0.

(Other examples may need fixing, too).

Signed-off-by: Michael S. Tsirkin [EMAIL PROTECTED]

Index: pingpong.c
===
--- pingpong.c  (revision 2437)
+++ pingpong.c  (working copy)
@@ -360,19 +360,19 @@ static int pp_post_send(struct pingpong_
 static int pp_connect_ctx(struct pingpong_context *ctx, int port, int my_psn,
  struct pingpong_dest *dest)
 {
-   struct ibv_qp_attr attr;
-
-   attr.qp_state   = IBV_QPS_RTR;
-   attr.path_mtu   = IBV_MTU_1024;
-   attr.dest_qp_num= dest-qpn;
-   attr.rq_psn = dest-psn;
-   attr.max_dest_rd_atomic = 1;
-   attr.min_rnr_timer  = 12;
-   attr.ah_attr.is_global  = 0;
-   attr.ah_attr.dlid   = dest-lid;
-   attr.ah_attr.sl = 0;
-   attr.ah_attr.src_path_bits = 0;
-   attr.ah_attr.port_num   = port;
+   struct ibv_qp_attr attr = {
+   .qp_state   = IBV_QPS_RTR;
+   .path_mtu   = IBV_MTU_1024;
+   .dest_qp_num= dest-qpn;
+   .rq_psn = dest-psn;
+   .max_dest_rd_atomic = 1;
+   .min_rnr_timer  = 12;
+   .ah_attr.is_global  = 0;
+   .ah_attr.dlid   = dest-lid;
+   .ah_attr.sl = 0;
+   .ah_attr.src_path_bits  = 0;
+   .ah_attr.port_num   = port;
+   };
if (ibv_modify_qp(ctx-qp, attr,
  IBV_QP_STATE  |
  IBV_QP_AV |

-- 
MST - Michael S. Tsirkin
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] RE: OpenSM Routing Scalability Proposal

2005-05-25 Thread Hal Rosenstock
Hi Eitan,

On Tue, 2005-05-17 at 10:53, Eitan Zahavi wrote: 
 Hi All,
 
 This is an updated proposal document for your comments.

I finally got a chance to read this. Some comments below.

 The main change is in describing the need for preserving enough data
 to enable incremental routing algorithm. 

I think incremental can help but presents some new issues.

 So the actual proposal is to implement the algorithm described in
 section 4.3.

4.1 (min hop) and 4.2 (up/down) are already implemented, right ?

 EZ OpenSM Routing.pdf 

It seems like there are 2 parts to 4.3:
1. Min hop table per leaf switch rather than per LID
What are the savings for this ? Seems like in terms of memory, this is
something like a divisor of L times the number of LIDs per HCA port.
Of course, switch port 0s on non leaf switches need to be accomodated.

2. Incremental routing (5)
a. Subcase of 5 where there is no other link between 2 adjacent
switches. Is another way of stating this, examine next hop switches to
see if there is a path between the 2 original switches and keep
expanding the depth until 1 is found ? Couldn't this be worse from a
compute standpoint than rerouting everything depending on the topology
(the likelihood of another path between the 2 original switches) ?

b. 5 asks How do we support topology changes line moving an HCA from
one Switch to another? Also, what about a link moving from one switch
to another ? It seems that link down is handled, but nothing is done on
a link up. Doesn't there need to be incremental defined for links being
added ?

c. Also, with incremental routing, it's unclear to me how the paths
found would compare with the ones which would be determined from the
full algorithm (from scratch). Also, would there be some point at which
the full routing would be retriggered ? 

d. Clearly, there are end node responsibilities here as well (whether
this is done incrementally or fully or something else). 

3. Persistency (6)
a. Full LFT storage (6.1) This presumes that the determination of a
topology change upon discovery is cheaper computationally than running
the routing. Has this been proven ? (I hope this is the case).
b. Root nodes storage (6.2) Are the root nodes determined by the routing
or supplied to the routing ? Are they different for unicast and
multicast ?
 
-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] [HELP] Encounter Kernel Panic when Add MellanoxHCA Supporting on 2.6.9 Kernel

2005-05-25 Thread Roland Dreier
Lenber Can I assume it is the problem of HCA card? Or the issue
Lenber is relative with the SMP platform? So strange..

It's possible it's the HCA but I'm not sure what could be wrong.  With
CONFIG_DEBUG_SPINLOCK can you get more of the traceback?  The BUG()
should be producing a full stack trace.

Thanks,
  Roland
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: [PATCH] [kdapl CM] Fix endian conversions of service ID

2005-05-25 Thread Hal Rosenstock
On Wed, 2005-05-25 at 11:38, James Lentini wrote:
 halr On Tue, 2005-05-24 at 15:20, James Lentini wrote:
 halr  halr [kdapl CM] Fix endian conversions of service ID
 halr  halr Problem pointed out by James Lentini
 halr  halr 
 halr  halr Signed-off-by: Hal Rosenstock [EMAIL PROTECTED]
 halr  halr 
 halr  halr Index: dapl_openib_cm.c
 halr  halr 
 ===
 halr  halr -- dapl_openib_cm.c(revision 2468)
 halr  halr +++ dapl_openib_cm.c(working copy)
 halr  halr @@ -309,7 +309,7 @@
 halr  halr if (conn-dapl_path.mtu  IB_MTU_1024)
 halr  halr conn-dapl_path.mtu = IB_MTU_1024;
 halr  halr 
 halr  halr -   conn-param.service_id = be64_to_cpu(conn-service_id);
 halr  halr +   conn-param.service_id = conn-service_id;
 halr  
 halr  With the change to dapl_ib_connect below, the conn-service_id is in 
 halr  CPU byte order at this point. The conn-param is a ib_cm_req_param 
 halr  structure. The comment describing this structure's service_id field 
 halr  says that it should be in network (big endian) byte order.
 halr  
 halr  So...
 halr  
 halr  halr conn-param.primary_path = conn-dapl_path;
 halr  halr conn-param.alternate_path = NULL;
 halr  halr 
 halr  halr @@ -445,8 +445,7 @@
 halr  halr conn-param.local_cm_response_timeout =
 halr  halr DAPL_OPENIB_CM_RESPONSE_TIMEOUT;
 halr  halr conn-param.max_cm_retries = DAPL_OPENIB_MAX_CM_RETRIES;
 halr  halr 
 halr  halr -   memcpy(conn-service_id, remote_conn_qual, sizeof
 halr  halr conn-service_id);
 halr  halr -
 halr  halr +   conn-service_id = be64_to_cpu(remote_conn_qual);
 halr  
 halr  ...that makes me think we should change the line above to
 halr  
 halr  conn-service_id = remote_conn_qual;
 halr  
 halr  and require that consumer's specify their connection qualifier values 
 halr  in network byte order here and ...
 halr 
 halr I think the convention OpenIB has been using is to supply parameters in
 halr CPU endian but it can work either way.
 
 The comments in ib_cm.h say that service id parameters should be in 
 network byte order. Are these incorrect?

No. I was referring to your alternative of where this is performed.

 halr 
 halr  halr conn-remote_ia_address = remote_ia_address;
 halr  halr conn-dapl_comp.fn = dapl_rt_comp_handler;
 halr  halr conn-dapl_comp.context = conn;
 halr  halr @@ -627,7 +626,7 @@
 halr  halr }
 halr  halr 
 halr  halr status = ib_cm_listen(sp_ptr-cm_srvc_handle,
 halr  halr - be64_to_cpu(sp_ptr-conn_qual), 
 0);
 halr  halr + cpu_to_be64(sp_ptr-conn_qual), 
 0);
 halr  
 halr  ... do the same here. What do you think?
 halr 
 halr 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: Latest CM and retransmissions

2005-05-25 Thread Sean Hefty

Hal Rosenstock wrote:

DREQ 
---  DREP
DREQ 
DREQ 


Do you know if the code that issues the DREQ destroys the cm_id immediately 
afterwards?


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: Latest CM and retransmissions

2005-05-25 Thread Hal Rosenstock
On Wed, 2005-05-25 at 12:30, Sean Hefty wrote:
 Hal Rosenstock wrote:
  DREQ 
  ---  DREP
  DREQ 
  DREQ 
 
 Do you know if the code that issues the DREQ destroys the cm_id immediately 
 afterwards?

It likely is. I was going to ask about this yesterday.

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OOPS: ib_mad crashery on bootup

2005-05-25 Thread Sean Hefty
I've been able to hit an issue in the MAD layer that might be related.  I 
still do not know the root cause, however.


- Sean


cmpost: starting client
cmpost: connecting
cmpost: connect time: 4479000 us
cmpost: completing data transfers
cmpost: waiting to disconnect
cmpost: test complete
Unable to handle kernel paging request at virtual address 6b6b6b83
 printing eip:
f8ca36e2
*pde = 
Oops:  [#1]
SMP
Modules linked in: ib_cmpost ib_cm ib_sa ib_mthca ib_mad ib_core edd st 
sr_mod ide_cd cdrom thermal processor fan button battery ac e100 mii e1000 
hw_random uhci_hcd usbcore evdev reiserfs aic7xxx sd_mod scsi_mod

CPU:0
EIP:0060:[pg0+948262626/1069220864]Not tainted VLI
EIP:0060:[f8ca36e2]Not tainted VLI
EFLAGS: 00010292   (2.6.9)
EIP is at ib_mad_send_done_handler+0x12/0x120 [ib_mad]
eax: dfda185c   ebx: c5cf4790   ecx: c967855c   edx: d92e9f24
esi: 6b6b6b6b   edi: d92e9f24   ebp: c5cf4790   esp: d92e9f04
ds: 007b   es: 007b   ss: 0068
Process ib_mad1 (pid: 4493, threadinfo=d92e8000 task=f7738230)
Stack: 0001 d92e9f24 d810795c f6c6f5bc dfda185c d92e9f24 dfda185c f8ca3979
   c5cf4790    d92e9f58 c011e4f9  0402
   0006  0001f341 f7612c30 dfda18cc f7612c30 dfda18d0 c01305f8
Call Trace:
 [pg0+948263289/1069220864] ib_mad_completion_handler+0x89/0xa0 [ib_mad]
 [f8ca3979] ib_mad_completion_handler+0x89/0xa0 [ib_mad]
 [__wake_up+41/64] __wake_up+0x29/0x40
 [c011e4f9] __wake_up+0x29/0x40
 [worker_thread+424/560] worker_thread+0x1a8/0x230
 [c01305f8] worker_thread+0x1a8/0x230
 [pg0+948263152/1069220864] ib_mad_completion_handler+0x0/0xa0 [ib_mad]
 [f8ca38f0] ib_mad_completion_handler+0x0/0xa0 [ib_mad]
 [default_wake_function+0/16] default_wake_function+0x0/0x10
 [c011e460] default_wake_function+0x0/0x10
 [default_wake_function+0/16] default_wake_function+0x0/0x10
 [c011e460] default_wake_function+0x0/0x10
 [worker_thread+0/560] worker_thread+0x0/0x230
 [c0130450] worker_thread+0x0/0x230
 [kthread+136/176] kthread+0x88/0xb0
 [c0134128] kthread+0x88/0xb0
 [kthread+0/176] kthread+0x0/0xb0
 [c01340a0] kthread+0x0/0xb0
 [kernel_thread_helper+5/16] kernel_thread_helper+0x5/0x10
 [c0105275] kernel_thread_helper+0x5/0x10
Code: 00 6a 00 e8 11 ae 47 c7 5d eb 98 8d b4 26 00 00 00 00 8d bc 27 00 00 
00 00 55 57 56 53 83 ec 0c 89 54 24 04 8b 1a 89 dd 8b 73 08 8b 46 18 89 04 
24 eb 50 8d b6 00 00 00 00 8b 54 24 04 89 d8 e8

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: Latest CM and retransmissions

2005-05-25 Thread Sean Hefty

Hal Rosenstock wrote:

On Wed, 2005-05-25 at 12:30, Sean Hefty wrote:


Hal Rosenstock wrote:


DREQ 
   ---  DREP
DREQ 
DREQ 


Do you know if the code that issues the DREQ destroys the cm_id immediately 
afterwards?


It likely is. I was going to ask about this yesterday.


If a client destroys the cm_id immediately after sending a DREQ (before the 
DREP is received), the CM will transition the cm_id directly into the 
timewait state.  I've just committed a change to the CM to cancel the DREQ 
if the cm_id is destroyed.  Note that this won't result in the DREP matching 
with the DREQ, since the cm_id has been destroyed, but should prevent the 
DREQ from being resent, if this is indeed what is happening.  Can you pull 
the CM from 2485 and retest


- Sean
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] Re: Latest CM and retransmissions

2005-05-25 Thread Fab Tillier
 From: Sean Hefty [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, May 25, 2005 10:40 AM
 
 Hal Rosenstock wrote:
  On Wed, 2005-05-25 at 12:30, Sean Hefty wrote:
 
 Hal Rosenstock wrote:
 
 DREQ 
 ---  DREP
 DREQ 
 DREQ 
 
 Do you know if the code that issues the DREQ destroys the cm_id
 immediately afterwards?
 
  It likely is. I was going to ask about this yesterday.
 
 If a client destroys the cm_id immediately after sending a DREQ (before
 the
 DREP is received), the CM will transition the cm_id directly into the
 timewait state.  I've just committed a change to the CM to cancel the DREQ
 if the cm_id is destroyed.  Note that this won't result in the DREP
 matching
 with the DREQ, since the cm_id has been destroyed, but should prevent the
 DREQ from being resent, if this is indeed what is happening.  Can you pull
 the CM from 2485 and retest

Why not just delay the transition into timewait until the DREP is received
or the DREQ times out?

- Fab

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: [PATCH] [kdapl CM] Fix endian conversions of service ID

2005-05-25 Thread James Lentini


I committed the fix for this in revision 2486. I made one small 
simplification: I used the dapl_cm_id's param.service_id field 
throughout.


On Wed, 25 May 2005, Hal Rosenstock wrote:


On Wed, 2005-05-25 at 11:38, James Lentini wrote:

halr On Tue, 2005-05-24 at 15:20, James Lentini wrote:
halr  halr [kdapl CM] Fix endian conversions of service ID
halr  halr Problem pointed out by James Lentini
halr  halr
halr  halr Signed-off-by: Hal Rosenstock [EMAIL PROTECTED]
halr  halr
halr  halr Index: dapl_openib_cm.c
halr  halr 
===
halr  halr -- dapl_openib_cm.c(revision 2468)
halr  halr +++ dapl_openib_cm.c(working copy)
halr  halr @@ -309,7 +309,7 @@
halr  halr if (conn-dapl_path.mtu  IB_MTU_1024)
halr  halr conn-dapl_path.mtu = IB_MTU_1024;
halr  halr
halr  halr -   conn-param.service_id = be64_to_cpu(conn-service_id);
halr  halr +   conn-param.service_id = conn-service_id;
halr 
halr  With the change to dapl_ib_connect below, the conn-service_id is in
halr  CPU byte order at this point. The conn-param is a ib_cm_req_param
halr  structure. The comment describing this structure's service_id field
halr  says that it should be in network (big endian) byte order.
halr 
halr  So...
halr 
halr  halr conn-param.primary_path = conn-dapl_path;
halr  halr conn-param.alternate_path = NULL;
halr  halr
halr  halr @@ -445,8 +445,7 @@
halr  halr conn-param.local_cm_response_timeout =
halr  halr DAPL_OPENIB_CM_RESPONSE_TIMEOUT;
halr  halr conn-param.max_cm_retries = DAPL_OPENIB_MAX_CM_RETRIES;
halr  halr
halr  halr -   memcpy(conn-service_id, remote_conn_qual, sizeof
halr  halr conn-service_id);
halr  halr -
halr  halr +   conn-service_id = be64_to_cpu(remote_conn_qual);
halr 
halr  ...that makes me think we should change the line above to
halr 
halr  conn-service_id = remote_conn_qual;
halr 
halr  and require that consumer's specify their connection qualifier values
halr  in network byte order here and ...
halr
halr I think the convention OpenIB has been using is to supply parameters in
halr CPU endian but it can work either way.

The comments in ib_cm.h say that service id parameters should be in
network byte order. Are these incorrect?


No. I was referring to your alternative of where this is performed.


halr
halr  halr conn-remote_ia_address = remote_ia_address;
halr  halr conn-dapl_comp.fn = dapl_rt_comp_handler;
halr  halr conn-dapl_comp.context = conn;
halr  halr @@ -627,7 +626,7 @@
halr  halr }
halr  halr
halr  halr status = ib_cm_listen(sp_ptr-cm_srvc_handle,
halr  halr - be64_to_cpu(sp_ptr-conn_qual), 0);
halr  halr + cpu_to_be64(sp_ptr-conn_qual), 0);
halr 
halr  ... do the same here. What do you think?
halr
halr



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] RE: OpenSM Routing Scalability Proposal

2005-05-25 Thread Eitan Zahavi
Title: RE: [openib-general] RE: OpenSM Routing Scalability Proposal





Hi Hal,


All your points are valid. Especially the ones regarding the incremental algorithm for routing. Please see below.


Eitan Zahavi
Design Technology Director
Mellanox Technologies LTD
Tel:+972-4-9097208
Fax:+972-4-9593245
P.O. Box 586 Yokneam 20692 ISRAEL



 -Original Message-
 From: Hal Rosenstock [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, May 25, 2005 5:53 PM
 To: Eitan Zahavi
 Cc: 'openib-general@openib.org'
 Subject: Re: [openib-general] RE: OpenSM Routing Scalability Proposal
 
 Hi Eitan,
 
 On Tue, 2005-05-17 at 10:53, Eitan Zahavi wrote:
  Hi All,
 
  This is an updated proposal document for your comments.
 
 I finally got a chance to read this. Some comments below.
 
  The main change is in describing the need for preserving enough data
  to enable incremental routing algorithm.
 
 I think incremental can help but presents some new issues.
[EZ] Yes very true. I do not have a full algorithm in place that will cover all possible cases.
 
  So the actual proposal is to implement the algorithm described in
  section 4.3.
 
 4.1 (min hop) and 4.2 (up/down) are already implemented, right ?
[EZ] No they are not. The current implementation uses MinHop tables etc.
 
  EZ OpenSM Routing.pdf
 
 It seems like there are 2 parts to 4.3:
 1. Min hop table per leaf switch rather than per LID
 What are the savings for this ? Seems like in terms of memory, this is
 something like a divisor of L times the number of LIDs per HCA port.
[EZ] Yes. 
 Of course, switch port 0s on non leaf switches need to be accomodated.
[EZ] True.
 
 2. Incremental routing (5)
 a. Subcase of 5 where there is no other link between 2 adjacent
 switches. Is another way of stating this, examine next hop switches to
 see if there is a path between the 2 original switches and keep
 expanding the depth until 1 is found ? Couldn't this be worse from a
 compute standpoint than rerouting everything depending on the topology
 (the likelihood of another path between the 2 original switches) ?
[EZ] If one knows which ports have changed this will be faster then full recalc.
 
 b. 5 asks How do we support topology changes line moving an HCA from
 one Switch to another? Also, what about a link moving from one switch
 to another ? It seems that link down is handled, but nothing is done on
 a link up. Doesn't there need to be incremental defined for links being
 added ?
[EZ] Yes. This is not even close to full algorithm.
 
 c. Also, with incremental routing, it's unclear to me how the paths
 found would compare with the ones which would be determined from the
 full algorithm (from scratch). Also, would there be some point at which
 the full routing would be retriggered ?
[EZ] Good point.
 
 d. Clearly, there are end node responsibilities here as well (whether
 this is done incrementally or fully or something else).
[EZ] Not sure what you mean.
 
 3. Persistency (6)
 a. Full LFT storage (6.1) This presumes that the determination of a
 topology change upon discovery is cheaper computationally than running
 the routing. Has this been proven ? (I hope this is the case).
[EZ] Comparing two graphs is O(Links). 
 b. Root nodes storage (6.2) Are the root nodes determined by the routing
 or supplied to the routing ? Are they different for unicast and
 multicast ?
[EZ] Both is true. If not provided by human extracted using heuristics.
 
 -- Hal



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Re: Latest CM and retransmissions

2005-05-25 Thread Hal Rosenstock
On Wed, 2005-05-25 at 13:39, Sean Hefty wrote: 
 Do you know if the code that issues the DREQ destroys the cm_id immediately 
 afterwards?
  
  It likely is. I was going to ask about this yesterday.
 
 If a client destroys the cm_id immediately after sending a DREQ (before the 
 DREP is received), the CM will transition the cm_id directly into the 
 timewait state.  I've just committed a change to the CM to cancel the DREQ 
 if the cm_id is destroyed.  Note that this won't result in the DREP matching 
 with the DREQ, since the cm_id has been destroyed, but should prevent the 
 DREQ from being resent, if this is indeed what is happening.  Can you pull 
 the CM from 2485 and retest

Yes, that's better :-) Only 1 DREQ/DREP. Thanks.

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] RE: OpenSM Routing Scalability Proposal

2005-05-25 Thread Hal Rosenstock
On Wed, 2005-05-25 at 14:45, Eitan Zahavi wrote: 
 Hi Hal,
 
 All your points are valid. Especially the ones regarding the
 incremental algorithm for routing. Please see below.

One other question:
Is there an impact of LMC  0 on this ?

 Eitan Zahavi
 Design Technology Director
 Mellanox Technologies LTD
 Tel:+972-4-9097208
 Fax:+972-4-9593245
 P.O. Box 586 Yokneam 20692 ISRAEL
 
 
  --Original Message--
  From: Hal Rosenstock [mailto:[EMAIL PROTECTED]
  Sent: Wednesday, May 25, 2005 5:53 PM
  To: Eitan Zahavi
  Cc: 'openib-general@openib.org'
  Subject: Re: [openib-general] RE: OpenSM Routing Scalability
 Proposal
  
  Hi Eitan,
  
  On Tue, 2005-05-17 at 10:53, Eitan Zahavi wrote:
   Hi All,
  
   This is an updated proposal document for your comments.
  
  I finally got a chance to read this. Some comments below.
  
   The main change is in describing the need for preserving enough
 data
   to enable incremental routing algorithm.
  
  I think incremental can help but presents some new issues.
 [EZ] Yes very true. I do not have a full algorithm in place that will
 cover all possible cases.
  
   So the actual proposal is to implement the algorithm described in
   section 4.3.
  
  4.1 (min hop) and 4.2 (up/down) are already implemented, right ?
 [EZ] No they are not. The current implementation uses MinHop tables
 etc.
 

I'm not sure I'm following you. Are you saying min hop is implemented
and up/down isn't (just analyzed) ?

   EZ OpenSM Routing.pdf
  
  It seems like there are 2 parts to 4.3:
  1. Min hop table per leaf switch rather than per LID
  What are the savings for this ? Seems like in terms of memory, this
 is
  something like a divisor of L times the number of LIDs per HCA port.
 [EZ] Yes. 
  Of course, switch port 0s on non leaf switches need to be
 accomodated.
 [EZ] True.
  
  2. Incremental routing (5)
  a. Subcase of 5 where there is no other link between 2 adjacent
  switches. Is another way of stating this, examine next hop switches
 to
  see if there is a path between the 2 original switches and keep
  expanding the depth until 1 is found ? Couldn't this be worse from a
  compute standpoint than rerouting everything depending on the
 topology
  (the likelihood of another path between the 2 original switches) ?
 [EZ] If one knows which ports have changed this will be faster then
 full recalc.

Sure but if the depth keeps expanding because no path is found between
the switches which lost a trunk link between them, then isn't the
calculation done on an ever expanding horizon of switches ? That was the
case I was referring to.
 
  b. 5 asks How do we support topology changes line moving an HCA
 from
  one Switch to another? Also, what about a link moving from one
 switch
  to another ? It seems that link down is handled, but nothing is done
 on
  a link up. Doesn't there need to be incremental defined for links
 being
  added ?
 [EZ] Yes. This is not even close to full algorithm.
  
  c. Also, with incremental routing, it's unclear to me how the paths
  found would compare with the ones which would be determined from the
  full algorithm (from scratch). Also, would there be some point at
 which
  the full routing would be retriggered ?
 [EZ] Good point.
  
  d. Clearly, there are end node responsibilities here as well
 (whether
  this is done incrementally or fully or something else).
 [EZ] Not sure what you mean.

I'm referring to path changes and their implications on connections.
 
  3. Persistency (6)
  a. Full LFT storage (6.1) This presumes that the determination of a
  topology change upon discovery is cheaper computationally than
 running
  the routing. Has this been proven ? (I hope this is the case).
 [EZ] Comparing two graphs is O(Links).

OK. I hope not twice the memory is needed for this.

  b. Root nodes storage (6.2) Are the root nodes determined by the
 routing
  or supplied to the routing ? Are they different for unicast and
  multicast ?
 [EZ] Both is true. If not provided by human extracted using
 heuristics.

By both is true, do you mean that the root nodes can either be
determined by the routing or supplied to the routing ?

Are the unicast and multicast roots the same or different ?

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] RE: OpenSM Routing Scalability Proposal

2005-05-25 Thread Eitan Zahavi
Title: RE: [openib-general] RE: OpenSM Routing Scalability Proposal





 
 On Wed, 2005-05-25 at 14:45, Eitan Zahavi wrote:
  Hi Hal,
 
  All your points are valid. Especially the ones regarding the
  incremental algorithm for routing. Please see below.
 
 One other question:
 Is there an impact of LMC  0 on this ?


[EZ] If LMC  0 then the proposed algorithm for calculating min hop tables (step1) is going to be even faster then today implementation.

 
  
   Hi Eitan,
  
   On Tue, 2005-05-17 at 10:53, Eitan Zahavi wrote:
Hi All,
   
This is an updated proposal document for your comments.
  
   I finally got a chance to read this. Some comments below.
  
The main change is in describing the need for preserving enough
  data
to enable incremental routing algorithm.
  
   I think incremental can help but presents some new issues.
  [EZ] Yes very true. I do not have a full algorithm in place that will
  cover all possible cases.
  
So the actual proposal is to implement the algorithm described in
section 4.3.
  
   4.1 (min hop) and 4.2 (up/down) are already implemented, right ?
  [EZ] No they are not. The current implementation uses MinHop tables
  etc.
 
 
 I'm not sure I'm following you. Are you saying min hop is implemented
 and up/down isn't (just analyzed) ?
[EZ] No - I use the term min hop for the first stage of the routing.
Today this first stage generate a different kind of table then the proposed and it does so using a different traversal algorithm.

 
EZ OpenSM Routing.pdf
  
   It seems like there are 2 parts to 4.3:
   1. Min hop table per leaf switch rather than per LID
   What are the savings for this ? Seems like in terms of memory, this
  is
   something like a divisor of L times the number of LIDs per HCA port.
  [EZ] Yes.
   Of course, switch port 0s on non leaf switches need to be
  accomodated.
  [EZ] True.
  
   2. Incremental routing (5)
   a. Subcase of 5 where there is no other link between 2 adjacent
   switches. Is another way of stating this, examine next hop switches
  to
   see if there is a path between the 2 original switches and keep
   expanding the depth until 1 is found ? Couldn't this be worse from a
   compute standpoint than rerouting everything depending on the
  topology
   (the likelihood of another path between the 2 original switches) ?
  [EZ] If one knows which ports have changed this will be faster then
  full recalc.
 
 Sure but if the depth keeps expanding because no path is found between
 the switches which lost a trunk link between them, then isn't the
 calculation done on an ever expanding horizon of switches ? That was the
 case I was referring to.
[EZ] But not all switches needs to be recomputed.
 
   b. 5 asks How do we support topology changes line moving an HCA
  from
   one Switch to another? Also, what about a link moving from one
  switch
   to another ? It seems that link down is handled, but nothing is done
  on
   a link up. Doesn't there need to be incremental defined for links
  being
   added ?
  [EZ] Yes. This is not even close to full algorithm.
  
   c. Also, with incremental routing, it's unclear to me how the paths
   found would compare with the ones which would be determined from the
   full algorithm (from scratch). Also, would there be some point at
  which
   the full routing would be retriggered ?
  [EZ] Good point.
  
   d. Clearly, there are end node responsibilities here as well
  (whether
   this is done incrementally or fully or something else).
  [EZ] Not sure what you mean.
 
 I'm referring to path changes and their implications on connections.
[EZ] I assume the QP has already timed out.
 
   3. Persistency (6)
   a. Full LFT storage (6.1) This presumes that the determination of a
   topology change upon discovery is cheaper computationally than
  running
   the routing. Has this been proven ? (I hope this is the case).
  [EZ] Comparing two graphs is O(Links).
 
 OK. I hope not twice the memory is needed for this.
[EZ] The memory involved with keeping the connectivity is small compared to the routing data and various other tables (PKey SL2VL...)

 
   b. Root nodes storage (6.2) Are the root nodes determined by the
  routing
   or supplied to the routing ? Are they different for unicast and
   multicast ?
  [EZ] Both is true. If not provided by human extracted using
  heuristics.
 
 By both is true, do you mean that the root nodes can either be
 determined by the routing or supplied to the routing ?
[EZ] OpenSM can take roots from file or calculate them using some heuristics.
 
 Are the unicast and multicast roots the same or different ?
[EZ] Multicast roots are calculated only. If you have a small group then the roots will be the lowest level in the tree that fits all the members.

 
 -- Hal



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit 

[openib-general] [PATCH] [TRIVIAL] [kdapl CM] Change messages to be consistent with routine names

2005-05-25 Thread Hal Rosenstock
[kdapl CM] Change messages to be consistent with routine names

Signed-off-by: Hal Rosenstock [EMAIL PROTECTED]

Index: dapl_openib_cm.c
===
--- dapl_openib_cm.c(revision 2486)
+++ dapl_openib_cm.c(working copy)
@@ -78,7 +78,7 @@
int status;
 
if (conn-ep-qp_handle == NULL) {
-   dapl_dbg_log(DAPL_DBG_TYPE_ERR,  do_rep_recv: invalid qp 
+   dapl_dbg_log(DAPL_DBG_TYPE_ERR,  dapl_rep_recv: invalid qp 
 handle\n);
goto disc;
}
@@ -86,7 +86,7 @@
/* First, transition QP to RTR */
status = dapl_modify_qp_state_to_rtr(conn-cm_id, conn-ep-qp_handle);
if (status) {
-   dapl_dbg_log(DAPL_DBG_TYPE_ERR,  do_rep_recv: could not 
+   dapl_dbg_log(DAPL_DBG_TYPE_ERR,  dapl_rep_recv: could not 
 modify QP state to RTR status %d\n, status);
goto disc;
}
@@ -94,14 +94,14 @@
/* Now, transition QP to RTS */
status = dapl_modify_qp_state_to_rts(conn-cm_id, conn-ep-qp_handle);
if (status) {
-   dapl_dbg_log(DAPL_DBG_TYPE_ERR,  do_rep_recv: could not 
+   dapl_dbg_log(DAPL_DBG_TYPE_ERR,  dapl_rep_recv: could not 
 modify QP state to RTS status %d\n, status);
goto disc;
}
 
status = ib_send_cm_rtu(conn-cm_id, NULL, 0);
if (status) {
-   dapl_dbg_log(DAPL_DBG_TYPE_ERR,  do_rep_recv: ib_send_cm_rtu 
+   dapl_dbg_log(DAPL_DBG_TYPE_ERR,  dapl_rep_recv: ib_send_cm_rtu 

 failed: %d\n, status);
goto disc;
}
@@ -181,7 +181,7 @@
 
status = dapl_modify_qp_state_to_rts(conn-cm_id, conn-ep-qp_handle);
if (status) {
-   dapl_dbg_log(DAPL_DBG_TYPE_ERR,  do_rtu_recv: could not 
+   dapl_dbg_log(DAPL_DBG_TYPE_ERR,  dapl_rtu_recv: could not 
 modify QP state to RTS status %d\n, status);
goto reject;
}


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: [PATCH] pingpong test: zero-initialize all attributes

2005-05-25 Thread Roland Dreier
Thanks, good catch.  I fixed your patch so that it compiles (including
with gcc-2.95) and committed it.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: Re: [PATCH] (repost) sdp: replace mlock with get_user_pages

2005-05-25 Thread Michael S. Tsirkin
Quoting r. Libor Michalek [EMAIL PROTECTED]:
 Subject: Re: Re: [PATCH] (repost) sdp: replace mlock with get_user_pages
 
 On Fri, May 13, 2005 at 04:51:45PM +0300, Michael S. Tsirkin wrote:
  Quoting r. Roland Dreier [EMAIL PROTECTED]:
   
   Libor   Always call aio_complete() immediately when
   Libor iocb_complete() is called, and only spawn the work thread
   Libor to unlock the memory after the aio_complete() call. The
   Libor patch is below.
  
  Libor, I dont think its a good idea - this will break other assumptions,
  like the assumption that the task mm isnt destroyed before we unlock
  the memory.
 
   That's a good point.
 
   Another alternative would be to always complete aios asynchronously,
   which should preserve the order.  I guess this would hurt latency for
   small ios...
  
  To avoid hurting latency, lets count the number of outstanding
  asynchronous AIOs, and if there are asynchronous AIOs complete
  all of them asynchronously.
  
  Does this make sense?
 
   Yes, except that the current iocb code does not reference individual
 sockets anywhere, and do_iocb_complete would have to be the function
 which decremented the per connection counter of outstanding AIOs. Also,
 since we don't have to do get_user_pages a second time on the send
 path, this would only need to be done on the recv path.
 
 -Libor
 

I thought about this some more: what if we set users to 1 before
releasing the irq spinlock, and call sdp_conn_unlock in thread context
after completing the aio iocb?
Any synchronous transfer would then wait till socket is unloacked.




-- 
MST - Michael S. Tsirkin
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: [PATCH] kDAPL: consolidate rmr files into one

2005-05-25 Thread James Lentini

Committed in revision 2488.

On Wed, 25 May 2005, Tom Duffy wrote:

tduffy Signed-off-by: Tom Duffy [EMAIL PROTECTED]
tduffy 
tduffy Index: linux-kernel-rmr/dat-provider/dapl_rmr_create.c
tduffy ===
tduffy --- linux-kernel-rmr/dat-provider/dapl_rmr_create.c (revision 2483)
tduffy +++ linux-kernel-rmr/dat-provider/dapl_rmr_create.c (working copy)
tduffy @@ -1,89 +0,0 @@
tduffy -/*
tduffy - * Copyright (c) 2002-2005, Network Appliance, Inc. All rights 
reserved.
tduffy - *
tduffy - * This Software is licensed under one of the following licenses:
tduffy - *
tduffy - * 1) under the terms of the Common Public License 1.0 a copy of 
which is
tduffy - *available from the Open Source Initiative, see
tduffy - *http://www.opensource.org/licenses/cpl.php.
tduffy - *
tduffy - * 2) under the terms of the The BSD License a copy of which is
tduffy - *available from the Open Source Initiative, see
tduffy - *http://www.opensource.org/licenses/bsd-license.php.
tduffy - *
tduffy - * 3) under the terms of the GNU General Public License (GPL) Version 
2 a
tduffy - *copy of which is available from the Open Source Initiative, see
tduffy - *http://www.opensource.org/licenses/gpl-license.php.
tduffy - *
tduffy - * Licensee has the right to choose one of the above licenses.
tduffy - *
tduffy - * Redistributions of source code must retain the above copyright
tduffy - * notice and one of the license notices.
tduffy - *
tduffy - * Redistributions in binary form must reproduce both the above 
copyright
tduffy - * notice, one of the license notices in the documentation
tduffy - * and/or other materials provided with the distribution.
tduffy - */
tduffy -
tduffy -/*
tduffy - * $Id$
tduffy - */
tduffy -
tduffy -#include dapl_rmr_util.h
tduffy -#include dapl_openib_util.h
tduffy -
tduffy -/*
tduffy - * dapl_rmr_create
tduffy - *
tduffy - * Create a remote memory region for the specified protection zone
tduffy - *
tduffy - * Input:
tduffy - * pz_handle
tduffy - *
tduffy - * Output:
tduffy - * rmr_handle
tduffy - *
tduffy - * Returns:
tduffy - * DAT_SUCCESS
tduffy - * DAT_INSUFFICIENT_RESOURCES
tduffy - * DAT_INVALID_PARAMETER
tduffy - */
tduffy -u32 dapl_rmr_create(DAT_PZ_HANDLE pz_handle, DAT_RMR_HANDLE 
*rmr_handle)
tduffy -{
tduffy -   struct dapl_pz *pz;
tduffy -   struct dapl_rmr *rmr;
tduffy -   u32 dat_status = DAT_SUCCESS;
tduffy -
tduffy -   if (DAPL_BAD_HANDLE(pz_handle, DAPL_MAGIC_PZ)) {
tduffy -   dat_status =
tduffy -   DAT_ERROR(DAT_INVALID_HANDLE, 
DAT_INVALID_HANDLE_PZ);
tduffy -   goto bail;
tduffy -   }
tduffy -
tduffy -   pz = (struct dapl_pz *)pz_handle;
tduffy -
tduffy -   rmr = dapl_rmr_alloc(pz);
tduffy -
tduffy -   if (rmr == NULL) {
tduffy -   dat_status =
tduffy -   DAT_ERROR(DAT_INSUFFICIENT_RESOURCES, 
DAT_RESOURCE_MEMORY);
tduffy -   goto bail;
tduffy -   }
tduffy -
tduffy -   dat_status = dapl_ib_mw_alloc(rmr);
tduffy -
tduffy -   if (dat_status != DAT_SUCCESS) {
tduffy -   dapl_rmr_dealloc(rmr);
tduffy -   dat_status =
tduffy -   DAT_ERROR(DAT_INSUFFICIENT_RESOURCES,
tduffy - DAT_RESOURCE_MEMORY_REGION);
tduffy -   goto bail;
tduffy -   }
tduffy -
tduffy -   atomic_inc(pz-pz_ref_count);
tduffy -
tduffy -   *rmr_handle = rmr;
tduffy -
tduffy -bail:
tduffy -   return dat_status;
tduffy -}
tduffy Index: linux-kernel-rmr/dat-provider/Makefile
tduffy ===
tduffy --- linux-kernel-rmr/dat-provider/Makefile  (revision 2483)
tduffy +++ linux-kernel-rmr/dat-provider/Makefile  (working copy)
tduffy @@ -76,11 +61,7 @@ PROVIDER_MODULES := \
tduffy  dapl_psp_query \
tduffy  dapl_pz \
tduffy  dapl_ring_buffer_util  \
tduffy -dapl_rmr_bind  \
tduffy -dapl_rmr_create\
tduffy -dapl_rmr_free  \
tduffy -dapl_rmr_query \
tduffy -dapl_rmr_util  \
tduffy +dapl_rmr   \
tduffy  dapl_rsp_create\
tduffy  dapl_rsp_free  \
tduffy  dapl_rsp_query \
tduffy @@ -98,5 +79,25 @@ PROVIDER_MODULES := \
tduffy Index: linux-kernel-rmr/dat-provider/dapl_rmr_free.c
tduffy ===
tduffy --- linux-kernel-rmr/dat-provider/dapl_rmr_free.c   (revision 2483)
tduffy +++ linux-kernel-rmr/dat-provider/dapl_rmr_free.c   (working copy)
tduffy @@ -1,84 +0,0 @@
tduffy -/*
tduffy - * Copyright (c) 2002-2005, Network Appliance, Inc. All rights 
reserved.
tduffy - *
tduffy - * This Software is licensed under 

[openib-general] Re: [PATCH] remove redundant check in mthca_provider.c

2005-05-25 Thread Roland Dreier
James Did this patch get lost in the shuffle? Is the proposed
James change incorrect?

Sorry, I just missed it the first time around... I just applied it.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [HELP] Encounter Kernel Panic when Add MellanoxHCA Supporting on 2.6.9 Kernel

2005-05-25 Thread Bob Woodruff
 

Lenber Can I assume it is the problem of HCA card? Or the issue
Lenber is relative with the SMP platform? So strange..

Roland It's possible it's the HCA but I'm not sure what could be wrong.
With
Roland CONFIG_DEBUG_SPINLOCK can you get more of the traceback?  The BUG()
Roland should be producing a full stack trace.

Ok, I was able to reproduce this error on an IA32 system, running the 
redhat 2.6.9-5.EL (UP kernel) with the IB patches applied. It turned out
to be a problem with the HCA card that had older firmware, 3.0.1.

I updated the firmware to 3.3.2 and the system booted OK and everything 
seems to work fine. 

woody

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] [HELP] Encounter Kernel Panic when Add MellanoxHCA Supporting on 2.6.9 Kernel

2005-05-25 Thread Roland Dreier
Bob Ok, I was able to reproduce this error on an IA32 system,
Bob running the redhat 2.6.9-5.EL (UP kernel) with the IB patches
Bob applied. It turned out to be a problem with the HCA card that
Bob had older firmware, 3.0.1.

Did you get any kind of stack dump or traceback?

We really shouldn't panic on downrev FW, so I'd like to get to the
bottom of this.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


RE: [openib-general] [HELP] Encounter Kernel Panic when Add MellanoxHCA Supporting on 2.6.9 Kernel

2005-05-25 Thread Woodruff, Robert J
 
Bob Ok, I was able to reproduce this error on an IA32 system,
Bob running the redhat 2.6.9-5.EL (UP kernel) with the IB patches
Bob applied. It turned out to be a problem with the HCA card that
Bob had older firmware, 3.0.1.

Roland Did you get any kind of stack dump or traceback?

We really shouldn't panic on downrev FW, so I'd like to get to the
bottom of this.

 - R.

Unfortunately not, I did not have CONFIG_DEBUG_SPINLOCK set and 
I did not save the old firmware before loading in the new firmware.

Perhaps Lenber can get the traceback info before he updates his card.
That would be helpful as I agree it is not desirable to panic
on cards with old firmware.

woody



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] [HELP] Encounter Kernel Panic when Add MellanoxHCA Supporting on 2.6.9 Kernel

2005-05-25 Thread Roland Dreier
Robert Unfortunately not, I did not have CONFIG_DEBUG_SPINLOCK
Robert set and I did not save the old firmware before loading in
Robert the new firmware.

I'll try to build an old fw image and see if I can reproduce it here.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] [Fwd: RE: Use of dapl_ring_buffer in KDAPL !]

2005-05-25 Thread Caitlin Bestler
My recollection is that one of the original design factors
for the ring_buffers was that they are invoked from common
code (i.e., user or kernel mode). In user mode, the atomics
can be more efficient than safely updating two linked lists
in a multi-threaded environment.

That benefit may have been largely negated by the decision
to use dynamic memory allocation though.

The third reason was to minimize OS dependencies, which
is obviously no longer a factor when coding inside the kernel.

In any event, if none of the code using these routines will
ever be compiled in user mode then there are no remaining
motivations for a specialized ring buffer. If any of the code
will still be common between uDAPL and kDAPL then care
should be taken to ensure that the ring buffer substitution
is transparent to the caller and/or works in user mode.

On 5/25/05, Tom Duffy [EMAIL PROTECTED] wrote:
 Moving this discussion on list since more people might know of a way to
 do this with already existing linux primitives.
 
 -tduffy
 
 
 
 -- Forwarded message --
 From: James Lentini [EMAIL PROTECTED]
 To: Tom Duffy [EMAIL PROTECTED]
 Date: Wed, 25 May 2005 15:05:42 -0400 (EDT)
 Subject: RE: Use of dapl_ring_buffer in KDAPL !
 
 The original idea was that the ring buffer would perform better than
 other data structures because it used atomic operations. I don't
 believe that this theory was ever validated though.
 
 Two ring buffers were used because there are two different classes of
 events stored: free events and pending events.
 
 If their is a native linux data structure that provides equivalent
 functionality, dapl should use it.
 
 james
 
 On Tue, 24 May 2005, Tom Duffy wrote:
 
  On Wed, 2005-05-25 at 00:09 +0300, Itamar Rabenstein wrote:
  I just relised that we can implemet it with 2 lists empty_list and
  events_list where the events_list will be pop from head and pust at
  tail
 
  no need for dapl_ring_buffer
 
  What do you think ?
 
 
  Since the llists in Linux are doubly linked and circular and they have
  the ability to act like a stack or a queue, I think they should suffice.
 
  -tduffy
 
  P.S.  Any reason not to CC the list?
 
 
 
 BodyID:12856038.5.n.logpart (stored separately)
 
 ___
 openib-general mailing list
 openib-general@openib.org
 http://openib.org/mailman/listinfo/openib-general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
 

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] Re: Re: [PATCH] (repost) sdp: replace mlock with get_user_pages

2005-05-25 Thread Libor Michalek
On Wed, May 25, 2005 at 11:21:28PM +0300, Michael S. Tsirkin wrote:
 Quoting r. Libor Michalek [EMAIL PROTECTED]:
  Subject: Re: Re: [PATCH] (repost) sdp: replace mlock with get_user_pages
  
  On Fri, May 13, 2005 at 04:51:45PM +0300, Michael S. Tsirkin wrote:
   Quoting r. Roland Dreier [EMAIL PROTECTED]:

Libor   Always call aio_complete() immediately when
Libor iocb_complete() is called, and only spawn the work thread
Libor to unlock the memory after the aio_complete() call. The
Libor patch is below.
   
   Libor, I dont think its a good idea - this will break other assumptions,
   like the assumption that the task mm isnt destroyed before we unlock
   the memory.
  
That's a good point.
  
Another alternative would be to always complete aios asynchronously,
which should preserve the order.  I guess this would hurt latency for
small ios...
   
   To avoid hurting latency, lets count the number of outstanding
   asynchronous AIOs, and if there are asynchronous AIOs complete
   all of them asynchronously.
   
   Does this make sense?
  
Yes, except that the current iocb code does not reference individual
  sockets anywhere, and do_iocb_complete would have to be the function
  which decremented the per connection counter of outstanding AIOs. Also,
  since we don't have to do get_user_pages a second time on the send
  path, this would only need to be done on the recv path.
 
 I thought about this some more: what if we set users to 1 before
 releasing the irq spinlock, and call sdp_conn_unlock in thread context
 after completing the aio iocb?
 Any synchronous transfer would then wait till socket is unloacked.

  So use 'users' as a reference count, and basically increment the
lock before spawning do_iocb_complete. do_iocb_complete would then
unlock one reference and the function calling iocb_complete would 
unlock the other reference? This could work... Remeber that it's
possible for many do_iocb_complete functions to be in flight for a
given connection. 

  FYI, The problem I saw occured when the iocb_complete was called 
during the sdp_conn_unlock CQ poll, which is done with IRQ diabled.

-Libor
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] OpenIB and OpenRDMA: Convergence on common RDMA APIs and ULPs for Linux

2005-05-25 Thread Venkata Jagana

I would like to start a discussion around the convergence of RDMA APIs and ULPs
between OpenIB and OpenRDMA projects.

As you all know, Infiniband and iWARP based RNICs support RDMA capabilities being
exploited by both kernel and user based applications and which can take advantage of
these RDMA capabilities through standards based RDMA APIs such as DAPL, IT-API (v1/v2).

There exists a set of upper layer protocols, such as NFS, SRP/iSER, SDP, which are mostly
kernel based and also exists user based middleware/applications such as DB2, Oracle, scientific
applications which would like to use a common set of APIs supported by the underlying
operating systems in order to work over different RDMA fabrics like IB and RNICs.

>From Linux kernel perspective, it is undesirable to have a different set of APIs and ULPs
supported for variety of reasons including but not limited to the duplication, testing effort etc.
OpenIB and OpenRDMA projects are separate efforts and are actively working in its own paths
to develop the corresponding RDMA support in Linux but we want to make sure 
we work together to avoid the duplication in providing the support.

The proposal for both communities is to start thinking and discussing on how best
we could accomplish this commonality between these two projects. BTW, To make this objective
further clear - this proposal is not about merging these two projects since each project
has its own objective of supporting its RDMA function and rather intended to steer both
projects toward the goal of standardizing RDMA APIs and providing common ULPs as applicable.

However, we also have a challenge to address in implementing these common ULPs and APIs
since OpenIB is currently using verbs PI for Linux defined through an open source process and 
OpenRDMA is currently defining RNIC-PI (supporting RNIC and IB compatible verbs) for Linux 
based on the industry standard evolving through Opengroup/ICSC and open source community reviews.

The ultimate challenge for us is to come up with a common PI acceptable in Linux while
taking into account the standards, hardware vendors portability for device drivers, ULPs etc.

Thanks,
Venkat
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] OpenIB and OpenRDMA: Convergence on common RDMA APIs and ULPs for Linux

2005-05-25 Thread Roland Dreier
I believe the way forward is to evolve the existing drivers/infiniband
code already in Linux into a drivers/rdma that supports both IB and
RNICs.  To be extremely blunt, I believe the RNIC-PI is irrelevant to
the Linux kernel -- no IB vendors will support ripping out a working
midlayer and starting from scratch, and it doesn't make sense to have
two essentially equivalent midlayers in the same kernel.

To put a really concrete proposal on the table, I would suggest to
start by extending the current ib_client registration structure, which
looks like

struct ib_client {
char  *name;
void (*add)   (struct ib_device *);
void (*remove)(struct ib_device *);

struct list_head list;
};

by extending the current enum ib_node_type to something like

enum rdma_device_type {
RDMA_DEVICE_IB_CA,
RDMA_DEVICE_IB_SWITCH,
RDMA_DEVICE_IB_ROUTER,
RDMA_DEVICE_RNIC
};

Then the various pieces of code layered on top of the RDMA midlayer
can decide whether they want to deal with a particular device or not
by looking at the node_type member.  For example, the IB CM, IPoIB,
etc. could ignore devices of type RDMA_DEVICE_RNIC, while SDP or iSER
would use all devices and the RNIC CM would take only devices of type
RDMA_DEVICE_RNIC.

Then someone would have to start implementing a low-level driver for a
specific RNIC, and find which modifications to the existing verbs are
required.  For example, I believe the QP attribute structure passed
into the QP modify verb probably has to become a union containing the
IB attributes and the RNIC attributes.  However, most verbs should
work fine with at most trivial modifications.

The existing OpenIB SDP code will be a good example to study as we
determine what abstractions need to be added to make it simple for
consumers to deal with the differences between IB and RNIC.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenIB and OpenRDMA: Convergence on common RDMA APIs and ULPs for Linux

2005-05-25 Thread Troy Benjegerdes
 Then someone would have to start implementing a low-level driver for a
 specific RNIC, and find which modifications to the existing verbs are


I'll believe that RNICs are actually going to work and it's worth
talking about OpenRDMA when I can see code that runs. Initially, I'd say
that extending OpenIB is going to be the best way forward, once there is
a working RNIC driver.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] umad abi 2 v 3 and multicast join failed

2005-05-25 Thread Hal Rosenstock
On Wed, 2005-05-25 at 19:06, Troy Benjegerdes wrote:
 I was running a crufty version of opensm (compiled from the
 roland-uverbs branch),

roland-uverbs is an orphaned branch at this point.

You should not be using this version. It is not supported and out of
date. You should use the one on the trunk. See below.

  and I started getting these kinds of errors for
 no apparent reason:
 
 ib0: multicast join failed for ff12:401b::0:0:0::, status
 -22
 ib0: multicast join failed for ff12:401b::0:0:0::, status
 -22
 
 I'm running 2.6.11 kernels, and 'stock' modules.. I just tried
 rebuilding opensm from the latest SVN, but it apparently needs a new
 umad driver..
 
 warn: [24878] umad_init: wrong ABI version:
 /sys/class/infiniband_mad/abi_version is 2 but library ABI is 3

Right. This is old OpenSM (actually old libibumad) with the latest from
OpenIB svn (past where I put the changes to support send side RMPP in).

Note that I did say the following:
user_mad: Support RMPP on send side

Note that this change will need a coordinated change to OpenSM and some
userspace/management libraries which will be done as soon as possible
once this patch is accepted.

It was followed by a patch to userspace/management which includes OpenSM
for this:
userspace/management changes to support send side RMPP
(needs change to linux-kernel/infiniband/core/user_mad.c)
ABI_VERSION is now 3
RMPP is enabled in build
SA GetTable is now supported properly (within current RMPP limitations)

 I suppose I need to rebuild the kernel ib_umad (and maybe everything
 else for good measure)..

No. It's the other way around. You need to rebuild OpenSM.

  And if I do that, should I expect OpenSM to
 work better regarding the multicast issue?
 
 Also, what will happen if I run opensm on two different nodes? Will they
 fight, or will one of them figure out how to be a backup slave SM if the
 first goes down?

SM mastership should work. You should be able to run any number of
OpenSMs in a subnet and one of them will become master. [This is a
separate issue from the ABI version change.]

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] OpenIB and OpenRDMA: Convergence on common RDMA APIs and ULPs for Linux

2005-05-25 Thread Grant Grundler
On Wed, May 25, 2005 at 07:35:52PM -0700, Roland Dreier wrote:
 I believe the way forward is to evolve the existing drivers/infiniband
 code already in Linux into a drivers/rdma that supports both IB and
 RNICs.  To be extremely blunt, I believe the RNIC-PI is irrelevant to
 the Linux kernel -- no IB vendors will support ripping out a working
 midlayer and starting from scratch, and it doesn't make sense to have
 two essentially equivalent midlayers in the same kernel.

Yes, I think that's an accurate assessment.

...
 The existing OpenIB SDP code will be a good example to study as we
 determine what abstractions need to be added to make it simple for
 consumers to deal with the differences between IB and RNIC.

Venkata,
Interesting coincidence: I was talking with someone (at HP) today
who knows substantially more than I do about RNICs.
They indicated RNICs need to manage TCP state on the card from userspace.
I suspect that's only possible through a private interface
(e.g. ioctl() or /proc) or the non-existant (in kernel.org)
TOE implementation. Is this correct?

If it is correct, any ideas/proposals on how that functionality
will get into kernel.org and where it might fit?

Solving the RDMA part of the problem isn't useful if one can't
configure/manage the TCP part of the RNIC.

hth,
grant
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] umad abi 2 v 3 and multicast join failed

2005-05-25 Thread Grant Grundler
On Wed, May 25, 2005 at 09:07:25PM -0700, Roland Dreier wrote:
 In general, given that kernels 2.6.11 and 2.6.12 are shipping with ABI
 version 2, does it make sense to avoid problems like this by keeping
 the old userspace code around and having the library decide at runtime
 which ABI to use?

Only if you want to continue providing fixes for 2.6.11 and 2.6.12.
It would be nice if someone put together a source snapshot that
would work with those releases. But I'm not sure how to
make that available to some random person who is using it.

Personally, since no distro is providing support for ABI v2,
I would not bundle both ABIs together in one library.

grant
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general