Re: [openib-general] IBM eHCA testing..

2005-10-12 Thread IBMEHCA DD

This is basically the answer why its
so sensitive which port is plugged.
We're working on a solution to that
problem.
But currently we only see a chance to
change this behaviour by also changing the firmware interface,
which needs to be coordinated with firmware
development.

Roland Dreier wrote on 10.10.2005 23:44:21:

 IBMEHCA So you need some kind of signal from the operating system
 IBMEHCA to system firmware, which in the eHCA case is the
 IBMEHCA H_DEFINE_AQP1 triggered by ib_create_qp with IB_QPT_GSI
 IBMEHCA parameter. AFTER that call handshaking between system
 IBMEHCA firmware and the SM will start, here's a new adapter
 IBMEHCA active on a switch port... what's your guid? here's your
 IBMEHCA LID, p_key, SM lid ...and after all that it's
 IBMEHCA possible to send and receive packets from the fabric.
 IBMEHCA The openib stack expects that a port is fully functional
 IBMEHCA after this create_qp returns, and starts to do all sorts
 IBMEHCA of modify QP and post send. So the only choice we
have
 IBMEHCA there is to delay create_qp until the complete
 IBMEHCA handshaking between system firmware and the SM has
 IBMEHCA finished (until we see a IB_PORT_ACTIVE in hcad_mod).
If
 IBMEHCA we don't see that until EHCA_PORT_ACTIVE_TIMEOUT we have
 IBMEHCA to return an error code to openib, otherwise we're
 IBMEHCA seriously in trouble (tried that).
 
 I think this scheme breaks the IB model. When consumers get
access to
 an HCA, they expect to be able to access the HCA, even if an SM has
 not configured it (and even in the case no cable is connected). As
an
 example of why this is useful, if the link won't come up, it's nice
to
 be able to get to query the port's PMA counters to see if there are
 excessive errors or something like that.

 I understand that you don't want to have all
HCAs always visible to
 the SM, but the scheme you've chosen puts an unneeded dependency
 between driver initialization and the external SM. It would
be fine
 if creating QP1 triggered the transition of the port from DOWN to
INIT
 so that it is discoverable by the SM, but there's no reason for
 creation of QP1 to wait to finish until the SM has brought the port
up.

 (As a side note, Mellanox HCAs don't bring a
port to INIT until the
 host driver has transitioned QP0 to the RTR state, which seems more
 sensible than waiting for QP1 to be created)

 I hope this can be fixed in firmware with your
current HCA hardware.

 - R.___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IBM eHCA testing..

2005-10-12 Thread Troy Benjegerdes
What is the turnaround time on a firmware change? If we can get an
update, I think that would be the best solution. I'll be happy to test
this.

On Wed, Oct 12, 2005 at 11:36:59AM +0200, IBMEHCA DD wrote:
 This is basically the answer why its so sensitive which port is plugged.
 We're working on a solution to that problem.
 But currently we only see a chance to change this behaviour by also 
 changing the firmware interface,
 which needs to be coordinated with firmware development.
 
 Roland Dreier wrote on 10.10.2005 23:44:21:
 
  IBMEHCA So you need some kind of signal from the operating system
  IBMEHCA to system firmware, which in the eHCA case is the
  IBMEHCA H_DEFINE_AQP1 triggered by ib_create_qp with IB_QPT_GSI
  IBMEHCA parameter.  AFTER that call handshaking between system
  IBMEHCA firmware and the SM will start, here's a new adapter
  IBMEHCA active on a switch port... what's your guid? here's your
  IBMEHCA LID, p_key, SM lid  ...and after all that it's
  IBMEHCA possible to send and receive packets from the fabric.
  IBMEHCA The openib stack expects that a port is fully functional
  IBMEHCA after this create_qp returns, and starts to do all sorts
  IBMEHCA of modify QP and post send.  So the only choice we have
  IBMEHCA there is to delay create_qp until the complete
  IBMEHCA handshaking between system firmware and the SM has
  IBMEHCA finished (until we see a IB_PORT_ACTIVE in hcad_mod). If
  IBMEHCA we don't see that until EHCA_PORT_ACTIVE_TIMEOUT we have
  IBMEHCA to return an error code to openib, otherwise we're
  IBMEHCA seriously in trouble (tried that).
  

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: Re: Re: [openib-general] IBM eHCA testing..

2005-10-11 Thread Heiko J Schick
Hello Troy,

this morning I've looked in detail into the problem you've reported on Oct 
10 via the OpenIB mailing-list [1]. It seems that the kernel panic is an 
IPoIB issues.

[1]:  http://openib.org/pipermail/openib-general/2005-October/012353.html

The following things appens:

1.  modprobe hcad_mod ehca_nr_ports=1
The eHCA InfiniBand Device Driver is loaded.

2.  modprobe ib_mad
The ib_mad stack creates an AQP1. This will start the port 
activation process. 
By my count it will take more than 110 / 120 seconds to activate a 
port. 
Our device driver gets a timeout, which means that the port is NOT 
active. and
ib_modify_qp will not work (for any QP, doesn't matter if it was 
created in the ib_mad 
stack or in the ib_ipoib stack).

3.  modprobe ib_ipoib
All ressources for IPoIB are allocated (CQ, QPs, MR, etc.)

4.  A user runs ifconfig ib0 xxx.xxx.xxx.xxx which executes the 
following functions:
ipoib_open - ipoib_ib_dev_open - ipoib_qp_create. The user 
should see the following 
error message:
 
l2:/home/schickhj/ibt/linstack/ehca2/ehca2 # ifconfig ib0 
192.168.8.8
SIOCSIFFLAGS: Invalid argument

5.  The function ipoib_qp_create modifies the QP from Reset 2 Init 2 
RTR 2 RTS.
If one of these three ib_modify_qp doesn't work, the IPoIB QP 
(priv-qp) will be destroyed
(by the ipoib_qp_create error routine / out_fail) and priv-qp 
will be NULL.
 
-- see /src/linux-kernel/infiniband/ulp/ipoib/ipoib_verbs.c 
function ipoib_qp_create

6.  A user runs (again) ifconfig ib0 xxx.xxx.xxx which executes 
(again) the following functions:
ipoib_open - ipoib_ib_dev_open - ipoib_qp_create

7.  ipoib_qp_create wants to modify the IPoIB QP (priv-qp) which is 
NULL, because the
QP was destroy earlier in time by the error handling routine in 
ipoib_qp_create (see 5.)

I think this error could also show up on Mellanox based IB cards when 
ib_modify_qp failes in ipoib_qp_create.

In dmesg you should see:

(see 1.)
eHCA Infiniband Device Driver (Rel.: )
xics_enable_irq: irq=9029: ibm_int_on returned fffd
eHCA Infiniband Device Driver (Rel.: )

(see 2.)
PU 000b0078:ehca_define_sqp HCAD_ERROR  Port 1 is not active.
PU 000b0387:ehca_create_qp HCAD_ERROR  ehca_define_sqp() failed 
rc=
PU 000b03ae:ehca_create_qp  failed ret=ffea
ib_mad: Couldn't create ib_mad QP1
ib_mad: Couldn't open ehca0 port 1
PU0001 00060103:ehca_parse_ec  EHCA port 1 is available.
PU 000b00bd:plpar_hcall_7arg_7ret HCAD_ERROR  HCALL77_IN r3=168 
r4=100100050304 r5=2001002c r6=8a40 3ed48000 r8=0 
r9=0 r10=0
PU 000b00c4:plpar_hcall_7arg_7ret HCAD_ERROR  HCALL77_OUT 
r3=ffd3 r4=0 r5=0 r6=0 r7=4 r8=0 r9=8005aa18 r10=0 

(see 4.)
PU 000b0564:internal_modify_qp HCAD_ERROR  hipz_h_modify_qp() failed 
rc=ffd3 ehca_qp=c3ba4e00 qp_num=2c
ib0: failed to modify QP to init, ret = -22
ib0: ipoib_qp_create returned -22

Mit freundlichen Gruessen / Kind Regards
Heiko Joerg Schick

IBM Deutschland Entwicklung GmbH
I/Ox Microcode Development
Linux Infiniband Device Drivers

Schoenaicher Str. 220
71032 Boeblingen
E-Mail: [EMAIL PROTECTED]
External: 49-7031-16-0 x4219,   t/l: 120-4219

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: Re: Re: [openib-general] IBM eHCA testing..

2005-10-11 Thread Hal Rosenstock
Hi Heiko,

On Tue, 2005-10-11 at 08:43, Heiko J Schick wrote:
 this morning I've looked in detail into the problem you've reported on Oct 
 10 via the OpenIB mailing-list [1]. It seems that the kernel panic is an 
 IPoIB issues.
 
 [1]:  http://openib.org/pipermail/openib-general/2005-October/012353.html
 
 The following things appens:
 
 1.  modprobe hcad_mod ehca_nr_ports=1
 The eHCA InfiniBand Device Driver is loaded.
 
 2.  modprobe ib_mad
 The ib_mad stack creates an AQP1. This will start the port 
 activation process. 
 By my count it will take more than 110 / 120 seconds to activate a 
 port. 
 Our device driver gets a timeout, which means that the port is NOT 
 active. and
 ib_modify_qp will not work (for any QP, doesn't matter if it was 
 created in the ib_mad 
 stack or in the ib_ipoib stack).

Where does this time to activate a port come from ? Is there some
maximum time in which the eHCA firmware requires this to be completed ?

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: Re: Re: [openib-general] IBM eHCA testing..

2005-10-11 Thread Heiko J Schick

Hello Hal,

normally the timeout is set to 30 seconds.
If you need more information about the activation please
see [1].

[1]: http://openib.org/pipermail/openib-general/2005-October/012355.html


Mit freundlichen Gruessen / Kind Regards
Heiko Joerg Schick

IBM Deutschland Entwicklung GmbH
I/Ox Microcode Development
Linux Infiniband Device Drivers

Schoenaicher Str. 220
71032 Boeblingen
E-Mail: [EMAIL PROTECTED]
External: 49-7031-16-0 x4219,  t/l: 120-4219






Hal Rosenstock [EMAIL PROTECTED]

11.10.2005 14:48




To
Heiko J Schick/Germany/[EMAIL PROTECTED]


cc
openib-general@openib.org,
Christoph Raisch/Germany/[EMAIL PROTECTED]


Subject
Re: Re: Re: [openib-general]
IBM eHCA testing..








Hi Heiko,

On Tue, 2005-10-11 at 08:43, Heiko J Schick wrote:
 this morning I've looked in detail into the problem you've reported
on Oct 
 10 via the OpenIB mailing-list [1]. It seems that the kernel panic
is an 
 IPoIB issues.
 
 [1]: http://openib.org/pipermail/openib-general/2005-October/012353.html
 
 The following things appens:
 
 1.   modprobe hcad_mod ehca_nr_ports=1
 The eHCA InfiniBand Device Driver is loaded.
 
 2.   modprobe ib_mad
 The ib_mad stack creates an AQP1. This
will start the port 
 activation process. 
 By my count it will take more than 110
/ 120 seconds to activate a 
 port. 
 Our device driver gets a timeout, which
means that the port is NOT 
 active. and
 ib_modify_qp will not work (for any QP,
doesn't matter if it was 
 created in the ib_mad 
 stack or in the ib_ipoib stack).

Where does this time to activate a port come from ? Is there some
maximum time in which the eHCA firmware requires this to be completed ?

-- Hal


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: Re: Re: [openib-general] IBM eHCA testing..

2005-10-11 Thread Hal Rosenstock
Hi again Heiko,

On Tue, 2005-10-11 at 09:21, Heiko J Schick wrote:
 Hello Hal,
 
 normally the timeout is set to 30 seconds.  

Why does there need to be a timeout for this ? There is no time defined
in the IB spec for activating a port. The SM may or may not be up and it
is implementation specific when it activates any particular port.

 If you need more information about the activation please see [1].
 
 [1]:
 http://openib.org/pipermail/openib-general/2005-October/012355.html

Yes, I saw that post yesterday.

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: Re: Re: [openib-general] IBM eHCA testing..

2005-10-11 Thread Hal Rosenstock
Hi again Heiko,

On Tue, 2005-10-11 at 09:21, Heiko J Schick wrote:
 Hello Hal,
 
 normally the timeout is set to 30 seconds. 

One more thing:

How can the timeout be adjusted ? Is it an module parameter ?

-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: Re: Re: [openib-general] IBM eHCA testing..

2005-10-11 Thread Shirley Ma

The IB stack doesn't handle errors during
client initialization. This problem is easy to reproduce by inducing errors
(resouce allocation failure or query failure) in mad_client or sa_client
registration. I am working on a patch, but I am in class the whole week,
don't have time to verify the patch. I hope the patch will be available
early next week to fix the panic. 

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IBM eHCA testing..

2005-10-11 Thread Roland Dreier
Heiko 7.  ipoib_qp_create wants to modify the IPoIB QP (priv-qp)
Heiko which is NULL, because the QP was destroy earlier in time
Heiko by the error handling routine in ipoib_qp_create (see 5.)

Heiko I think this error could also show up on Mellanox based IB
Heiko cards when ib_modify_qp failes in ipoib_qp_create.

Yes, this is a bug.  I think something like the patch below is needed
-- ipoib_qp_create() should not destroy the QP on failure, since it no
longer creates the QP.  In fact we should fix the name as well, since
creation of the QP has moved elsewhere.

I'll check this in and queue it for 2.6.15.

Thanks,
  Roland

--- infiniband/ulp/ipoib/ipoib_verbs.c  (revision 3707)
+++ infiniband/ulp/ipoib/ipoib_verbs.c  (working copy)
@@ -92,7 +92,7 @@ int ipoib_mcast_detach(struct net_device
return ret;
 }
 
-int ipoib_qp_create(struct net_device *dev)
+int ipoib_init_qp(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
int ret;
@@ -149,10 +149,11 @@ int ipoib_qp_create(struct net_device *d
return 0;
 
 out_fail:
-   ib_destroy_qp(priv-qp);
-   priv-qp = NULL;
+   qp_attr.qp_state = IB_QPS_RESET;
+   if (ib_modify_qp(priv-qp, qp_attr, IB_QP_STATE))
+   ipoib_warn(priv, Failed to modify QP to RESET state\n);
 
-   return -EINVAL;
+   return ret;
 }
 
 int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca)
--- infiniband/ulp/ipoib/ipoib.h(revision 3707)
+++ infiniband/ulp/ipoib/ipoib.h(working copy)
@@ -277,7 +277,7 @@ int ipoib_mcast_attach(struct net_device
 int ipoib_mcast_detach(struct net_device *dev, u16 mlid,
   union ib_gid *mgid);
 
-int ipoib_qp_create(struct net_device *dev);
+int ipoib_init_qp(struct net_device *dev);
 int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca);
 void ipoib_transport_dev_cleanup(struct net_device *dev);
 
--- infiniband/ulp/ipoib/ipoib_ib.c (revision 3707)
+++ infiniband/ulp/ipoib/ipoib_ib.c (working copy)
@@ -387,9 +387,9 @@ int ipoib_ib_dev_open(struct net_device 
struct ipoib_dev_priv *priv = netdev_priv(dev);
int ret;
 
-   ret = ipoib_qp_create(dev);
+   ret = ipoib_init_qp(dev);
if (ret) {
-   ipoib_warn(priv, ipoib_qp_create returned %d\n, ret);
+   ipoib_warn(priv, ipoib_init_qp returned %d\n, ret);
return -1;
}
 
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] IBM eHCA testing..

2005-10-11 Thread Troy Benjegerdes
On Tue, Oct 11, 2005 at 09:13:20AM -0700, Shirley Ma wrote:
 The IB stack doesn't handle errors during client initialization. This 
 problem is easy to reproduce by inducing errors (resouce allocation 
 failure or query failure) in mad_client or sa_client registration. I am 
 working on a patch, but I am in class the whole week, don't have time to 
 verify the patch. I hope the patch will be available early next week to 
 fix the panic. 

I'd be happy to verify the patch, but I need to get the latest version
of the ehca driver, ideally already integrated into the subversion tree.
Otherwise a tar.gz I can extract and drop in drivers/infiniband/hw/ehca
would work just fine.

I'm still not sure I got an answer why the ehca is so senstive to which 
port is plugged in.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] IBM eHCA testing..

2005-10-10 Thread IBMEHCA DD

This is caused by a complex interaction
of ib_mad, hcad_mod and pSeries firmware.

As you might already have noticed a
eHCA doesn't show up as a port but as a switch in the fabric.
Reason for this is partition support
and virtualisation in Infininband.

If you want to give each partition in
a system a own IB adapter, it has to have its own
LID(s) and therefore it's own GUIDs.
IB standard only allows one way currently
how to accomplish this: You need a switch and multiple adapters behind.
So that's exactly how the eHCA shows
up in the fabric. In our case system firmware handles the SMA traffic for
that switch and for all adapters (running an SMA
or SM on QP0 is currently not supported).

This brings up another problem: you
definetly won't want to allocate LIDs for all potentially possible
operating system partitions (not to confuse with IB partitioning), otherwise
you could come close to the 48000 LIDs/subnet limit pretty quickly. So
you need some kind of signal from the operating system to system firmware,
which in the eHCA case is the H_DEFINE_AQP1
triggered by ib_create_qp with IB_QPT_GSI parameter.
AFTER that call handshaking between
system firmware and the SM will start, here's a new adapter active on a
switch port... what's your guid? here's your LID, p_key, SM lid
...and after all that it's possible
to send and receive packets from the fabric.
The openib stack expects that a port
is fully functional after this create_qp returns, and starts to do all
sorts of modify QP and post send.
So the only choice we have there is
to delay create_qp until the complete handshaking between system firmware
and the SM has finished (until we see a IB_PORT_ACTIVE in hcad_mod). If
we don't see that until EHCA_PORT_ACTIVE_TIMEOUT we have to return an error
code to openib, otherwise we're seriously in trouble (tried that).

Shirley already pointed out on the mailinglist,
that ib_mad and others have different recovery depending on the success
of ib_create_qp(IB_QPT_GSI), especially ib_mad decides it's the best thing
to kill the complete adapter if that call fails on a single port.

so that's the full explanation of ehca_nr_ports
and hopefully answers your question







Troy Benjegerdes [EMAIL PROTECTED]

08.10.2005 04:03




To
Shirley Ma [EMAIL PROTECTED]


cc
Pradeep Satyanarayana [EMAIL PROTECTED],
Troy Benjegerdes [EMAIL PROTECTED], IBMEHCA DD/Germany/[EMAIL PROTECTED],
openib-general@openib.org, [EMAIL PROTECTED]


Subject
Re: [openib-general] IBM
eHCA testing..








On Fri, Oct 07, 2005 at 09:33:27AM -0700, Shirley
Ma wrote:
 Hi, Troy,

 There is INSTALL file in the EHCA driver package.
 In OpenPower 720 port 1 is at the top, port 2 is at the bottom.
 In P570, port1 is at the bottom, port2 is at the top.

Okay, I guess I should read more carefully ;)

What is the issue with needing to use port 1? Can
that be fixed in the
driver, or does that need a firmware update?
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: Re: [openib-general] IBM eHCA testing..

2005-10-10 Thread Heiko J Schick

Hello Troy,

below you will find our preliminary
analysis about the problem you've reported on Oct 10 via the OpenIB mailing-list
[1]: 

[1]: http://openib.org/pipermail/openib-general/2005-October/012353.html

[ 381.453731] eHCA Infiniband
Device Driver (Rel.: EHCA2_0025)
[ 381.458602] xics_enable_irq:
irq=36868: ibm_int_on returned fffd
[ 393.378143] eHCA Infiniband
Device Driver (Rel.: EHCA2_0025)
[ 452.658083] PU0002 000b0075:ehca_define_sqp
HCAD_ERROR Port 1 is not active.
[ 452.658106] PU0002 000b0383:ehca_create_qp
HCAD_ERROR ehca_define_sqp() failed rc=
[ 452.821917] PU0002 000b03aa:ehca_create_qp
 failed ret=ffea
[ 452.821939] ib_mad: Couldn't
create ib_mad QP1
[ 453.313412] ib_mad: Couldn't
open ehca0 port 1
[ 475.132318] PU0002 00060100:ehca_parse_ec
EHCA port 1 is available.
[ 518.249381] PU0007 000b00b9:plpar_hcall_7arg_7ret
HCAD_ERROR HCALL77_IN r3=168 r4=1304 r5=2008 r6=8a40
r7=1e4e49000 r8=0 r9=0 r10=0
[ 518.249411] PU0007 000b00c0:plpar_hcall_7arg_7ret
HCAD_ERROR HCALL77_OUT r3=ffd3 r4=0 r5=0 r6=0 r7=4 r8=0 r9=8005aa18
r10=0
[ 518.249438] PU0007 000b0560:internal_modify_qp
HCAD_ERROR hipz_h_modify_qp() failed rc=ffd3 ehca_qp=cf2cd080
qp_num=8
[ 518.249460] ib0: failed to modify
QP to init, ret = -22
[ 518.418976] ib0: ipoib_qp_create
returned -22
[ 528.813491] Oops: Kernel access
of bad area, sig: 11 [#1]
[ 528.813505] SMP NR_CPUS=8 NUMA
PSERIES LPAR
[ 528.813517] Modules linked in:
ib_ipoib ib_sa ib_mad hcad_mod ib_core ebus
[ 528.813540] NIP: D0049C6C
XER: 2020 LR: D00760A0 CTR: D0049C60
[ 528.813554] REGS: cf1eb1d0
TRAP: 0300  Not tainted (2.6.13.3-power5)
[ 528.813568] MSR: 80009032
EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 22028422
[ 528.813580] DAR: 
DSISR: 4000
[ 528.813592] TASK: c209a9a0[2021]
'ifconfig' THREAD: cf1e8000 CPU: 0
[ 528.813605] GPR00: D00760A0
CF1EB450 D005FFF0 
[ 528.813625] GPR04: CF1EB548
0071 CF1EB540 0001
[ 528.813645] GPR08: 000B
0001 0004 D0049C60
[ 528.813664] GPR12: D00774C0
C04B4000 100C 100A
[ 528.813685] GPR16: 
 1002 1002
[ 528.813704] GPR20: 1001E71C
C001E466C000 8914 C001E46D4810
[ 528.813725] GPR24: C001E46D4800
CF43B900 CF1EBD10 0002
[ 528.813745] GPR28: 
C001E466C380 D0084640 CF1EB548
[ 528.813768] NIP [d0049c6c]
.ib_modify_qp+0xc/0x40 [ib_core]
[ 528.813797] LR [d00760a0]
.ipoib_qp_create+0xe0/0x1c0 [ib_ipoib]
[ 528.813822] Call Trace:
[ 528.813829] [cf1eb450]
[434849c5] 0x434849c5 (unreliable)
[ 528.813846] [cf1eb4d0]
[d00760a0] .ipoib_qp_create+0xe0/0x1c0 [ib_ipoib]
[ 528.813873] [cf1eb5f0]
[d007261c] .ipoib_ib_dev_open+0x2c/0x120 [ib_ipoib]
[ 528.813899] [cf1eb680]
[d006f38c] .ipoib_open+0x7c/0x190 [ib_ipoib]
[ 528.813923] [cf1eb720]
[c032a650] .dev_open+0xc0/0x120
[ 528.813942] [cf1eb7c0]
[c0328c70] .dev_change_flags+0x180/0x1c0
[ 528.813961] [cf1eb860]
[c037a02c] .devinet_ioctl+0x81c/0x850
[ 528.813980] [cf1eb970]
[c037a65c] .inet_ioctl+0x27c/0x2d0
[ 528.813998] [cf1eba00]
[c031bc4c] .sock_ioctl+0x8c/0x440
[ 528.814016] [cf1ebaa0]
[c00c22f0] .do_ioctl+0x60/0x120
[ 528.814033] [cf1ebb40]
[c00c244c] .vfs_ioctl+0x9c/0x4d0
[ 528.814050] [cf1ebbf0]
[c00c28cc] .sys_ioctl+0x4c/0xa0
[ 528.814066] [cf1ebca0]
[c001bb24] .dev_ifsioc+0x84/0x390
[ 528.814084] [cf1ebd70]
[c00e4d88] .compat_sys_ioctl+0x158/0x500
[ 528.814103] [cf1ebe30]
[c000d300] syscall_exit+0x0/0x18
[ 528.814119] Instruction dump:
[ 528.814126] 7c601b78 38210080
7c030378 e8010010 7c0803a6 4e800020 6000 6000
[ 528.814150] 6000 7c0802a6
f8010010 f821ff81 e923 e9490170 e80a f8410028
[ 528.814174] 7RTAS:
event: 3, Type: Platform Error, Severity: 2

It looks that IPoIB uses ressources
which are already freed. We don't receive a port active event
for port 1 in time (after 20 seconds). This means, that the ib_mad stack
tries to create an AQP1. Here, our eHCA InfiniBand Device Driver waits
for a maximum of 20 seconds for a port active event. It seems that with
the usage of OpenSM we will receive the port active event after
ca. 45 seconds. 

For the MAD and IPoIB stack this means
the following:

MAD:

1. No AQP1 QP will exist for port 1,
because of the missing port active event.

2. All resources are freed, because
of the error handling routines in ib_mad.
 create_mad_qp reports an
error to ib_mad_port_open which destroys all 

Re: [openib-general] IBM eHCA testing..

2005-10-10 Thread Roland Dreier
IBMEHCA So you need some kind of signal from the operating system
IBMEHCA to system firmware, which in the eHCA case is the
IBMEHCA H_DEFINE_AQP1 triggered by ib_create_qp with IB_QPT_GSI
IBMEHCA parameter.  AFTER that call handshaking between system
IBMEHCA firmware and the SM will start, here's a new adapter
IBMEHCA active on a switch port... what's your guid? here's your
IBMEHCA LID, p_key, SM lid  ...and after all that it's
IBMEHCA possible to send and receive packets from the fabric.
IBMEHCA The openib stack expects that a port is fully functional
IBMEHCA after this create_qp returns, and starts to do all sorts
IBMEHCA of modify QP and post send.  So the only choice we have
IBMEHCA there is to delay create_qp until the complete
IBMEHCA handshaking between system firmware and the SM has
IBMEHCA finished (until we see a IB_PORT_ACTIVE in hcad_mod). If
IBMEHCA we don't see that until EHCA_PORT_ACTIVE_TIMEOUT we have
IBMEHCA to return an error code to openib, otherwise we're
IBMEHCA seriously in trouble (tried that).

I think this scheme breaks the IB model.  When consumers get access to
an HCA, they expect to be able to access the HCA, even if an SM has
not configured it (and even in the case no cable is connected).  As an
example of why this is useful, if the link won't come up, it's nice to
be able to get to query the port's PMA counters to see if there are
excessive errors or something like that.

I understand that you don't want to have all HCAs always visible to
the SM, but the scheme you've chosen puts an unneeded dependency
between driver initialization and the external SM.  It would be fine
if creating QP1 triggered the transition of the port from DOWN to INIT
so that it is discoverable by the SM, but there's no reason for
creation of QP1 to wait to finish until the SM has brought the port up.

(As a side note, Mellanox HCAs don't bring a port to INIT until the
host driver has transitioned QP0 to the RTR state, which seems more
sensible than waiting for QP1 to be created)

I hope this can be fixed in firmware with your current HCA hardware.

 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] IBM eHCA testing..

2005-10-09 Thread Troy Benjegerdes
What's the status on getting the ehca driver integrated into subversion?
If there's something holding it up, can we at least get a version that
can be dropped into drivers/infiniband/hw ?

Also, one final note, is it really appropriate to have ehca/ebus in the
infiniband directory? It's really a PPC64 specific driver that works for
more than just the ehca device, correct?

I have the correct port plugged in now, and I can see the logical HCA device
in the output of 'ibnetdiscover' (from another node), but trying to bring up 
ib0 caused this:

[  381.453731] eHCA Infiniband Device Driver (Rel.: EHCA2_0025)
[  381.458602] xics_enable_irq: irq=36868: ibm_int_on returned fffd
[  393.378143] eHCA Infiniband Device Driver (Rel.: EHCA2_0025)
[  452.658083] PU0002 000b0075:ehca_define_sqp HCAD_ERROR  Port 1 is not
active.
[  452.658106] PU0002 000b0383:ehca_create_qp HCAD_ERROR
ehca_define_sqp() failed rc=
[  452.821917] PU0002 000b03aa:ehca_create_qp  failed ret=ffea
[  452.821939] ib_mad: Couldn't create ib_mad QP1
[  453.313412] ib_mad: Couldn't open ehca0 port 1
[  475.132318] PU0002 00060100:ehca_parse_ec  EHCA port 1 is available.
[  518.249381] PU0007 000b00b9:plpar_hcall_7arg_7ret HCAD_ERROR
HCALL77_IN r3=168 r4=1304 r5=2008
r6=8a40 r7=1e4e49000 r8=0 r9=0 r10=0
[  518.249411] PU0007 000b00c0:plpar_hcall_7arg_7ret HCAD_ERROR
HCALL77_OUT r3=ffd3 r4=0 r5=0 r6=0 r7=4 r8=0
r9=8005aa18 r10=0
[  518.249438] PU0007 000b0560:internal_modify_qp HCAD_ERROR
hipz_h_modify_qp() failed rc=ffd3 ehca_qp=cf2cd080
qp_num=8
[  518.249460] ib0: failed to modify QP to init, ret = -22
[  518.418976] ib0: ipoib_qp_create returned -22
[  528.813491] Oops: Kernel access of bad area, sig: 11 [#1]
[  528.813505] SMP NR_CPUS=8 NUMA PSERIES LPAR
[  528.813517] Modules linked in: ib_ipoib ib_sa ib_mad hcad_mod ib_core
ebus
[  528.813540] NIP: D0049C6C XER: 2020 LR: D00760A0
CTR: D0049C60
[  528.813554] REGS: cf1eb1d0 TRAP: 0300   Not tainted
(2.6.13.3-power5)
[  528.813568] MSR: 80009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11
CR: 22028422
[  528.813580] DAR:  DSISR: 4000
[  528.813592] TASK: c209a9a0[2021] 'ifconfig' THREAD:
cf1e8000 CPU: 0
[  528.813605] GPR00: D00760A0 CF1EB450 D005FFF0

[  528.813625] GPR04: CF1EB548 0071 CF1EB540
0001
[  528.813645] GPR08: 000B 0001 0004
D0049C60
[  528.813664] GPR12: D00774C0 C04B4000 100C
100A
[  528.813685] GPR16:   1002
1002
[  528.813704] GPR20: 1001E71C C001E466C000 8914
C001E46D4810
[  528.813725] GPR24: C001E46D4800 CF43B900 CF1EBD10
0002
[  528.813745] GPR28:  C001E466C380 D0084640
CF1EB548
[  528.813768] NIP [d0049c6c] .ib_modify_qp+0xc/0x40 [ib_core]
[  528.813797] LR [d00760a0] .ipoib_qp_create+0xe0/0x1c0
[ib_ipoib]
[  528.813822] Call Trace:
[  528.813829] [cf1eb450] [434849c5] 0x434849c5
(unreliable)
[  528.813846] [cf1eb4d0] [d00760a0]
.ipoib_qp_create+0xe0/0x1c0 [ib_ipoib]
[  528.813873] [cf1eb5f0] [d007261c]
.ipoib_ib_dev_open+0x2c/0x120 [ib_ipoib]
[  528.813899] [cf1eb680] [d006f38c]
.ipoib_open+0x7c/0x190 [ib_ipoib]
[  528.813923] [cf1eb720] [c032a650]
.dev_open+0xc0/0x120
[  528.813942] [cf1eb7c0] [c0328c70]
.dev_change_flags+0x180/0x1c0
[  528.813961] [cf1eb860] [c037a02c]
.devinet_ioctl+0x81c/0x850
[  528.813980] [cf1eb970] [c037a65c]
.inet_ioctl+0x27c/0x2d0
[  528.813998] [cf1eba00] [c031bc4c]
.sock_ioctl+0x8c/0x440
[  528.814016] [cf1ebaa0] [c00c22f0]
.do_ioctl+0x60/0x120
[  528.814033] [cf1ebb40] [c00c244c]
.vfs_ioctl+0x9c/0x4d0
[  528.814050] [cf1ebbf0] [c00c28cc]
.sys_ioctl+0x4c/0xa0
[  528.814066] [cf1ebca0] [c001bb24]
.dev_ifsioc+0x84/0x390
[  528.814084] [cf1ebd70] [c00e4d88]
.compat_sys_ioctl+0x158/0x500
[  528.814103] [cf1ebe30] [c000d300]
syscall_exit+0x0/0x18
[  528.814119] Instruction dump:
[  528.814126] 7c601b78 38210080 7c030378 e8010010 7c0803a6 4e800020
6000 6000
[  528.814150] 6000 7c0802a6 f8010010 f821ff81 e923 e9490170
e80a f8410028
[  528.814174]  7RTAS: event: 3, Type: Platform Error, Severity: 2

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


[openib-general] IBM eHCA testing..

2005-10-07 Thread Troy Benjegerdes
I have two IBM eHCA cards installed and it appears that OpenSM
is happily talking to the firmware and bringing up the links.

So now I'm looking at the install instructions for the ehca2_EHCA2_0025.tgz
code drop, and wondering what (if any) issues there are with a 2.6.13
kernel, or later OpenIB svn drops. 

Is there a later code drop I can get ahold of? Is the nr_ports issue
something in the driver? I wound up connecting to the lower port in the
Openpower720 machine.. do you know if that's port 1 or 2?
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general


Re: [openib-general] IBM eHCA testing..

2005-10-07 Thread Pradeep Satyanarayana

I believe the lower port is port 1. I will defer to the EHCA team as regards to issues with 2.6.13 (if any). We have minimally used both
ports on p570. So, my guess is that should work on a Openpower720.

Pradeep
[EMAIL PROTECTED]

[EMAIL PROTECTED] wrote on 10/07/2005 07:12:07 AM:

 I have two IBM eHCA cards installed and it appears that OpenSM
 is happily talking to the firmware and bringing up the links.
 
 So now I'm looking at the install instructions for the ehca2_EHCA2_0025.tgz
 code drop, and wondering what (if any) issues there are with a 2.6.13
 kernel, or later OpenIB svn drops. 
 
 Is there a later code drop I can get ahold of? Is the nr_ports issue
 something in the driver? I wound up connecting to the lower port in the
 Openpower720 machine.. do you know if that's port 1 or 2?
 ___
 openib-general mailing list
 openib-general@openib.org
 http://openib.org/mailman/listinfo/openib-general
 
 To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IBM eHCA testing..

2005-10-07 Thread Shirley Ma

Hi, Troy,

There is INSTALL file in the EHCA driver
package.
In OpenPower 720 port 1 is at the top,
port 2 is at the bottom.
In P570, port1 is at the bottom, port2
is at the top.

Thanks
Shirley Ma
IBM Linux Technology Center
15300 SW Koll Parkway
Beaverton, OR 97006-6063
Phone(Fax): (503) 578-7638___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] IBM eHCA testing..

2005-10-07 Thread Troy Benjegerdes
On Fri, Oct 07, 2005 at 09:33:27AM -0700, Shirley Ma wrote:
 Hi, Troy,
 
 There is INSTALL file in the EHCA driver package.
 In OpenPower 720 port 1 is at the top, port 2 is at the bottom.
 In P570, port1 is at the bottom, port2 is at the top.

Okay, I guess I should read more carefully ;)

What is the issue with needing to use port 1? Can that be fixed in the
driver, or does that need a firmware update?
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general