Re: [openib-general] IBM eHCA testing..
This is basically the answer why its so sensitive which port is plugged. We're working on a solution to that problem. But currently we only see a chance to change this behaviour by also changing the firmware interface, which needs to be coordinated with firmware development. Roland Dreier wrote on 10.10.2005 23:44:21: IBMEHCA So you need some kind of signal from the operating system IBMEHCA to system firmware, which in the eHCA case is the IBMEHCA H_DEFINE_AQP1 triggered by ib_create_qp with IB_QPT_GSI IBMEHCA parameter. AFTER that call handshaking between system IBMEHCA firmware and the SM will start, here's a new adapter IBMEHCA active on a switch port... what's your guid? here's your IBMEHCA LID, p_key, SM lid ...and after all that it's IBMEHCA possible to send and receive packets from the fabric. IBMEHCA The openib stack expects that a port is fully functional IBMEHCA after this create_qp returns, and starts to do all sorts IBMEHCA of modify QP and post send. So the only choice we have IBMEHCA there is to delay create_qp until the complete IBMEHCA handshaking between system firmware and the SM has IBMEHCA finished (until we see a IB_PORT_ACTIVE in hcad_mod). If IBMEHCA we don't see that until EHCA_PORT_ACTIVE_TIMEOUT we have IBMEHCA to return an error code to openib, otherwise we're IBMEHCA seriously in trouble (tried that). I think this scheme breaks the IB model. When consumers get access to an HCA, they expect to be able to access the HCA, even if an SM has not configured it (and even in the case no cable is connected). As an example of why this is useful, if the link won't come up, it's nice to be able to get to query the port's PMA counters to see if there are excessive errors or something like that. I understand that you don't want to have all HCAs always visible to the SM, but the scheme you've chosen puts an unneeded dependency between driver initialization and the external SM. It would be fine if creating QP1 triggered the transition of the port from DOWN to INIT so that it is discoverable by the SM, but there's no reason for creation of QP1 to wait to finish until the SM has brought the port up. (As a side note, Mellanox HCAs don't bring a port to INIT until the host driver has transitioned QP0 to the RTR state, which seems more sensible than waiting for QP1 to be created) I hope this can be fixed in firmware with your current HCA hardware. - R.___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IBM eHCA testing..
What is the turnaround time on a firmware change? If we can get an update, I think that would be the best solution. I'll be happy to test this. On Wed, Oct 12, 2005 at 11:36:59AM +0200, IBMEHCA DD wrote: This is basically the answer why its so sensitive which port is plugged. We're working on a solution to that problem. But currently we only see a chance to change this behaviour by also changing the firmware interface, which needs to be coordinated with firmware development. Roland Dreier wrote on 10.10.2005 23:44:21: IBMEHCA So you need some kind of signal from the operating system IBMEHCA to system firmware, which in the eHCA case is the IBMEHCA H_DEFINE_AQP1 triggered by ib_create_qp with IB_QPT_GSI IBMEHCA parameter. AFTER that call handshaking between system IBMEHCA firmware and the SM will start, here's a new adapter IBMEHCA active on a switch port... what's your guid? here's your IBMEHCA LID, p_key, SM lid ...and after all that it's IBMEHCA possible to send and receive packets from the fabric. IBMEHCA The openib stack expects that a port is fully functional IBMEHCA after this create_qp returns, and starts to do all sorts IBMEHCA of modify QP and post send. So the only choice we have IBMEHCA there is to delay create_qp until the complete IBMEHCA handshaking between system firmware and the SM has IBMEHCA finished (until we see a IB_PORT_ACTIVE in hcad_mod). If IBMEHCA we don't see that until EHCA_PORT_ACTIVE_TIMEOUT we have IBMEHCA to return an error code to openib, otherwise we're IBMEHCA seriously in trouble (tried that). ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: Re: Re: [openib-general] IBM eHCA testing..
Hello Troy, this morning I've looked in detail into the problem you've reported on Oct 10 via the OpenIB mailing-list [1]. It seems that the kernel panic is an IPoIB issues. [1]: http://openib.org/pipermail/openib-general/2005-October/012353.html The following things appens: 1. modprobe hcad_mod ehca_nr_ports=1 The eHCA InfiniBand Device Driver is loaded. 2. modprobe ib_mad The ib_mad stack creates an AQP1. This will start the port activation process. By my count it will take more than 110 / 120 seconds to activate a port. Our device driver gets a timeout, which means that the port is NOT active. and ib_modify_qp will not work (for any QP, doesn't matter if it was created in the ib_mad stack or in the ib_ipoib stack). 3. modprobe ib_ipoib All ressources for IPoIB are allocated (CQ, QPs, MR, etc.) 4. A user runs ifconfig ib0 xxx.xxx.xxx.xxx which executes the following functions: ipoib_open - ipoib_ib_dev_open - ipoib_qp_create. The user should see the following error message: l2:/home/schickhj/ibt/linstack/ehca2/ehca2 # ifconfig ib0 192.168.8.8 SIOCSIFFLAGS: Invalid argument 5. The function ipoib_qp_create modifies the QP from Reset 2 Init 2 RTR 2 RTS. If one of these three ib_modify_qp doesn't work, the IPoIB QP (priv-qp) will be destroyed (by the ipoib_qp_create error routine / out_fail) and priv-qp will be NULL. -- see /src/linux-kernel/infiniband/ulp/ipoib/ipoib_verbs.c function ipoib_qp_create 6. A user runs (again) ifconfig ib0 xxx.xxx.xxx which executes (again) the following functions: ipoib_open - ipoib_ib_dev_open - ipoib_qp_create 7. ipoib_qp_create wants to modify the IPoIB QP (priv-qp) which is NULL, because the QP was destroy earlier in time by the error handling routine in ipoib_qp_create (see 5.) I think this error could also show up on Mellanox based IB cards when ib_modify_qp failes in ipoib_qp_create. In dmesg you should see: (see 1.) eHCA Infiniband Device Driver (Rel.: ) xics_enable_irq: irq=9029: ibm_int_on returned fffd eHCA Infiniband Device Driver (Rel.: ) (see 2.) PU 000b0078:ehca_define_sqp HCAD_ERROR Port 1 is not active. PU 000b0387:ehca_create_qp HCAD_ERROR ehca_define_sqp() failed rc= PU 000b03ae:ehca_create_qp failed ret=ffea ib_mad: Couldn't create ib_mad QP1 ib_mad: Couldn't open ehca0 port 1 PU0001 00060103:ehca_parse_ec EHCA port 1 is available. PU 000b00bd:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_IN r3=168 r4=100100050304 r5=2001002c r6=8a40 3ed48000 r8=0 r9=0 r10=0 PU 000b00c4:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_OUT r3=ffd3 r4=0 r5=0 r6=0 r7=4 r8=0 r9=8005aa18 r10=0 (see 4.) PU 000b0564:internal_modify_qp HCAD_ERROR hipz_h_modify_qp() failed rc=ffd3 ehca_qp=c3ba4e00 qp_num=2c ib0: failed to modify QP to init, ret = -22 ib0: ipoib_qp_create returned -22 Mit freundlichen Gruessen / Kind Regards Heiko Joerg Schick IBM Deutschland Entwicklung GmbH I/Ox Microcode Development Linux Infiniband Device Drivers Schoenaicher Str. 220 71032 Boeblingen E-Mail: [EMAIL PROTECTED] External: 49-7031-16-0 x4219, t/l: 120-4219 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: Re: Re: [openib-general] IBM eHCA testing..
Hi Heiko, On Tue, 2005-10-11 at 08:43, Heiko J Schick wrote: this morning I've looked in detail into the problem you've reported on Oct 10 via the OpenIB mailing-list [1]. It seems that the kernel panic is an IPoIB issues. [1]: http://openib.org/pipermail/openib-general/2005-October/012353.html The following things appens: 1. modprobe hcad_mod ehca_nr_ports=1 The eHCA InfiniBand Device Driver is loaded. 2. modprobe ib_mad The ib_mad stack creates an AQP1. This will start the port activation process. By my count it will take more than 110 / 120 seconds to activate a port. Our device driver gets a timeout, which means that the port is NOT active. and ib_modify_qp will not work (for any QP, doesn't matter if it was created in the ib_mad stack or in the ib_ipoib stack). Where does this time to activate a port come from ? Is there some maximum time in which the eHCA firmware requires this to be completed ? -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: Re: Re: [openib-general] IBM eHCA testing..
Hello Hal, normally the timeout is set to 30 seconds. If you need more information about the activation please see [1]. [1]: http://openib.org/pipermail/openib-general/2005-October/012355.html Mit freundlichen Gruessen / Kind Regards Heiko Joerg Schick IBM Deutschland Entwicklung GmbH I/Ox Microcode Development Linux Infiniband Device Drivers Schoenaicher Str. 220 71032 Boeblingen E-Mail: [EMAIL PROTECTED] External: 49-7031-16-0 x4219, t/l: 120-4219 Hal Rosenstock [EMAIL PROTECTED] 11.10.2005 14:48 To Heiko J Schick/Germany/[EMAIL PROTECTED] cc openib-general@openib.org, Christoph Raisch/Germany/[EMAIL PROTECTED] Subject Re: Re: Re: [openib-general] IBM eHCA testing.. Hi Heiko, On Tue, 2005-10-11 at 08:43, Heiko J Schick wrote: this morning I've looked in detail into the problem you've reported on Oct 10 via the OpenIB mailing-list [1]. It seems that the kernel panic is an IPoIB issues. [1]: http://openib.org/pipermail/openib-general/2005-October/012353.html The following things appens: 1. modprobe hcad_mod ehca_nr_ports=1 The eHCA InfiniBand Device Driver is loaded. 2. modprobe ib_mad The ib_mad stack creates an AQP1. This will start the port activation process. By my count it will take more than 110 / 120 seconds to activate a port. Our device driver gets a timeout, which means that the port is NOT active. and ib_modify_qp will not work (for any QP, doesn't matter if it was created in the ib_mad stack or in the ib_ipoib stack). Where does this time to activate a port come from ? Is there some maximum time in which the eHCA firmware requires this to be completed ? -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: Re: Re: [openib-general] IBM eHCA testing..
Hi again Heiko, On Tue, 2005-10-11 at 09:21, Heiko J Schick wrote: Hello Hal, normally the timeout is set to 30 seconds. Why does there need to be a timeout for this ? There is no time defined in the IB spec for activating a port. The SM may or may not be up and it is implementation specific when it activates any particular port. If you need more information about the activation please see [1]. [1]: http://openib.org/pipermail/openib-general/2005-October/012355.html Yes, I saw that post yesterday. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: Re: Re: [openib-general] IBM eHCA testing..
Hi again Heiko, On Tue, 2005-10-11 at 09:21, Heiko J Schick wrote: Hello Hal, normally the timeout is set to 30 seconds. One more thing: How can the timeout be adjusted ? Is it an module parameter ? -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: Re: Re: [openib-general] IBM eHCA testing..
The IB stack doesn't handle errors during client initialization. This problem is easy to reproduce by inducing errors (resouce allocation failure or query failure) in mad_client or sa_client registration. I am working on a patch, but I am in class the whole week, don't have time to verify the patch. I hope the patch will be available early next week to fix the panic. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IBM eHCA testing..
Heiko 7. ipoib_qp_create wants to modify the IPoIB QP (priv-qp) Heiko which is NULL, because the QP was destroy earlier in time Heiko by the error handling routine in ipoib_qp_create (see 5.) Heiko I think this error could also show up on Mellanox based IB Heiko cards when ib_modify_qp failes in ipoib_qp_create. Yes, this is a bug. I think something like the patch below is needed -- ipoib_qp_create() should not destroy the QP on failure, since it no longer creates the QP. In fact we should fix the name as well, since creation of the QP has moved elsewhere. I'll check this in and queue it for 2.6.15. Thanks, Roland --- infiniband/ulp/ipoib/ipoib_verbs.c (revision 3707) +++ infiniband/ulp/ipoib/ipoib_verbs.c (working copy) @@ -92,7 +92,7 @@ int ipoib_mcast_detach(struct net_device return ret; } -int ipoib_qp_create(struct net_device *dev) +int ipoib_init_qp(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; @@ -149,10 +149,11 @@ int ipoib_qp_create(struct net_device *d return 0; out_fail: - ib_destroy_qp(priv-qp); - priv-qp = NULL; + qp_attr.qp_state = IB_QPS_RESET; + if (ib_modify_qp(priv-qp, qp_attr, IB_QP_STATE)) + ipoib_warn(priv, Failed to modify QP to RESET state\n); - return -EINVAL; + return ret; } int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca) --- infiniband/ulp/ipoib/ipoib.h(revision 3707) +++ infiniband/ulp/ipoib/ipoib.h(working copy) @@ -277,7 +277,7 @@ int ipoib_mcast_attach(struct net_device int ipoib_mcast_detach(struct net_device *dev, u16 mlid, union ib_gid *mgid); -int ipoib_qp_create(struct net_device *dev); +int ipoib_init_qp(struct net_device *dev); int ipoib_transport_dev_init(struct net_device *dev, struct ib_device *ca); void ipoib_transport_dev_cleanup(struct net_device *dev); --- infiniband/ulp/ipoib/ipoib_ib.c (revision 3707) +++ infiniband/ulp/ipoib/ipoib_ib.c (working copy) @@ -387,9 +387,9 @@ int ipoib_ib_dev_open(struct net_device struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; - ret = ipoib_qp_create(dev); + ret = ipoib_init_qp(dev); if (ret) { - ipoib_warn(priv, ipoib_qp_create returned %d\n, ret); + ipoib_warn(priv, ipoib_init_qp returned %d\n, ret); return -1; } ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IBM eHCA testing..
On Tue, Oct 11, 2005 at 09:13:20AM -0700, Shirley Ma wrote: The IB stack doesn't handle errors during client initialization. This problem is easy to reproduce by inducing errors (resouce allocation failure or query failure) in mad_client or sa_client registration. I am working on a patch, but I am in class the whole week, don't have time to verify the patch. I hope the patch will be available early next week to fix the panic. I'd be happy to verify the patch, but I need to get the latest version of the ehca driver, ideally already integrated into the subversion tree. Otherwise a tar.gz I can extract and drop in drivers/infiniband/hw/ehca would work just fine. I'm still not sure I got an answer why the ehca is so senstive to which port is plugged in. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IBM eHCA testing..
This is caused by a complex interaction of ib_mad, hcad_mod and pSeries firmware. As you might already have noticed a eHCA doesn't show up as a port but as a switch in the fabric. Reason for this is partition support and virtualisation in Infininband. If you want to give each partition in a system a own IB adapter, it has to have its own LID(s) and therefore it's own GUIDs. IB standard only allows one way currently how to accomplish this: You need a switch and multiple adapters behind. So that's exactly how the eHCA shows up in the fabric. In our case system firmware handles the SMA traffic for that switch and for all adapters (running an SMA or SM on QP0 is currently not supported). This brings up another problem: you definetly won't want to allocate LIDs for all potentially possible operating system partitions (not to confuse with IB partitioning), otherwise you could come close to the 48000 LIDs/subnet limit pretty quickly. So you need some kind of signal from the operating system to system firmware, which in the eHCA case is the H_DEFINE_AQP1 triggered by ib_create_qp with IB_QPT_GSI parameter. AFTER that call handshaking between system firmware and the SM will start, here's a new adapter active on a switch port... what's your guid? here's your LID, p_key, SM lid ...and after all that it's possible to send and receive packets from the fabric. The openib stack expects that a port is fully functional after this create_qp returns, and starts to do all sorts of modify QP and post send. So the only choice we have there is to delay create_qp until the complete handshaking between system firmware and the SM has finished (until we see a IB_PORT_ACTIVE in hcad_mod). If we don't see that until EHCA_PORT_ACTIVE_TIMEOUT we have to return an error code to openib, otherwise we're seriously in trouble (tried that). Shirley already pointed out on the mailinglist, that ib_mad and others have different recovery depending on the success of ib_create_qp(IB_QPT_GSI), especially ib_mad decides it's the best thing to kill the complete adapter if that call fails on a single port. so that's the full explanation of ehca_nr_ports and hopefully answers your question Troy Benjegerdes [EMAIL PROTECTED] 08.10.2005 04:03 To Shirley Ma [EMAIL PROTECTED] cc Pradeep Satyanarayana [EMAIL PROTECTED], Troy Benjegerdes [EMAIL PROTECTED], IBMEHCA DD/Germany/[EMAIL PROTECTED], openib-general@openib.org, [EMAIL PROTECTED] Subject Re: [openib-general] IBM eHCA testing.. On Fri, Oct 07, 2005 at 09:33:27AM -0700, Shirley Ma wrote: Hi, Troy, There is INSTALL file in the EHCA driver package. In OpenPower 720 port 1 is at the top, port 2 is at the bottom. In P570, port1 is at the bottom, port2 is at the top. Okay, I guess I should read more carefully ;) What is the issue with needing to use port 1? Can that be fixed in the driver, or does that need a firmware update? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: Re: [openib-general] IBM eHCA testing..
Hello Troy, below you will find our preliminary analysis about the problem you've reported on Oct 10 via the OpenIB mailing-list [1]: [1]: http://openib.org/pipermail/openib-general/2005-October/012353.html [ 381.453731] eHCA Infiniband Device Driver (Rel.: EHCA2_0025) [ 381.458602] xics_enable_irq: irq=36868: ibm_int_on returned fffd [ 393.378143] eHCA Infiniband Device Driver (Rel.: EHCA2_0025) [ 452.658083] PU0002 000b0075:ehca_define_sqp HCAD_ERROR Port 1 is not active. [ 452.658106] PU0002 000b0383:ehca_create_qp HCAD_ERROR ehca_define_sqp() failed rc= [ 452.821917] PU0002 000b03aa:ehca_create_qp failed ret=ffea [ 452.821939] ib_mad: Couldn't create ib_mad QP1 [ 453.313412] ib_mad: Couldn't open ehca0 port 1 [ 475.132318] PU0002 00060100:ehca_parse_ec EHCA port 1 is available. [ 518.249381] PU0007 000b00b9:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_IN r3=168 r4=1304 r5=2008 r6=8a40 r7=1e4e49000 r8=0 r9=0 r10=0 [ 518.249411] PU0007 000b00c0:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_OUT r3=ffd3 r4=0 r5=0 r6=0 r7=4 r8=0 r9=8005aa18 r10=0 [ 518.249438] PU0007 000b0560:internal_modify_qp HCAD_ERROR hipz_h_modify_qp() failed rc=ffd3 ehca_qp=cf2cd080 qp_num=8 [ 518.249460] ib0: failed to modify QP to init, ret = -22 [ 518.418976] ib0: ipoib_qp_create returned -22 [ 528.813491] Oops: Kernel access of bad area, sig: 11 [#1] [ 528.813505] SMP NR_CPUS=8 NUMA PSERIES LPAR [ 528.813517] Modules linked in: ib_ipoib ib_sa ib_mad hcad_mod ib_core ebus [ 528.813540] NIP: D0049C6C XER: 2020 LR: D00760A0 CTR: D0049C60 [ 528.813554] REGS: cf1eb1d0 TRAP: 0300 Not tainted (2.6.13.3-power5) [ 528.813568] MSR: 80009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 22028422 [ 528.813580] DAR: DSISR: 4000 [ 528.813592] TASK: c209a9a0[2021] 'ifconfig' THREAD: cf1e8000 CPU: 0 [ 528.813605] GPR00: D00760A0 CF1EB450 D005FFF0 [ 528.813625] GPR04: CF1EB548 0071 CF1EB540 0001 [ 528.813645] GPR08: 000B 0001 0004 D0049C60 [ 528.813664] GPR12: D00774C0 C04B4000 100C 100A [ 528.813685] GPR16: 1002 1002 [ 528.813704] GPR20: 1001E71C C001E466C000 8914 C001E46D4810 [ 528.813725] GPR24: C001E46D4800 CF43B900 CF1EBD10 0002 [ 528.813745] GPR28: C001E466C380 D0084640 CF1EB548 [ 528.813768] NIP [d0049c6c] .ib_modify_qp+0xc/0x40 [ib_core] [ 528.813797] LR [d00760a0] .ipoib_qp_create+0xe0/0x1c0 [ib_ipoib] [ 528.813822] Call Trace: [ 528.813829] [cf1eb450] [434849c5] 0x434849c5 (unreliable) [ 528.813846] [cf1eb4d0] [d00760a0] .ipoib_qp_create+0xe0/0x1c0 [ib_ipoib] [ 528.813873] [cf1eb5f0] [d007261c] .ipoib_ib_dev_open+0x2c/0x120 [ib_ipoib] [ 528.813899] [cf1eb680] [d006f38c] .ipoib_open+0x7c/0x190 [ib_ipoib] [ 528.813923] [cf1eb720] [c032a650] .dev_open+0xc0/0x120 [ 528.813942] [cf1eb7c0] [c0328c70] .dev_change_flags+0x180/0x1c0 [ 528.813961] [cf1eb860] [c037a02c] .devinet_ioctl+0x81c/0x850 [ 528.813980] [cf1eb970] [c037a65c] .inet_ioctl+0x27c/0x2d0 [ 528.813998] [cf1eba00] [c031bc4c] .sock_ioctl+0x8c/0x440 [ 528.814016] [cf1ebaa0] [c00c22f0] .do_ioctl+0x60/0x120 [ 528.814033] [cf1ebb40] [c00c244c] .vfs_ioctl+0x9c/0x4d0 [ 528.814050] [cf1ebbf0] [c00c28cc] .sys_ioctl+0x4c/0xa0 [ 528.814066] [cf1ebca0] [c001bb24] .dev_ifsioc+0x84/0x390 [ 528.814084] [cf1ebd70] [c00e4d88] .compat_sys_ioctl+0x158/0x500 [ 528.814103] [cf1ebe30] [c000d300] syscall_exit+0x0/0x18 [ 528.814119] Instruction dump: [ 528.814126] 7c601b78 38210080 7c030378 e8010010 7c0803a6 4e800020 6000 6000 [ 528.814150] 6000 7c0802a6 f8010010 f821ff81 e923 e9490170 e80a f8410028 [ 528.814174] 7RTAS: event: 3, Type: Platform Error, Severity: 2 It looks that IPoIB uses ressources which are already freed. We don't receive a port active event for port 1 in time (after 20 seconds). This means, that the ib_mad stack tries to create an AQP1. Here, our eHCA InfiniBand Device Driver waits for a maximum of 20 seconds for a port active event. It seems that with the usage of OpenSM we will receive the port active event after ca. 45 seconds. For the MAD and IPoIB stack this means the following: MAD: 1. No AQP1 QP will exist for port 1, because of the missing port active event. 2. All resources are freed, because of the error handling routines in ib_mad. create_mad_qp reports an error to ib_mad_port_open which destroys all
Re: [openib-general] IBM eHCA testing..
IBMEHCA So you need some kind of signal from the operating system IBMEHCA to system firmware, which in the eHCA case is the IBMEHCA H_DEFINE_AQP1 triggered by ib_create_qp with IB_QPT_GSI IBMEHCA parameter. AFTER that call handshaking between system IBMEHCA firmware and the SM will start, here's a new adapter IBMEHCA active on a switch port... what's your guid? here's your IBMEHCA LID, p_key, SM lid ...and after all that it's IBMEHCA possible to send and receive packets from the fabric. IBMEHCA The openib stack expects that a port is fully functional IBMEHCA after this create_qp returns, and starts to do all sorts IBMEHCA of modify QP and post send. So the only choice we have IBMEHCA there is to delay create_qp until the complete IBMEHCA handshaking between system firmware and the SM has IBMEHCA finished (until we see a IB_PORT_ACTIVE in hcad_mod). If IBMEHCA we don't see that until EHCA_PORT_ACTIVE_TIMEOUT we have IBMEHCA to return an error code to openib, otherwise we're IBMEHCA seriously in trouble (tried that). I think this scheme breaks the IB model. When consumers get access to an HCA, they expect to be able to access the HCA, even if an SM has not configured it (and even in the case no cable is connected). As an example of why this is useful, if the link won't come up, it's nice to be able to get to query the port's PMA counters to see if there are excessive errors or something like that. I understand that you don't want to have all HCAs always visible to the SM, but the scheme you've chosen puts an unneeded dependency between driver initialization and the external SM. It would be fine if creating QP1 triggered the transition of the port from DOWN to INIT so that it is discoverable by the SM, but there's no reason for creation of QP1 to wait to finish until the SM has brought the port up. (As a side note, Mellanox HCAs don't bring a port to INIT until the host driver has transitioned QP0 to the RTR state, which seems more sensible than waiting for QP1 to be created) I hope this can be fixed in firmware with your current HCA hardware. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IBM eHCA testing..
What's the status on getting the ehca driver integrated into subversion? If there's something holding it up, can we at least get a version that can be dropped into drivers/infiniband/hw ? Also, one final note, is it really appropriate to have ehca/ebus in the infiniband directory? It's really a PPC64 specific driver that works for more than just the ehca device, correct? I have the correct port plugged in now, and I can see the logical HCA device in the output of 'ibnetdiscover' (from another node), but trying to bring up ib0 caused this: [ 381.453731] eHCA Infiniband Device Driver (Rel.: EHCA2_0025) [ 381.458602] xics_enable_irq: irq=36868: ibm_int_on returned fffd [ 393.378143] eHCA Infiniband Device Driver (Rel.: EHCA2_0025) [ 452.658083] PU0002 000b0075:ehca_define_sqp HCAD_ERROR Port 1 is not active. [ 452.658106] PU0002 000b0383:ehca_create_qp HCAD_ERROR ehca_define_sqp() failed rc= [ 452.821917] PU0002 000b03aa:ehca_create_qp failed ret=ffea [ 452.821939] ib_mad: Couldn't create ib_mad QP1 [ 453.313412] ib_mad: Couldn't open ehca0 port 1 [ 475.132318] PU0002 00060100:ehca_parse_ec EHCA port 1 is available. [ 518.249381] PU0007 000b00b9:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_IN r3=168 r4=1304 r5=2008 r6=8a40 r7=1e4e49000 r8=0 r9=0 r10=0 [ 518.249411] PU0007 000b00c0:plpar_hcall_7arg_7ret HCAD_ERROR HCALL77_OUT r3=ffd3 r4=0 r5=0 r6=0 r7=4 r8=0 r9=8005aa18 r10=0 [ 518.249438] PU0007 000b0560:internal_modify_qp HCAD_ERROR hipz_h_modify_qp() failed rc=ffd3 ehca_qp=cf2cd080 qp_num=8 [ 518.249460] ib0: failed to modify QP to init, ret = -22 [ 518.418976] ib0: ipoib_qp_create returned -22 [ 528.813491] Oops: Kernel access of bad area, sig: 11 [#1] [ 528.813505] SMP NR_CPUS=8 NUMA PSERIES LPAR [ 528.813517] Modules linked in: ib_ipoib ib_sa ib_mad hcad_mod ib_core ebus [ 528.813540] NIP: D0049C6C XER: 2020 LR: D00760A0 CTR: D0049C60 [ 528.813554] REGS: cf1eb1d0 TRAP: 0300 Not tainted (2.6.13.3-power5) [ 528.813568] MSR: 80009032 EE: 1 PR: 0 FP: 0 ME: 1 IR/DR: 11 CR: 22028422 [ 528.813580] DAR: DSISR: 4000 [ 528.813592] TASK: c209a9a0[2021] 'ifconfig' THREAD: cf1e8000 CPU: 0 [ 528.813605] GPR00: D00760A0 CF1EB450 D005FFF0 [ 528.813625] GPR04: CF1EB548 0071 CF1EB540 0001 [ 528.813645] GPR08: 000B 0001 0004 D0049C60 [ 528.813664] GPR12: D00774C0 C04B4000 100C 100A [ 528.813685] GPR16: 1002 1002 [ 528.813704] GPR20: 1001E71C C001E466C000 8914 C001E46D4810 [ 528.813725] GPR24: C001E46D4800 CF43B900 CF1EBD10 0002 [ 528.813745] GPR28: C001E466C380 D0084640 CF1EB548 [ 528.813768] NIP [d0049c6c] .ib_modify_qp+0xc/0x40 [ib_core] [ 528.813797] LR [d00760a0] .ipoib_qp_create+0xe0/0x1c0 [ib_ipoib] [ 528.813822] Call Trace: [ 528.813829] [cf1eb450] [434849c5] 0x434849c5 (unreliable) [ 528.813846] [cf1eb4d0] [d00760a0] .ipoib_qp_create+0xe0/0x1c0 [ib_ipoib] [ 528.813873] [cf1eb5f0] [d007261c] .ipoib_ib_dev_open+0x2c/0x120 [ib_ipoib] [ 528.813899] [cf1eb680] [d006f38c] .ipoib_open+0x7c/0x190 [ib_ipoib] [ 528.813923] [cf1eb720] [c032a650] .dev_open+0xc0/0x120 [ 528.813942] [cf1eb7c0] [c0328c70] .dev_change_flags+0x180/0x1c0 [ 528.813961] [cf1eb860] [c037a02c] .devinet_ioctl+0x81c/0x850 [ 528.813980] [cf1eb970] [c037a65c] .inet_ioctl+0x27c/0x2d0 [ 528.813998] [cf1eba00] [c031bc4c] .sock_ioctl+0x8c/0x440 [ 528.814016] [cf1ebaa0] [c00c22f0] .do_ioctl+0x60/0x120 [ 528.814033] [cf1ebb40] [c00c244c] .vfs_ioctl+0x9c/0x4d0 [ 528.814050] [cf1ebbf0] [c00c28cc] .sys_ioctl+0x4c/0xa0 [ 528.814066] [cf1ebca0] [c001bb24] .dev_ifsioc+0x84/0x390 [ 528.814084] [cf1ebd70] [c00e4d88] .compat_sys_ioctl+0x158/0x500 [ 528.814103] [cf1ebe30] [c000d300] syscall_exit+0x0/0x18 [ 528.814119] Instruction dump: [ 528.814126] 7c601b78 38210080 7c030378 e8010010 7c0803a6 4e800020 6000 6000 [ 528.814150] 6000 7c0802a6 f8010010 f821ff81 e923 e9490170 e80a f8410028 [ 528.814174] 7RTAS: event: 3, Type: Platform Error, Severity: 2 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] IBM eHCA testing..
I have two IBM eHCA cards installed and it appears that OpenSM is happily talking to the firmware and bringing up the links. So now I'm looking at the install instructions for the ehca2_EHCA2_0025.tgz code drop, and wondering what (if any) issues there are with a 2.6.13 kernel, or later OpenIB svn drops. Is there a later code drop I can get ahold of? Is the nr_ports issue something in the driver? I wound up connecting to the lower port in the Openpower720 machine.. do you know if that's port 1 or 2? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IBM eHCA testing..
I believe the lower port is port 1. I will defer to the EHCA team as regards to issues with 2.6.13 (if any). We have minimally used both ports on p570. So, my guess is that should work on a Openpower720. Pradeep [EMAIL PROTECTED] [EMAIL PROTECTED] wrote on 10/07/2005 07:12:07 AM: I have two IBM eHCA cards installed and it appears that OpenSM is happily talking to the firmware and bringing up the links. So now I'm looking at the install instructions for the ehca2_EHCA2_0025.tgz code drop, and wondering what (if any) issues there are with a 2.6.13 kernel, or later OpenIB svn drops. Is there a later code drop I can get ahold of? Is the nr_ports issue something in the driver? I wound up connecting to the lower port in the Openpower720 machine.. do you know if that's port 1 or 2? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IBM eHCA testing..
Hi, Troy, There is INSTALL file in the EHCA driver package. In OpenPower 720 port 1 is at the top, port 2 is at the bottom. In P570, port1 is at the bottom, port2 is at the top. Thanks Shirley Ma IBM Linux Technology Center 15300 SW Koll Parkway Beaverton, OR 97006-6063 Phone(Fax): (503) 578-7638___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] IBM eHCA testing..
On Fri, Oct 07, 2005 at 09:33:27AM -0700, Shirley Ma wrote: Hi, Troy, There is INSTALL file in the EHCA driver package. In OpenPower 720 port 1 is at the top, port 2 is at the bottom. In P570, port1 is at the bottom, port2 is at the top. Okay, I guess I should read more carefully ;) What is the issue with needing to use port 1? Can that be fixed in the driver, or does that need a firmware update? ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general