[openib-general] opensm crash with topspin HCA
When we run opensm (OFED) release and if a Topspin HCA is in the IB network, opensm crashes in umad_receiver with NULL pointer exception. The transaction ID is zero is the MAD'S from topspin HCA on windows. The crashes seems to random in umad_receiver. HCA found: hca_id=InfiniHost0 vendor_id=0x02C9 vendor_part_id=0x5A44 hw_ver=0xA0 fw_ver=0x40006 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] CM and REP handling
In the current communication manager (CM) implementation how is the REP MADgetting lost handled. When the REP gets lost, the cm_dup_req_handler gets calledwhich currently enters the default condition and does nothing. The client retries the number of timers it is configured to and fails. If the first REP gets lost, the connectionnever gets established. So what should be the behavior ?-Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Disabling end-to-end flow control
Is there a way to disable end-to-end flowcontrol using any of the API's ?Thanks,-Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] opensm and NPTL
I am using the trunk. Should I be using 1.0 ? -Viswa On 13 Jun 2006 12:35:17 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: On Tue, 2006-06-13 at 12:21, Viswanath Krishnamurthy wrote:> Yes.. I want to test waters again and see if the issues went away.Are you using the trunk or 1.0 ?-- Hal> -Viswa>> > On 13 Jun 2006 06:15:34 -0400, Hal Rosenstock <[EMAIL PROTECTED]>> wrote:> Hi Viswa,>> On Mon, 2006-06-12 at 23:16, Viswanath Krishnamurthy wrote: > > There were some issues with opensm running with> NPTL (thread> > library). Has the issues been resolved ?>> There were some fixes to the signal handling which went in > back in the> Feb/early March time frame. OpenSM should be better with NPTL> now. Is it> working for you or are you asking before stepping into these> waters > again ?>> -- Hal>> > Regards,> > Viswa> >> >> >> >> __ > >> > ___> > openib-general mailing list> > openib-general@openib.org > > http://openib.org/mailman/listinfo/openib-general> >> > To unsubscribe, please visit> http://openib.org/mailman/listinfo/openib-general>> ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] opensm and NPTL
Yes.. I want to test waters again and see if the issues went away. -Viswa On 13 Jun 2006 06:15:34 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: Hi Viswa,On Mon, 2006-06-12 at 23:16, Viswanath Krishnamurthy wrote:> There were some issues with opensm running with NPTL (thread> library). Has the issues been resolved ?There were some fixes to the signal handling which went in back in the Feb/early March time frame. OpenSM should be better with NPTL now. Is itworking for you or are you asking before stepping into these watersagain ?-- Hal> Regards,> Viswa>> >> __>> ___> openib-general mailing list> openib-general@openib.org> http://openib.org/mailman/listinfo/openib-general>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] opensm and NPTL
There were some issues with opensm running with NPTL (thread library). Has the issues been resolved ? Regards, Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Fix for ibping
Works like a charm... -Viswa On 12 Apr 2006 21:32:33 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: On Wed, 2006-04-12 at 20:46, Hal Rosenstock wrote:> On Wed, 2006-04-12 at 18:25, Viswanath Krishnamurthy wrote:> > The RMPP version needs to be 1.>> Thanks. I'm not sure what changed here to require this. I need to do > some more digging.I figured it out. The fix is in r6448. Can you update and try it ?Thanks.-- Hal> -- Hal>> > [EMAIL PROTECTED] src]# svn diff ibping.c> > Index: ibping.c> > ===> > -- ibping.c(revision 6446)> > +++ ibping.c(working copy)> > @@ -336,7 +336,7 @@> > exit(0); > > }> >> > - if (mad_register_client(ping_class, 0) < 0)> > + if (mad_register_client(ping_class, 1) < 0)> > IBERROR("can't register to ping class %d",> > ping_class);> >> > if (ib_resolve_portid_str(&portid, argv[0], dest_type, sm_id)> > < 0)> >> > > >> > __> >> > ___> > openib-general mailing list> > openib-general@openib.org> > http://openib.org/mailman/listinfo/openib-general> >> > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general>> ___> openib-general mailing list> openib-general@openib.org> http://openib.org/mailman/listinfo/openib-general>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Fix for ibping
The mad_register_agent function in mad.c kernel file was checking for rmpp_version. This was failing and this failure was propagated to umad (thru ioctl) On 12 Apr 2006 20:46:33 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: On Wed, 2006-04-12 at 18:25, Viswanath Krishnamurthy wrote:> The RMPP version needs to be 1.Thanks. I'm not sure what changed here to require this. I need to dosome more digging.-- Hal> [ [EMAIL PROTECTED] src]# svn diff ibping.c> Index: ibping.c> ===> --- ibping.c(revision 6446)> +++ ibping.c(working copy) > @@ -336,7 +336,7 @@> exit(0);> }>> - if (mad_register_client(ping_class, 0) < 0)> + if (mad_register_client(ping_class, 1) < 0)> IBERROR("can't register to ping class %d",> ping_class);>> if (ib_resolve_portid_str(&portid, argv[0], dest_type, sm_id)> < 0)>>>> __ >> ___> openib-general mailing list> openib-general@openib.org> http://openib.org/mailman/listinfo/openib-general>> To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Fix for ibping
The RMPP version needs to be 1. [EMAIL PROTECTED] src]# svn diff ibping.c Index: ibping.c === --- ibping.c (revision 6446) +++ ibping.c (working copy) @@ -336,7 +336,7 @@ exit(0); } - if (mad_register_client(ping_class, 0) < 0) + if (mad_register_client(ping_class, 1) < 0) IBERROR("can't register to ping class %d", ping_class); if (ib_resolve_portid_str(&portid, argv[0], dest_type, sm_id) < 0) ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] ibping broken in SVN 6446 ?
When I do a ibping I get an error (on a 32 bit machine) Linux Kernel: 2.6.16 infiniband directory replaced with SVN6446 I enable debug in umad.c, I get the following error. The ioctl call to the umad driver (umad device) is failing. return value for ioctl is -1, errno is -22 (EINVAL) portid 0 registering qp 1 class 50 version 1 failed: ibping: iberror: can't register to ping class 50 -Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Mainline 2.6.16 kernel with openib userland libraries
My guess is the bug is in userspace library, since a kernel module which uses the same API's in kernel mode works fine. I will work on the sample code and send it.. -Viswa On 3/27/06, Roland Dreier <[EMAIL PROTECTED]> wrote: Roland> Did this code work with mainline kernel 2.6.15? If so youRoland> could do a bisection on the changes between 2.6.15 andRoland> 2.6.16 to pin down which patch broke things.Just to be clear: the thing to check would be the same userspace code on kernel 2.6.15 and 2.6.16. Because it's entirely possible that thebug is in a userspace library. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Mainline 2.6.16 kernel with openib userland libraries
I tried using openib userland libraries with mainline 2.6.16 kernel but ran into a strange problem. A userland application which uses CM and VERBS library which works fine with earlier releases stopped working with no error (in API's). When I put the analyser on, I see the CM connect sequence is fine but when ibv_post_send (RC send) the DLID field in the LRH header is zero causing the packet to be dropped. I tried mainline 2.6.16 kernel (with IB stack from kernel tree) openib userland libraies [EMAIL PROTECTED] 216GEN2]# svn info Path: . URL: https://openib.org/svn/gen2/trunk Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd Revision: 5989 Node Kind: directory Schedule: normal Last Changed Author: halr Last Changed Rev: 5989 Last Changed Date: 2006-03-23 10:17:02 -0800 (Thu, 23 Mar 2006) Any idea about the compatilibility issues ? Thanks, Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] mthca and coalesced ACK
When the HCA receives back to back RDMA write followed by RDMA read requests. It generates coalesced ACK (implicit ACK for RDMA write). Is there a configuration in the mthca driver which will enable HCA firmware to generate individual ACK's. I an trying to debug another issue and this will be helpful. Thanks, Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Getting the right userspace libraries
How does one pull out the correct userland libraries for 2.6.16 kernel IB stack. Is it to look at the SVN number in the driver code, and pull that version ? -Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] mthca and non-MSI system
Has the mthca driver been tested on non-MSI (interrupt) system. I seem to have a problem where interrupts are not generated on non-MSI system with the following message "NOP command failed to generate interrupt (IRQ 9), aborting." BIOS or ACPI interrupt routing problem? -Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Vendor specific MAD support
Does openIB Gen2 stack umad/mad library support Vendor specific MAD extensions ? -Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] mthca error ?
Roland, I see the following when I use the latest mthca driver on a different HCA card [ 193.882759] ib_mthca: Initializing :03:00.0 [ 193.887546] ib_mthca :03:00.0: Found bridge: :02:0c.0 [ 194.894937] ib_mthca :03:00.0: SYS_EN DDR error: syn=4, sock=0, sladdr=0, SPD source=DIMM [ 194.903781] ib_mthca :03:00.0: SYS_EN returned status 0x07, aborting. [ 194.910823] ib_mthca: probe of :03:00.0 failed with error -22 lspci output :03:00.0 InfiniBand: Mellanox Technology MT23108 InfiniHost (rev a1) Any idea what th error is ? Thanks, Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] opensm and faulty hardware
Hal, Thanks.. works like a charm... -Viswa On 27 Sep 2005 16:13:01 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: On Tue, 2005-09-27 at 16:00, Viswanath Krishnamurthy wrote:> Hal,>> I added a hack now to get around the problem. There needs to be a> proper fix later..Can you try this instead ? Thanks. -- HalIndex: include/opensm/osm_port.h===--- include/opensm/osm_port.h (revision 3567)+++ include/opensm/osm_port.h (working copy) @@ -346,7 +346,7 @@ osm_physp_is_healthy( * Returns TRUE if the Physical Port has been maked as healthy * FALSE otherwise. * All physical ports are initialized as "healthy" but may be marked -* otherwise if a received trap claims otherwise.+* otherwise if a received trap claims otherwise. * * NOTES *@@ -456,6 +456,42 @@ osm_physp_set_port_info( * Port, Physical Port */ +/f* OpenSM: Physical Port/osm_physp_validate_base_lid+* NAME+* osm_physp_validate_base_lid+*+* DESCRIPTION+* Validates the base LID in the Physical Port object.+*+* SYNOPSIS+*/ +static inline boolean_t+osm_physp_validate_base_lid(+ IN osm_physp_t* const p_physp )+{+ CL_ASSERT( osm_physp_is_valid( p_physp ) );+ if ( cl_ntoh16( p_physp->port_info.base_lid ) > IB_LID_UCAST_END_HO ) + {+ p_physp->port_info.base_lid = 0;+ return FALSE;+ }+ return TRUE;+}+/*+* PARAMETERS+* p_physp+* [in] Pointer to an osm_physp_t object. +*+* RETURN VALUES+* Returns TRUE if the base LID in the Physical port object is valid.+* FALSE otherwise.+*+* NOTES+*+* SEE ALSO+* Port, Physical Port+*/ + /f* OpenSM: Physical Port/osm_physp_set_pkey_tbl * NAME * osm_physp_set_pkey_tblIndex: opensm/osm_port_info_rcv.c===--- opensm/osm_port_info_rcv.c (revision 3579) +++ opensm/osm_port_info_rcv.c (working copy)@@ -346,8 +346,12 @@ __osm_pi_rcv_process_switch_port( if (port_num == 0) {-/* This is a management port 0 */- __osm_pi_rcv_process_endport(p_rcv, p_physp, p_pi); + /* This is switch management port 0 */+ if ( !osm_physp_validate_base_lid( p_physp ) )+ osm_log( p_rcv->p_log, OSM_LOG_ERROR,+"__osm_pi_rcv_process_switch_port: ERR 0F04: " +"Invalid base LID corrected.\n" );+ __osm_pi_rcv_process_endport(p_rcv, p_physp, p_pi); } OSM_LOG_EXIT( p_rcv->p_log );@@ -367,6 +371,10 @@ __osm_pi_rcv_process_ca_port( UNUSED_PARAM( p_node ); osm_physp_set_port_info( p_physp, p_pi );+ if ( !osm_physp_validate_base_lid( p_physp ) )+osm_log( p_rcv->p_log, OSM_LOG_ERROR,+"__osm_pi_rcv_process_ca_port: ERR 0F08: " +"Invalid base LID corrected.\n" ); __osm_pi_rcv_process_endport(p_rcv, p_physp, p_pi);@@ -390,6 +398,10 @@ __osm_pi_rcv_process_router_port( Update the PortInfo attribute. */ osm_physp_set_port_info( p_physp, p_pi );+ if ( !osm_physp_validate_base_lid( p_physp ) )+osm_log( p_rcv->p_log, OSM_LOG_ERROR,+"__osm_pi_rcv_process_router_port: ERR 0F09: " +"Invalid base LID corrected.\n" ); OSM_LOG_EXIT( p_rcv->p_log ); } ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] opensm and faulty hardware
Hal, I added a hack now to get around the problem. There needs to be a proper fix later.. [EMAIL PROTECTED] opensm]# svn diff osm_port.h Index: osm_port.h === --- osm_port.h (revision 3549) +++ osm_port.h (working copy) @@ -1049,6 +1049,8 @@ { CL_ASSERT( p_physp ); CL_ASSERT( osm_physp_is_valid( p_physp ) ); + if (p_physp->port_info.base_lid == 0x) + return (0); return( p_physp->port_info.base_lid ); } /* On 27 Sep 2005 15:11:05 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: On Tue, 2005-09-27 at 14:13, Viswanath Krishnamurthy wrote:> I tracked down the issue to a bug in osm_lid_mgr.c>> function: __osm_lid_mgr_init_sweep(...)>> The bad hardware was retutning an assigned LID of 0x. In this > function there is a loop> as follows where opensm is getting stuck.. (with line number)>> 392 p_port_guid_tbl = &p_mgr->p_subn->port_guid_tbl;> 393> 394 for( p_port = (osm_port_t*)cl_qmap_head( p_port_guid_tbl ); > 395p_port != (osm_port_t*)cl_qmap_end( p_port_guid_tbl );> 396p_port = (osm_port_t*)cl_qmap_next( &p_port->map_item )> )> 397 {> 398 osm_port_get_lid_range_ho(p_port, &disc_min_lid,> &disc_max_lid);> 399 for (lid = disc_min_lid; lid <= disc_max_lid; > lid++) <= Bug here> 400 cl_ptr_vector_set(p_discovered_vec, lid, p_port );> 401 }>> Since the disc_max_lid and disc_min_lid are 0x, and these are> unsigned 16 bit numbers, the condition > in the for loop never becomes false, and opensm is stuck in the loop.> There are couple of other places in that> function that needs fixing too.Sep 26 15:26:03 424135 [B66CFBB0] -> SMP dump: base_ver0x1mgmt_class..0x81class_ver...0x1method..0x1 (SubnGet)D bit...0x0status..0x0hop_ptr.0x0hop_count...0x2trans_id0x1274 attr_id.0x15 (PortInfo)resv0x0attr_mod0x1m_key...0xdr_slid.0x dr_dlid.0xSep 26 15:26:03 424407 [B6ED0BB0] -> __osm_nd_rcv_process_nd: Node 0x30d32c7234Description = Agilent E2954A 4x Generator for InfiniBand.Sep 26 15:26:03 424426 [B6ED0BB0] -> __osm_nd_rcv_process_nd: ]Sep 26 15:26:03 679882 [B56CDBB0] -> SMP dump:base_ver0x1 mgmt_class..0x81class_ver...0x1method..0x81 (SubnGetResp)D bit...0x1status..0x0hop_ptr.0x0hop_count...0x2trans_id0x1274 attr_id.0x15 (PortInfo)resv0x0attr_mod0x1m_key...0xdr_slid.0x dr_dlid.0xInitial path: [0][1][12]Return path: [0][E][0]Sep 26 15:26:03 680291 [B76D1BB0] -> osm_pi_rcv_process: [Sep 26 15:26:03 680323 [B56CDBB0] -> __osm_sm_mad_ctrl_rcv_callback: ]Sep 26 15:26:03 680343 [B76D1BB0] -> PortInfo dump: port number.0x1node_guid...0x0030d32c7234port_guid...0x0030d32c7234m_key...0x subnet_prefix...0xfe80base_lid0xYes, it appears the Agilent exerciser returned good status to a SM Get PortInfo with a base_lid of 0x. The base_lid should be validated byOpenSM.-- Hal ___ openib
Re: [openib-general] opensm and faulty hardware
I tracked down the issue to a bug in osm_lid_mgr.c function: __osm_lid_mgr_init_sweep(...) The bad hardware was retutning an assigned LID of 0x. In this function there is a loop as follows where opensm is getting stuck.. (with line number) 392 p_port_guid_tbl = &p_mgr->p_subn->port_guid_tbl; 393 394 for( p_port = (osm_port_t*)cl_qmap_head( p_port_guid_tbl ); 395 p_port != (osm_port_t*)cl_qmap_end( p_port_guid_tbl ); 396 p_port = (osm_port_t*)cl_qmap_next( &p_port->map_item ) ) 397 { 398 osm_port_get_lid_range_ho(p_port, &disc_min_lid, &disc_max_lid); 399 for (lid = disc_min_lid; lid <= disc_max_lid; lid++) <= Bug here 400 cl_ptr_vector_set(p_discovered_vec, lid, p_port ); 401 } Since the disc_max_lid and disc_min_lid are 0x, and these are unsigned 16 bit numbers, the condition in the for loop never becomes false, and opensm is stuck in the loop. There are couple of other places in that function that needs fixing too. -Viswa On 9/27/05, Viswanath Krishnamurthy <[EMAIL PROTECTED]> wrote: Log sent off-list... -Viswa On 9/27/05, Eitan Zahavi < [EMAIL PROTECTED]> wrote: Hi Viswa,Please send a full /var/log/osm.log file of opensm -V .You can send us a copy off the list if it is too big:yael and eitan in @ mellanox.co.ilEZ Hal Rosenstock wrote:> On Mon, 2005-09-26 at 19:57, Viswanath Krishnamurthy wrote:>>>I have an exerciser in the IB network. The exerciser seems to be>>faulty/buggy. When opensm starts I do not >>see 'SUBNET UP" message. It says "Entering MASTER" and waits there.>>Any new node inserted in this state is not assigned any LID. Anybody>>seen such behavior ?> >> Any idea on how the IB exerciser misbehaves on the network ? Do you have> an analyzer too ?>> What does the OSM log show ?>> -- Hal>> ___ > openib-general mailing list> openib-general@openib.org> http://openib.org/mailman/listinfo/openib-general >> To unsubscribe, please visit> http://openib.org/mailman/listinfo/openib-general > ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] opensm and faulty hardware
Log sent off-list... -Viswa On 9/27/05, Eitan Zahavi <[EMAIL PROTECTED]> wrote: Hi Viswa,Please send a full /var/log/osm.log file of opensm -V .You can send us a copy off the list if it is too big:yael and eitan in @mellanox.co.ilEZ Hal Rosenstock wrote:> On Mon, 2005-09-26 at 19:57, Viswanath Krishnamurthy wrote:>>>I have an exerciser in the IB network. The exerciser seems to be>>faulty/buggy. When opensm starts I do not >>see 'SUBNET UP" message. It says "Entering MASTER" and waits there.>>Any new node inserted in this state is not assigned any LID. Anybody>>seen such behavior ?> >> Any idea on how the IB exerciser misbehaves on the network ? Do you have> an analyzer too ?>> What does the OSM log show ?>> -- Hal>> ___ > openib-general mailing list> openib-general@openib.org> http://openib.org/mailman/listinfo/openib-general >> To unsubscribe, please visit> http://openib.org/mailman/listinfo/openib-general> ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] opensm and faulty hardware
I have an exerciser in the IB network. The exerciser seems to be faulty/buggy. When opensm starts I do not see 'SUBNET UP" message. It says "Entering MASTER" and waits there. Any new node inserted in this state is not assigned any LID. Anybody seen such behavior ? -Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Another opensm bug ?
I ran into another opensm bug which caused opensm to stop functioning. This happened only once. Here is the test case 1. Run opensm on Machine A 2. Run the following script on M/c B a. Check ibstatus b. Ping machine A c. Run osmtest d. reboot The test case is to make sure opensm configures the machine correcty. Out of 850 iterations, I saw this error once. The opensm started receiving Sbnet trap continiously. 9I did not see any message in the log to prevent DOS attacks) The Trap has the same transacation id (0x224 in this case). opensm mad receive thread was getting called continously called. Initially I suspected the situation which Eitan described.. (Bad hardware causing traps etc). But when I stoppped and restarted opensm, the problem went away. Log attached off-list. -Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: Another opensm problem ?
Hi Eitan, I see that message in the log. -Viswa On 9/24/05, Eitan Zahavi <[EMAIL PROTECTED]> wrote: Hi Viswa and Hal,I have read through the thread and have few comments.But first let me see if I understand the test run correctly. The test is as follows:1. OpenSM starts up configuring the subnet. 2. Then the user ears up a cable and connects it to the other side port of a switch3. The SM is supposed to bring up the new connection4. Step 2 is repeated until the SM stops responding.Well, if this is the case then OpenSM is might stop responding due to the following features: 1. We had in the past cases where bad hardware continuously flooded the SM with Traps.To protect against this kind of DOS attack we have implemented an adaptive filter inthe SM trap receiver:If the exact same trap is received continuously from same source more then 10 times (with no more then of 5sec between the traps) they are considered DOS and are ignored.Please see osm_trap_rcv.c for details.2. The way IB switches work is that each time a port of their changes state they: a. Set the "change bit" in the SwitchInfob. Send a trap 128 to the SM. But Trap 128 does not carry the changed port number.So under a test case like you describe what can happen:1. The SM decides to ignore trap 128 from the switch as more then 5 connect/reconnect sequences happen with not enough "quite" time to recover.2. The SwitchInfo ChangeBit is sampled during the OSM light sweep. There is a race between thereading of the change bit and the clearing of it. If the connect disconnect happen very fast the change bit set by the re-connect can be cleaned by the clear starting by the disconnect.It is easy to see in the log file if the SM did ignore traps. Run with -V and look for:grep "Continuously received this trap" /var/log/osm.log (for some reason I did not get any log attachments with this thread - otherwise I woulddo some analysis on it too).Anyway, if the SM does not heavy sweep (due to the above) it is very likely it will continue to poll the non existing node that was previously attached to a switch port with no success.So testing of cable tear off and reconnect should be done with at least 10 seconds recovery time.Also you could try sending kill -HUP to the OpenSM process and see if the full sweep you start is able to bring all ports up.Viswa, with all that said, it is very possible you are experiencing a bug in OpenSM and wewant to encourage your effort finding those. With your, and others, help we will be able to flush them out.ThanksEitanHal Rosenstock wrote:> On Fri, 2005-09-23 at 14:57, Hal Rosenstock wrote:>>>On Fri, 2005-09-23 at 13:50, Viswanath Krishnamurthy wrote:>> >>>- After 7-8 iterations, I ran into a weird problem, where opensm was>>>showing the HCA as UNKNOWN. The port>>>never came up to ACTIVE state. The unplugged and replugged into>>>different slots, the port remained in INIT >>>state.>>>>Mellanox: SW : 12 : INI : : : 2048 : 1x : 2.5 :>> 0002c9010d26e780 : UNKNOWN>>>OpenSM thinks that either there is no physical port on the other end>> of>>>the link or it is not "valid" (GUID non 0). Obviously it is there as >> the>>>port state is INIT so the physical link came up which requires the>>remote end to be there.>>>>From the log you sent, this is exactly what is happening. > Sep 23 10:07:23 451191 [B7751BB0] -> osm_drop_mgr_process: Checking port> 0x0002c9010d26e780.> Sep 23 10:07:23 451209 [B7751BB0] -> osm_drop_mgr_process: Checking port> 0x0002c90200400cfd. > Sep 23 10:07:23 451226 [B7751BB0] -> osm_drop_mgr_process: ERR 0108:> Unknown remote side for node 0x0002c9010d26e780 port 20. Adding to light> sweep sampling list.> Sep 23 10:07:23 451251 [B7751BB0] -> Directed Path Dump of 1 hop path: > Path = [0][1]> Sep 23 10:07:23 451267 [B7751BB0] -> osm_drop_mgr_process: ]>> So look in osm_drop_mgr.c line 707:> Can you enhance the log display to see which is failing:> osm_physp_is_valid(p_physp) or osm_physp_get_remote(p_physp) ? >> Also, it appears to keep light sweeping this port but whichever switch> port it is on, it does not respond. Not sure where the problem is. It> could be on the outgoing side of the switch (we could run diags against > the switch and various ports; I would be curious what they return when> the subnet is in this broken state) or on the HCA. However, the fact> that restarting opensm made it go away without touching anything else > makes this appear otherwise.>>>>One other note is that it appears to have come up as 1x. Is that what>>should happen ?>>> -- Hal>> ___ > openib-general mailing list>
Re: [openib-general] Re: opensm and SIGINT
On 23 Sep 2005 13:49:31 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: Hi Viswa,On Fri, 2005-09-23 at 13:43, Viswanath Krishnamurthy wrote:> More information,>> The test case is as follows>> 1. Start opensm in verbose mode (-V)> 2. Ping remote node > 3. osmtest -f c> 4. osmtest -f a> 5. pkill -9 opensm> 6. Repeat over>> Out of about 2500 iterations, 143 osmtest failed. Keep in mind,> only Step 4 failed.Yes. Do you see any port LEDs on the switch blink indicating the port wentdown from active and back while running this ? No, I ran this test overnight and logged the results. I will try it next week and let you know. > Step 3 which is inventory file creation *never* failed. (I think> inventory file creation also talks to SA right ?) Right.-- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Forcing IB link state down
On 23 Sep 2005 13:59:28 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: Hi Viswa,On Fri, 2005-09-23 at 13:55, Viswanath Krishnamurthy wrote:> Is there an API or command to force an IB link to go down.Not currently.> This will be helpful in running tests on opensm. Yes, I can understand that. Technically (per the IBA spec), the SM isthe only one allowed to do Sets. I think it would be possible to have adiag command do this as long as the MKey protection is weak (which it is now). A better way might be to have a CLI on the OpenSM and be able toissue a down command to a port. I was looking if mthca driver has any API/ioctl to disable/enable the link.. -Viswa -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: Another opensm problem ?
Hal,On 23 Sep 2005 14:04:00 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: Hi again Viswa,On Fri, 2005-09-23 at 13:50, Viswanath Krishnamurthy wrote:Good test. Hadn't tried this. I will try it and will recreate this.> - 2 machines with a switch in bertween. One m/c running opensm. How was opensm started ? Manually # opensm -V > Attached is the logThe default log is in /var/log/osm.log I captured what appeared on the screen. I will send the osm.log file too.. It is a big one and had accumulated over a period of time.. -- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Forcing IB link state down
Is there an API or command to force an IB link to go down. This will be helpful in running tests on opensm. -Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: opensm and SIGINT
More information, The test case is as follows 1. Start opensm in verbose mode (-V) 2. Ping remote node 3. osmtest -f c 4. osmtest -f a 5. pkill -9 opensm 6. Repeat over Out of about 2500 iterations, 143 osmtest failed. Keep in mind, only Step 4 failed. Step 3 which is inventory file creation *never* failed. (I think inventory file creation also talks to SA right ?) -Viswa On 23 Sep 2005 12:54:56 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: Hi Eitan,On Fri, 2005-09-23 at 12:19, Eitan Zahavi wrote:> Hi Hal, Viswa,>> Sorry I'm joining late on this thread due to the weekend (which starts> here on Friday ending Saturday night). > Is there any conclusion on this one?No.> The only log I have seen was from osmtest failing to send a MAD.True.> Looks like a umad issue?Not sure why you say that. There are other possibilities I'm aware of here:Note that that failed sent MAD is one which has a response expected sothis means that the response was not received. It also goes through thetransmit retry strategy (I could see this on the SA side). So the only thing I can say at this point is that for some reason, the response doesnot make it back from the SA to the SA client (osmtest). That's wherethis one is right now.-- Hal> Eitan>> Hal Rosenstock wrote: > > Hi again Viswa,> >> > On Wed, 2005-09-21 at 21:00, Hal Rosenstock wrote:> >> >>Hi Viswa,> >>> >>On Wed, 2005-09-21 at 20:23, Viswanath Krishnamurthy wrote: > >>> >>>Currently opensm traps SIGINT. There was some discussion to remove> >> > it.> >> >>>I have currently running some tests on opensm> >>>by killing (SIGKILL) and restarting opensm. So far I ahve not found > >>>any resource leak issues. Is ther a plan to remove that> >>>signal handler. Ideally it should not exist.> >>> >>Eitan stated that this was historical in nature for gen1 drivers which > >>had resource tracking problems: "if OpenSM left without cleaning up> >> > all> >> >>used resources (like MAD buffers and UD-AVs), the driver oops'ed." > >>> >>I think that (eliminating the handler for SIGINT) can at least be done> >>for OSM_VENDOR_INTF_OPENIB and leave it there for the other vendor> >>layers for starters. I will experiment with gen2 and let you know. > >> >> > Does the patch below do what you want ? Can you try it ?> >> > -- Hal> >> > Index: opensm/osm_opensm.c> > === > > --- opensm/osm_opensm.c (revision 3513)> > +++ opensm/osm_opensm.c (working copy)> > @@ -182,7 +182,9 @@ osm_reg_sig_handler(> > IN osm_opensm_t * const p_osm )> > { > > __p_osm_to_signal = p_osm;> > +#ifndef OSM_VENDOR_INTF_OPENIB> > cl_reg_sig_hdl( SIGINT, __sig_handler );> > +#endif> > cl_reg_sig_hdl( SIGTERM, __sig_handler ); > > cl_reg_sig_hdl( SIGHUP, __sig_handler );> > osm_exit_flag = 0;> >> >> > ___> > openib-general mailing list > > openib-general@openib.org> > http://openib.org/mailman/listinfo/openib-general> > > > To unsubscribe, please visit> > http://openib.org/mailman/listinfo/openib-general> >> ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: opensm and SIGINT
On 22 Sep 2005 18:44:44 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: Hi Viswa,On Thu, 2005-09-22 at 15:55, Viswanath Krishnamurthy wrote:> Here is the log of osmtest failure. This was seen 150 times out of> 2500 iterations. The opensm SUBNET UP failure is tough to reproduce. > Saw it once in 2500 iterations. Unfortunately I did not collect the> log on that error.I understand but it is hard to know whether this is a known issue orsomething else without a log of the failure. > The patch worked as expected and did not see any issues with ctrl-C.> When I tried apply the patch, I got a failure. (I used the patch> command). I manually added those 2 lines.Not sure why the patch wouldn't apply. > Command Line Arguments> Done with args> Flow = All Validations> Sep 21 17:50:56 684254 [B7F026C0] -> osm_vendor_get_all_port_attr:> assign CA mthca0 port 1 guid (0x2c90200400cfd) as the def > ault port.> using default guid 0x2c90200400cfd> Sep 21 17:50:56 686301 [B7F026C0] -> osm_vendor_get_all_port_attr:> assign CA mthca0 port 1 guid (0x2c90200400cfd) as the def> ault port. > Sep 21 17:50:56 686347 [B7F026C0] -> osm_vendor_bind: Binding to port> 0x2c90200400cfd.> Sep 21 17:50:56 689963 [B7F026C0] -> osm_vendor_get_all_port_attr:> assign CA mthca0 port 1 guid (0x2c90200400cfd) as the def > ault port.> Sep 21 17:50:56 691969 [B7F026C0] -> osm_vendor_get_all_port_attr:> assign CA mthca0 port 1 guid (0x2c90200400cfd) as the def> ault port.> Sep 21 17:50:56 693187 [B7F026C0] -> > osmtest_validate_sa_class_port_info:> -> SA Class Port Info:> base_ver:1> class_ver:2> cap_mask:0x202> resp_time_val:0x64> - > Sep 21 17:50:56 775383 [B7F026C0] -> osmtest_wrong_sm_key_ignored: Try> PortRecord for port with LID 0x0 Num:0x1.> Sep 21 17:51:00 775320 [B76FFBB0] -> umad_receiver: ERR 5409: send> completed with error (method=1 attr=12 trans_id=0x34) -- > dropping.> Sep 21 17:51:00 775389 [B76FFBB0] -> umad_receiver: ERR 5410: class> 0x3 LID 0x0> Sep 21 17:51:00 775418 [B76FFBB0] -> osmtest_query_res_cb: ERR 0003:> Error on query (IB_TIMEOUT). > Sep 21 17:51:00 775465 [B7F026C0] -> osmtest_wrong_sm_key_ignored: ERR> 0011: Did not get a timeout but got (IB_SUCCESS).> Sep 21 17:51:00 775581 [B7F026C0] -> osmt_register_service:> Registering Service: name: osmt.srvc.1804289383.7793 id:0x6b8b26f> 6.> Sep 21 17:51:00 777143 [B7F026C0] -> osmt_register_service:> Registering Service: name:osmt.srvc.846930885.7793 id:0x327b0554> Sep 21 17:51:00 777143 [B7F026C0] -> osmt_register_service: > Registering Service: name:osmt.srvc.846930885.7793 id:0x327b0554> .> Sep 21 17:51:04 779578 [B76FFBB0] -> umad_receiver: ERR 5409: send> completed with error (method=2 attr=31 trans_id=0x36) --dropping. > Sep 21 17:51:04 779604 [B76FFBB0] -> umad_receiver: ERR 5410: class> 0x3 LID 0x0> Sep 21 17:51:04 779631 [B76FFBB0] -> osmtest_query_res_cb: ERR 0003:> Error on query (IB_TIMEOUT).> Sep 21 17:51:04 779674 [B7F026C0] -> osmt_register_service: ERR 0364: > ib_query failed (IB_TIMEOUT).> Sep 21 17:51:04 779740 [B7F026C0] -> osmtest_run: ERR 00148: Service> Flow failed (IB_TIMEOUT)> OSMTEST: TEST "All Validations" FAILThe final FAIL/PASS is definitive so there are real failures here. Is this consistent or intermittent ? Does this work sometimes or always Intermittent.. As I said 150 out of 2500 iterations failed. Is there any log you want me to collect ? fail ?-- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: opensm and SIGINT
Hal, Here is the log of osmtest failure. This was seen 150 times out of 2500 iterations. The opensm SUBNET UP failure is tough to reproduce. Saw it once in 2500 iterations. Unfortunately I did not collect the log on that error. The patch worked as expected and did not see any issues with ctrl-C. When I tried apply the patch, I got a failure. (I used the patch command). I manually added those 2 lines. Command Line Arguments Done with args Flow = All Validations Sep 21 17:50:56 684254 [B7F026C0] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c90200400cfd) as the def ault port. using default guid 0x2c90200400cfd Sep 21 17:50:56 686301 [B7F026C0] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c90200400cfd) as the def ault port. Sep 21 17:50:56 686347 [B7F026C0] -> osm_vendor_bind: Binding to port 0x2c90200400cfd. Sep 21 17:50:56 689963 [B7F026C0] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c90200400cfd) as the def ault port. Sep 21 17:50:56 691969 [B7F026C0] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c90200400cfd) as the def ault port. Sep 21 17:50:56 693187 [B7F026C0] -> osmtest_validate_sa_class_port_info: - SA Class Port Info: base_ver:1 class_ver:2 cap_mask:0x202 resp_time_val:0x64 - Sep 21 17:50:56 775383 [B7F026C0] -> osmtest_wrong_sm_key_ignored: Try PortRecord for port with LID 0x0 Num:0x1. Sep 21 17:51:00 775320 [B76FFBB0] -> umad_receiver: ERR 5409: send completed with error (method=1 attr=12 trans_id=0x34) -- dropping. Sep 21 17:51:00 775389 [B76FFBB0] -> umad_receiver: ERR 5410: class 0x3 LID 0x0 Sep 21 17:51:00 775418 [B76FFBB0] -> osmtest_query_res_cb: ERR 0003: Error on query (IB_TIMEOUT). Sep 21 17:51:00 775465 [B7F026C0] -> osmtest_wrong_sm_key_ignored: ERR 0011: Did not get a timeout but got (IB_SUCCESS). Sep 21 17:51:00 775581 [B7F026C0] -> osmt_register_service: Registering Service: name:osmt.srvc.1804289383.7793 id:0x6b8b26f 6. Sep 21 17:51:00 777143 [B7F026C0] -> osmt_register_service: Registering Service: name:osmt.srvc.846930885.7793 id:0x327b0554 Sep 21 17:51:00 777143 [B7F026C0] -> osmt_register_service: Registering Service: name:osmt.srvc.846930885.7793 id:0x327b0554 . Sep 21 17:51:04 779578 [B76FFBB0] -> umad_receiver: ERR 5409: send completed with error (method=2 attr=31 trans_id=0x36) --dropping. Sep 21 17:51:04 779604 [B76FFBB0] -> umad_receiver: ERR 5410: class 0x3 LID 0x0 Sep 21 17:51:04 779631 [B76FFBB0] -> osmtest_query_res_cb: ERR 0003: Error on query (IB_TIMEOUT). Sep 21 17:51:04 779674 [B7F026C0] -> osmt_register_service: ERR 0364: ib_query failed (IB_TIMEOUT). Sep 21 17:51:04 779740 [B7F026C0] -> osmtest_run: ERR 00148: Service Flow failed (IB_TIMEOUT) OSMTEST: TEST "All Validations" FAIL -Viswa On 22 Sep 2005 15:08:02 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: On Thu, 2005-09-22 at 15:06, Viswanath Krishnamurthy wrote:> I do not think this would help. The system is never rebooted. Just> opensm is started and stopped. On the mext opensm start/stop the> subnet came up. I think it is more of an opensm issue than any kernel > module issue.Can you run opensm in -V mode and send the log. It might be related tothe SM Set PortInfo armed->active issue which has been documented butnot resolved.-- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: opensm and SIGINT
Hal,On 22 Sep 2005 14:41:04 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: Hi Viswa,On Thu, 2005-09-22 at 14:37, Viswanath Krishnamurthy wrote:> Hi Hal,>> Sure will test it out. I see no issue in this fix. I have run the> following test overnight> in a script with yesterday's code >> 1. Start opensm> 2. Ping another node over IB> 3. Run osmtest (osmtest -f c, osmtest -f a)> 4. Kill opensm with -9 signal and repeat over>> The failures are captured in a log. >> This has run more than 2500 times without resource leak issues. I saw> about 150 osmtest> failures which I will followup with another mail.Some failures are intentional (bad flow tests). They are all not marked obviously. Some of this has been documented on the list but not fixedyet but I am interested in seeing what you are referring to. I will attach the log later. > Once opensm failed to start correctly with SUBNET UP message in the> log. So the subnet didn't come up and the ports didn't become active ? Justout of curiousity, could you unload and reload ib_umad and then startopensm when that occurs to see if that fixes things ? I'm not sure it would. I do not think this would help. The system is never rebooted. Just opensm is started and stopped. On the mext opensm start/stop the subnet came up. I think it is more of an opensm issue than any kernel module issue. Thanks.-- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Re: opensm and SIGINT
Hi Hal, Sure will test it out. I see no issue in this fix. I have run the following test overnight in a script with yesterday's code 1. Start opensm 2. Ping another node over IB 3. Run osmtest (osmtest -f c, osmtest -f a) 4. Kill opensm with -9 signal and repeat over The failures are captured in a log. This has run more than 2500 times without resource leak issues. I saw about 150 osmtest failures which I will followup with another mail. Once opensm failed to start correctly with SUBNET UP message in the log. -Viswa On 22 Sep 2005 11:17:46 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: Hi again Viswa,On Wed, 2005-09-21 at 21:00, Hal Rosenstock wrote:> Hi Viswa,>> On Wed, 2005-09-21 at 20:23, Viswanath Krishnamurthy wrote:> > Currently opensm traps SIGINT. There was some discussion to remove it. > > I have currently running some tests on opensm> > by killing (SIGKILL) and restarting opensm. So far I ahve not found> > any resource leak issues. Is ther a plan to remove that> > signal handler. Ideally it should not exist. >> Eitan stated that this was historical in nature for gen1 drivers which> had resource tracking problems: "if OpenSM left without cleaning up all> used resources (like MAD buffers and UD-AVs), the driver oops'ed." >> I think that (eliminating the handler for SIGINT) can at least be done> for OSM_VENDOR_INTF_OPENIB and leave it there for the other vendor> layers for starters. I will experiment with gen2 and let you know. Does the patch below do what you want ? Can you try it ?-- HalIndex: opensm/osm_opensm.c===--- opensm/osm_opensm.c (revision 3513) +++ opensm/osm_opensm.c (working copy)@@ -182,7 +182,9 @@ osm_reg_sig_handler(IN osm_opensm_t * const p_osm ) {__p_osm_to_signal = p_osm;+#ifndef OSM_VENDOR_INTF_OPENIBcl_reg_sig_hdl( SIGINT, __sig_handler ); +#endifcl_reg_sig_hdl( SIGTERM, __sig_handler );cl_reg_sig_hdl( SIGHUP, __sig_handler );osm_exit_flag = 0; ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] ib_create_cq memory leak?
Roland, Thanks. Tested this out.. Works like a charm... -Viswa On 9/21/05, Roland Dreier <[EMAIL PROTECTED]> wrote: Thanks very much for the excellent test case. The following patch(already checked into svn and queued in git for merging into 2.6.14)should fix things -- on my system, your test case ran successfully formany hundreds of iterations. --- linux-kernel/infiniband/hw/mthca/mthca_memfree.c(revision 3500)+++ linux-kernel/infiniband/hw/mthca/mthca_memfree.c(working copy)@@ -529,12 +529,25 @@ int mthca_alloc_db(struct mthca_dev *dev goto found;}+ for (i = start; i != end; i += dir)+ if (!dev->db_tab->page[i].db_rec) {+ page = dev->db_tab->page + i;+ goto alloc;+ }+if (dev->db_tab->max_group1 >= dev->db_tab->min_group2 - 1) {ret = -ENOMEM;goto out;}+ if (group == 0) + ++dev->db_tab->max_group1;+ else+ --dev->db_tab->min_group2;+page = dev->db_tab->page + end;++alloc:page->db_rec = dma_alloc_coherent(&dev->pdev->dev, 4096, &page->mapping, GFP_KERNEL);if (!page->db_rec) {@@ -554,10 +567,6 @@ int mthca_alloc_db(struct mthca_dev *dev}bitmap_zero(page->used, MTHCA_DB_REC_PER_PAGE);- if (group == 0) - ++dev->db_tab->max_group1;- else- --dev->db_tab->min_group2; found:j = find_first_zero_bit(page->used, MTHCA_DB_REC_PER_PAGE); ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] opensm and SIGINT
Hal, Currently opensm traps SIGINT. There was some discussion to remove it. I have currently running some tests on opensm by killing (SIGKILL) and restarting opensm. So far I ahve not found any resource leak issues. Is ther a plan to remove that signal handler. Ideally it should not exist. -Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] Modifying QP state error
The mthca state transistion code allows this transistion (RTS --> RESET), but the mthca hardware/firmware does not allow it. It allows RTS->ERR->RESET. I will post the code later to reproduce this. I was trying to workaround the CQ destroy memory leak by caching QP entries and reusing them, but ran into other issues. -Viswa On 9/21/05, Roland Dreier <[EMAIL PROTECTED]> wrote: Hal> You can only get to RESET from ERROR. See Figure 124 QPHal> Context State Diagram IBA 1.2 p. 452.I think the figure drawn in a slightly misleading way. The text atthe lower left says: It is possible to transition from any state to either the Error orthe Reset state with the Modify QP/EE Verb. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Modifying QP state error
When I try to modify QP state from RTS to RESET I get the following error ib_mthca :05:00.0: Command 1e completed with status 0a ib_mthca :05:00.0: modify QP 7 returned status 0a. Is modifying QP state from RTS to RESET a valid state transistion ? (I guess so) Are there anything else that needs to be taken care of ? -Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] ib_create_cq memory leak? (Resend)
I ran into this issue when using the kernel API to create CQ's. In order to reproduce this problem, I wrote a small kernel module which creates 4K CQ's and destroys them. After running the test (8-10 times), I saw create_cq error with error -12 (ENOMEM). I am attaching the test module source code with Makefiles [root src]# svn info (Latest code) Path: . URL: https://openib.org/svn/gen2/trunk/src Repository UUID: 21a7a0b7-18d7-0310-8e21 -e8b31bdbf5cd Revision: 3512 Node Kind: directory Schedule: normal Last Changed Author: halr Last Changed Rev: 3511 Last Changed Date: 2005-09-21 08:57:38 -0700 (Wed, 21 Sep 2005) To compile the code, change the KERNELSRC variable in mysock.mak to point to your kernel source tree #make -f mysock.mak #insmod mysock.ko To run the test #echo 1 > /dev/mysock After 8-10 times of running the above, you will see a -12 error on the console. This problem does not occur when you create a single CQ and destroy it immediately in a loop (I tried 10 times). This occurs when you create 4K CQ's and then destroy it. ib_mthca :05:00.0: Mapped page at 362f9000 to 7e000 for ICM. ib_mthca :05:00.0: Mapped page at 362fa000 to 41000 for ICM. ib_mthca :05:00.0: Mapped page at 35d1 to 7d000 for ICM. ib_mthca :05:00.0: Mapped page at 35d11000 to 42000 for ICM. ib_mthca :05:00.0: Mapped page at 35f27000 to 7c000 for ICM. ib_mthca :05:00.0: Mapped page at 35f28000 to 43000 for ICM. ib_mthca :05:00.0: Mapped page at 3593f000 to 7b000 for ICM. ib_mthca :05:00.0: Mapped page at 3594 to 44000 for ICM. ib_mthca :05:00.0: Mapped page at 35b56000 to 7a000 for ICM. ib_mthca :05:00.0: Mapped page at 35b57000 to 45000 for ICM. ib_mthca :05:00.0: Mapped page at 3556d000 to 79000 for ICM. ib_mthca :05:00.0: Mapped page at 3556e000 to 46000 for ICM. ib_mthca :05:00.0: Mapped page at 35785000 to 78000 for ICM. ib_mthca :05:00.0: Mapped page at 35786000 to 47000 for ICM. ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2604 for ICM. ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2004 for ICM. ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2584 for ICM. ib_mthca :05:00.0: Mapped page at 3519b000 to 77000 for ICM. ib_mthca :05:00.0: Mapped page at 3519c000 to 48000 for ICM. ib_mthca :05:00.0: Unmapping 1 pages at 7e000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 7d000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 7c000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 7b000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 7a000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 79000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 78000 from ICM. ib_mthca :05:00.0: Unmapping 64 pages at 2584 from ICM. ib_mthca :05:00.0: Unmapping 64 pages at 2004 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 48000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 77000 from ICM. ib_mthca :05:00.0: Unmapping 64 pages at 2604 from ICM. ib_mthca :05:00.0: Mapped page at 35b03000 to 76000 for ICM. ib_mthca :05:00.0: Mapped page at 362ba000 to 75000 for ICM. ib_mthca :05:00.0: Mapped page at 35d2c000 to 74000 for ICM. ib_mthca :05:00.0: Mapped page at 35c83000 to 73000 for ICM. ib_mthca :05:00.0: Mapped page at 35b99000 to 72000 for ICM. ib_mthca :05:00.0: Mapped page at 35db to 71000 for ICM. ib_mthca :05:00.0: Mapped page at 356c5000 to 7 for ICM. ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2604 for ICM. ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2004 for ICM. ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2584 for ICM. ib_mthca :05:00.0: Mapped page at 35adc000 to 6f000 for ICM. ib_mthca :05:00.0: Mapped page at 35add000 to 49000 for ICM. ib_mthca :05:00.0: Unmapping 1 pages at 76000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 75000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 74000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 73000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 72000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 71000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 7 from ICM. ib_mthca :05:00.0: Unmapping 64 pages at 2584 from ICM. ib_mthca :05:00.0: Unmapping 64 pages at 2004 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 49000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 6f000 from ICM. ib_mthca :05:00.0: Unmapping 64 pages at 2604 from ICM. ib_mthca :05:00.0: Mapped page at 362cf000 to 6e000 for ICM. ib_mthca :05:00.0: Mapped page at 35a0f000 to 6d000 for ICM. ib_mthca :05:00.0: Mapped page at 3507 to 6c000 for ICM. ib_mthca :05:00.0: Mapped page at 35e83000 to 6b000 for ICM. ib_mthca :05:00.0: Mapped page at 35bd8000 to 6a000 for ICM. ib_mthca :05:00.0: Mapped page at 351ef000 to 69000 for ICM. ib_mthca :05:00.0: Mapped page at 35c440
[openib-general] ib_create_cq memory leak?
I ran into this issue when using the kernel API to create CQ's. In order to reproduce this problem, I wrote a small kernel module which creates 4K CQ's and destroys them. After running the test (8-10 times), I saw create_cq error with error -12 (ENOMEM). I am attaching the test module source code with Makefiles [root src]# svn info (Latest code) Path: . URL: https://openib.org/svn/gen2/trunk/src Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd Revision: 3512 Node Kind: directory Schedule: normal Last Changed Author: halr Last Changed Rev: 3511 Last Changed Date: 2005-09-21 08:57:38 -0700 (Wed, 21 Sep 2005) To compile the code, change the KERNELSRC variable in mysock.mak to point to your kernel source tree #make -f mysock.mak #insmod mysock.ko To run the test #echo 1 > /dev/mysock After 8-10 times of running the above, you will see a -12 error on the console. This problem does not occur when you create a single CQ and destroy it immediately in a loop (I tried 10 times). This occurs when you create 4K CQ's and then destroy it. ib_mthca :05:00.0: Mapped page at 362f9000 to 7e000 for ICM. ib_mthca :05:00.0: Mapped page at 362fa000 to 41000 for ICM. ib_mthca :05:00.0: Mapped page at 35d1 to 7d000 for ICM. ib_mthca :05:00.0: Mapped page at 35d11000 to 42000 for ICM. ib_mthca :05:00.0: Mapped page at 35f27000 to 7c000 for ICM. ib_mthca :05:00.0: Mapped page at 35f28000 to 43000 for ICM. ib_mthca :05:00.0: Mapped page at 3593f000 to 7b000 for ICM. ib_mthca :05:00.0: Mapped page at 3594 to 44000 for ICM. ib_mthca :05:00.0: Mapped page at 35b56000 to 7a000 for ICM. ib_mthca :05:00.0: Mapped page at 35b57000 to 45000 for ICM. ib_mthca :05:00.0: Mapped page at 3556d000 to 79000 for ICM. ib_mthca :05:00.0: Mapped page at 3556e000 to 46000 for ICM. ib_mthca :05:00.0: Mapped page at 35785000 to 78000 for ICM. ib_mthca :05:00.0: Mapped page at 35786000 to 47000 for ICM. ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2604 for ICM. ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2004 for ICM. ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2584 for ICM. ib_mthca :05:00.0: Mapped page at 3519b000 to 77000 for ICM. ib_mthca :05:00.0: Mapped page at 3519c000 to 48000 for ICM. ib_mthca :05:00.0: Unmapping 1 pages at 7e000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 7d000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 7c000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 7b000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 7a000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 79000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 78000 from ICM. ib_mthca :05:00.0: Unmapping 64 pages at 2584 from ICM. ib_mthca :05:00.0: Unmapping 64 pages at 2004 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 48000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 77000 from ICM. ib_mthca :05:00.0: Unmapping 64 pages at 2604 from ICM. ib_mthca :05:00.0: Mapped page at 35b03000 to 76000 for ICM. ib_mthca :05:00.0: Mapped page at 362ba000 to 75000 for ICM. ib_mthca :05:00.0: Mapped page at 35d2c000 to 74000 for ICM. ib_mthca :05:00.0: Mapped page at 35c83000 to 73000 for ICM. ib_mthca :05:00.0: Mapped page at 35b99000 to 72000 for ICM. ib_mthca :05:00.0: Mapped page at 35db to 71000 for ICM. ib_mthca :05:00.0: Mapped page at 356c5000 to 7 for ICM. ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2604 for ICM. ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2004 for ICM. ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2584 for ICM. ib_mthca :05:00.0: Mapped page at 35adc000 to 6f000 for ICM. ib_mthca :05:00.0: Mapped page at 35add000 to 49000 for ICM. ib_mthca :05:00.0: Unmapping 1 pages at 76000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 75000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 74000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 73000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 72000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 71000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 7 from ICM. ib_mthca :05:00.0: Unmapping 64 pages at 2584 from ICM. ib_mthca :05:00.0: Unmapping 64 pages at 2004 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 49000 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at 6f000 from ICM. ib_mthca :05:00.0: Unmapping 64 pages at 2604 from ICM. ib_mthca :05:00.0: Mapped page at 362cf000 to 6e000 for ICM. ib_mthca :05:00.0: Mapped page at 35a0f000 to 6d000 for ICM. ib_mthca :05:00.0: Mapped page at 3507 to 6c000 for ICM. ib_mthca :05:00.0: Mapped page at 35e83000 to 6b000 for ICM. ib_mthca :05:00.0: Mapped page at 35bd8000 to 6a000 for ICM. ib_mthca :05:00.0: Mapped page at 351ef000 to 69000 for ICM. ib_mthca :05:00.0: Mapped page at 35c44000 to 6800
[openib-general] Re: [PATCH] libmthca: fix wqe post
Just wanted to confirm kernel mthca also works fine.. Thanks Roland & Michael -Viswa On 9/13/05, Viswanath Krishnamurthy <[EMAIL PROTECTED]> wrote: Thanks.. yes that was the problem... The panic was happening when I was getting these errors and pressed Ctrl-C on the server. This may be an error path issue. I am not seeing it now.. -Viswa On 9/13/05, Roland Dreier < [EMAIL PROTECTED]> wrote: Viswanath> When I ran the cmpost program which I sent you, IViswanath> started getting errors from the mthca library even forViswanath> smaller number of connections (Earlier it wasViswanath> working). Yeah, I found another problem with your cmpost program. I thinkyou're setting the packet lifetime far too low. You have:sa.packet_life_time = 2;This ends up having the CM set an ACK timeout of something like 32microseconds, which is way too low. If you poll the send CQ, you'llprobably see some "retries exceeded" errors. Setting the packet_life_time to something like 14 or 15 should work better.Viswanath> Also it is now easier to create the panic when you killViswanath> the cmpost server program. The panic may be happening Viswanath> on an error path.I still have never been able to reproduce this panic (and believe me,I've killed the cmpost program many time). Anyway, I'll take a lookat the traceback and see if anything jumps out at me. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH] libmthca: fix wqe post
Thanks.. yes that was the problem... The panic was happening when I was getting these errors and pressed Ctrl-C on the server. This may be an error path issue. I am not seeing it now.. -Viswa On 9/13/05, Roland Dreier <[EMAIL PROTECTED]> wrote: Viswanath> When I ran the cmpost program which I sent you, IViswanath> started getting errors from the mthca library even forViswanath> smaller number of connections (Earlier it wasViswanath> working). Yeah, I found another problem with your cmpost program. I thinkyou're setting the packet lifetime far too low. You have:sa.packet_life_time = 2;This ends up having the CM set an ACK timeout of something like 32microseconds, which is way too low. If you poll the send CQ, you'llprobably see some "retries exceeded" errors. Setting the packet_life_time to something like 14 or 15 should work better.Viswanath> Also it is now easier to create the panic when you killViswanath> the cmpost server program. The panic may be happening Viswanath> on an error path.I still have never been able to reproduce this panic (and believe me,I've killed the cmpost program many time). Anyway, I'll take a lookat the traceback and see if anything jumps out at me. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH] libmthca: fix wqe post
Roland, I got the latest sorces, built it along with the drivers. Userland mthca Your test application ran fine without any issue. (rctest) When I ran the cmpost program which I sent you, I started getting errors from the mthca library even for smaller number of connections (Earlier it was working). This looks like error dump im mthca library. .. [ 0] 0493 [ 4] [ 8] [ c] [10] 05f4 [14] [18] 0042 [1c] fe10 failed polling CQ: 142: err 1 <=== This is from cmpost program [ 0] 0493 [ 4] [ 8] [ c] [10] 05f9 [14] [18] 0082 [1c] fe10 failed polling CQ: 142: err 1 [ 0] 0493 Also it is now easier to create the panic when you kill the cmpost server program. The panic may be happening on an error path. printing eip: c029197d *pde = 35d56001 Oops: [#1] SMP Modules linked in: nfs nfsd exportfs lockd autofs4 sunrpc uhci_hcd ehci_hcd hw_random e1000 ext3 jbd sd_mod CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00010002 (2.6.13) EIP is at mthca_poll_cq+0x158/0x534 eax: ebx: f5e90280 ecx: 0006 edx: 1250 esi: 023a edi: f5e90304 ebp: f7941f0c esp: f7941ea4 ds: 007b es: 007b ss: 0068 Process ib_mad1 (pid: 308, threadinfo=f794 task=f7cb7540) Stack: f7941ed0 c0118c7d f7def41c c0355dc0 f7cb7540 f7dea41c c1a01bc0 0080 0286 f7ce1000 f7941f0c 0001 f7dea400 f8806000 0292 0001 f5e90280 f7ce1000 f7def400 f7941f0c Call Trace: [] load_balance_newidle+0x23/0xa2 [] ib_mad_completion_handler+0x2c/0x8d [] remove_wait_queue+0xf/0x34 [] worker_thread+0x1b0/0x23a [] schedule+0x5d3/0xbdf [] ib_mad_completion_handler+0x0/0x8d [] default_wake_function+0x0/0xc [] default_wake_function+0x0/0xc [] worker_thread+0x0/0x23a [] kthread+0x8a/0xb2 [] kthread+0x0/0xb2 [] kernel_thread_helper+0x5/0xb Code: 01 00 00 8b 44 24 18 8d bb 84 00 00 00 8b 53 5c 8b 70 18 8b 4f 24 0f ce 2b b3 b8 00 00 00 8b 83 bc 00 00 00 d3 ee 01 f2 8d 14 d0 <8b> 02 8b 52 04 85 ff 89 45 00 89 55 04 74 16 8b 57 10 89 f0 39 -Viswa On 9/13/05, Roland Dreier <[EMAIL PROTECTED]> wrote: Viswanath> Once you generate a kernel patch, I can test out bothViswanath> user and kernel mthca since I have the tests ready..Excellent. I merged MST's patch, and applied the patch below to the kernel. (So you can either update from svn or apply the patches)Thanks for testing -- let me know if you still see problems.Index: infiniband/hw/mthca/mthca_srq.c=== --- infiniband/hw/mthca/mthca_srq.c (revision 3404)+++ infiniband/hw/mthca/mthca_srq.c (working copy)@@ -189,7 +189,6 @@ int mthca_alloc_srq(struct mthca_dev *desrq->max = attr->max_wr; srq->max_gs = attr->max_sge;- srq->last = NULL;srq->counter = 0;if (mthca_is_memfree(dev))@@ -264,6 +263,7 @@ int mthca_alloc_srq(struct mthca_dev *de srq->first_free = 0;srq->last_free = srq->max - 1;+ srq->last = get_wqe(srq, srq->max - 1);return 0;@@ -446,13 +446,11 @@ int mthca_tavor_post_srq_recv(struct ib_ ((struct mthca_data_seg *) wqe)->addr = 0;}- if (likely(prev_wqe)) {- ((struct mthca_next_seg *) prev_wqe)->nda_op =- cpu_to_be32((ind << srq->wqe_shift) | 1);- wmb();- ((struct mthca_next_seg *) prev_wqe)->ee_nds =- cpu_to_be32(MTHCA_NEXT_DBD);- }+ ((struct mthca_next_seg *) prev_wqe)->nda_op =+ cpu_to_be32((ind << srq->wqe_shift) | 1);+ wmb();+ ((struct mthca_next_seg *) prev_wqe)->ee_nds =+ cpu_to_be32(MTHCA_NEXT_DBD);srq->wrid[ind] = wr->wr_id;srq->first_free = next_ind;Index: infiniband/hw/mthca/mthca_qp.c===--- infiniband/hw/mthca/mthca_qp.c (revision 3404) +++ infiniband/hw/mthca/mthca_qp.c (working copy)@@ -227,7 +227,6 @@ static void mthca_wq_init(struct mthca_wwq->last_comp = wq->max - 1;wq->head = 0;wq->tail = 0; - wq->last = NULL; } void mthca_qp_event(struct mthca_dev *dev, u32 qpn,@@ -1103,6 +1102,9 @@ static int mthca_alloc_qp_common(struct}}+ qp-> sq.last = get_send_wqe(qp, qp->sq.max - 1);+ qp->rq.last = get_recv_wqe(qp, qp->rq.max - 1);+return 0; }@@ -1583,15 +1585,13 @@ int mthca_tavor_post_send(struct ib_qp *goto out;}- if (prev_wqe) {- ((struct mthca_next_seg *) prev_wqe)->nda_op =- cpu_to_be32((
[openib-general] Strange configure error in libibcm
I got the latest code from the repository to verify mthca fixes, I ran into this strange configure error in libibcm checking infiniband/at.h usability... yes checking infiniband/at.h presence... yes checking for infiniband/at.h... yes checking for ANSI C header files... (cached) yes checking for an ANSI C-conforming const... yes checking for long... yes checking size of long... configure: error: cannot compute sizeof (long), 77 See `config.log' for more details. gcc version is 3.4 Linux 2.6.13 I was able to build earlier versions on the same machine. This happens only with libibcm Any clues ? -Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: [PATCH] libmthca: fix wqe post (was Re: strange mem-free bug)
Michael, Thanks.. Roland, Once you generate a kernel patch, I can test out both user and kernel mthca since I have the tests ready.. -Viswa On 9/13/05, Michael S. Tsirkin <[EMAIL PROTECTED]> wrote: Quoting r. Roland Dreier <[EMAIL PROTECTED]>:> Subject: strange mem-free bug (was: [openib-general] completion Q overflow error/panic)>> While looking at Viswa's example, I've found what seems to be a > problem using lots of QPs on mem-free HCAs.Hi, Roland!This seems to be a bug in libmthca. Patch below.We probably need a similiar fix for kernel mthca - let me know ifyou plan to work on that, otherwise I'll look into it tomorrow. And its probably something we want fixed for 2.6.14, right?Let me know.With regard to the test code that you posted - I also have some smallcomments. If you plan to use it in the future, you can stick it in svn somewhere and I'll send patches.---Fix posting of the first work request for memfree hardware.Simplify code for tavor mode hardware.Signed-off-by: Michael S. Tsirkin < [EMAIL PROTECTED]>Index: userspace/libmthca/src/qp.c===--- userspace.orig/libmthca/src/qp.c2005-09-13 17:17:58.0 +0300 +++ userspace/libmthca/src/qp.c 2005-09-13 17:26:23.0 +0300@@ -259,15 +259,13 @@ int mthca_tavor_post_send(struct ibv_qpgoto out;}- if (prev_wqe) {- ((struct mthca_next_seg *) prev_wqe)->nda_op =- htonl(((ind << qp->sq.wqe_shift) +- qp->send_wqe_offset) |- mthca_opcode[wr->opcode]);+ ((struct mthca_next_seg *) prev_wqe)->nda_op =+ htonl(((ind << qp->sq.wqe_shift) ++ qp->send_wqe_offset) |+ mthca_opcode[wr->opcode]);- ((struct mthca_next_seg *) prev_wqe)->ee_nds =- htonl((size0 ? 0 : MTHCA_NEXT_DBD) | size);- }+ ((struct mthca_next_seg *) prev_wqe)->ee_nds =+ htonl((size0 ? 0 : MTHCA_NEXT_DBD) | size);if (!size0) {size0 = size;@@ -353,12 +351,10 @@ int mthca_tavor_post_recv(struct ibv_qpqp->wrid[ind] = wr->wr_id;- if (prev_wqe) {- ((struct mthca_next_seg *) prev_wqe)->nda_op =- htonl((ind << qp->rq.wqe_shift) | 1);- ((struct mthca_next_seg *) prev_wqe)->ee_nds =- htonl(MTHCA_NEXT_DBD | size);- }+ ((struct mthca_next_seg *) prev_wqe)->nda_op =+ htonl((ind << qp->rq.wqe_shift) | 1);+ ((struct mthca_next_seg *) prev_wqe)->ee_nds =+ htonl(MTHCA_NEXT_DBD | size);if (!size0)size0 = size;@@ -562,15 +558,13 @@ int mthca_arbel_post_send(struct ibv_qpgoto out;}- if (prev_wqe) {- ((struct mthca_next_seg *) prev_wqe)->nda_op =- htonl(((ind << qp->sq.wqe_shift) +- qp->send_wqe_offset) |- mthca_opcode[wr->opcode]);- mb();- ((struct mthca_next_seg *) prev_wqe)->ee_nds =- htonl(MTHCA_NEXT_DBD | size);- }+ ((struct mthca_next_seg *) prev_wqe)->nda_op =+ htonl(((ind << qp->sq.wqe_shift) ++ qp->send_wqe_offset) |+ mthca_opcode[wr->opcode]);+ mb();+ ((struct mthca_next_seg *) prev_wqe)->ee_nds =+ htonl(MTHCA_NEXT_DBD | size);if (!size0) {size0 = size;@@ -767,6 +761,8 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd}}+ qp->sq.last = get_send_wqe(qp, qp->sq.max - 1);+ qp->rq.last = get_recv_wqe(qp, qp-> sq.max - 1);return 0; }Index: userspace/libmthca/src/srq.c===--- userspace.orig/libmthca/src/srq.c 2005-09-13 17:25:41.0 +0300+++ userspace/libmthca/src/srq.c2005-09-13 17:25:51.0 +0300@@ -142,13 +142,11 @@ int mthca_tavor_post_srq_recv(struct ibv((struct mthca_data_seg *) wqe)->addr = 0;}- if (prev_wqe) {- ((struct mthca_next_seg *) prev_wqe)->nda_op =- htonl((ind << srq->wqe_shift) | 1);- mb();- ((struct mthca_next_seg *) prev_wqe)->ee_nds =- htonl(MTHCA_NEXT_DBD);- }+ ((struct mthca_next_seg *) prev
[openib-general] Status of opensm 1.8 merge
Can I start testing opensm 1.8 merge on gen2 ? What is the current status ? -Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] completion Q overflow error/panic
Here is ibv_devinfo output. It is InfiniHost_III_Lx0 ]# ibv_devinfo hca_id: mthca0 fw_ver: 1.0.1 node_guid: 0002:c902:0040:0cfc sys_image_guid: 0002:c902:0040:0cff max_mr_size: 0x page_size_cap: 0x0 vendor_id: 0x02c9 vendor_part_id: 25204 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: invalid MTU (0) active_mtu: invalid MTU (0) sm_lid: 1 port_lid: 1 port_lmc: 0x00 Yes the CQE is a bug. But in this case at any time there should be one outstanding packet in the pipe. The client sends 1 packet, waits for response with a pause (delay), then sends the next packet. If everything works, we should be using atmost 1 cq entry. Initially I had more number of CQ entries, but the problem appeared later. Looks like the packet is getting stuck somewhere, with no notification back of any error. Do we need to tweak any of the QP parameters ? (packet life time, retries etc) ? -Viswa On 9/9/05, Roland Dreier <[EMAIL PROTECTED]> wrote: I found one bug in your cmpost.c program that could cause CQoverruns. When you create your receive and send CQs, you create themwith a cqe value of 5, so they can hold at most 5 entries. However,you create the send and receive work queues so they can hold up to 10 entries, and in fact the code will post up to 8 entries at a time. Soit's possible to overflow the CQ.The fix is to create the CQs to have at least as many entries as thework queues -- in other words, change cqe to 10. However, even with this fixed I do see some strange behavior that I'mstill debugging. More details on Monday.What HCA firmware version do your systems have? - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] completion Q overflow error/panic
Some more info.. This also happens in the kernel level. I have a small kernel module which does the echo reply. After about 100-200 connections, I start to see the following message ib_mthca :05:00.0: SQ 590473 full (8 head, 0 tail, 8 max, 0 nreq) ib_mthca :05:00.0: SQ 590477 full (8 head, 0 tail, 8 max, 0 nreq) ib_mthca :05:00.0: SQ 59040c full (8 head, 0 tail, 8 max, 0 nreq) Below 100 connections I do not see any such messages. Looks like if there is problem, it exists in both kernel and userland API's. -Viswa On 9/9/05, Roland Dreier <[EMAIL PROTECTED]> wrote: Thanks for the excellent bug report. I'll try your code and see if Ican reproduce the problem. If I can, then I should be able to fix thebugs. - R. ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] completion Q overflow error/panic
Somehow gmail ate away the main content of my mail.. Here it is.. I modified the cmpost program to have individual completion send/receive Q's. The mcpost server acts like a echo server, echoing back anything it receives. The client program keeps sending the packets. The test works fine upto around 600 connections. After 600 connections, I start to see ibv_post_send errors with. I added some debug messages in libmthca/src/qp.c where a check is made for wq_overflow. In fact it is overflowing. I checked the code to make sure all the send descriptors are recovered with cq_poll operation. Also the wc.status field is checked for any errors. I am attaching the modified code . bash-3.00$ svn info Path: . URL: https://openib.org/svn/gen2/trunk Repository UUID: 21a7a0b7-18d7-0310-8e21 -e8b31bdbf5cd Revision: 3344 Node Kind: directory Schedule: normal Last Changed Author: jlentini Last Changed Rev: 3344 Last Changed Date: 2005-09-08 16:39:25 -0700 (Thu, 08 Sep 2005) To run the test compile the code cc -o cmpost cmpost.c -libcm -libverbs -libat $ cmpost -n 1024 <=== as server $ cmpost -c -n 1024 -l -g After sometime you start seeing post_send errors. On my system upto 600 connections work fine. When running the test I saw panics couple of time. But difficult to reproduce ernel BUG at include/asm/spinlock.h:149! invalid operand: [#1] SMP Modules linked in: nfs nfsd exportfs lockd autofs4 sunrpc uhci_hcd ehci_hcd hw_random e1000 ext3 jbdsd_mod CPU: 1 EIP: 0060:[] Not tainted VLI EFLAGS: 00010086 (2.6.13) EIP is at _spin_lock_irqsave+0x47/0x51 eax: 0011 ebx: 0282 ecx: c035950c edx: 0082 esi: f7d82010 edi: ebp: f6792c80 esp: c1a33ed0 ds: 007b es: 007b ss: 0068 Process ib_mad1 (pid: 308, threadinfo=c1a32000 task=f7e3c540) Stack: c03123ee c0276963 f6792c80 f7d82010 c0276963 f79a6adc f7974b00 0001 c1a33f0c f7912e00 f7df2000 f7df4200 c1a33f0c 0292 c0276b96 f6792c80 b93e2c00 0128 0296 0402 0001 Call Trace: [] ib_mad_send_done_handler+0x72/0x11e [] ib_mad_send_done_handler+0x72/0x11e [] ib_mad_completion_handler+0x80/0x8d [] wait_noreap_copyout+0x55/0xbe [] worker_thread+0x1b0/0x23a [] schedule+0x5d3/0xbdf [] ib_mad_completion_handler+0x0/0x8d [] default_wake_function+0x0/0xc [] default_wake_function+0x0/0xc [] worker_thread+0x0/0x23a [] kthread+0x8a/0xb2 [] kthread+0x0/0xb2 [] kernel_thread_helper+0x5/0xb Code: 00 00 74 01 fb f3 90 80 3e 00 7e f9 fa eb e8 83 c4 08 89 d8 5b 5e c3 8b 44 24 10 c7 04 24 ee 23 31 c0 89 44 24 04 e8 2f e7 e1 ff <0f> 0b 95 00 39 1c 31 c0 eb c2 53 89 c3 83 ec 08 fa 81 78 04 ad -Viswa ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] completion Q overflow error/panic
I modified the cmpost program to have individual completion send/receive Q's. The mcpost server acts like a echo server, echoing back anything it receives. The client program keeps sending the packets. The test works fine upto around 600 connections. After 600 connections, I start to see ibv_post_send errors with. I added some debug messages in libmthca/src/qp.c where a check is made for wq_overflow. In fact it is overflowing. I checked the code to make sure all the send descriptors are recovered with cq_poll operation. Also the wc.status field is checked for any errors. I am attaching the modified code . bash-3.00$ svn info Path: . URL: https://openib.org/svn/gen2/trunk Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd Revision: 3344 Node Kind: directory Schedule: normal Last Changed Author: jlentini Last Changed Rev: 3344 Last Changed Date: 2005-09-08 16:39:25 -0700 (Thu, 08 Sep 2005) To run the test compile the code cc -o cmpost cmpost.c -libcm -libverbs -libat $ cmpost -n 1024 <=== as server $ cmpost -c -n 1024 -l -g After sometime you start seeing post_send errors. On my system upto 600 connections work fine. When running the test I saw panics couple of time. But difficult to reproduce ernel BUG at include/asm/spinlock.h:149! invalid operand: [#1] SMP Modules linked in: nfs nfsd exportfs lockd autofs4 sunrpc uhci_hcd ehci_hcd hw_random e1000 ext3 jbdsd_mod CPU: 1 EIP: 0060:[] Not tainted VLI EFLAGS: 00010086 (2.6.13) EIP is at _spin_lock_irqsave+0x47/0x51 eax: 0011 ebx: 0282 ecx: c035950c edx: 0082 esi: f7d82010 edi: ebp: f6792c80 esp: c1a33ed0 ds: 007b es: 007b ss: 0068 Process ib_mad1 (pid: 308, threadinfo=c1a32000 task=f7e3c540) Stack: c03123ee c0276963 f6792c80 f7d82010 c0276963 f79a6adc f7974b00 0001 c1a33f0c f7912e00 f7df2000 f7df4200 c1a33f0c 0292 c0276b96 f6792c80 b93e2c00 0128 0296 0402 0001 Call Trace: [] ib_mad_send_done_handler+0x72/0x11e [] ib_mad_send_done_handler+0x72/0x11e [] ib_mad_completion_handler+0x80/0x8d [] wait_noreap_copyout+0x55/0xbe [] worker_thread+0x1b0/0x23a [] schedule+0x5d3/0xbdf [] ib_mad_completion_handler+0x0/0x8d [] default_wake_function+0x0/0xc [] default_wake_function+0x0/0xc [] worker_thread+0x0/0x23a [] kthread+0x8a/0xb2 [] kthread+0x0/0xb2 [] kernel_thread_helper+0x5/0xb Code: 00 00 74 01 fb f3 90 80 3e 00 7e f9 fa eb e8 83 c4 08 89 d8 5b 5e c3 8b 44 24 10 c7 04 24 ee 23 31 c0 89 44 24 04 e8 2f e7 e1 ff <0f> 0b 95 00 39 1c 31 c0 eb c2 53 89 c3 83 ec 08 fa 81 78 04 ad -Viswa cmpost.c Description: Binary data ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [Fwd: Re: [openib-general] kernel oops]
See inline..On 02 Sep 2005 17:04:42 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: On Fri, 2005-09-02 at 16:59, Viswanath Krishnamurthy wrote:> Here is the setup..Thanks. A couple more questions:> #svn info> Path: .>> URL: https://openib.org/svn/gen2/trunk> Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd> Revision: 3295> Node Kind: directory> Schedule: normal> Last Changed Author: halr> Last Changed Rev: 3295 > Last Changed Date: 2005-09-01 12:07:54 -0700 (Thu, 01 Sep 2005)>>> Patch applied to core/at.c and kernel 2.6.13 recompiled.>>> Machine A> => Running opensm >> Run ucmpost>> machine B> => ./ucmpost Are these back to back HCAs or is there a switch in between ? There is a switch in between. A simple setup with 2 machines and a switch. The machines are running 2.6.13. One of them is running opensm. > The problem is reproducible when you *cannot* ping each otherover IPoIB ? Yes.. > [EMAIL PROTECTED] ~]# ibv_devinfo> hca_id: mthca0> fw_ver: 1.0.1> node_guid: 0002:c902:0040:0d00> sys_image_guid: 0002:c902:0040:0d03> max_mr_size:0x> page_size_cap: 0x0> vendor_id: 0x02c9> vendor_part_id: 25204> hw_ver: 0x0> phys_port_cnt: 1> port: 1> state: PORT_ACTIVE (4)> max_mtu:invalid MTU (0) <> What is this ??>> active_mtu: invalid MTU (0)If the program is right and those are the real values, somehow max_mtuis trashed which causes active_mtu to be invalid which could break allsorts of things... Is there some issue with the HCA ? > sm_lid: 1> port_lid: 3> port_lmc: 0x00That's on the remote (from the SM) machine.-- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [Fwd: Re: [openib-general] kernel oops]
Here is the setup.. #svn info Path: . URL: https://openib.org/svn/gen2/trunk Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd Revision: 3295 Node Kind: directory Schedule: normal Last Changed Author: halr Last Changed Rev: 3295 Last Changed Date: 2005-09-01 12:07:54 -0700 (Thu, 01 Sep 2005) Patch applied to core/at.c and kernel 2.6.13 recompiled. Machine A = Running opensm Run ucmpost machine B = ./ucmpost The problem is reproducible when you *cannot* ping each other [EMAIL PROTECTED] ~]# ibv_devinfo hca_id: mthca0 fw_ver: 1.0.1 node_guid: 0002:c902:0040:0d00 sys_image_guid: 0002:c902:0040:0d03 max_mr_size: 0x page_size_cap: 0x0 vendor_id: 0x02c9 vendor_part_id: 25204 hw_ver: 0x0 phys_port_cnt: 1 port: 1 state: PORT_ACTIVE (4) max_mtu: invalid MTU (0) < What is this ??> active_mtu: invalid MTU (0) sm_lid: 1 port_lid: 3 port_lmc: 0x00 -Viswa On 02 Sep 2005 16:02:44 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote: On Fri, 2005-09-02 at 15:39, Viswanath Krishnamurthy wrote:> The patch failed to fix the panic..Can you describe your setup ? Did you just run ucmpost without an SM/SArunning or is it a different scenario ? Thanks.-- Hal ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [Fwd: Re: [openib-general] kernel oops]
I am working on it. With the updated version of code, slightly difficult to reproduce. -Viswa On 9/2/05, Roland Dreier <[EMAIL PROTECTED]> wrote: Not really related to the ib_at oops, since I don't know that code.But have you made any progress in being able to post the code toreproduce the other oops (at mthca_poll_cq)?Thanks, Roland ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [Fwd: Re: [openib-general] kernel oops]
The patch failed to fix the panic.. subnetmgr5 login: ib_at: ib_dev_ats_op: dev (c0449800) ib0 already has pending op 2 Unable to handle kernel NULL pointer dereference at virtual address 0068 printing eip: c02fee65 *pde = 365a7001 Oops: [#1] SMP Modules linked in: nfsd exportfs lockd autofs4 sunrpc uhci_hcd ehci_hcd hw_random e1000 ext3 jbd sd_mod CPU: 0 EIP: 0060:[] Not tainted VLI EFLAGS: 00010086 (2.6.13) EIP is at _spin_lock_irqsave+0xa/0x51 eax: 0064 ebx: 0286 ecx: f665de6c edx: c037bcd0 esi: 0064 edi: 0064 ebp: esp: f665de00 ds: 007b es: 007b ss: 0068 Process lt-ucmpost (pid: 3749, threadinfo=f665c000 task=f6478020) Stack: c01410ed 0001 c037bcd0 c0272f87 00d0 f665deac f67abe80 c027f14c c035ef80 c17f8ec0 f665de6c 0c30 0064 f665deac f67abe80 c0284cfa 0c30 0064 00d0 c02847b8 f67abe80 Call Trace: [] __alloc_pages+0x324/0x3f1 [] ib_get_client_data+0x14/0x54 [] ib_sa_path_rec_get+0x1b/0x138 [] resolve_path+0x8c/0x15b [] path_req_complete+0x0/0xf7 [] rtnetlink_dump_all+0x0/0x9e [] rtnetlink_done+0x0/0x3 [] ib_at_paths_by_route+0xf5/0x10f [] same_path_req+0x0/0x95 [] ib_uat_paths_by_route+0xef/0x1c4 [] rtnetlink_dump_all+0x0/0x9e -- Forwarded message --From: Sean Hefty < [EMAIL PROTECTED]>To: Hal Rosenstock <[EMAIL PROTECTED]>Date: Thu, 01 Sep 2005 09:04:37 -0700Subject: Re: [openib-general] kernel oopsHal Rosenstock wrote: > Here's a patch for this. Let me know if it works. [I tried it out and it> works for me.] If it does, the next question is how does the pointer get> trashed.I don't think that the pointer is getting trashed. The SA was not running, so I don't think that any route was returned.- Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: List of issues in uverbs
--- Roland Dreier <[EMAIL PROTECTED]> wrote: > viswanath> Here is new list of issues with > uverbs > > Thanks for the reports. > > viswanath> I have attached the firmware > version/svn info in the > viswanath> attachment. > > In the future can you attach things as text/plain > (or just include > them in your email)? If you attach it as > application/octet-stream > then I have to save the attachment and open it > manually, rather than > just reading it as part of your email. OK.. > > viswanath> 2. libmthca library crashes when a > server accepts lots > viswanath> of new incoming sessions. See log > (gdb) in the > viswanath> attachment. (It accepts about 170 > connections) Looks > viswanath> like a memory allocation issue. > > I found a few bugs in libmthca relating to > allocating doorbell records > for memfree HCAs. I've checked in fixes. Please > try the latest > subversion libmthca and let me know if it helps. This definitely helped. No more crashes in the library. Thanks > > viswanath> 3. Kernel oops when lots of traffic > between multiple > viswanath> clients and server. Very consistently > reproducible. > viswanath> See attachment for details > > Can you post the application you use to reproduce > this? I still see the crash with yesterday's checkout consistently at the same place. I will send the application today to reproduce. If some debug log needs to be collected let me know. > > Thanks, > Roland > Thanks, Viswa __ Yahoo! Mail Stay connected, organized, and protected. Take the tour: http://tour.mail.yahoo.com/mailtour.html ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] kernel oops
I will try out this patch and let you know.. Hal Rosenstock wrote: > Here's a patch for this. Let me know if it works. [I tried it out and it > works for me.] If it does, the next question is how does the pointer get > trashed. I don't think that the pointer is getting trashed. The SA was not running, so I don't think that any route was returned. - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] List of issues in uverbs
--- Sean Hefty <[EMAIL PROTECTED]> wrote: > viswanath krishnamurthy wrote: > > 1. ib_cm_destroy_id(cm_id) > > hangs (does return to the caller) > > Is there a particular shutdown sequence > > that needs to be followed ? Is there a > trace/debug > > I can enable ? > > There's no significant debug to enable. What app > are you running that's calling > ib_cm_destroy_id()? I didn't think that the ping > pong tests used it. Are you > trying to call this function from within a CM > callback? Probably called from a callback.. The application is small application which accepts incoming connections (Like a socket server). When is the good time to call the destroy ? > > The call will hang while there is a CM callback > outstanding or if a CM event has > not been completed by calling put_event. > > > 2. libmthca library crashes when a server accepts > > lots of new incoming sessions. See log (gdb) > > in the attachment. (It accepts about 170 > > connections) Looks like a memory allocation issue. > > The log file borders on unreadable. Hope this time attachment is better.. See information here == A server program that accepts multiple incoming connections. After about 170 connections the library dies as seen in the gdb output == Program received signal SIGSEGV, Segmentation fault. [Switching to Thread -1208648784 (LWP 21309)] 0xb7f79de8 in mthca_free_db (db_tab=0x805c688, type=MTHCA_DB_TYPE_CQ_SET_CI, db_index=494) at src/memfree.c:150 150 db_tab->page[db_index / MTHCA_DB_REC_PER_PAGE]. (gdb) bt #0 0xb7f79de8 in mthca_free_db (db_tab=0x805c688, type=MTHCA_DB_TYPE_CQ_SET_CI, db_index=494) at src/memfree.c:150 #1 0xb7f7c699 in mthca_create_cq (context=0x805a0b4, cqe=10) at mthca.h:243 #2 0xb7f81eb5 in ibv_create_cq (context=0x805a0b4, cqe=10, cq_context=0x0) at src/verbs.c:107 #3 0xb7f5d6c0 in xib_qp_alloc_init (hp=0x865c958, port=1) at xsocket_trans2.c:157 #4 0xb7f5e19f in xib_conn_init (xcbp=0x865c958) at xsocket_trans2.c:496 #5 0xb7f5bd06 in handle_cm_req (hp=0x805da08, comm_id=0x865cab0, rguid=0x805db64 "", rn_guid=0x805db64 "", data=0x805d7b0, len=90) at xsocket.c:230 #6 0xb7f5ec73 in cm_handler () at xsocket_trans2.c:799 #7 0x007993ae in start_thread () from /lib/tls/libpthread.so.0 #8 0x00619aee in clone () from /lib/tls/libc.so.6 > > > 3. Kernel oops when lots of traffic between > multiple > >clients and server. Very consistently > >reproducible. See attachment for details > > Can you clarify what application you're running? I > can't understand your > configuration from the log file. The application is a simple one, which accepts incoming requests and spawns a thread to handle it. The application does simple "ping-pong" of data. printing eip: c0285f7d *pde = 3649a001 Oops: [#1] SMP Modules linked in: nfs nfsd exportfs lockd autofs4 sunrpc uhci_hcd ehci_hcd hw_random e1000 ext3 jbd sd_mod CPU:0 EIP:0060:[]Not tainted VLI EFLAGS: 00010002 (2.6.12.5) EIP is at mthca_poll_cq+0x158/0x534 eax: ebx: c2027080 ecx: 0007 edx: 0a60 esi: 013c edi: c2027104 ebp: c1a33f0c esp: c1a33ea4 ds: 007b es: 007b ss: 0068 Process ib_mad1 (pid: 312, threadinfo=c1a32000 task=f7f16540) Stack: c1800560 c17f8560 c17f8ec0 c1a33edc c0116819 f7d9489c f78a31e0 0080 0286 f7d83000 c1a33f0c 0001 f7d94880 f8806000 0292 0001 c2027080 f7d83000 f789bc00 c1a33f0c Call Trace: [] load_balance_newidle+0x76/0x81 [] ib_mad_completion_handler+0x2c/0x8d [] remove_wait_queue+0xf/0x34 [] worker_thread+0x1b0/0x23a [] ib_mad_completion_handler+0x0/0x8d [] default_wake_function+0x0/0xc [] default_wake_function+0x0/0xc [] worker_thread+0x0/0x23a [] kthread+0x8a/0xb2 [] kthread+0x0/0xb2 [] kernel_thread_helper+0x5/0xb Code: 01 00 00 8b 44 24 18 8d bb 84 00 00 00 8b 53 5c 8b 70 18 8b 4f 24 0f ce 2b b3 b8 00 00 00 8b 83 bc 00 00 00 d3 ee 01 f2 8d 14 d0 <8b> 02 8b 52 04 85 ff 89 45 00 89 55 04 74 16 8b 57 10 89 f0 39 After about 170 incoming connections the library (hence the application) dies.. > > > 4. Is there a way to get the Port GUID from > > incoming connection. I can only get the > remote > >node guid, but not the port GUID from the CM > REQ > > data. This was possible in gen1 stack. > > You can use the returned path record to obtain port > information. What do you > need the port GUID for? If an HCA has multiple ports, the node guid will be the same. It will be good to get the port guid to uniqely identify the port. > > - Sean > Here is the code version used.. [EMAIL PROTECTED] svn in
[openib-general] List of issues in uverbs
I have attached the firmware version/svn info in the attachment. Here is new list of issues with uverbs 1. ib_cm_destroy_id(cm_id) hangs (does return to the caller) Is there a particular shutdown sequence that needs to be followed ? Is there a trace/debug I can enable ? 2. libmthca library crashes when a server accepts lots of new incoming sessions. See log (gdb) in the attachment. (It accepts about 170 connections) Looks like a memory allocation issue. 3. Kernel oops when lots of traffic between multiple clients and server. Very consistently reproducible. See attachment for details 4. Is there a way to get the Port GUID from incoming connection. I can only get the remote node guid, but not the port GUID from the CM REQ data. This was possible in gen1 stack. I will look in the rc_ping pong issue and try to reproduce. --- Roland Dreier <[EMAIL PROTECTED]> wrote: > viswanath> I have the latest openib code on 2.16 > machine, when I > viswanath> run the rc pingpong program I get the > following error > viswanath> (The first time it passed, but > subsequent ones got an > viswanath> error, I tried changing the iteration > count to a large > viswanath> number, 10 after the first time) > > I left "ibv_rc_pingpong -n 10" running in a loop > between two of my > machines with no problems, so there's something > specific to your setup. > > When you say "latest openib code," what does this > mean? Are you > running something from subversion or a standard > Linux kernel? Do you > have 1-port or 2-port HCAs? What HCA firmware > version are you > running? > > - R. > Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs ib.log Description: 2164448128-ib.log ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] rc ping pong error
I have the latest openib code on 2.16 machine, when I run the rc pingpong program I get the following error (The first time it passed, but subsequent ones got an error, I tried changing the iteration count to a large number, 10 after the first time) #dmesg ib_mthca :05:00.0: Mapped page at 395aa000 to 8 for ICM. ib_mthca :05:00.0: CQ overrun on CQN 5b0083 <= ib_mthca :05:00.0: Unmapping 1 pages at 8 from ICM. [EMAIL PROTECTED] ./ibv_rc_pingpong 192.169.8.117 local address: LID 0x0003, QPN 0x440405, PSN 0xd6ae4e remote address: LID 0x0001, QPN 0x3a0405, PSN 0x9317a4 [ 0] 00440405 [ 4] [ 8] [ c] [10] 1581 [14] [18] 8002 [1c] ff10 Failed status 12 for wr_id 2 Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
Re: [openib-general] kernel oops
Still see the issue 1. I rebooted both the machines, started opensm, after LID assignment killed opensm. Next started the ucmpost client/server, killing it panics the system -Viswa Unable to handle kernel NULL pointer dereference at virtual address 0068 printing eip: c02f2635 *pde = 3661e001 Oops: [#1] SMP Modules linked in: nfsd exportfs lockd autofs4 sunrpc uhci_hcd ehci_hcd hw_random e1000 ext3 jbd sd_mod CPU:0 EIP:0060:[]Not tainted VLI EFLAGS: 00010086 (2.6.12.5) EIP is at _spin_lock_irqsave+0xa/0x51 eax: 0064 ebx: 0286 ecx: f689be6c edx: c036cbcc esi: 0064 edi: 0064 ebp: esp: f689be00 ds: 007b es: 007b ss: 0068 Process lt-ucmpost (pid: 3993, threadinfo=f689a000 task=f6ef9540) Stack: c013e3f0 c036cbcc c0267667 00d0 f689beac f66a9e80 c027393f c0350d00 f689be6c 0c30 0064 f689beac f66a9e80 c027955f 0c30 0064 00d0 c0279022 f66a9e80 Call Trace: [] __alloc_pages+0x166/0x3b6 [] ib_get_client_data+0x14/0x54 [] ib_sa_path_rec_get+0x1b/0x13e [] resolve_path+0x8c/0x15b [] path_req_complete+0x0/0xf7 [] rtnetlink_dump_all+0x0/0x9e [] rtnetlink_done+0x0/0x3 [] ib_at_paths_by_route+0xc4/0xd9 [] same_path_req+0x0/0x95 Sean Hefty wrote: I downloaded the latest openib gen2 stack and ran into kernel panic when I run the cmpost/ucmpost example. I modified the program to continously send and receive data in an infinite loop and killed the application with ctrl-c. The kernel panics pretty consistently. I am currently running 2.6.12 version of the kernel . Log attached. I will try upgrading to newer kernel and see if I can reproduce it. I have gotten something similar to this in my own testing, but haven't had the time to track it down. It seems to be related to how the IB AT code interacts with the SM, and if the SM has been restarted. Can you try resetting the SM node, then rebooting your other systems? - Sean ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] kernel oops
I downloaded the latest openib gen2 stack and ran into kernel panic when I run the cmpost/ucmpost example. I modified the program to continously send and receive data in an infinite loop and killed the application with ctrl-c. The kernel panics pretty consistently. I am currently running 2.6.12 version of the kernel . Log attached. I will try upgrading to newer kernel and see if I can reproduce it. -Viswa [EMAIL PROTECTED] examples]# uname -a Linux subnetmgr4 2.6.12 #7 SMP Thu Aug 25 22:33:36 PDT 2005 i686 i686 i386 GNU/Linux # svn info Path: . URL: https://openib.org/svn/gen2/trunk/src/linux-kernel Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd Revision: 3197 Node Kind: directory Schedule: normal Last Changed Author: roland Last Changed Rev: 3197 Last Changed Date: 2005-08-25 18:07:09 -0700 (Thu, 25 Aug 2005) mgr4 login: Unable to handle kernel NULL pointer dereference at virtual address 0068 printing eip: c02f35c5 *pde = 365b6001 Oops: [#1] SMP Modules linked in: nfsd exportfs lockd autofs4 sunrpc uhci_hcd ehci_hcd hw_random e1000 ext3 jbd sd_mod CPU:1 EIP:0060:[]Not tainted VLI EFLAGS: 00010086 (2.6.12) EIP is at _spin_lock_irqsave+0xa/0x51 eax: 0064 ebx: 0286 ecx: f6607e6c edx: c036dbcc esi: 0064 edi: 0064 ebp: esp: f6607e00 ds: 007b es: 007b ss: 0068 Process lt-ucmpost (pid: 3837, threadinfo=f6606000 task=f6feb020) Stack: c013e410 c036dbcc c0267637 0073 00d0 f6607eac f6504e80 c027390f c0351d00 f6607e6c 0c30 0064 f6607eac f6504e80 c027952f 0c30 0064 00d0 c0278ff2 f6504e80 Call Trace: [] __alloc_pages+0x166/0x3b6 [] ib_get_client_data+0x14/0x54 [] ib_sa_path_rec_get+0x1b/0x13e [] resolve_path+0x8c/0x15b [] path_req_complete+0x0/0xf7 [] rtnetlink_dump_all+0x0/0x9e [] rtnetlink_done+0x0/0x3 [] ib_at_paths_by_route+0xc4/0xd9 [] same_path_req+0x0/0x95 [] ib_uat_paths_by_route+0xef/0x1c4 [] rtnetlink_dump_all+0x0/0x9e [] rtnetlink_done+0x0/0x3 [] ib_uat_write+0x96/0xa2 [] vfs_write+0x108/0x10a [] sys_write+0x41/0x6a [] sysenter_past_esp+0x54/0x75 Code: c8 c3 81 78 04 ed 1e af de 75 0c f0 83 28 01 79 05 e8 94 e5 ff ff c3 0f 0b d7 00 56 60 30 c0 eb ea 56 89 c6 53 83 ec 08 9c 5b fa <81> 78 04 ad 4e ad de 75 20 f0 fe 0e 79 13 f7 c3 00 02 00 00 74 <7>ib_mthca :05:00.0: Unmapping 1 pages at 8 from ICM. ib_mthca :05:00.0: Unmapping 1 pages at bf000 from ICM. Unable to handle kernel NULL pointer dereference at virtual address 0005 printing eip: c027a160 *pde = 37e01001 Oops: 0002 [#2] SMP Modules linked in: nfsd exportfs lockd autofs4 sunrpc uhci_hcd ehci_hcd hw_random e1000 ext3 jbd sd_mod CPU:0 EIP:0060:[]Not tainted VLI EFLAGS: 00010286 (2.6.12) EIP is at ib_uat_create_event+0x4e/0xb4 eax: 0246 ebx: 0005 ecx: edx: 0066 esi: c19fa180 edi: c19fa1b4 ebp: f72f5f00 esp: f7f4def4 ds: 007b es: 007b ss: 0068 Process ib_at_wq/0 (pid: 309, threadinfo=f7f4c000 task=f7cfca60) Stack: 0096 f72f5f00 0001 f6504ed4 c027a219 0002 ff92 f6504eb0 f6504ed4 0292 c027a291 f72f5f00 ff92 ff92 c0278913 ff92 f6504eb0 c1a12000 c0129aad 000f41fd Call Trace: [] ib_uat_callback+0x53/0x6d [] ib_uat_path_callback+0x1a/0x1f [] req_comp_work+0x19/0x25 [] worker_thread+0x1b0/0x23a [] req_comp_work+0x0/0x25 [] default_wake_function+0x0/0xc [] default_wake_function+0x0/0xc [] worker_thread+0x0/0x23a [] kthread+0x8a/0xb2 [] kthread+0x0/0xb2 [] kernel_thread_helper+0x5/0xb Code: 84 82 00 00 00 89 c7 b9 0d 00 00 00 89 d8 f3 ab 89 6e 04 ba 66 00 00 00 8b 44 24 04 89 06 b8 fa 44 30 c0 8b 5d 08 e8 84 ea e9 ff ff 0b 0f 88 27 0d 00 00 8b 45 08 8d 56 08 83 c0 24 8b 48 04 ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
RE: [openib-general] Re: useraccess_cm sample client/server (gen1 )
Itamar, Thanks.. I was able to use it. One more question.. Once the connection is established, which API's needs to be used from the userland to send and receive data.. Any sample code/pointers is appreciated. Thanks, -Vish --- Itamar Rabenstein <[EMAIL PROTECTED]> wrote: > openib is working now on gen2. > but if you want you can look at mellanox IBGD 1.7.0 > from > www.mellnaox.com follow the link "Download IB GOLD - > 1.7.0" > look for udapl code . > The code is useing the user_cm IF > > Itamar > > > -----Original Message- > > From: viswanath krishnamurthy > [mailto:[EMAIL PROTECTED] > > Sent: Wednesday, July 06, 2005 8:28 PM > > To: viswanath krishnamurthy; > openib-general@openib.org > > Subject: [openib-general] Re: useraccess_cm sample > > > client/server (gen1) > > > > > > I looked further into the whole gen1 source tree. > > There is no consumer of this useraccess_cm API > > (ioctl). Are there any consumers of this API's. Is > it > > supported ? > > > > Thanks, > > Vish > > > > --- viswanath krishnamurthy <[EMAIL PROTECTED]> > wrote: > > > > > Is there a sample code (examples) to use the > gen1 > > > stack user level CM API (ioctls) ? Any pointers > is > > > appreciated. > > > > > > Thanks, > > > Vish > > > > > > > > > > > > > > > > > > > Yahoo! Sports > > > Rekindle the Rivalries. Sign up for Fantasy > Football > > > > > > http://football.fantasysports.yahoo.com > > > > > > > > > > > > > > > > Sell on Yahoo! Auctions - no fees. Bid on great > items. > > http://auctions.yahoo.com/ > > ___ > > openib-general mailing list > > openib-general@openib.org > > http://openib.org/mailman/listinfo/openib-general > > > > To unsubscribe, please visit > > http://openib.org/mailman/listinfo/openib-general > > > __ Do You Yahoo!? Tired of spam? Yahoo! Mail has the best spam protection around http://mail.yahoo.com ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] Re: useraccess_cm sample client/server (gen1)
I looked further into the whole gen1 source tree. There is no consumer of this useraccess_cm API (ioctl). Are there any consumers of this API's. Is it supported ? Thanks, Vish --- viswanath krishnamurthy <[EMAIL PROTECTED]> wrote: > Is there a sample code (examples) to use the gen1 > stack user level CM API (ioctls) ? Any pointers is > appreciated. > > Thanks, > Vish > > > > > > Yahoo! Sports > Rekindle the Rivalries. Sign up for Fantasy Football > > http://football.fantasysports.yahoo.com > Sell on Yahoo! Auctions no fees. Bid on great items. http://auctions.yahoo.com/ ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
[openib-general] useraccess_cm sample client/server (gen1)
Is there a sample code (examples) to use the gen1 stack user level CM API (ioctls) ? Any pointers is appreciated. Thanks, Vish Yahoo! Sports Rekindle the Rivalries. Sign up for Fantasy Football http://football.fantasysports.yahoo.com ___ openib-general mailing list openib-general@openib.org http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general