from:"viswanath krishnamurthy"

Hal,

Thanks.. works like a charm...

-Viswa
On 27 Sep 2005 16:13:01 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote:
On Tue, 2005-09-27 at 16:00, Viswanath Krishnamurthy wrote:> Hal,>> I added a hack now to get around the problem. There needs to be a> proper fix later..Can you try this instead ? Thanks.
-- HalIndex: include/opensm/osm_port.h===--- include/opensm/osm_port.h   (revision 3567)+++ include/opensm/osm_port.h   (working copy)
@@ -346,7 +346,7 @@ osm_physp_is_healthy( *  Returns TRUE if the Physical Port has been maked as healthy *  FALSE otherwise. *  All physical ports are initialized as "healthy" but may be marked
-*  otherwise if a  received trap claims otherwise.+*  otherwise if a received trap claims otherwise. * * NOTES *@@ -456,6 +456,42 @@ osm_physp_set_port_info( *  Port, Physical Port */
+/f* OpenSM: Physical Port/osm_physp_validate_base_lid+* NAME+*  osm_physp_validate_base_lid+*+* DESCRIPTION+*  Validates the base LID in the Physical Port object.+*+* SYNOPSIS+*/
+static inline boolean_t+osm_physp_validate_base_lid(+   IN osm_physp_t* const p_physp )+{+   CL_ASSERT( osm_physp_is_valid( p_physp ) );+   if ( cl_ntoh16( p_physp->port_info.base_lid ) > IB_LID_UCAST_END_HO )
+   {+   p_physp->port_info.base_lid = 0;+   return FALSE;+   }+   return TRUE;+}+/*+* PARAMETERS+*  p_physp+*  [in] Pointer to an osm_physp_t object.
+*+* RETURN VALUES+*  Returns TRUE if the base LID in the Physical port object is valid.+*  FALSE otherwise.+*+* NOTES+*+* SEE ALSO+*  Port, Physical Port+*/
+ /f* OpenSM: Physical Port/osm_physp_set_pkey_tbl * NAME *  osm_physp_set_pkey_tblIndex: opensm/osm_port_info_rcv.c===--- opensm/osm_port_info_rcv.c  (revision 3579)
+++ opensm/osm_port_info_rcv.c  (working copy)@@ -346,8 +346,12 @@ __osm_pi_rcv_process_switch_port(   if (port_num == 0)   {-/* This is a management port 0 */-   __osm_pi_rcv_process_endport(p_rcv, p_physp, p_pi);
+   /* This is switch management port 0 */+   if ( !osm_physp_validate_base_lid( p_physp ) )+   osm_log( p_rcv->p_log, OSM_LOG_ERROR,+"__osm_pi_rcv_process_switch_port: ERR 0F04: "
+"Invalid
base LID corrected.\n" );+   __osm_pi_rcv_process_endport(p_rcv, p_physp, p_pi);   }   OSM_LOG_EXIT( p_rcv->p_log );@@ -367,6 +371,10 @@ __osm_pi_rcv_process_ca_port(   UNUSED_PARAM( p_node );
   osm_physp_set_port_info( p_physp, p_pi );+  if ( !osm_physp_validate_base_lid( p_physp ) )+osm_log( p_rcv->p_log, OSM_LOG_ERROR,+"__osm_pi_rcv_process_ca_port: ERR 0F08: "
+"Invalid base LID corrected.\n" );   __osm_pi_rcv_process_endport(p_rcv, p_physp, p_pi);@@ -390,6 +398,10 @@ __osm_pi_rcv_process_router_port( Update the PortInfo attribute.
   */   osm_physp_set_port_info( p_physp, p_pi );+  if ( !osm_physp_validate_base_lid( p_physp ) )+osm_log( p_rcv->p_log, OSM_LOG_ERROR,+"__osm_pi_rcv_process_router_port: ERR 0F09: "
+"Invalid base LID corrected.\n" );   OSM_LOG_EXIT( p_rcv->p_log ); }
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] opensm and faulty hardware

Hal,

I added a hack now to get around the problem. There needs to be a proper fix later..

[EMAIL PROTECTED] opensm]# svn diff osm_port.h
Index: osm_port.h
===
--- osm_port.h  (revision 3549)
+++ osm_port.h  (working copy)
@@ -1049,6 +1049,8 @@
 {
    CL_ASSERT( p_physp );
    CL_ASSERT( osm_physp_is_valid( p_physp ) );
+   if (p_physp->port_info.base_lid == 0x)
+   return (0);
    return( p_physp->port_info.base_lid );
 }
 /*
On 27 Sep 2005 15:11:05 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote:
On Tue, 2005-09-27 at 14:13, Viswanath Krishnamurthy wrote:> I tracked down the issue to a bug in osm_lid_mgr.c>> function:  __osm_lid_mgr_init_sweep(...)>> The bad hardware was retutning an assigned LID of 0x. In this
> function there is a loop> as follows where opensm is getting stuck.. (with line number)>> 392   p_port_guid_tbl = &p_mgr->p_subn->port_guid_tbl;> 393> 394   for( p_port = (osm_port_t*)cl_qmap_head( p_port_guid_tbl );
>
395p_port !=
(osm_port_t*)cl_qmap_end( p_port_guid_tbl );>
396p_port =
(osm_port_t*)cl_qmap_next( &p_port->map_item )> )> 397   {> 398 osm_port_get_lid_range_ho(p_port, &disc_min_lid,> &disc_max_lid);> 399 for (lid = disc_min_lid; lid <= disc_max_lid;
>
lid++)  <=
Bug here> 400   cl_ptr_vector_set(p_discovered_vec, lid, p_port );> 401   }>> Since the disc_max_lid and disc_min_lid are 0x, and these are> unsigned 16 bit numbers, the condition
> in the for loop never becomes false, and opensm is stuck in the loop.> There are couple of other places in that> function that needs fixing too.Sep 26 15:26:03 424135 [B66CFBB0] -> SMP dump:
base_ver0x1mgmt_class..0x81class_ver...0x1method..0x1
(SubnGet)D
bit...0x0status..0x0hop_ptr.0x0hop_count...0x2trans_id0x1274
attr_id.0x15
(PortInfo)resv0x0attr_mod0x1m_key...0xdr_slid.0x
dr_dlid.0xSep 26 15:26:03 424407 [B6ED0BB0] -> __osm_nd_rcv_process_nd: Node 0x30d32c7234Description
= Agilent E2954A 4x Generator for InfiniBand.Sep 26 15:26:03 424426 [B6ED0BB0] -> __osm_nd_rcv_process_nd: ]Sep 26 15:26:03 679882 [B56CDBB0] -> SMP dump:base_ver0x1
mgmt_class..0x81class_ver...0x1method..0x81
(SubnGetResp)D
bit...0x1status..0x0hop_ptr.0x0hop_count...0x2trans_id0x1274
attr_id.0x15
(PortInfo)resv0x0attr_mod0x1m_key...0xdr_slid.0x
dr_dlid.0xInitial
path: [0][1][12]Return
path:  [0][E][0]Sep 26 15:26:03 680291 [B76D1BB0] -> osm_pi_rcv_process: [Sep 26 15:26:03 680323 [B56CDBB0] -> __osm_sm_mad_ctrl_rcv_callback: ]Sep 26 15:26:03 680343 [B76D1BB0] -> PortInfo dump:
port
number.0x1node_guid...0x0030d32c7234port_guid...0x0030d32c7234m_key...0x
subnet_prefix...0xfe80base_lid0xYes, it appears the Agilent exerciser returned good status to a SM Get
PortInfo with a base_lid of 0x. The base_lid should be validated byOpenSM.-- Hal
___
openib

Re: [openib-general] opensm and faulty hardware

I tracked down the issue to a bug in osm_lid_mgr.c 

function:  __osm_lid_mgr_init_sweep(...)

The bad hardware was retutning an assigned LID of 0x. In this function there is a loop
as follows where opensm is getting stuck.. (with line number)

    392   p_port_guid_tbl = &p_mgr->p_subn->port_guid_tbl;
    393
    394   for( p_port = (osm_port_t*)cl_qmap_head( p_port_guid_tbl );
    395    p_port != (osm_port_t*)cl_qmap_end( p_port_guid_tbl );
    396    p_port
= (osm_port_t*)cl_qmap_next( &p_port->map_item ) )
    397   {
    398 osm_port_get_lid_range_ho(p_port, &disc_min_lid, &disc_max_lid);
    399 for (lid = disc_min_lid;
lid <= disc_max_lid;
lid++) 
<= Bug here
    400   cl_ptr_vector_set(p_discovered_vec, lid, p_port );
    401   }

Since the disc_max_lid and disc_min_lid are 0x, and these are unsigned 16 bit numbers, the condition
in the for loop never becomes false, and opensm is stuck in the loop.  There are couple of other places in that
function that needs fixing too.

-Viswa
On 9/27/05, Viswanath Krishnamurthy <[EMAIL PROTECTED]> wrote:
Log sent off-list...

-Viswa
On 9/27/05, Eitan Zahavi <
[EMAIL PROTECTED]> wrote:

Hi Viswa,Please send a full /var/log/osm.log file of opensm -V .You can send us a copy off the list if it is too big:yael and eitan in @
mellanox.co.ilEZ
Hal Rosenstock wrote:> On Mon, 2005-09-26 at 19:57, Viswanath Krishnamurthy wrote:>>>I have an exerciser in the IB network. The exerciser seems to be>>faulty/buggy. When opensm starts I do not
>>see 'SUBNET UP" message. It says "Entering MASTER"  and waits there.>>Any new node inserted in this state is not assigned any LID.   Anybody>>seen such behavior ?>

>> Any idea on how the IB exerciser misbehaves on the network ? Do you have> an analyzer too ?>> What does the OSM log show ?>> -- Hal>> ___
> openib-general mailing list> openib-general@openib.org> 
http://openib.org/mailman/listinfo/openib-general
>> To unsubscribe, please visit> http://openib.org/mailman/listinfo/openib-general
>


___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] opensm and faulty hardware

Log sent off-list...

-Viswa
On 9/27/05, Eitan Zahavi <[EMAIL PROTECTED]> wrote:
Hi Viswa,Please send a full /var/log/osm.log file of opensm -V .You can send us a copy off the list if it is too big:yael and eitan in @mellanox.co.ilEZ
Hal Rosenstock wrote:> On Mon, 2005-09-26 at 19:57, Viswanath Krishnamurthy wrote:>>>I have an exerciser in the IB network. The exerciser seems to be>>faulty/buggy. When opensm starts I do not
>>see 'SUBNET UP" message. It says "Entering MASTER"  and waits there.>>Any new node inserted in this state is not assigned any LID.   Anybody>>seen such behavior ?>
>> Any idea on how the IB exerciser misbehaves on the network ? Do you have> an analyzer too ?>> What does the OSM log show ?>> -- Hal>> ___
> openib-general mailing list> openib-general@openib.org> http://openib.org/mailman/listinfo/openib-general
>> To unsubscribe, please visit> http://openib.org/mailman/listinfo/openib-general>
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] opensm and faulty hardware

2005-09-26 Thread Viswanath Krishnamurthy

I have an exerciser in the IB network. The exerciser seems to be faulty/buggy. When opensm starts I do not
see 'SUBNET UP" message. It says "Entering MASTER"  and waits there.
Any new node inserted in this state is not assigned any LID.   Anybody seen such behavior ?

-Viswa

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Another opensm bug ?

2005-09-26 Thread Viswanath Krishnamurthy

I ran into another opensm bug which caused opensm to stop functioning. This happened only once.

Here is the test case

1. Run opensm on Machine A
2. Run the following script on M/c B
    a. Check ibstatus
    b. Ping machine A
    c. Run osmtest
 d. reboot

The test case is to make sure opensm configures the machine correcty.
Out of 850 iterations, I saw this error once.  The opensm started receiving
Sbnet trap continiously. 9I did not see any message in the log to prevent DOS attacks)
The Trap has the same transacation id (0x224 in this case). opensm mad receive thread
was getting called continously called. 

Initially I suspected the situation which Eitan described.. (Bad hardware causing traps etc).
But when I stoppped and restarted opensm, the problem went away. Log attached off-list.

-Viswa




   
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: Another opensm problem ?

2005-09-26 Thread Viswanath Krishnamurthy

Hi Eitan,

I see that message in the log. 

-Viswa
On 9/24/05, Eitan Zahavi <[EMAIL PROTECTED]> wrote:
Hi Viswa and Hal,I have read through the thread and have few comments.But first let me see if I understand the test run correctly. The test is as follows:1. OpenSM starts up configuring the subnet.
2. Then the user ears up a cable and connects it to the other side port of a switch3. The SM is supposed to bring up the new connection4. Step 2 is repeated until the SM stops responding.Well, if this is the case then OpenSM is might stop responding due to the following features:
1. We had in the past cases where bad hardware continuously flooded the SM with Traps.To protect against this kind of DOS attack we have implemented an adaptive filter inthe SM trap receiver:If the exact same trap is received continuously from same source more then 10 times
(with no more then of 5sec between the traps) they are considered DOS and are ignored.Please see osm_trap_rcv.c for details.2. The way IB switches work is that each time a port of their changes state they:
a. Set the "change bit" in the SwitchInfob. Send a trap 128 to the SM. But Trap 128 does not carry the changed port number.So under a test case like you describe what can happen:1. The SM decides to ignore trap 128 from the switch as more then 5 connect/reconnect sequences
happen with not enough "quite" time to recover.2. The SwitchInfo ChangeBit is sampled during the OSM light sweep. There is a race between thereading of the change bit and the clearing of it. If the connect disconnect happen very fast
the change bit set by the re-connect can be cleaned by the clear starting by the disconnect.It is easy to see in the log file if the SM did ignore traps. Run with -V and look for:grep "Continuously received this trap" /var/log/osm.log
(for some reason I did not get any log attachments with this thread - otherwise I woulddo some analysis on it too).Anyway, if the SM does not heavy sweep (due to the above) it is very likely it will continue to
poll the non existing node that was previously attached to a switch port with no success.So testing of cable tear off and reconnect should be done with at least 10 seconds recovery time.Also you could try sending kill -HUP to the OpenSM process and see if the full sweep you start
is able to bring all ports up.Viswa, with all that said, it is very possible you are experiencing a bug in OpenSM and wewant to encourage your effort finding those. With your, and others, help we will be able to
flush them out.ThanksEitanHal Rosenstock wrote:> On Fri, 2005-09-23 at 14:57, Hal Rosenstock wrote:>>>On Fri, 2005-09-23 at 13:50, Viswanath Krishnamurthy wrote:>>
>>>- After 7-8 iterations, I ran into a weird problem, where opensm was>>>showing the HCA as UNKNOWN. The port>>>never came up to ACTIVE state.  The unplugged and replugged into>>>different slots, the port remained in INIT
>>>state.>>>>Mellanox:
SW : 12 : INI
:  : : 2048 :
1x  : 2.5 :>> 0002c9010d26e780 : UNKNOWN>>>OpenSM thinks that either there is no physical port on the other end>> of>>>the link or it is not "valid" (GUID non 0). Obviously it is there as
>> the>>>port state is INIT so the physical link came up which requires the>>remote end to be there.>>>>From the log you sent, this is exactly what is happening.
> Sep 23 10:07:23 451191 [B7751BB0] -> osm_drop_mgr_process: Checking port> 0x0002c9010d26e780.> Sep 23 10:07:23 451209 [B7751BB0] -> osm_drop_mgr_process: Checking port> 0x0002c90200400cfd.
> Sep 23 10:07:23 451226 [B7751BB0] -> osm_drop_mgr_process: ERR 0108:> Unknown remote side for node 0x0002c9010d26e780 port 20. Adding to light> sweep sampling list.> Sep 23 10:07:23 451251 [B7751BB0] -> Directed Path Dump of 1 hop path:
>
Path = [0][1]> Sep 23 10:07:23 451267 [B7751BB0] -> osm_drop_mgr_process: ]>> So look in osm_drop_mgr.c line 707:> Can you enhance the log display to see which is failing:> osm_physp_is_valid(p_physp) or osm_physp_get_remote(p_physp) ?
>> Also, it appears to keep light sweeping this port but whichever switch> port it is on, it does not respond. Not sure where the problem is. It> could be on the outgoing side of the switch (we could run diags against
> the switch and various ports; I would be curious what they return when> the subnet is in this broken state) or on the HCA. However, the fact> that restarting opensm made it go away without touching anything else
> makes this appear otherwise.>>>>One other note is that it appears to have come up as 1x. Is that what>>should happen ?>>> -- Hal>> ___
> openib-general mailing list>

Re: [openib-general] Re: opensm and SIGINT

On 23 Sep 2005 13:49:31 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote:
Hi Viswa,On Fri, 2005-09-23 at 13:43, Viswanath Krishnamurthy wrote:> More information,>> The test case is as follows>> 1. Start opensm in verbose mode (-V)> 2. Ping remote node
> 3. osmtest -f c> 4. osmtest -f a> 5. pkill -9 opensm> 6. Repeat over>> Out of about 2500 iterations, 143  osmtest  failed. Keep in mind,> only Step 4 failed.Yes.
Do you see any port LEDs on the switch blink indicating the port wentdown from active and back while running this  ?

No, I ran this test overnight and logged the results.  I will try it next week and let you know.
> Step 3 which is inventory file creation *never* failed. (I think> inventory file creation also talks to SA right ?)
Right.-- Hal
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Forcing IB link state down

On 23 Sep 2005 13:59:28 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote:
Hi Viswa,On Fri, 2005-09-23 at 13:55, Viswanath Krishnamurthy wrote:> Is there an API or command to force an IB link to go down.Not currently.>  This will be helpful in running tests on opensm.
Yes, I can understand that. Technically (per the IBA spec), the SM isthe only one allowed to do Sets. I think it would be possible to have adiag command do this as long as the MKey protection is weak (which it is
now). A better way might be to have a CLI on the OpenSM and be able toissue a down command to a port.

I was looking if mthca driver has any API/ioctl to disable/enable the link..

-Viswa
 
-- Hal
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Re: Another opensm problem ?

Hal,On 23 Sep 2005 14:04:00 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote:
Hi again Viswa,On Fri, 2005-09-23 at 13:50, Viswanath Krishnamurthy wrote:Good test. Hadn't tried this. I will try it and will recreate this.> - 2 machines with a switch in bertween. One m/c running opensm.
How was opensm started ?

Manually   # opensm -V 
> Attached is the logThe default log is in /var/log/osm.log


I captured what appeared on the screen.   I will  send
the osm.log file too.. It is a big one and had accumulated over a
period of time..
-- Hal
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Forcing IB link state down

Is there an API or command to force an IB link to go down. This will be helpful in running tests on opensm.

-Viswa

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: opensm and SIGINT

More information,

The test case is as follows

1. Start opensm in verbose mode (-V)
2. Ping remote node 
3. osmtest -f c
4. osmtest -f a
5. pkill -9 opensm
6. Repeat over

Out of about 2500 iterations, 143  osmtest  failed. Keep in
mind,  only Step 4 failed.  Step 3 which is inventory file
creation *never* failed. (I think inventory file creation also talks to
SA right ?)

-Viswa

On 23 Sep 2005 12:54:56 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote:
Hi Eitan,On Fri, 2005-09-23 at 12:19, Eitan Zahavi wrote:> Hi Hal, Viswa,>> Sorry I'm joining late on this thread due to the weekend (which starts> here on Friday ending Saturday night).
> Is there any conclusion on this one?No.> The only log I have seen was from osmtest failing to send a MAD.True.> Looks like a umad issue?Not sure why you say that. There are other possibilities I'm aware of
here:Note that that failed sent MAD is one which has a response expected sothis means that the response was not received. It also goes through thetransmit retry strategy (I could see this on the SA side). So the only
thing I can say at this point is that for some reason, the response doesnot make it back from the SA to the SA client (osmtest). That's wherethis one is right now.-- Hal> Eitan>> Hal Rosenstock wrote:
> > Hi again Viswa,> >> > On Wed, 2005-09-21 at 21:00, Hal Rosenstock wrote:> >> >>Hi Viswa,> >>> >>On Wed, 2005-09-21 at 20:23, Viswanath Krishnamurthy wrote:
> >>> >>>Currently opensm traps SIGINT. There was some discussion to remove> >> > it.> >> >>>I have currently running some tests on opensm> >>>by killing (SIGKILL) and restarting opensm. So far I ahve not found
> >>>any resource leak issues. Is ther a plan to remove that> >>>signal handler. Ideally it should not exist.> >>> >>Eitan stated that this was historical in nature for gen1 drivers which
> >>had resource tracking problems: "if OpenSM left without cleaning up> >> > all> >> >>used resources (like MAD buffers and UD-AVs), the driver oops'ed."
> >>> >>I think that (eliminating the handler for SIGINT) can at least be done> >>for OSM_VENDOR_INTF_OPENIB and leave it there for the other vendor> >>layers for starters. I will experiment with gen2 and let you know.
> >> >> > Does the patch below do what you want ? Can you try it ?> >> > -- Hal> >> > Index: opensm/osm_opensm.c> > ===
> > --- opensm/osm_opensm.c (revision 3513)> > +++ opensm/osm_opensm.c (working copy)> > @@ -182,7 +182,9 @@ osm_reg_sig_handler(> > IN osm_opensm_t * const p_osm )> >  {
> > __p_osm_to_signal = p_osm;> > +#ifndef OSM_VENDOR_INTF_OPENIB> > cl_reg_sig_hdl( SIGINT, __sig_handler );> > +#endif> > cl_reg_sig_hdl( SIGTERM, __sig_handler );
> > cl_reg_sig_hdl( SIGHUP, __sig_handler );> > osm_exit_flag = 0;> >> >> > ___> > openib-general mailing list
> > openib-general@openib.org> > http://openib.org/mailman/listinfo/openib-general> >
> > To unsubscribe, please visit> > http://openib.org/mailman/listinfo/openib-general> >>
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: opensm and SIGINT

On 22 Sep 2005 18:44:44 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote:
Hi Viswa,On Thu, 2005-09-22 at 15:55, Viswanath Krishnamurthy wrote:> Here is the log of osmtest failure. This was seen 150 times out of> 2500 iterations. The opensm SUBNET UP failure is tough to reproduce.
> Saw it once in 2500 iterations. Unfortunately I did not collect the> log on that error.I understand but it is hard to know whether this is a known issue orsomething else without a log of the failure.
> The patch worked as expected and did not see any issues with ctrl-C.> When I tried apply the patch, I got a failure.  (I used the patch> command). I manually added those 2 lines.Not sure why the patch wouldn't apply.
> Command Line Arguments> Done with args> Flow = All Validations> Sep 21 17:50:56 684254 [B7F026C0] -> osm_vendor_get_all_port_attr:> assign CA mthca0 port 1 guid (0x2c90200400cfd) as the def
> ault port.> using default guid 0x2c90200400cfd> Sep 21 17:50:56 686301 [B7F026C0] -> osm_vendor_get_all_port_attr:> assign CA mthca0 port 1 guid (0x2c90200400cfd) as the def> ault port.
> Sep 21 17:50:56 686347 [B7F026C0] -> osm_vendor_bind: Binding to port> 0x2c90200400cfd.> Sep 21 17:50:56 689963 [B7F026C0] -> osm_vendor_get_all_port_attr:> assign CA mthca0 port 1 guid (0x2c90200400cfd) as the def
> ault port.> Sep 21 17:50:56 691969 [B7F026C0] -> osm_vendor_get_all_port_attr:> assign CA mthca0 port 1 guid (0x2c90200400cfd) as the def> ault port.> Sep 21 17:50:56 693187 [B7F026C0] ->
> osmtest_validate_sa_class_port_info:> -> SA Class Port Info:>  base_ver:1>  class_ver:2>  cap_mask:0x202>  resp_time_val:0x64> -
> Sep 21 17:50:56 775383 [B7F026C0] -> osmtest_wrong_sm_key_ignored: Try> PortRecord for port with LID 0x0 Num:0x1.> Sep 21 17:51:00 775320 [B76FFBB0] -> umad_receiver: ERR 5409: send> completed with error (method=1 attr=12 trans_id=0x34) --
> dropping.> Sep 21 17:51:00 775389 [B76FFBB0] -> umad_receiver: ERR 5410: class> 0x3 LID 0x0> Sep 21 17:51:00 775418 [B76FFBB0] -> osmtest_query_res_cb: ERR 0003:> Error on query (IB_TIMEOUT).
> Sep 21 17:51:00 775465 [B7F026C0] -> osmtest_wrong_sm_key_ignored: ERR> 0011: Did not get a timeout but got (IB_SUCCESS).> Sep 21 17:51:00 775581 [B7F026C0] -> osmt_register_service:> Registering Service: name:
osmt.srvc.1804289383.7793 id:0x6b8b26f> 6.> Sep 21 17:51:00 777143 [B7F026C0] -> osmt_register_service:> Registering Service: name:osmt.srvc.846930885.7793 id:0x327b0554> Sep 21 17:51:00 777143 [B7F026C0] -> osmt_register_service:
> Registering Service: name:osmt.srvc.846930885.7793 id:0x327b0554> .> Sep 21 17:51:04 779578 [B76FFBB0] -> umad_receiver: ERR 5409: send> completed with error (method=2 attr=31 trans_id=0x36) --dropping.
> Sep 21 17:51:04 779604 [B76FFBB0] -> umad_receiver: ERR 5410: class> 0x3 LID 0x0> Sep 21 17:51:04 779631 [B76FFBB0] -> osmtest_query_res_cb: ERR 0003:> Error on query (IB_TIMEOUT).> Sep 21 17:51:04 779674 [B7F026C0] -> osmt_register_service: ERR 0364:
> ib_query failed (IB_TIMEOUT).> Sep 21 17:51:04 779740 [B7F026C0] -> osmtest_run: ERR 00148: Service> Flow failed (IB_TIMEOUT)> OSMTEST: TEST "All Validations" FAILThe final FAIL/PASS is definitive so there are real failures here. Is
this consistent or intermittent ? Does this work sometimes or always

Intermittent.. As I said 150 out of  2500 iterations failed. Is there any log you want me to collect ?
fail ?-- Hal
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: opensm and SIGINT

Hal,

Here is the log of osmtest failure. This was seen 150 times out of 2500
iterations. The opensm SUBNET UP failure is tough to reproduce. Saw it
once in 2500 iterations. Unfortunately I did not collect the log on
that error.

The patch worked as expected and did not see any issues with
ctrl-C.  When I tried apply the patch, I got a failure.  (I
used the patch command). I manually added those 2 lines.

Command Line Arguments
Done with args
    Flow = All Validations
Sep 21 17:50:56 684254 [B7F026C0] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c90200400cfd) as the def
ault port.
using default guid 0x2c90200400cfd
Sep 21 17:50:56 686301 [B7F026C0] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c90200400cfd) as the def
ault port.
Sep 21 17:50:56 686347 [B7F026C0] -> osm_vendor_bind: Binding to port 0x2c90200400cfd.
Sep 21 17:50:56 689963 [B7F026C0] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c90200400cfd) as the def
ault port.
Sep 21 17:50:56 691969 [B7F026C0] -> osm_vendor_get_all_port_attr: assign CA mthca0 port 1 guid (0x2c90200400cfd) as the def
ault port.
Sep 21 17:50:56 693187 [B7F026C0] -> osmtest_validate_sa_class_port_info:
-
SA Class Port Info:
 base_ver:1
 class_ver:2
 cap_mask:0x202
 resp_time_val:0x64
-
Sep 21 17:50:56 775383 [B7F026C0] -> osmtest_wrong_sm_key_ignored: Try PortRecord for port with LID 0x0 Num:0x1.
Sep 21 17:51:00 775320 [B76FFBB0] -> umad_receiver: ERR 5409: send completed with error (method=1 attr=12 trans_id=0x34) --
dropping.
Sep 21 17:51:00 775389 [B76FFBB0] -> umad_receiver: ERR 5410: class 0x3 LID 0x0
Sep 21 17:51:00 775418 [B76FFBB0] -> osmtest_query_res_cb: ERR 0003: Error on query (IB_TIMEOUT).
Sep 21 17:51:00 775465 [B7F026C0] -> osmtest_wrong_sm_key_ignored: ERR 0011: Did not get a timeout but got (IB_SUCCESS).
Sep 21 17:51:00 775581 [B7F026C0] -> osmt_register_service: Registering Service: name:osmt.srvc.1804289383.7793 id:0x6b8b26f
6.
Sep 21 17:51:00 777143 [B7F026C0] -> osmt_register_service: Registering Service: name:osmt.srvc.846930885.7793 id:0x327b0554
Sep 21 17:51:00 777143 [B7F026C0] -> osmt_register_service: Registering Service: name:osmt.srvc.846930885.7793 id:0x327b0554
.
Sep 21 17:51:04 779578 [B76FFBB0] -> umad_receiver: ERR 5409: send
completed with error (method=2 attr=31 trans_id=0x36) --dropping.
Sep 21 17:51:04 779604 [B76FFBB0] -> umad_receiver: ERR 5410: class 0x3 LID 0x0
Sep 21 17:51:04 779631 [B76FFBB0] -> osmtest_query_res_cb: ERR 0003: Error on query (IB_TIMEOUT).
Sep 21 17:51:04 779674 [B7F026C0] -> osmt_register_service: ERR 0364: ib_query failed (IB_TIMEOUT).
Sep 21 17:51:04 779740 [B7F026C0] -> osmtest_run: ERR 00148: Service Flow failed (IB_TIMEOUT)
OSMTEST: TEST "All Validations" FAIL


-Viswa

On 22 Sep 2005 15:08:02 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote:
On Thu, 2005-09-22 at 15:06, Viswanath Krishnamurthy wrote:> I do not think this would help.  The system is never rebooted. Just> opensm is started  and stopped. On the mext opensm start/stop the> subnet came up. I think it is more of an opensm issue than any kernel
> module issue.Can you run opensm in -V mode and send the log. It might be related tothe SM Set PortInfo armed->active issue which has been documented butnot resolved.-- Hal

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: opensm and SIGINT

Hal,On 22 Sep 2005 14:41:04 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote:
Hi Viswa,On Thu, 2005-09-22 at 14:37, Viswanath Krishnamurthy wrote:> Hi Hal,>> Sure will test it out. I see no issue in this fix. I have run the> following test overnight> in a script with yesterday's code
>> 1. Start opensm> 2. Ping another  node over IB> 3. Run osmtest (osmtest -f c, osmtest -f a)> 4. Kill opensm with -9 signal and repeat over>> The failures are  captured in a log.
>> This has run more than 2500 times without resource leak issues. I saw> about 150 osmtest> failures which I will followup with another mail.Some failures are intentional (bad flow tests). They are all not marked
obviously. Some of this has been documented on the list but not fixedyet but I am interested in seeing what you are referring to.

I will attach the log later. 
>  Once opensm failed to start  correctly with SUBNET UP message in the> log.
So the subnet didn't come up and the ports didn't become active ? Justout of curiousity, could you unload and reload ib_umad and then startopensm when that occurs to see if that fixes things ? I'm not sure it
would.

I do not think this would help.  The system is never rebooted.
Just opensm is started  and stopped. On the mext opensm start/stop
the subnet came up. I think it is more of an opensm issue than any
kernel module issue. 
Thanks.-- Hal
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Re: opensm and SIGINT

Hi Hal,

Sure will test it out. I see no issue in this fix. I have run the following test overnight
in a script with yesterday's code

1. Start opensm
2. Ping another  node over IB
3. Run osmtest (osmtest -f c, osmtest -f a)
4. Kill opensm with -9 signal and repeat over

The failures are  captured in a log.

This has run more than 2500 times without resource leak issues. I saw about 150 osmtest
failures which I will followup with another mail. Once opensm failed to start  correctly with SUBNET UP message in the log.

-Viswa
On 22 Sep 2005 11:17:46 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote:
Hi again Viswa,On Wed, 2005-09-21 at 21:00, Hal Rosenstock wrote:> Hi Viswa,>> On Wed, 2005-09-21 at 20:23, Viswanath Krishnamurthy wrote:> > Currently opensm traps SIGINT. There was some discussion to remove it.
> > I have currently running some tests on opensm> > by killing (SIGKILL) and restarting opensm. So far I ahve not found> > any resource leak issues. Is ther a plan to remove that> > signal handler. Ideally it should not exist.
>> Eitan stated that this was historical in nature for gen1 drivers which> had resource tracking problems: "if OpenSM left without cleaning up all> used resources (like MAD buffers and UD-AVs), the driver oops'ed."
>> I think that (eliminating the handler for SIGINT) can at least be done> for OSM_VENDOR_INTF_OPENIB and leave it there for the other vendor> layers for starters. I will experiment with gen2 and let you know.
Does the patch below do what you want ? Can you try it ?-- HalIndex: opensm/osm_opensm.c===--- opensm/osm_opensm.c (revision 3513)
+++ opensm/osm_opensm.c (working copy)@@ -182,7 +182,9 @@ osm_reg_sig_handler(IN osm_opensm_t * const p_osm ) {__p_osm_to_signal = p_osm;+#ifndef OSM_VENDOR_INTF_OPENIBcl_reg_sig_hdl( SIGINT, __sig_handler );
+#endifcl_reg_sig_hdl( SIGTERM, __sig_handler );cl_reg_sig_hdl( SIGHUP, __sig_handler );osm_exit_flag = 0;
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] ib_create_cq memory leak?

Roland,

Thanks.  Tested this out.. Works like a charm...

-Viswa
On 9/21/05, Roland Dreier <[EMAIL PROTECTED]> wrote:
Thanks very much for the excellent test case.  The following patch(already checked into svn and queued in git for merging into 2.6.14)should fix things -- on my system, your test case ran successfully formany hundreds of iterations.
--- linux-kernel/infiniband/hw/mthca/mthca_memfree.c(revision 3500)+++ linux-kernel/infiniband/hw/mthca/mthca_memfree.c(working copy)@@ -529,12 +529,25 @@ int mthca_alloc_db(struct mthca_dev *dev
goto
found;}+   for (i = start; i != end; i += dir)+   if (!dev->db_tab->page[i].db_rec) {+  
page = dev->db_tab->page + i;+  
goto alloc;+   }+if (dev->db_tab->max_group1 >= dev->db_tab->min_group2 - 1) {ret = -ENOMEM;goto out;}+   if (group == 0)
+   ++dev->db_tab->max_group1;+   else+   --dev->db_tab->min_group2;+page = dev->db_tab->page + end;++alloc:page->db_rec = dma_alloc_coherent(&dev->pdev->dev, 4096,
  &page->mapping,
GFP_KERNEL);if (!page->db_rec) {@@ -554,10 +567,6 @@ int mthca_alloc_db(struct mthca_dev *dev}bitmap_zero(page->used, MTHCA_DB_REC_PER_PAGE);-   if (group == 0)
-   ++dev->db_tab->max_group1;-   else-   --dev->db_tab->min_group2; found:j = find_first_zero_bit(page->used, MTHCA_DB_REC_PER_PAGE);

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] opensm and SIGINT

Hal,

Currently opensm traps SIGINT. There was some discussion to remove it. I have currently running some tests on opensm
by killing (SIGKILL) and restarting opensm. So far I ahve not found any resource leak issues. Is ther a plan to remove that
signal handler. Ideally it should not exist.

-Viswa



___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] Modifying QP state error

The mthca state transistion  code allows this transistion (RTS
--> RESET), but the mthca hardware/firmware does not allow it. It
allows RTS->ERR->RESET. I will post the code later  to
reproduce this. I was trying to workaround the CQ destroy memory 
leak by caching QP entries and reusing them, but ran into other issues.

-Viswa

On 9/21/05, Roland Dreier <[EMAIL PROTECTED]> wrote:
Hal> You can only get to RESET from ERROR. See Figure 124 QPHal> Context State Diagram IBA 1.2 p. 452.I think the figure drawn in a slightly misleading way.  The text atthe lower left says:
It is possible to transition from any state to either the Error orthe Reset state with the Modify QP/EE Verb. - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Modifying QP state error

When I try to modify QP state from RTS to RESET I get the following error

ib_mthca :05:00.0: Command 1e completed with status 0a
ib_mthca :05:00.0: modify QP 7 returned status 0a.

Is modifying QP state from RTS to RESET a valid state transistion ?  (I guess so)
Are there anything else that needs to be taken care of ?


-Viswa

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] ib_create_cq memory leak? (Resend)

I ran into this issue when using the kernel API to create CQ's. In order to reproduce this problem, I wrote

a small kernel module which creates 4K CQ's and destroys them. After running the test (8-10 times), I saw

create_cq error with error -12 (ENOMEM).



I am attaching the test module source code with Makefiles



[root src]# svn info   (Latest code)

Path: .

URL: https://openib.org/svn/gen2/trunk/src

Repository UUID: 21a7a0b7-18d7-0310-8e21
-e8b31bdbf5cd
Revision: 3512
Node Kind: directory
Schedule: normal
Last Changed Author: halr
Last Changed Rev: 3511
Last Changed Date: 2005-09-21 08:57:38 -0700 (Wed, 21 Sep 2005)


To compile the code, change the KERNELSRC variable in mysock.mak to point to your kernel source tree

#make -f mysock.mak

#insmod mysock.ko

To run the test

#echo 1 > /dev/mysock

After 8-10 times of running the above, you will see a -12 error on the console.

This problem does not occur when you create a single CQ and destroy it immediately in a loop (I tried 10 times).
This occurs when you create 4K CQ's and then destroy it.


ib_mthca :05:00.0: Mapped page at 362f9000 to 7e000 for ICM.
ib_mthca :05:00.0: Mapped page at 362fa000 to 41000 for ICM.
ib_mthca :05:00.0: Mapped page at 35d1 to 7d000 for ICM.
ib_mthca :05:00.0: Mapped page at 35d11000 to 42000 for ICM.
ib_mthca :05:00.0: Mapped page at 35f27000 to 7c000 for ICM.
ib_mthca :05:00.0: Mapped page at 35f28000 to 43000 for ICM.
ib_mthca :05:00.0: Mapped page at 3593f000 to 7b000 for ICM.
ib_mthca :05:00.0: Mapped page at 3594 to 44000 for ICM.
ib_mthca :05:00.0: Mapped page at 35b56000 to 7a000 for ICM.
ib_mthca :05:00.0: Mapped page at 35b57000 to 45000 for ICM.
ib_mthca :05:00.0: Mapped page at 3556d000 to 79000 for ICM.
ib_mthca :05:00.0: Mapped page at 3556e000 to 46000 for ICM.
ib_mthca :05:00.0: Mapped page at 35785000 to 78000 for ICM.
ib_mthca :05:00.0: Mapped page at 35786000 to 47000 for ICM.
ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2604 for ICM.
ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2004 for ICM.
ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2584 for ICM.
ib_mthca :05:00.0: Mapped page at 3519b000 to 77000 for ICM.
ib_mthca :05:00.0: Mapped page at 3519c000 to 48000 for ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 7e000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 7d000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 7c000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 7b000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 7a000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 79000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 78000 from ICM.
ib_mthca :05:00.0: Unmapping 64 pages at 2584 from ICM.
ib_mthca :05:00.0: Unmapping 64 pages at 2004 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 48000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 77000 from ICM.
ib_mthca :05:00.0: Unmapping 64 pages at 2604 from ICM.
ib_mthca :05:00.0: Mapped page at 35b03000 to 76000 for ICM.
ib_mthca :05:00.0: Mapped page at 362ba000 to 75000 for ICM.
ib_mthca :05:00.0: Mapped page at 35d2c000 to 74000 for ICM.
ib_mthca :05:00.0: Mapped page at 35c83000 to 73000 for ICM.
ib_mthca :05:00.0: Mapped page at 35b99000 to 72000 for ICM.
ib_mthca :05:00.0: Mapped page at 35db to 71000 for ICM.
ib_mthca :05:00.0: Mapped page at 356c5000 to 7 for ICM.
ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2604 for ICM.
ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2004 for ICM.
ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2584 for ICM.
ib_mthca :05:00.0: Mapped page at 35adc000 to 6f000 for ICM.
ib_mthca :05:00.0: Mapped page at 35add000 to 49000 for ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 76000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 75000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 74000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 73000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 72000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 71000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 7 from ICM.
ib_mthca :05:00.0: Unmapping 64 pages at 2584 from ICM.
ib_mthca :05:00.0: Unmapping 64 pages at 2004 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 49000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 6f000 from ICM.
ib_mthca :05:00.0: Unmapping 64 pages at 2604 from ICM.
ib_mthca :05:00.0: Mapped page at 362cf000 to 6e000 for ICM.
ib_mthca :05:00.0: Mapped page at 35a0f000 to 6d000 for ICM.
ib_mthca :05:00.0: Mapped page at 3507 to 6c000 for ICM.
ib_mthca :05:00.0: Mapped page at 35e83000 to 6b000 for ICM.
ib_mthca :05:00.0: Mapped page at 35bd8000 to 6a000 for ICM.
ib_mthca :05:00.0: Mapped page at 351ef000 to 69000 for ICM.
ib_mthca :05:00.0: Mapped page at 35c440

[openib-general] ib_create_cq memory leak?

I ran into this issue when using the kernel API to create CQ's. In order to reproduce this problem, I wrote
a small kernel module which creates 4K CQ's and destroys them. After running the test (8-10 times), I saw
create_cq error with error -12 (ENOMEM).

I am attaching the test module source code with Makefiles

[root src]# svn info   (Latest code)
Path: .
URL: https://openib.org/svn/gen2/trunk/src
Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd
Revision: 3512
Node Kind: directory
Schedule: normal
Last Changed Author: halr
Last Changed Rev: 3511
Last Changed Date: 2005-09-21 08:57:38 -0700 (Wed, 21 Sep 2005)


To compile the code, change the KERNELSRC variable in mysock.mak to point to your kernel source tree

#make -f mysock.mak

#insmod mysock.ko

To run the test

#echo 1 > /dev/mysock

After 8-10 times of running the above, you will see a -12 error on the console.

This problem does not occur when you create a single CQ and destroy it immediately in a loop (I tried 10 times).
This occurs when you create 4K CQ's and then destroy it.


ib_mthca :05:00.0: Mapped page at 362f9000 to 7e000 for ICM.
ib_mthca :05:00.0: Mapped page at 362fa000 to 41000 for ICM.
ib_mthca :05:00.0: Mapped page at 35d1 to 7d000 for ICM.
ib_mthca :05:00.0: Mapped page at 35d11000 to 42000 for ICM.
ib_mthca :05:00.0: Mapped page at 35f27000 to 7c000 for ICM.
ib_mthca :05:00.0: Mapped page at 35f28000 to 43000 for ICM.
ib_mthca :05:00.0: Mapped page at 3593f000 to 7b000 for ICM.
ib_mthca :05:00.0: Mapped page at 3594 to 44000 for ICM.
ib_mthca :05:00.0: Mapped page at 35b56000 to 7a000 for ICM.
ib_mthca :05:00.0: Mapped page at 35b57000 to 45000 for ICM.
ib_mthca :05:00.0: Mapped page at 3556d000 to 79000 for ICM.
ib_mthca :05:00.0: Mapped page at 3556e000 to 46000 for ICM.
ib_mthca :05:00.0: Mapped page at 35785000 to 78000 for ICM.
ib_mthca :05:00.0: Mapped page at 35786000 to 47000 for ICM.
ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2604 for ICM.
ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2004 for ICM.
ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2584 for ICM.
ib_mthca :05:00.0: Mapped page at 3519b000 to 77000 for ICM.
ib_mthca :05:00.0: Mapped page at 3519c000 to 48000 for ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 7e000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 7d000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 7c000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 7b000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 7a000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 79000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 78000 from ICM.
ib_mthca :05:00.0: Unmapping 64 pages at 2584 from ICM.
ib_mthca :05:00.0: Unmapping 64 pages at 2004 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 48000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 77000 from ICM.
ib_mthca :05:00.0: Unmapping 64 pages at 2604 from ICM.
ib_mthca :05:00.0: Mapped page at 35b03000 to 76000 for ICM.
ib_mthca :05:00.0: Mapped page at 362ba000 to 75000 for ICM.
ib_mthca :05:00.0: Mapped page at 35d2c000 to 74000 for ICM.
ib_mthca :05:00.0: Mapped page at 35c83000 to 73000 for ICM.
ib_mthca :05:00.0: Mapped page at 35b99000 to 72000 for ICM.
ib_mthca :05:00.0: Mapped page at 35db to 71000 for ICM.
ib_mthca :05:00.0: Mapped page at 356c5000 to 7 for ICM.
ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2604 for ICM.
ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2004 for ICM.
ib_mthca :05:00.0: Mapped 1 chunks/256 KB at 2584 for ICM.
ib_mthca :05:00.0: Mapped page at 35adc000 to 6f000 for ICM.
ib_mthca :05:00.0: Mapped page at 35add000 to 49000 for ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 76000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 75000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 74000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 73000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 72000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 71000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 7 from ICM.
ib_mthca :05:00.0: Unmapping 64 pages at 2584 from ICM.
ib_mthca :05:00.0: Unmapping 64 pages at 2004 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 49000 from ICM.
ib_mthca :05:00.0: Unmapping 1 pages at 6f000 from ICM.
ib_mthca :05:00.0: Unmapping 64 pages at 2604 from ICM.
ib_mthca :05:00.0: Mapped page at 362cf000 to 6e000 for ICM.
ib_mthca :05:00.0: Mapped page at 35a0f000 to 6d000 for ICM.
ib_mthca :05:00.0: Mapped page at 3507 to 6c000 for ICM.
ib_mthca :05:00.0: Mapped page at 35e83000 to 6b000 for ICM.
ib_mthca :05:00.0: Mapped page at 35bd8000 to 6a000 for ICM.
ib_mthca :05:00.0: Mapped page at 351ef000 to 69000 for ICM.
ib_mthca :05:00.0: Mapped page at 35c44000 to 6800

[openib-general] Re: [PATCH] libmthca: fix wqe post

Just wanted to confirm kernel mthca also works fine..

Thanks Roland & Michael

-Viswa
On 9/13/05, Viswanath Krishnamurthy <[EMAIL PROTECTED]> wrote:
Thanks.. yes that was the problem...

The panic was happening when I was getting these errors and pressed Ctrl-C on
the server. This may be an error path issue.

I am not seeing it now..

-Viswa
On 9/13/05, Roland Dreier <
[EMAIL PROTECTED]> wrote:

Viswanath> When I ran the cmpost program which I sent you, IViswanath> started getting errors from the mthca library even forViswanath> smaller number of connections (Earlier it wasViswanath> working).
Yeah, I found another problem with your cmpost program. I thinkyou're setting the packet lifetime far too low. You have:sa.packet_life_time =
2;This ends up having the CM set an ACK timeout of something like 32microseconds, which is way too low. If you poll the send CQ, you'llprobably see some "retries exceeded" errors. Setting the

packet_life_time to something like 14 or 15 should work better.Viswanath> Also it is now easier to create the panic when you killViswanath> the cmpost server program. The panic may be happening

Viswanath> on an error path.I still have never been able to reproduce this panic (and believe me,I've killed the cmpost program many time). Anyway, I'll take a lookat the traceback and see if anything jumps out at me.
- R.

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Re: [PATCH] libmthca: fix wqe post

Thanks.. yes that was the problem...

The panic was happening when I was getting these errors  and pressed Ctrl-C on
the server. This may be an error path issue. 

I am not seeing it now..

-Viswa
On 9/13/05, Roland Dreier <[EMAIL PROTECTED]> wrote:
Viswanath> When I ran the cmpost program which I sent you, IViswanath> started getting errors from the mthca library even forViswanath> smaller number of connections (Earlier it wasViswanath> working).
Yeah, I found another problem with your cmpost program.  I thinkyou're setting the packet lifetime far too low.  You have:sa.packet_life_time  =
2;This ends up having the CM set an ACK timeout of something like 32microseconds, which is way too low.  If you poll the send CQ, you'llprobably see some "retries exceeded" errors.  Setting the
packet_life_time to something like 14 or 15 should work better.Viswanath> Also it is now easier to create the panic when you killViswanath> the cmpost server program. The panic may be happening
Viswanath> on an error path.I still have never been able to reproduce this panic (and believe me,I've killed the cmpost program many time).  Anyway, I'll take a lookat the traceback and see if anything jumps out at me.
 - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Re: [PATCH] libmthca: fix wqe post

Roland,

I got the latest sorces, built it along with the drivers.  

Userland mthca

Your test application ran fine without any issue. (rctest)
When I ran the cmpost program which I sent you, I started getting errors from
the mthca library even for smaller number of connections (Earlier it was working). This looks
like error dump im mthca library.

..  [ 0] 0493
  [ 4] 
  [ 8] 
  [ c] 
  [10] 05f4
  [14]    
  [18] 0042
  [1c] fe10
failed polling CQ: 142: err 1  <=== This is from cmpost program
  [ 0] 0493
  [ 4] 
  [ 8] 
  [ c] 
  [10] 05f9
  [14] 
  [18] 0082
  [1c] fe10
failed polling CQ: 142: err 1
  [ 0] 0493

Also it is now easier to create the panic when  you kill the cmpost server program. The panic
may be happening on an error path.

printing eip:
c029197d
*pde = 35d56001
Oops:  [#1]
SMP
Modules linked in: nfs nfsd exportfs lockd autofs4 sunrpc uhci_hcd ehci_hcd hw_random e1000 ext3 jbd sd_mod
CPU:    0
EIP:    0060:[]    Not tainted VLI
EFLAGS: 00010002   (2.6.13)
EIP is at mthca_poll_cq+0x158/0x534
eax:    ebx: f5e90280   ecx: 0006   edx: 1250
esi: 023a   edi: f5e90304   ebp: f7941f0c   esp: f7941ea4
ds: 007b   es: 007b   ss: 0068
Process ib_mad1 (pid: 308, threadinfo=f794 task=f7cb7540)
Stack: f7941ed0 c0118c7d f7def41c c0355dc0 f7cb7540 f7dea41c c1a01bc0 
   0080   0286 f7ce1000 f7941f0c 0001 f7dea400
   f8806000 0292 0001  f5e90280 f7ce1000 f7def400 f7941f0c
Call Trace:
 [] load_balance_newidle+0x23/0xa2
 [] ib_mad_completion_handler+0x2c/0x8d
 [] remove_wait_queue+0xf/0x34
 [] worker_thread+0x1b0/0x23a
 [] schedule+0x5d3/0xbdf
 [] ib_mad_completion_handler+0x0/0x8d
 [] default_wake_function+0x0/0xc
 [] default_wake_function+0x0/0xc
 [] worker_thread+0x0/0x23a
 [] kthread+0x8a/0xb2
 [] kthread+0x0/0xb2
 [] kernel_thread_helper+0x5/0xb
Code: 01 00 00 8b 44 24 18 8d bb 84 00 00 00 8b 53 5c 8b 70 18 8b 4f 24
0f ce 2b b3 b8 00 00 00 8b 83 bc 00 00 00 d3 ee 01 f2 8d 14 d0
<8b> 02 8b 52 04 85 ff 89 45 00 89 55 04 74 16 8b 57 10 89 f0 39

-Viswa
On 9/13/05, Roland Dreier <[EMAIL PROTECTED]> wrote:
Viswanath> Once you generate a kernel patch, I can test out bothViswanath> user and kernel mthca since I have the tests ready..Excellent.  I merged MST's patch, and applied the patch below to the
kernel.  (So you can either update from svn or apply the patches)Thanks for testing -- let me know if you still see problems.Index: infiniband/hw/mthca/mthca_srq.c===
--- infiniband/hw/mthca/mthca_srq.c (revision 3404)+++ infiniband/hw/mthca/mthca_srq.c (working copy)@@ -189,7 +189,6 @@ int mthca_alloc_srq(struct mthca_dev *desrq->max  = attr->max_wr;
srq->max_gs   = attr->max_sge;-   srq->last = NULL;srq->counter  = 0;if (mthca_is_memfree(dev))@@ -264,6 +263,7 @@ int mthca_alloc_srq(struct mthca_dev *de
srq->first_free = 0;srq->last_free  = srq->max - 1;+   srq->last   = get_wqe(srq, srq->max - 1);return 0;@@ -446,13 +446,11 @@ int mthca_tavor_post_srq_recv(struct ib_
((struct
mthca_data_seg *) wqe)->addr = 0;}-   if (likely(prev_wqe)) {-  
((struct mthca_next_seg *) prev_wqe)->nda_op =-  
cpu_to_be32((ind << srq->wqe_shift) | 1);-  
wmb();-  
((struct mthca_next_seg *) prev_wqe)->ee_nds =-  
cpu_to_be32(MTHCA_NEXT_DBD);-   }+  
((struct mthca_next_seg *) prev_wqe)->nda_op =+  
cpu_to_be32((ind << srq->wqe_shift) | 1);+   wmb();+  
((struct mthca_next_seg *) prev_wqe)->ee_nds =+  
cpu_to_be32(MTHCA_NEXT_DBD);srq->wrid[ind]  =
wr->wr_id;srq->first_free = next_ind;Index: infiniband/hw/mthca/mthca_qp.c===--- infiniband/hw/mthca/mthca_qp.c  (revision 3404)
+++ infiniband/hw/mthca/mthca_qp.c  (working copy)@@ -227,7 +227,6 @@ static void mthca_wq_init(struct mthca_wwq->last_comp = wq->max - 1;wq->head  = 0;wq->tail  = 0;
-   wq->last  = NULL; } void mthca_qp_event(struct mthca_dev *dev, u32 qpn,@@ -1103,6 +1102,9 @@ static int mthca_alloc_qp_common(struct}}+   qp->
sq.last = get_send_wqe(qp, qp->sq.max - 1);+   qp->rq.last = get_recv_wqe(qp, qp->rq.max - 1);+return 0; }@@ -1583,15 +1585,13 @@ int mthca_tavor_post_send(struct ib_qp *goto
out;}-   if (prev_wqe) {-  
((struct mthca_next_seg *) prev_wqe)->nda_op =-  
cpu_to_be32((

[openib-general] Strange configure error in libibcm

I got the latest code from the repository to verify mthca fixes, I ran into this
strange configure error in libibcm 

checking infiniband/at.h usability... yes
checking infiniband/at.h presence... yes
checking for infiniband/at.h... yes
checking for ANSI C header files... (cached) yes
checking for an ANSI C-conforming const... yes
checking for long... yes
checking size of long... configure: error: cannot compute sizeof (long), 77
See `config.log' for more details.

gcc version is 3.4
Linux 2.6.13

I was able to build earlier versions on the same machine. This happens only with libibcm Any clues ?

-Viswa

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] Re: [PATCH] libmthca: fix wqe post (was Re: strange mem-free bug)

Michael,

Thanks..

Roland,

Once you generate a kernel patch, I can test out both user and kernel mthca since I have the tests
ready..

-Viswa
On 9/13/05, Michael S. Tsirkin <[EMAIL PROTECTED]> wrote:
Quoting r. Roland Dreier <[EMAIL PROTECTED]>:> Subject: strange mem-free bug (was: [openib-general] completion Q overflow error/panic)>> While looking at Viswa's example, I've found what seems to be a
> problem using lots of QPs on mem-free HCAs.Hi, Roland!This seems to be a bug in libmthca. Patch below.We probably need a similiar fix for kernel mthca - let me know ifyou plan to work on that, otherwise I'll look into it tomorrow.
And its probably something we want fixed for 2.6.14, right?Let me know.With regard to the test code that you posted - I also have some smallcomments. If you plan to use it in the future, you can stick it
in svn somewhere and I'll send patches.---Fix posting of the first work request for memfree hardware.Simplify code for tavor mode hardware.Signed-off-by: Michael S. Tsirkin <
[EMAIL PROTECTED]>Index: userspace/libmthca/src/qp.c===--- userspace.orig/libmthca/src/qp.c2005-09-13 17:17:58.0 +0300
+++ userspace/libmthca/src/qp.c 2005-09-13 17:26:23.0 +0300@@ -259,15 +259,13 @@ int mthca_tavor_post_send(struct ibv_qpgoto
out;}-   if (prev_wqe) {-  
((struct mthca_next_seg *) prev_wqe)->nda_op =-  
htonl(((ind << qp->sq.wqe_shift) +-  qp->send_wqe_offset)
|-
mthca_opcode[wr->opcode]);+  
((struct mthca_next_seg *) prev_wqe)->nda_op =+  
htonl(((ind << qp->sq.wqe_shift) ++  qp->send_wqe_offset)
|+
mthca_opcode[wr->opcode]);-  
((struct mthca_next_seg *) prev_wqe)->ee_nds =-  
htonl((size0 ? 0 : MTHCA_NEXT_DBD) | size);-   }+  
((struct mthca_next_seg *) prev_wqe)->ee_nds =+  
htonl((size0 ? 0 : MTHCA_NEXT_DBD) | size);if (!size0) {size0
= size;@@ -353,12 +351,10 @@ int mthca_tavor_post_recv(struct ibv_qpqp->wrid[ind] = wr->wr_id;-   if (prev_wqe) {-  
((struct mthca_next_seg *) prev_wqe)->nda_op =-  
htonl((ind << qp->rq.wqe_shift) | 1);-  
((struct mthca_next_seg *) prev_wqe)->ee_nds =-  
htonl(MTHCA_NEXT_DBD | size);-   }+  
((struct mthca_next_seg *) prev_wqe)->nda_op =+  
htonl((ind << qp->rq.wqe_shift) | 1);+  
((struct mthca_next_seg *) prev_wqe)->ee_nds =+  
htonl(MTHCA_NEXT_DBD | size);if (!size0)size0
= size;@@ -562,15 +558,13 @@ int mthca_arbel_post_send(struct ibv_qpgoto
out;}-   if (prev_wqe) {-  
((struct mthca_next_seg *) prev_wqe)->nda_op =-  
htonl(((ind << qp->sq.wqe_shift) +-  qp->send_wqe_offset)
|-
mthca_opcode[wr->opcode]);-  
mb();-  
((struct mthca_next_seg *) prev_wqe)->ee_nds =-  
htonl(MTHCA_NEXT_DBD | size);-   }+  
((struct mthca_next_seg *) prev_wqe)->nda_op =+  
htonl(((ind << qp->sq.wqe_shift) ++  qp->send_wqe_offset)
|+
mthca_opcode[wr->opcode]);+   mb();+  
((struct mthca_next_seg *) prev_wqe)->ee_nds =+  
htonl(MTHCA_NEXT_DBD | size);if (!size0) {size0
= size;@@ -767,6 +761,8 @@ int mthca_alloc_qp_buf(struct ibv_pd *pd}}+   qp->sq.last = get_send_wqe(qp, qp->sq.max - 1);+   qp->rq.last = get_recv_wqe(qp, qp->
sq.max - 1);return 0; }Index: userspace/libmthca/src/srq.c===--- userspace.orig/libmthca/src/srq.c   2005-09-13 17:25:41.0
 +0300+++ userspace/libmthca/src/srq.c2005-09-13 17:25:51.0 +0300@@ -142,13 +142,11 @@ int mthca_tavor_post_srq_recv(struct ibv((struct
mthca_data_seg *) wqe)->addr = 0;}-   if (prev_wqe) {-  
((struct mthca_next_seg *) prev_wqe)->nda_op =-  
htonl((ind << srq->wqe_shift) | 1);-  
mb();-  
((struct mthca_next_seg *) prev_wqe)->ee_nds =-  
htonl(MTHCA_NEXT_DBD);-   }+  
((struct mthca_next_seg *) prev

[openib-general] Status of opensm 1.8 merge

2005-09-12 Thread Viswanath Krishnamurthy

Can I start testing opensm 1.8 merge on gen2  ?   What is the current status ?

-Viswa

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] completion Q overflow error/panic

2005-09-10 Thread Viswanath Krishnamurthy

Here is ibv_devinfo output. It is InfiniHost_III_Lx0

]# ibv_devinfo
hca_id: mthca0
   
fw_ver:
1.0.1
   
node_guid: 
0002:c902:0040:0cfc
   
sys_image_guid:
0002:c902:0040:0cff
   
max_mr_size:   
0x
   
page_size_cap: 
0x0
   
vendor_id: 
0x02c9
   
vendor_part_id:
25204
   
hw_ver:
0x0
   
phys_port_cnt: 
1
    port:   1
   
state: 
PORT_ACTIVE (4)
   
max_mtu:   
invalid MTU (0)
   
active_mtu:
invalid MTU (0)
   
sm_lid:
1
   
port_lid:  
1
   
port_lmc:  
0x00


Yes the CQE is a bug. But in this case at any time there should be  one
outstanding packet in the pipe. The client sends 1 packet, waits for response with a 
pause (delay), then sends the next packet. If everything works, we should be
using atmost 1 cq entry. Initially I had more number of CQ entries, but the problem
appeared later.

Looks like the packet is getting stuck somewhere, with no notification
back of any error.  Do we need to tweak any of the QP parameters ?
(packet life time, retries etc)  ?

-Viswa


On 9/9/05, Roland Dreier <[EMAIL PROTECTED]> wrote:
I found one bug in your cmpost.c program that could cause CQoverruns.  When you create your receive and send CQs, you create themwith a cqe value of 5, so they can hold at most 5 entries.  However,you create the send and receive work queues so they can hold up to 10
entries, and in fact the code will post up to 8 entries at a time.  Soit's possible to overflow the CQ.The fix is to create the CQs to have at least as many entries as thework queues -- in other words, change cqe to 10.
However, even with this fixed I do see some strange behavior that I'mstill debugging.  More details on Monday.What HCA firmware version do your systems have? - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [openib-general] completion Q overflow error/panic

2005-09-09 Thread Viswanath Krishnamurthy


Some more info..

This also happens in the kernel level. I have a small kernel module which does the echo
reply.  After about 100-200 connections, I start to see the following message

ib_mthca :05:00.0: SQ 590473 full (8 head, 0 tail, 8 max, 0 nreq)
ib_mthca :05:00.0: SQ 590477 full (8 head, 0 tail, 8 max, 0 nreq)
ib_mthca :05:00.0: SQ 59040c full (8 head, 0 tail, 8 max, 0 nreq)

Below 100 connections I do not see any such messages.   

Looks like if there is problem, it exists in both kernel and userland API's.

-Viswa


On 9/9/05, Roland Dreier <[EMAIL PROTECTED]> wrote:
Thanks for the excellent bug report.  I'll try your code and see if Ican reproduce the problem.  If I can, then I should be able to fix thebugs. - R.
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] completion Q overflow error/panic

2005-09-09 Thread Viswanath Krishnamurthy

Somehow gmail ate away the main content of my mail..

Here it is..


I modified the cmpost program to have individual completion send/receive Q's.  The mcpost

server acts like a echo server, echoing back anything it receives. The client program keeps sending

the packets.



The test works fine upto around 600 connections. After 600 connections, I start to see ibv_post_send errors

with. I added some debug messages in libmthca/src/qp.c  where a check is made for wq_overflow. In fact

it is overflowing. I checked the code to make sure all the send descriptors are recovered with cq_poll operation. Also

the wc.status field is checked for any errors.

I am attaching the modified code . 



bash-3.00$ svn info

Path: .

URL: https://openib.org/svn/gen2/trunk

Repository UUID: 21a7a0b7-18d7-0310-8e21
-e8b31bdbf5cd
Revision: 3344
Node Kind: directory
Schedule: normal
Last Changed Author: jlentini
Last Changed Rev: 3344
Last Changed Date: 2005-09-08 16:39:25 -0700 (Thu, 08 Sep 2005)


To run the test compile the code 

cc -o cmpost cmpost.c -libcm -libverbs -libat

$ cmpost -n 1024    <=== as server

$ cmpost -c  -n 1024 -l  -g 

After sometime you start seeing post_send errors. On my system upto 600 connections work fine.


When running the test I saw panics couple of time. But difficult to reproduce

ernel BUG at include/asm/spinlock.h:149!
invalid operand:  [#1]
SMP
Modules linked in: nfs nfsd exportfs lockd autofs4 sunrpc uhci_hcd ehci_hcd hw_random e1000 ext3 jbdsd_mod
CPU:    1
EIP:    0060:[]    Not tainted VLI
EFLAGS: 00010086   (2.6.13)
EIP is at _spin_lock_irqsave+0x47/0x51
eax: 0011   ebx: 0282   ecx: c035950c   edx: 0082
esi: f7d82010   edi:    ebp: f6792c80   esp: c1a33ed0
ds: 007b   es: 007b   ss: 0068
Process ib_mad1 (pid: 308, threadinfo=c1a32000 task=f7e3c540)
Stack: c03123ee c0276963 f6792c80 f7d82010 c0276963 f79a6adc f7974b00 0001
   c1a33f0c f7912e00 f7df2000 f7df4200 c1a33f0c 0292 c0276b96 f6792c80
      b93e2c00 0128 0296 0402 0001
Call Trace:
 [] ib_mad_send_done_handler+0x72/0x11e
 [] ib_mad_send_done_handler+0x72/0x11e
 [] ib_mad_completion_handler+0x80/0x8d
 [] wait_noreap_copyout+0x55/0xbe
 [] worker_thread+0x1b0/0x23a
 [] schedule+0x5d3/0xbdf
 [] ib_mad_completion_handler+0x0/0x8d
 [] default_wake_function+0x0/0xc
 [] default_wake_function+0x0/0xc
 [] worker_thread+0x0/0x23a
 [] kthread+0x8a/0xb2
 [] kthread+0x0/0xb2
 [] kernel_thread_helper+0x5/0xb
Code: 00 00 74 01 fb f3 90 80 3e 00 7e f9 fa eb e8 83 c4 08 89 d8 5b 5e
c3 8b 44 24 10 c7 04 24 ee 23 31 c0 89 44 24 04 e8 2f e7 e1 ff
<0f> 0b 95 00 39 1c 31 c0 eb c2 53 89 c3 83 ec 08 fa 81 78 04 ad



-Viswa
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

[openib-general] completion Q overflow error/panic

2005-09-09 Thread Viswanath Krishnamurthy

I modified the cmpost program to have individual completion send/receive Q's.  The mcpost
server acts like a echo server, echoing back anything it receives. The client program keeps sending
the packets.

The test works fine upto around 600 connections. After 600 connections, I start to see ibv_post_send errors
with. I added some debug messages in libmthca/src/qp.c  where a check is made for wq_overflow. In fact
it is overflowing. I checked the code to make sure all the send descriptors are recovered with cq_poll operation. Also
the wc.status field is checked for any errors.
I am attaching the modified code . 

bash-3.00$ svn info
Path: .
URL: https://openib.org/svn/gen2/trunk
Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd
Revision: 3344
Node Kind: directory
Schedule: normal
Last Changed Author: jlentini
Last Changed Rev: 3344
Last Changed Date: 2005-09-08 16:39:25 -0700 (Thu, 08 Sep 2005)


To run the test compile the code 

cc -o cmpost cmpost.c -libcm -libverbs -libat

$ cmpost -n 1024    <=== as server

$ cmpost -c  -n 1024 -l  -g 

After sometime you start seeing post_send errors. On my system upto 600 connections work fine.


When running the test I saw panics couple of time. But difficult to reproduce

ernel BUG at include/asm/spinlock.h:149!
invalid operand:  [#1]
SMP
Modules linked in: nfs nfsd exportfs lockd autofs4 sunrpc uhci_hcd ehci_hcd hw_random e1000 ext3 jbdsd_mod
CPU:    1
EIP:    0060:[]    Not tainted VLI
EFLAGS: 00010086   (2.6.13)
EIP is at _spin_lock_irqsave+0x47/0x51
eax: 0011   ebx: 0282   ecx: c035950c   edx: 0082
esi: f7d82010   edi:    ebp: f6792c80   esp: c1a33ed0
ds: 007b   es: 007b   ss: 0068
Process ib_mad1 (pid: 308, threadinfo=c1a32000 task=f7e3c540)
Stack: c03123ee c0276963 f6792c80 f7d82010 c0276963 f79a6adc f7974b00 0001
   c1a33f0c f7912e00 f7df2000 f7df4200 c1a33f0c 0292 c0276b96 f6792c80
      b93e2c00 0128 0296 0402 0001
Call Trace:
 [] ib_mad_send_done_handler+0x72/0x11e
 [] ib_mad_send_done_handler+0x72/0x11e
 [] ib_mad_completion_handler+0x80/0x8d
 [] wait_noreap_copyout+0x55/0xbe
 [] worker_thread+0x1b0/0x23a
 [] schedule+0x5d3/0xbdf
 [] ib_mad_completion_handler+0x0/0x8d
 [] default_wake_function+0x0/0xc
 [] default_wake_function+0x0/0xc
 [] worker_thread+0x0/0x23a
 [] kthread+0x8a/0xb2
 [] kthread+0x0/0xb2
 [] kernel_thread_helper+0x5/0xb
Code: 00 00 74 01 fb f3 90 80 3e 00 7e f9 fa eb e8 83 c4 08 89 d8 5b 5e
c3 8b 44 24 10 c7 04 24 ee 23 31 c0 89 44 24 04 e8 2f e7 e1 ff
<0f> 0b 95 00 39 1c 31 c0 eb c2 53 89 c3 83 ec 08 fa 81 78 04 ad



-Viswa





cmpost.c
Description: Binary data
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [Fwd: Re: [openib-general] kernel oops]

See inline..On 02 Sep 2005 17:04:42 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote:
On Fri, 2005-09-02 at 16:59, Viswanath Krishnamurthy wrote:> Here is the setup..Thanks. A couple more questions:> #svn info> Path: .>> URL: 
https://openib.org/svn/gen2/trunk> Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd> Revision: 3295> Node Kind: directory> Schedule: normal> Last Changed Author: halr> Last Changed Rev: 3295
> Last Changed Date: 2005-09-01 12:07:54 -0700 (Thu, 01 Sep 2005)>>> Patch applied to core/at.c and kernel 2.6.13 recompiled.>>> Machine  A> => Running opensm
>> Run ucmpost>> machine B> => ./ucmpost Are these back to back HCAs or is there a switch in between ?

There is a  switch in between.  A simple setup with 2 machines and a switch.  The machines are running
2.6.13. One of them is running opensm.
> The problem is reproducible when you *cannot* ping each otherover IPoIB ?


Yes.. 
> [EMAIL PROTECTED] ~]# ibv_devinfo> hca_id: mthca0>
fw_ver:
1.0.1>
node_guid:  0002:c902:0040:0d00>
sys_image_guid:
0002:c902:0040:0d03>
max_mr_size:0x>
page_size_cap:  0x0>
vendor_id:  0x02c9>
vendor_part_id:
25204>
hw_ver:
0x0>
phys_port_cnt:  1> port:   1>
state:  PORT_ACTIVE
(4)>
max_mtu:invalid
MTU (0)  <> What is this ??>>
active_mtu:
invalid MTU (0)If the program is right and those are the real values, somehow max_mtuis trashed which causes active_mtu to be invalid which could break allsorts of things...
Is there some issue with the HCA ?  
>
sm_lid:
1>
port_lid:  
3>
port_lmc:  
0x00That's on the remote (from the SM) machine.-- Hal
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [Fwd: Re: [openib-general] kernel oops]

Here is the setup..

#svn info
Path: .

URL: https://openib.org/svn/gen2/trunk
Repository UUID: 21a7a0b7-18d7-0310-8e21-e8b31bdbf5cd
Revision: 3295
Node Kind: directory
Schedule: normal
Last Changed Author: halr
Last Changed Rev: 3295
Last Changed Date: 2005-09-01 12:07:54 -0700 (Thu, 01 Sep 2005)


Patch applied to core/at.c and kernel 2.6.13 recompiled.


Machine  A
=
Running opensm

Run ucmpost

machine B
=
./ucmpost 

The problem is reproducible when you *cannot* ping each other

[EMAIL PROTECTED] ~]# ibv_devinfo
hca_id: mthca0
   
fw_ver:
1.0.1
   
node_guid: 
0002:c902:0040:0d00
   
sys_image_guid:
0002:c902:0040:0d03
   
max_mr_size:   
0x
   
page_size_cap: 
0x0
   
vendor_id: 
0x02c9
   
vendor_part_id:
25204
   
hw_ver:
0x0
   
phys_port_cnt: 
1
    port:   1
   
state: 
PORT_ACTIVE (4)
   
max_mtu:   
invalid MTU (0)  < What is this ??>
   
active_mtu:
invalid MTU (0)
   
sm_lid:
1
   
port_lid:  
3
   
port_lmc:  
0x00


-Viswa



On 02 Sep 2005 16:02:44 -0400, Hal Rosenstock <[EMAIL PROTECTED]> wrote:
On Fri, 2005-09-02 at 15:39, Viswanath Krishnamurthy wrote:> The patch failed to fix the panic..Can you describe your setup ? Did you just run ucmpost without an SM/SArunning or is it a different scenario ?
Thanks.-- Hal
___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [Fwd: Re: [openib-general] kernel oops]

I am working on it. With the updated version of code, slightly difficult to reproduce.

-Viswa

On 9/2/05, Roland Dreier <[EMAIL PROTECTED]> wrote:
Not really related to the ib_at oops, since I don't know that code.But have you made any progress in being able to post the code toreproduce the other oops (at mthca_poll_cq)?Thanks,  Roland

___
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Re: [Fwd: Re: [openib-general] kernel oops]