Hal Rosenstock wrote:

The one I see that might be related is the following:

commit 39798695b4bcc7b145f8910ca56195808d3a7637
Author: Roland Dreier <[EMAIL PROTECTED]>
Date:   Mon Nov 13 09:38:07 2006 -0800

   IB/mad: Fix race between cancel and receive completion
When ib_cancel_mad() is called, it puts the canceled send on a list
   and schedules a "flushed" callback from process context.  However,
   this leaves a window where a receive completion could be processed
   before the send is fully flushed.
This is fine, except that ib_find_send_mad() will find the MAD and
   return it to the receive processing, which results in the sender
   getting both a successful receive and a "flushed" send completion for
   the same request.  Understandably, this confuses the sender, which is
   expecting only one of these two callbacks, and leads to grief such as
   a use-after-free in IPoIB.
Fix this by changing ib_find_send_mad() to return a send struct only
   if the status is still successful (and not "flushed").  The search of
   the send_list already had this check, so this patch just adds the same
   check to the search of the wait_list.
Signed-off-by: Roland Dreier <[EMAIL PROTECTED]>

My search was not exhaustive.
It looks like this may be the fix for the MAD send errors. Do you think this is the cause of opensm not grabbing the mastership from the other ?


Are they incrementing ? Which node is this ? I think some of them would
increment on node reboot.
Looks like some counters (Symbol errors, link downed) are reached the top ceiling.
This output was captured on node vortex3l-83, the one who runs opensm.
Do you want the perfquery output before and after some time interval ?

VBabu
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to