Hal Rosenstock wrote:
The one I see that might be related is the following:
commit 39798695b4bcc7b145f8910ca56195808d3a7637
Author: Roland Dreier <[EMAIL PROTECTED]>
Date: Mon Nov 13 09:38:07 2006 -0800
IB/mad: Fix race between cancel and receive completion
When ib_cancel_mad() is called, it puts the canceled send on a list
and schedules a "flushed" callback from process context. However,
this leaves a window where a receive completion could be processed
before the send is fully flushed.
This is fine, except that ib_find_send_mad() will find the MAD and
return it to the receive processing, which results in the sender
getting both a successful receive and a "flushed" send completion for
the same request. Understandably, this confuses the sender, which is
expecting only one of these two callbacks, and leads to grief such as
a use-after-free in IPoIB.
Fix this by changing ib_find_send_mad() to return a send struct only
if the status is still successful (and not "flushed"). The search of
the send_list already had this check, so this patch just adds the same
check to the search of the wait_list.
Signed-off-by: Roland Dreier <[EMAIL PROTECTED]>
My search was not exhaustive.
It looks like this may be the fix for the MAD send errors. Do you
think this is the cause of opensm not grabbing the mastership from the
other ?
Are they incrementing ? Which node is this ? I think some of them would
increment on node reboot.
Looks like some counters (Symbol errors, link downed) are reached the
top ceiling.
This output was captured on node vortex3l-83, the one who runs opensm.
Do you want the perfquery output before and after some time interval ?
VBabu
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general