On Mon, 04 Oct 2004 15:34:51 -0400 Hal Rosenstock <[EMAIL PROTECTED]> wrote:
> I am pretty sure there is a window here as follows: > First, deregistration cancels the MAD removing it from the agent send > list. > ib_mad_complete_send_wr is invoked some time later and never checks for > the send WR still being on the agent send list. It just assumes it is. > It potentially makes a send callback. The deregistration only removes the mad_send_wr from the agent send list if its reference count is zero. A reference is held on the mad_send_wr from the time that a work request is posted to the port, until a completion is reported. So, you should never get a callback for a mad_send_wr, unless its reference count is at least one. > Aren't some errors fine grained and pertain only to the WR supplied > whereas other errors are coarser (like fatal and general) and might > apply to something larger (perhaps the port but maybe the QP) ? I wonder > whether there is any assistance in the Mellanox documentation as to > which errors should be treated how. I was referring to errors that applied to a single work request only. For fatal errors that we cannot recover from, we may need a way to report such errors to the user to indicate that their mad_agent is no longer operational. > > It would help in this case for the port layer code > > just return completions for all queued work requests to the MAD > > agents, and let the MAD agent code deal with the issue. > > True for most errors. Not sure about fatal and general errors yet. I think it would depend on the error code that was reported in the send_mad_wc. If the return code is flushed, the mad_agent could just repost the send. If the return code is fatal error, it should complete the MAD to the client. > > > 3. The final scenario is board (not currently possible) or module > > > removal. My concern here is about potential send callbacks (indicating > > > FLUSHED) to a potentially stale MAD agent. When the module is removed > > > non forceably, the clients (upper layer modules) would need to be > > > removed first, which should cause the proper deregistration (and these > > > MADs would be cancelled so there would be none to cleanup). I am not > > > sure what the rules for proper behavior are on forceable module removal. > > > Board removal would be similar to this (the forceable module removal > > > case). > > > > Deregistration is a synchronous process, so will wait until all > > send MADs have completed. If this isn't happening, then the > > referencing counting is off somewhere. > > I think deregistration is fine (short of issue 1 which I think is > readily fixable). I was more asking about the asynchronous scenario here > (forced module (or board) removal) where that isn't the case. Unless there's a bug in the code, I don't believe that we can have send callbacks to stale MAD agents. If you're trying to have the code deregister for a client, this would be impossible. Clients should receive some sort of removal notification event and would need to deregister in response to that event. _______________________________________________ openib-general mailing list [EMAIL PROTECTED] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general