At 07:46 AM 8/31/2005, James Lentini wrote:


On Tue, 30 Aug 2005, Roland Dreier wrote:

> I just committed this SRP fix, which should make sure we don't use a
> device after it's gone.  And it actually simplifies the code a teeny bit...

The device could still be used after it's gone. For example:

 - the user is configuring SRP via sysfs. The thread in
   srp_create_target() has just called ib_sa_path_rec_get()
   [srp.c line 1209] and is waiting for the path
   record query to complete in wait_for_completion()
 - the SA callback, srp_path_rec_completion(), is called. This
   callback thread will make several verb calls (ib_create_cq,
   ib_req_notify_cq, ib_create_qp, ...) without any coordination with
   the hotplug device removal callback, srp_remove_one

Notice that if the SA client's hotplug removal function,
ib_sa_remove_one(), ensured that all callbacks had completed before
returning the problem would be fixed. This would protect all ULPs from
having to deal with hotplug races in their SA callback function. The
fix belongs in the SA client (the core stack), not in SRP.

All the ULPs are deficient with respect to their hotplug
synchronization. Given that there is a common problem, doesn't it make
sense to try and solve it in a generic way instead of in each ULP?

There are two approaches to device removal to consider - both are required to have a credible solution:

(1) Inform all entities that a planned device removal is to occur and allow them to close gracefully or migrate to alternatives.  Ideally, the OS comprehends whether the removal will result in the loss of any critical resources and not inform or take action unless it knows the removal is something that the system can survive.  Doing this requires the ULP to register interest with the OS in a particular hardware resource. This also allows the OS to construct a resource analysis tool to determine whether the removal of a device will be a good idea or not.  This is really outside the scope of an RDMA infrastructure and should be done by the OS through an OS defined API which is applicable to all types of hardware resources and sub-systems.

(2) Design all ULP to handle surprise removal, e.g. device failure, from the start and allow them to close gracefully or migrate to alternatives.  The OS would inform the device driver of the failure if the device driver has not already discovered the problem.  The OS would also inform interested parties of the device failure.  The device driver would simply error out all users of the device instance - there are already error codes defined for IB and iWARP for this purpose.  The associated verbs resources should be released as the ULP closes out its resources through the verbs API (we did define the verbs to clean up resources that the infrastructure may allocate on behalf of the ULP).  Activities such as listen entries would be released just like what is done for Sockets, etc. today. 


Device addition is simply a matter of informing policy or whatever service management within the OS that determines what services should be available on a given device.  The device driver really does not need to do anything special.  One area to consider is whether a planned migration of a service needs to be supported.  This is generally best handled by the ULP with only a small set of services required of the infrastructure, e.g. get / set of QP / LLP context and then coordinating any other aspects with the appropriate SM or network services such updating address vectors or fabric management / configuration.

In general, the ULP should already be designed to handle the error condition and whether they support a managed / planned removal or migration is perhaps the only potential area of deficiency.

Mike
_______________________________________________
openib-general mailing list
openib-general@openib.org
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to