> Userspace apps are supposed to release all ib device resources if
 > they receive a fatal async event (IBV_EVENT_DEVICE_FATAL).  However,
 > the app has no way of knowing when the device has come back up, except
 > to repeatedly attempt ibv_open_device() until it succeeds.
 > 
 > However, currently there is no protection against open succeeding when
 > the device is in the midst of the removal following the fatal event.
 > In this case, the open will succeed, but as a result the device waits
 > in the middle of its removal until the new app releases its ib resources
 >  -- and the new app will not do so, since the open succeeded at a point
 > following the fatal event generation.

Does mthca have this bug too?

 > --- a/include/linux/mlx4/device.h
 > +++ b/include/linux/mlx4/device.h
 > @@ -379,6 +379,7 @@ struct mlx4_dev {
 >      struct radix_tree_root  qp_table_tree;
 >      u32                     rev_id;
 >      char                    board_id[MLX4_BOARD_ID_LEN];
 > +    u32                     ib_active;
 >  };
 >  
 >  struct mlx4_init_port_param {

I don't much like this approach of putting an IB-driver-specific flag
into the generic mlx4 data structure.  Why can't the IB driver manage
this in its own data structure?

 > @@ -698,6 +713,8 @@ static void mlx4_ib_remove(struct mlx4_dev *dev, void 
 > *ibdev_ptr)
 >      struct mlx4_ib_dev *ibdev = ibdev_ptr;
 >      int p;
 >  
 > +    dev->ib_active = 0;
 > +
 >      mlx4_ib_mad_cleanup(ibdev);
 >      ib_unregister_device(&ibdev->ib_dev);

This just seems silly -- what is setting ib_active to 0 protecting
against?  The ib_unregister_device() call is going make sure we have no
clients left anyway, isn't it?

 - R.
_______________________________________________
general mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to