If oob:ud was disabled then there was no call to ibv_fork_init() anywhere
else, right? If so, then this is why the messages went away.

The calls to ibv_fork_init() from the opal common verbs were pushed to the
master. One of the places a call was set is oob:ud, but if there is a call
to memory registering verbs before this place, then the call to it in
oob:ud would result in a failure.

On Thu, Mar 5, 2015 at 4:21 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com>
wrote:

> This is not a PSM issue -- I believe Paul said that when he disabled
> oob:ud, the messages went away.
>
> I'm sorry; I'm at the MPI Forum this week and not paying close attention
> to master commits.  Has your code to ensure to call the opal common verbs
> ibv_fork_init() stuff been pushed to master yet?  If so, then
> ibv_fork_init() *should* be getting called first, and there's something
> else going on that needs to be understood.
>
>
>
> > On Mar 5, 2015, at 1:57 AM, Alina Sklarevich <ali...@dev.mellanox.co.il>
> wrote:
> >
> > Hi,
> >
> > I will change the default of the opal_common_verbs_want_fork_support to
> -1 in order to avoid these messages in case ibv_fork_init() fails.
> >
> > The reason why it is failing is that ibv_fork_init() is being called to
> late. To avoid this, every component should call ibv_fork_init() early in
> the init (in this case before oob/ud does) - call the
> opal_common_verbs_fork_test() function which does just that.
> >
> > Paul, can you please check if adding this call to psm fixes the issue?
> >
> > On Wed, Mar 4, 2015 at 11:40 PM, Dave Goodell (dgoodell) <
> dgood...@cisco.com> wrote:
> > On Mar 4, 2015, at 3:25 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
> >
> > > On Wed, Mar 4, 2015 at 1:04 PM, Dave Goodell (dgoodell) <
> dgood...@cisco.com> wrote:
> > > [...]
> > > > libibverbs: Warning: couldn't open config directory
> '/etc/libibverbs.d'.
> > > > libibverbs: Warning: no userspace device-specific driver found for
> /sys/class/infiniband_verbs/uverbs0
> > >
> > > I think that warning is printed by libibverbs itself.  Are you 100%
> sure there are no IB HCAs sitting in the head node?  If there are IB HCAs
> but you don't want them to be used, you might want to ensure that the
> various verbs kernel modules don't get loaded, which is one half of the
> mismatch which confuses libibverbs.
> > > [...]
> > >
> > > FWIW, I can confirm that these two lines are from libibverbs itself:
> > > $ strings /usr/lib64/libibverbs.a | grep -e 'no userspace' -e 'open
> config directory'
> > > libibverbs: Warning: no userspace device-specific driver found for %s
> > > libibverbs: Warning: couldn't open config directory '%s'.
> >
> > Yes, I think you'd also see the same message if you run "ibv_devices" or
> "ibv_devinfo" on the head node.
> >
> > > As it happens, the login node *does* have an HCA installed and the
> kernel modules appears to be loaded.  However, as the "17th node" in the
> cluster it was never cabled to the 16-port switch and the package(s) that
> should have created/populated /etc/libibverbs.d are *not* present
> (specifically the login node has libipathverbs-devel installed but not
> libipathverbs).
> > >
> > > So, Dave, are you saying that what I describe in the previous
> paragraph would be considered "misconfiguration"?  I am fine with dropping
> the discussion of those first two lines if there is agreement that Open MPI
> shouldn't be responsible for handling this case.
> >
> > I would consider that to be a lesser misconfiguration, which is only
> really an issue because of libibverbs deficiencies.  Either the hardware
> could be removed from the head node or the kernel modules could be unloaded
> / prevented from loading on the head node.
> >
> > > Now the ibv_fork_init() warnings are another issue entirely.  Since
> btl:verbs and mtl:psm both work (at least separately) perfectly fine on the
> compute nodes, I don't believe that there are any configuration issues
> there.
> >
> > Agreed, something needs to be improved there.  I assume that Mike D. or
> someone from his team will take a look.  I don't have any bandwidth to look
> at this myself.
> >
> > -Dave
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/03/17100.php
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/03/17101.php
>
>
> --
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2015/03/17102.php
>

Reply via email to