This is not a PSM issue -- I believe Paul said that when he disabled oob:ud, 
the messages went away.

I'm sorry; I'm at the MPI Forum this week and not paying close attention to 
master commits.  Has your code to ensure to call the opal common verbs 
ibv_fork_init() stuff been pushed to master yet?  If so, then ibv_fork_init() 
*should* be getting called first, and there's something else going on that 
needs to be understood.



> On Mar 5, 2015, at 1:57 AM, Alina Sklarevich <ali...@dev.mellanox.co.il> 
> wrote:
> 
> Hi,
> 
> I will change the default of the opal_common_verbs_want_fork_support to -1 in 
> order to avoid these messages in case ibv_fork_init() fails.
> 
> The reason why it is failing is that ibv_fork_init() is being called to late. 
> To avoid this, every component should call ibv_fork_init() early in the init 
> (in this case before oob/ud does) - call the opal_common_verbs_fork_test() 
> function which does just that.
> 
> Paul, can you please check if adding this call to psm fixes the issue?
> 
> On Wed, Mar 4, 2015 at 11:40 PM, Dave Goodell (dgoodell) <dgood...@cisco.com> 
> wrote:
> On Mar 4, 2015, at 3:25 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
> 
> > On Wed, Mar 4, 2015 at 1:04 PM, Dave Goodell (dgoodell) 
> > <dgood...@cisco.com> wrote:
> > [...]
> > > libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
> > > libibverbs: Warning: no userspace device-specific driver found for 
> > > /sys/class/infiniband_verbs/uverbs0
> >
> > I think that warning is printed by libibverbs itself.  Are you 100% sure 
> > there are no IB HCAs sitting in the head node?  If there are IB HCAs but 
> > you don't want them to be used, you might want to ensure that the various 
> > verbs kernel modules don't get loaded, which is one half of the mismatch 
> > which confuses libibverbs.
> > [...]
> >
> > FWIW, I can confirm that these two lines are from libibverbs itself:
> > $ strings /usr/lib64/libibverbs.a | grep -e 'no userspace' -e 'open config 
> > directory'
> > libibverbs: Warning: no userspace device-specific driver found for %s
> > libibverbs: Warning: couldn't open config directory '%s'.
> 
> Yes, I think you'd also see the same message if you run "ibv_devices" or 
> "ibv_devinfo" on the head node.
> 
> > As it happens, the login node *does* have an HCA installed and the kernel 
> > modules appears to be loaded.  However, as the "17th node" in the cluster 
> > it was never cabled to the 16-port switch and the package(s) that should 
> > have created/populated /etc/libibverbs.d are *not* present (specifically 
> > the login node has libipathverbs-devel installed but not libipathverbs).
> >
> > So, Dave, are you saying that what I describe in the previous paragraph 
> > would be considered "misconfiguration"?  I am fine with dropping the 
> > discussion of those first two lines if there is agreement that Open MPI 
> > shouldn't be responsible for handling this case.
> 
> I would consider that to be a lesser misconfiguration, which is only really 
> an issue because of libibverbs deficiencies.  Either the hardware could be 
> removed from the head node or the kernel modules could be unloaded / 
> prevented from loading on the head node.
> 
> > Now the ibv_fork_init() warnings are another issue entirely.  Since 
> > btl:verbs and mtl:psm both work (at least separately) perfectly fine on the 
> > compute nodes, I don't believe that there are any configuration issues 
> > there.
> 
> Agreed, something needs to be improved there.  I assume that Mike D. or 
> someone from his team will take a look.  I don't have any bandwidth to look 
> at this myself.
> 
> -Dave
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/03/17100.php
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/03/17101.php


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to