If oob:ud was disabled then there was no call to ibv_fork_init() anywhere else, right? If so, then this is why the messages went away.
The calls to ibv_fork_init() from the opal common verbs were pushed to the master. One of the places a call was set is oob:ud, but if there is a call to memory registering verbs before this place, then the call to it in oob:ud would result in a failure. On Thu, Mar 5, 2015 at 4:21 PM, Jeff Squyres (jsquyres) <jsquy...@cisco.com> wrote: > This is not a PSM issue -- I believe Paul said that when he disabled > oob:ud, the messages went away. > > I'm sorry; I'm at the MPI Forum this week and not paying close attention > to master commits. Has your code to ensure to call the opal common verbs > ibv_fork_init() stuff been pushed to master yet? If so, then > ibv_fork_init() *should* be getting called first, and there's something > else going on that needs to be understood. > > > > > On Mar 5, 2015, at 1:57 AM, Alina Sklarevich <ali...@dev.mellanox.co.il> > wrote: > > > > Hi, > > > > I will change the default of the opal_common_verbs_want_fork_support to > -1 in order to avoid these messages in case ibv_fork_init() fails. > > > > The reason why it is failing is that ibv_fork_init() is being called to > late. To avoid this, every component should call ibv_fork_init() early in > the init (in this case before oob/ud does) - call the > opal_common_verbs_fork_test() function which does just that. > > > > Paul, can you please check if adding this call to psm fixes the issue? > > > > On Wed, Mar 4, 2015 at 11:40 PM, Dave Goodell (dgoodell) < > dgood...@cisco.com> wrote: > > On Mar 4, 2015, at 3:25 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > > > > On Wed, Mar 4, 2015 at 1:04 PM, Dave Goodell (dgoodell) < > dgood...@cisco.com> wrote: > > > [...] > > > > libibverbs: Warning: couldn't open config directory > '/etc/libibverbs.d'. > > > > libibverbs: Warning: no userspace device-specific driver found for > /sys/class/infiniband_verbs/uverbs0 > > > > > > I think that warning is printed by libibverbs itself. Are you 100% > sure there are no IB HCAs sitting in the head node? If there are IB HCAs > but you don't want them to be used, you might want to ensure that the > various verbs kernel modules don't get loaded, which is one half of the > mismatch which confuses libibverbs. > > > [...] > > > > > > FWIW, I can confirm that these two lines are from libibverbs itself: > > > $ strings /usr/lib64/libibverbs.a | grep -e 'no userspace' -e 'open > config directory' > > > libibverbs: Warning: no userspace device-specific driver found for %s > > > libibverbs: Warning: couldn't open config directory '%s'. > > > > Yes, I think you'd also see the same message if you run "ibv_devices" or > "ibv_devinfo" on the head node. > > > > > As it happens, the login node *does* have an HCA installed and the > kernel modules appears to be loaded. However, as the "17th node" in the > cluster it was never cabled to the 16-port switch and the package(s) that > should have created/populated /etc/libibverbs.d are *not* present > (specifically the login node has libipathverbs-devel installed but not > libipathverbs). > > > > > > So, Dave, are you saying that what I describe in the previous > paragraph would be considered "misconfiguration"? I am fine with dropping > the discussion of those first two lines if there is agreement that Open MPI > shouldn't be responsible for handling this case. > > > > I would consider that to be a lesser misconfiguration, which is only > really an issue because of libibverbs deficiencies. Either the hardware > could be removed from the head node or the kernel modules could be unloaded > / prevented from loading on the head node. > > > > > Now the ibv_fork_init() warnings are another issue entirely. Since > btl:verbs and mtl:psm both work (at least separately) perfectly fine on the > compute nodes, I don't believe that there are any configuration issues > there. > > > > Agreed, something needs to be improved there. I assume that Mike D. or > someone from his team will take a look. I don't have any bandwidth to look > at this myself. > > > > -Dave > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/03/17100.php > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/03/17101.php > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/03/17102.php >