Apologies for the weird subject, but wasn't sure what exactly I'm asking here.

To wit, today I built Open MPI 5.0.0rc9 on our cluster and I built it on a 
compute node. And all seemed to work okay. I did get:

-- [borgh088:22776][:33:hmca_rcache_ucs_query]  UCS version mismatch. Libhcoll 
binary was compiled with UCS 1.8 while the runtime version of UCS is 1.10. UCS 
Rcache framework will be disabled. Performance of ZCOPY BCAST algorithm may be 
degraded. Add -x HCOLL_RCACHE=^ucs in order to suppress this message.

but for now I can turn that off with an environment variable.

What I'm more concerned about is that because I built it on my compute node 
where all the Mellanox libraries live, the executables now have missing 
libraries on the head nodes if I try to use, say, ompi_info:

$ ompi_info
ompi_info: error while loading shared libraries: libhcoll.so.1: cannot open 
shared object file: No such file or directory

(I get similar with ncdump, etc.)

Indeed, with ldd:

$ ldd /discover/swdev/gmao_SIteam/MPI/openmpi/5.0.0rc9/gcc-12.1.0/bin/ompi_info 
| grep found
     libhcoll.so.1 => not found
     libocoms.so.0 => not found
     libsharp_coll.so.5 => not found
     libsharp.so.6 => not found
     libhcoll.so.1 => not found
     libocoms.so.0 => not found
     libsharp_coll.so.5 => not found
     libsharp.so.6 => not found

Of course, these are all found on the compute node.

So, my question is two-fold.

1. Can I build Open MPI in such a way so I can make things a bit more portable 
on my system. I know I can do `--with-hcoll=no` but when I try that I get:

$ ldd 
/discover/swdev/gmao_SIteam/MPI/openmpi/5.0.0rc9/gcc-12.1.0-nohcoll/bin/ompi_info
 | grep found
     libhcoll.so.1 => not found
     libocoms.so.0 => not found
     libsharp_coll.so.5 => not found
     libsharp.so.6 => not found

so half of the libraries weren't linked? Is there a way to *really* not bring 
in hcoll?

2. What is the cost of doing so? I know vaguely about hcoll and how it helps 
MPI_All calls, but is there more I'd be missing without this?

I'm mainly just trying to anticipate user support issues when someone is 
confused that ncdump doesn't work, etc.

Thanks,
Matt
--
Matt Thompson, SSAI, Ld Scientific Programmer/Analyst
NASA GSFC,    Global Modeling and Assimilation Office
Code 610.1,  8800 Greenbelt Rd,  Greenbelt,  MD 20771
Phone: 301-614-6712                 Fax: 301-614-6246
http://science.gsfc.nasa.gov/sed/bio/matthew.thompson

Reply via email to