Hi Pritchard, thank you for replying. Nothing changed adding the parameter you suggested. Can it depend on the fact that I'm running v.1.10.0rc7? It's a custom version, we didn't modify spml or sm related code though.
2016-11-15 14:12 GMT+01:00 Pritchard Jr., Howard <[email protected]>: > HI Gianmario, > > Probably something went wrong at the spml layer. > Could you also add —mac spml_base_verbose 10 > to the job launch line? > > Howard > > -- > Howard Pritchard > HPC-DES > Los Alamos National Laboratory > > > From: devel <[email protected]> on behalf of Gianmario > Pozzi <[email protected]> > Reply-To: Open MPI Developers <[email protected]> > Date: Tuesday, November 15, 2016 at 5:32 AM > To: "[email protected]" <[email protected]> > Subject: [OMPI devel] Failure while loading shmem module > > Hi everybody, > > I'm trying to run a sample program on two 16-cores machines connected with > IB (command: mpirun -np 20 -host *localhost*,*remotehost* --mca > shmem_base_verbose 10 --mca btl self,sm,openib test). > > This command fails saying: > > [cn18:72296] mca: base: components_register: registering shmem components > [cn18:72296] mca: base: components_open: opening shmem components > [cn18:72296] shmem: base: runtime_query: Auto-selecting shmem components > [cn18:72296] shmem: base: runtime_query: (shmem) No component selected! > -------------------------------------------------------------------------- > It looks like opal_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during opal_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > opal_shmem_base_select failed > --> Returned value -1 instead of OPAL_SUCCESS > -------------------------------------------------------------------------- > > I dove into the code and found out that the cycle contained in that > function is not traversed, which apparently means that no suitable > component has even been found. > > Please notice that a sample Hello world application using shared memory > runs perfectly. Excluding sm from command line doesn't solve the problem. > > Any hint? Did any of y'all ever experienced something similar? > > Thank you. > -- > *Gianmario Pozzi* > *M.Sc. @ Politecnico di Milano* > > > _______________________________________________ > devel mailing list > [email protected] > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > -- *Gianmario Pozzi* *M.Sc. @ Politecnico di Milano*
_______________________________________________ devel mailing list [email protected] https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
