Hi Pritchard, thank you for replying. Nothing changed adding the parameter you suggested. Can it depend on the fact that I'm running v.1.10.0rc7? It's a custom version, we didn't modify spml or sm related code though.
2016-11-15 14:12 GMT+01:00 Pritchard Jr., Howard <howa...@lanl.gov>: > HI Gianmario, > > Probably something went wrong at the spml layer. > Could you also add —mac spml_base_verbose 10 > to the job launch line? > > Howard > > -- > Howard Pritchard > HPC-DES > Los Alamos National Laboratory > > > From: devel <devel-boun...@lists.open-mpi.org> on behalf of Gianmario > Pozzi <pozzigma...@gmail.com> > Reply-To: Open MPI Developers <devel@lists.open-mpi.org> > Date: Tuesday, November 15, 2016 at 5:32 AM > To: "devel@lists.open-mpi.org" <devel@lists.open-mpi.org> > Subject: [OMPI devel] Failure while loading shmem module > > Hi everybody, > > I'm trying to run a sample program on two 16-cores machines connected with > IB (command: mpirun -np 20 -host *localhost*,*remotehost* --mca > shmem_base_verbose 10 --mca btl self,sm,openib test). > > This command fails saying: > > [cn18:72296] mca: base: components_register: registering shmem components > [cn18:72296] mca: base: components_open: opening shmem components > [cn18:72296] shmem: base: runtime_query: Auto-selecting shmem components > [cn18:72296] shmem: base: runtime_query: (shmem) No component selected! > -------------------------------------------------------------------------- > It looks like opal_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during opal_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > opal_shmem_base_select failed > --> Returned value -1 instead of OPAL_SUCCESS > -------------------------------------------------------------------------- > > I dove into the code and found out that the cycle contained in that > function is not traversed, which apparently means that no suitable > component has even been found. > > Please notice that a sample Hello world application using shared memory > runs perfectly. Excluding sm from command line doesn't solve the problem. > > Any hint? Did any of y'all ever experienced something similar? > > Thank you. > -- > *Gianmario Pozzi* > *M.Sc. @ Politecnico di Milano* > > > _______________________________________________ > devel mailing list > devel@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > -- *Gianmario Pozzi* *M.Sc. @ Politecnico di Milano*
_______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel