Vincent, Can you try a small program such as examples/ring_c.c ? Does your app do MPI_Comm_spawn and friends ? Can you post your mpirun command line ? Are you using a batch manager ?
This error message is typical of unresolved libraries. (E.g. "ssh host ldd orted" fails to resolve some libs because LD_LIBRARY_PATH is not propagated) We usually recommend to configure with --enable-mpirun-prefix-by-default. That being said, that does not match your claim app worked for about 5 minutes Cheers, Gilles On Thursday, April 13, 2017, Vincent Drach <vincent.dr...@plymouth.ac.uk> wrote: > > Dear mailing list, > > We are experimenting run time failure on a small cluster with > openmpi-2.0.2 and gcc 6.3 and gcc 5.4. > The job start normally and lots of communications are performed. After > 5-10 minutes the connection to the hosts is closed and > the following error message is reported: > -------------------------------------------------------------------------- > ORTE was unable to reliably start one or more daemons. > This usually is caused by: > > * not finding the required libraries and/or binaries on > one or more nodes. Please check your PATH and LD_LIBRARY_PATH > settings, or configure OMPI with --enable-orterun-prefix-by-default > > * lack of authority to execute on one or more specified nodes. > Please verify your allocation and authorities. > > * the inability to write startup files into /tmp > (--tmpdir/orte_tmpdir_base). > Please check with your sys admin to determine the correct location to > use. > > * compilation of the orted with dynamic libraries when static are required > (e.g., on Cray). Please check your configure cmd line and consider using > one of the contrib/platform definitions for your system type. > > * an inability to create a connection back to mpirun due to a > lack of common network interfaces and/or no route found between > them. Please check network connectivity (including firewalls > and network routing requirements). > > > > The issue does not seem to be due to the infiniband configuration, because > the job also crash when using tcp protocol. > > Do you have any clue of what could be the issue ? > > > Thanks a lot, > > Vincent > > ------------------------------ > <http://www.plymouth.ac.uk/worldclass> > > This email and any files with it are confidential and intended solely for > the use of the recipient to whom it is addressed. If you are not the > intended recipient then copying, distribution or other use of the > information contained is strictly prohibited and you should not rely on it. > If you have received this email in error please let the sender know > immediately and delete it from your system(s). Internet emails are not > necessarily secure. While we take every care, Plymouth University accepts > no responsibility for viruses and it is your responsibility to scan emails > and their attachments. Plymouth University does not accept responsibility > for any changes made after it was sent. Nothing in this email or its > attachments constitutes an order for goods or services unless accompanied > by an official order form. >
_______________________________________________ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users