Vincent,

Can you try a small program such as examples/ring_c.c ?
Does your app do MPI_Comm_spawn and friends ?
Can you post your mpirun command line ? Are you using a batch manager ?

This error message is typical of unresolved libraries.
(E.g. "ssh host ldd orted" fails to resolve some libs because
LD_LIBRARY_PATH is not propagated)
We usually recommend to configure with --enable-mpirun-prefix-by-default.
That being said, that does not match your claim app worked for about 5
minutes

Cheers,

Gilles

On Thursday, April 13, 2017, Vincent Drach <vincent.dr...@plymouth.ac.uk>
wrote:

>
> Dear mailing list,
>
> We are experimenting run time failure  on a small cluster with
> openmpi-2.0.2 and gcc 6.3 and gcc 5.4.
> The job start normally and lots of communications are performed. After
> 5-10 minutes the connection to the hosts is closed and
> the following error message is reported:
> --------------------------------------------------------------------------
> ORTE was unable to reliably start one or more daemons.
> This usually is caused by:
>
> * not finding the required libraries and/or binaries on
>   one or more nodes. Please check your PATH and LD_LIBRARY_PATH
>   settings, or configure OMPI with --enable-orterun-prefix-by-default
>
> * lack of authority to execute on one or more specified nodes.
>   Please verify your allocation and authorities.
>
> * the inability to write startup files into /tmp
> (--tmpdir/orte_tmpdir_base).
>   Please check with your sys admin to determine the correct location to
> use.
>
> *  compilation of the orted with dynamic libraries when static are required
>   (e.g., on Cray). Please check your configure cmd line and consider using
>   one of the contrib/platform definitions for your system type.
>
> * an inability to create a connection back to mpirun due to a
>   lack of common network interfaces and/or no route found between
>   them. Please check network connectivity (including firewalls
>   and network routing requirements).
>
>
>
> The issue does not seem to be due to the infiniband configuration, because
> the job also crash when using tcp protocol.
>
> Do you have any clue of what could be the issue ?
>
>
> Thanks a lot,
>
> Vincent
>
> ------------------------------
> <http://www.plymouth.ac.uk/worldclass>
>
> This email and any files with it are confidential and intended solely for
> the use of the recipient to whom it is addressed. If you are not the
> intended recipient then copying, distribution or other use of the
> information contained is strictly prohibited and you should not rely on it.
> If you have received this email in error please let the sender know
> immediately and delete it from your system(s). Internet emails are not
> necessarily secure. While we take every care, Plymouth University accepts
> no responsibility for viruses and it is your responsibility to scan emails
> and their attachments. Plymouth University does not accept responsibility
> for any changes made after it was sent. Nothing in this email or its
> attachments constitutes an order for goods or services unless accompanied
> by an official order form.
>
_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to