Can you run anything under TM? Try running "hostname" directly from Torque to 
see if anything works at all.

The error message is telling you that the Torque daemon on the remote node 
reported a failure when trying to launch the OMPI daemon. Could be that Torque 
isn't setup to forward environments so the OMPI daemon isn't finding required 
libs. You could directly run "printenv" to see how your remote environ is being 
setup.

Could be that the tmp dir lacks correct permissions for a user to create the 
required directories. The OMPI daemon tries to create a session directory in 
the tmp dir, so failure to do so would indeed cause the launch to fail. You can 
specify the tmp dir with a cmd line option to mpirun. See "mpirun -h" for info.


On Mar 21, 2011, at 12:21 AM, Randall Svancara wrote:

> I have a question about using OpenMPI and Torque on stateless nodes.
> I have compiled openmpi 1.4.3 with --with-tm=/usr/local
> --without-slurm using intel compiler version 11.1.075.
> 
> When I run a simple "hello world" mpi program, I am receiving the
> following error.
> 
> [node164:11193] plm:tm: failed to poll for a spawned daemon, return
> status = 17002
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpiexec noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpiexec was unable to cleanly terminate the daemons on the nodes shown
> below. Additional manual cleanup may be required - please refer to
> the "orte-clean" tool for assistance.
> --------------------------------------------------------------------------
>         node163 - daemon did not report back when launched
>         node159 - daemon did not report back when launched
>         node158 - daemon did not report back when launched
>         node157 - daemon did not report back when launched
>         node156 - daemon did not report back when launched
>         node155 - daemon did not report back when launched
>         node154 - daemon did not report back when launched
>         node152 - daemon did not report back when launched
>         node151 - daemon did not report back when launched
>         node150 - daemon did not report back when launched
>         node149 - daemon did not report back when launched
> 
> 
> But if I include:
> 
> -mca plm rsh
> 
> The job runs just fine.
> 
> I am not sure what the problem is with torque or openmpi that prevents
> the process from launching on remote nodes.  I have posted to the
> torque list and someone suggested that it may be temporary directory
> space that can be causing issues.  I have 100MB allocated to /tmp
> 
> Any ideas as to why I am having this problem would be appreciated.
> 
> 
> -- 
> Randall Svancara
> http://knowyourlinux.com/
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users


Reply via email to