When I set the OPAL_OUTPUT_STDERR_FD=0 I receive a whole bunch of mca_oob_tcp_message_recv_complete: invalid message type errors and the job just hangs even when all the nodes have fired off the MPI application.
-- Bharath On Thu, Feb 14, 2013 at 09:51:50AM -0800, Ralph Castain wrote: > I don't think this is documented anywhere, but it is an available trick (not > sure if it is in 1.6.1, but might be): if you set OPAL_OUTPUT_STDERR_FD=N in > your environment, we will direct all our error outputs to that file > descriptor. If it is "0", then it goes to stdout. > > Might be worth a try? > > > On Feb 14, 2013, at 8:38 AM, Bharath Ramesh <bram...@vt.edu> wrote: > > > Is there any way to prevent the output of more than one node > > written to the same line. I tried setting --output-filename, > > which didnt help. For some reason only stdout was written to the > > files. Making it little bit hard to read close to a 6M output > > file. > > > > -- > > Bharath > > > > On Thu, Feb 14, 2013 at 07:35:02AM -0800, Ralph Castain wrote: > >> Sounds like the orteds aren't reporting back to mpirun after launch. The > >> MPI_proctable observation just means that the procs didn't launch in those > >> cases where it is absent, which is something you already observed. > >> > >> Set "-mca plm_base_verbose 5" on your cmd line. You should see each orted > >> report back to mpirun after it launches. If not, then it is likely that > >> something is blocking it. > >> > >> You could also try updating to 1.6.3/4 in case there is some race > >> condition in 1.6.1, though we haven't heard of it to-date. > >> > >> > >> On Feb 14, 2013, at 7:21 AM, Bharath Ramesh <bram...@vt.edu> wrote: > >> > >>> On our cluster we are noticing intermediate job launch failure when using > >>> OpenMPI. We are currently using OpenMPI-1.6.1 on our cluster and it is > >>> integrated with Torque-4.1.3. It failes even for a simple MPI hello world > >>> applications. The issue is that orted gets launched on all the nodes but > >>> there are a bunch of nodes that dont launch the actual MPI application. > >>> There are no errors reported when the job gets killed because the > >>> walltime expires. Enabling --debug-daemons doesnt show any errors either. > >>> The only difference being that successful runs have MPI_proctable listed > >>> and for failures this is absent. Any help in debugging this issue is > >>> greatly appreciated. > >>> > >>> -- > >>> Bharath > >>> > >>> > >>> _______________________________________________ > >>> users mailing list > >>> us...@open-mpi.org > >>> http://www.open-mpi.org/mailman/listinfo.cgi/users > >> > >> > >> _______________________________________________ > >> users mailing list > >> us...@open-mpi.org > >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users
smime.p7s
Description: S/MIME cryptographic signature