Ralph, At least part of them problem is to do with error reporting, orte-ps is hitting the error case for a stale hnp at around line 258 and is trying to report the error via orte_show_help() however this function is calling a rpc into the orted-run which is silently ignoring it for some reason.
The failure itself seems to come from a timeout in comm.c:1114 where the client process isn't waiting long enough for the orted-run to reply and is returning ORTE_ERR_SILENT instead. I can't think what to suggest here other than increasing the timeout? Ashley, On Mon, 2009-05-18 at 17:06 +0100, Ashley Pittman wrote: > It's certainly helped and now runs for me however if I run mpirun under > valgrind and then opmi-ps in another window Valgrind reports errors and > ompi-ps doesn't list the job so there is clearly something still amiss. > I'm trying to do some more diagnostics now. > > ==32362== Syscall param writev(vector[...]) points to uninitialised > byte(s) > ==32362== at 0x41BF10C: writev (writev.c:46) > ==32362== by 0x4EAAD52: mca_oob_tcp_msg_send_handler > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_oob_tcp.so) > ==32362== by 0x4EAC505: mca_oob_tcp_peer_send > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_oob_tcp.so) > ==32362== by 0x4EAEF89: mca_oob_tcp_send_nb > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_oob_tcp.so) > ==32362== by 0x4EA20BE: orte_rml_oob_send > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_rml_oob.so) > ==32362== by 0x4EA2359: orte_rml_oob_send_buffer > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_rml_oob.so) > ==32362== by 0x4050738: process_commands > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0) > ==32362== by 0x405108C: orte_daemon_cmd_processor > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0) > ==32362== by 0x4260B57: opal_event_base_loop > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0) > ==32362== by 0x4260DF6: opal_event_loop > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0) > ==32362== by 0x4260E1D: opal_event_dispatch > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0) > ==32362== by 0x804B15F: orterun (orterun.c:757) > ==32362== Address 0x448507c is 20 bytes inside a block of size 512 > alloc'd > ==32362== at 0x402613C: realloc (vg_replace_malloc.c:429) > ==32362== by 0x42556B7: opal_dss_buffer_extend > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0) > ==32362== by 0x4256C4F: opal_dss_pack_int32 > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0) > ==32362== by 0x42565C9: opal_dss_pack_buffer > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0) > ==32362== by 0x403A60D: orte_dt_pack_job > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0) > ==32362== by 0x42565C9: opal_dss_pack_buffer > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0) > ==32362== by 0x4256FFB: opal_dss_pack > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0) > ==32362== by 0x40506F7: process_commands > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0) > ==32362== by 0x405108C: orte_daemon_cmd_processor > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0) > ==32362== by 0x4260B57: opal_event_base_loop > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0) > ==32362== by 0x4260DF6: opal_event_loop > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0) > ==32362== by 0x4260E1D: opal_event_dispatch > (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0) > > On Mon, 2009-05-18 at 08:22 -0600, Ralph Castain wrote: > > Aha! Thanks for spotting the problem - I had to move that var init to > > cover all cases, but it should be working now with r21249 > > > > > > > > On May 18, 2009, at 8:08 AM, Ashley Pittman wrote: > > > > > > > > Ralph, > > > > > > This patch fixed it, num_nodes was being used initialised and hence > > > the > > > client was getting a bogus value for the number of nodes. > > > > > > Ashley, > > > > > > On Mon, 2009-05-18 at 10:09 +0100, Ashley Pittman wrote: > > >> No joy I'm afraid, now I get errors when I run it. This is a single > > >> node job run with the command line "mpirun -n 3 ./a.out". I've > > >> attached > > >> the strace output and gzipped /tmp files from the machine. > > >> Valgrind on > > >> the opmi-ps process doesn't show anything interesting. > > >> > > >> [alpha:29942] [[35044,0],0] ORTE_ERROR_LOG: Data unpack would read > > >> past > > >> end of buffer in > > >> file /mnt/home/debian/ashley/code/OpenMPI/ompi-trunk-tes/trunk/orte/ > > >> util/comm/comm.c at line 242 > > >> [alpha:29942] [[35044,0],0] ORTE_ERROR_LOG: Data unpack would read > > >> past > > >> end of buffer in > > >> file /mnt/home/debian/ashley/code/OpenMPI/ompi-trunk-tes/trunk/orte/ > > >> tools/orte-ps/orte-ps.c at line 818 > > >> > > >> Ashley. > > >> > > >> On Sat, 2009-05-16 at 08:15 -0600, Ralph Castain wrote: > > >>> This is fixed now, Ashley - sorry for the problem. > > >>> > > >>> > > >>> On May 15, 2009, at 4:47 AM, Ashley Pittman wrote: > > >>> > > >>>> On Thu, 2009-05-14 at 22:49 -0600, Ralph Castain wrote: > > >>>>> It is definitely broken at the moment, Ashley. I have it pretty > > >>>>> well > > >>>>> fixed, but need/want to cleanup some corner cases that have > > >>>>> plagued > > >>>>> us > > >>>>> for a long time. > > >>>>> > > >>>>> Should have it for you sometime Friday. > > >>>> > > >>>> Ok, thanks. I might try switching to slurm in the mean-time, I > > >>>> know > > >>>> my > > >>>> code works with that. > > >>>> > > >>>> Can you let me know when it's fixed on or off list and I'll do an > > >>>> update. > > >>>> > > >>>> Ashley, > > >>>> > > >>>> _______________________________________________ > > >>>> devel mailing list > > >>>> de...@open-mpi.org > > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >>> > > >>> _______________________________________________ > > >>> devel mailing list > > >>> de...@open-mpi.org > > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > >> _______________________________________________ > > >> devel mailing list > > >> de...@open-mpi.org > > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > <ompi-ps.patch>_______________________________________________ > > > devel mailing list > > > de...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel