Ralph,

At least part of them problem is to do with error reporting, orte-ps is
hitting the error case for a stale hnp at around line 258 and is trying
to report the error via orte_show_help() however this function is
calling a rpc into the orted-run which is silently ignoring it for some
reason.

The failure itself seems to come from a timeout in comm.c:1114 where the
client process isn't waiting long enough for the orted-run to reply and
is returning ORTE_ERR_SILENT instead.  I can't think what to suggest
here other than increasing the timeout?

Ashley,

On Mon, 2009-05-18 at 17:06 +0100, Ashley Pittman wrote:
> It's certainly helped and now runs for me however if I run mpirun under
> valgrind and then opmi-ps in another window Valgrind reports errors and
> ompi-ps doesn't list the job so there is clearly something still amiss.
> I'm trying to do some more diagnostics now.
> 
> ==32362== Syscall param writev(vector[...]) points to uninitialised
> byte(s)
> ==32362==    at 0x41BF10C: writev (writev.c:46)
> ==32362==    by 0x4EAAD52: mca_oob_tcp_msg_send_handler
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_oob_tcp.so)
> ==32362==    by 0x4EAC505: mca_oob_tcp_peer_send
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_oob_tcp.so)
> ==32362==    by 0x4EAEF89: mca_oob_tcp_send_nb
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_oob_tcp.so)
> ==32362==    by 0x4EA20BE: orte_rml_oob_send
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_rml_oob.so)
> ==32362==    by 0x4EA2359: orte_rml_oob_send_buffer
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/openmpi/mca_rml_oob.so)
> ==32362==    by 0x4050738: process_commands
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0)
> ==32362==    by 0x405108C: orte_daemon_cmd_processor
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0)
> ==32362==    by 0x4260B57: opal_event_base_loop
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362==    by 0x4260DF6: opal_event_loop
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362==    by 0x4260E1D: opal_event_dispatch
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362==    by 0x804B15F: orterun (orterun.c:757)
> ==32362==  Address 0x448507c is 20 bytes inside a block of size 512
> alloc'd
> ==32362==    at 0x402613C: realloc (vg_replace_malloc.c:429)
> ==32362==    by 0x42556B7: opal_dss_buffer_extend
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362==    by 0x4256C4F: opal_dss_pack_int32
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362==    by 0x42565C9: opal_dss_pack_buffer
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362==    by 0x403A60D: orte_dt_pack_job
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0)
> ==32362==    by 0x42565C9: opal_dss_pack_buffer
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362==    by 0x4256FFB: opal_dss_pack
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362==    by 0x40506F7: process_commands
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0)
> ==32362==    by 0x405108C: orte_daemon_cmd_processor
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-rte.so.0.0.0)
> ==32362==    by 0x4260B57: opal_event_base_loop
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362==    by 0x4260DF6: opal_event_loop
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> ==32362==    by 0x4260E1D: opal_event_dispatch
> (in /mnt/home/debian/ashley/code/OpenMPI/install/lib/libopen-pal.so.0.0.0)
> 
> On Mon, 2009-05-18 at 08:22 -0600, Ralph Castain wrote:
> > Aha! Thanks for spotting the problem - I had to move that var init to  
> > cover all cases, but it should be working now with r21249
> > 
> > 
> > 
> > On May 18, 2009, at 8:08 AM, Ashley Pittman wrote:
> > 
> > >
> > > Ralph,
> > >
> > > This patch fixed it, num_nodes was being used initialised and hence  
> > > the
> > > client was getting a bogus value for the number of nodes.
> > >
> > > Ashley,
> > >
> > > On Mon, 2009-05-18 at 10:09 +0100, Ashley Pittman wrote:
> > >> No joy I'm afraid,  now I get errors when I run it.  This is a single
> > >> node job run with the command line "mpirun -n 3 ./a.out".  I've  
> > >> attached
> > >> the strace output and gzipped /tmp files from the machine.   
> > >> Valgrind on
> > >> the opmi-ps process doesn't show anything interesting.
> > >>
> > >> [alpha:29942] [[35044,0],0] ORTE_ERROR_LOG: Data unpack would read  
> > >> past
> > >> end of buffer in
> > >> file /mnt/home/debian/ashley/code/OpenMPI/ompi-trunk-tes/trunk/orte/ 
> > >> util/comm/comm.c at line 242
> > >> [alpha:29942] [[35044,0],0] ORTE_ERROR_LOG: Data unpack would read  
> > >> past
> > >> end of buffer in
> > >> file /mnt/home/debian/ashley/code/OpenMPI/ompi-trunk-tes/trunk/orte/ 
> > >> tools/orte-ps/orte-ps.c at line 818
> > >>
> > >> Ashley.
> > >>
> > >> On Sat, 2009-05-16 at 08:15 -0600, Ralph Castain wrote:
> > >>> This is fixed now, Ashley - sorry for the problem.
> > >>>
> > >>>
> > >>> On May 15, 2009, at 4:47 AM, Ashley Pittman wrote:
> > >>>
> > >>>> On Thu, 2009-05-14 at 22:49 -0600, Ralph Castain wrote:
> > >>>>> It is definitely broken at the moment, Ashley. I have it pretty  
> > >>>>> well
> > >>>>> fixed, but need/want to cleanup some corner cases that have  
> > >>>>> plagued
> > >>>>> us
> > >>>>> for a long time.
> > >>>>>
> > >>>>> Should have it for you sometime Friday.
> > >>>>
> > >>>> Ok, thanks.  I might try switching to slurm in the mean-time, I  
> > >>>> know
> > >>>> my
> > >>>> code works with that.
> > >>>>
> > >>>> Can you let me know when it's fixed on or off list and I'll do an
> > >>>> update.
> > >>>>
> > >>>> Ashley,
> > >>>>
> > >>>> _______________________________________________
> > >>>> devel mailing list
> > >>>> de...@open-mpi.org
> > >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >>>
> > >>> _______________________________________________
> > >>> devel mailing list
> > >>> de...@open-mpi.org
> > >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >> _______________________________________________
> > >> devel mailing list
> > >> de...@open-mpi.org
> > >> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > <ompi-ps.patch>_______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > 
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to