Tim,
I just tried this same code again with 1.0rc4, and I still see the same
symptom. The gdb stack trace for a hung process looks a bit different
this time, however:
(gdb) bt
#0 0x0000002a98a085e1 in mca_bml_r2_progress ()
from /opt/openmpi-1.0rc4-pgi-6.0/lib/openmpi/mca_bml_r2.so
#1 0x0000002a986c3080 in mca_pml_ob1_progress ()
from /opt/openmpi-1.0rc4-pgi-6.0/lib/openmpi/mca_pml_ob1.so
#2 0x0000002a95d8378c in opal_progress ()
from /opt/openmpi-1.0rc4-pgi-6.0/lib/libopal.so.0
#3 0x0000002a95a6d8a5 in opal_condition_wait ()
from /opt/openmpi-1.0rc4-pgi-6.0/lib/libmpi.so.0
#4 0x0000002a95a6de49 in ompi_request_wait_all ()
from /opt/openmpi-1.0rc4-pgi-6.0/lib/libmpi.so.0
#5 0x0000002a95937602 in PMPI_Waitall ()
from /opt/openmpi-1.0rc4-pgi-6.0/lib/libmpi.so.0
#6 0x00000000004092d4 in rt_waitscanlines (voidhandle=0x635e10)
at parallel.c:229
#7 0x000000000040b515 in renderscene (scene=0x6394d0) at render.c:285
#8 0x0000000000404f75 in rt_renderscene (voidscene=0x6394d0) at
api.c:95
#9 0x0000000000418ac7 in main (argc=6, argv=0x7fbfffec38) at main.c:431
(gdb)
It still seems to be stuck in the MPI_Waitall call, for some reason.
Any ideas? If you need any additional information from me, please let
me know.
Thanks in advance,
+chris
--
Chris Parrott 5204 E. Ben White Blvd., M/S 628
Product Development Engineer Austin, TX 78741
Computational Products Group (512) 602-8710 / (512) 602-7745 (fax)
Advanced Micro Devices [email protected]
> -----Original Message-----
> From: Tim S. Woodall [mailto:[email protected]]
> Sent: Monday, October 17, 2005 2:01 PM
> To: Open MPI Users
> Cc: Parrott, Chris
> Subject: Re: [O-MPI users] OpenMPI hang issue
>
>
> Hello Chris,
>
> Please give the next release candidate a try. There was an
> issue w/ the GM port that was likely causing this.
>
> Thanks,
> Tim
>
>
> Parrott, Chris wrote:
> > Greetings,
> >
> > I have been testing OpenMPI 1.0rc3 on a rack of 8
> 2-processor (single
> > core) Opteron systems connected via both Gigabit Ethernet
> and Myrinet.
> > My testing has been mostly successful, although I have run into a
> > recurring issue on a few MPI applications. The symptom is that the
> > computation seems to progress nearly to completion, and
> then suddenly
> > just hangs without terminating. One code that demonstrates this is
> > the Tachyon parallel raytracer, available at:
> >
> > http://jedi.ks.uiuc.edu/~johns/raytracer/
> >
> > I am using PGI 6.0-5 to compile OpenMPI, so that may be part of the
> > root cause of this particular problem.
> >
> > I have attached the output of config.log to this message.
> Here is the
> > output from ompi_info:
> >
> > Open MPI: 1.0rc3r7730
> > Open MPI SVN revision: r7730
> > Open RTE: 1.0rc3r7730
> > Open RTE SVN revision: r7730
> > OPAL: 1.0rc3r7730
> > OPAL SVN revision: r7730
> > Prefix: /opt/openmpi-1.0rc3-pgi-6.0 Configured
> > architecture: x86_64-unknown-linux-gnu
> > Configured by: root
> > Configured on: Mon Oct 17 10:10:28 PDT 2005
> > Configure host: castor00
> > Built by: root
> > Built on: Mon Oct 17 10:29:20 PDT 2005
> > Built host: castor00
> > C bindings: yes
> > C++ bindings: yes
> > Fortran77 bindings: yes (all)
> > Fortran90 bindings: yes
> > C compiler: pgcc
> > C compiler absolute:
> > /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgcc
> > C++ compiler: pgCC
> > C++ compiler absolute:
> > /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgCC
> > Fortran77 compiler: pgf77
> > Fortran77 compiler abs:
> > /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgf77
> > Fortran90 compiler: pgf90
> > Fortran90 compiler abs:
> > /net/lisbon/opt/pgi-6.0-5/linux86-64/6.0/bin/pgf90
> > C profiling: yes
> > C++ profiling: yes
> > Fortran77 profiling: yes
> > Fortran90 profiling: yes
> > C++ exceptions: no
> > Thread support: posix (mpi: no, progress: no)
> > Internal debug support: no
> > MPI parameter check: runtime
> > Memory profiling support: no
> > Memory debugging support: no
> > libltdl support: 1
> > MCA memory: malloc_hooks (MCA v1.0, API v1.0,
> Component
> > v1.0)
> > MCA paffinity: linux (MCA v1.0, API v1.0, Component v1.0)
> > MCA maffinity: first_use (MCA v1.0, API v1.0,
> Component v1.0)
> > MCA maffinity: libnuma (MCA v1.0, API v1.0,
> Component v1.0)
> > MCA timer: linux (MCA v1.0, API v1.0, Component v1.0)
> > MCA allocator: basic (MCA v1.0, API v1.0, Component v1.0)
> > MCA allocator: bucket (MCA v1.0, API v1.0,
> Component v1.0)
> > MCA coll: basic (MCA v1.0, API v1.0, Component v1.0)
> > MCA coll: self (MCA v1.0, API v1.0, Component v1.0)
> > MCA coll: sm (MCA v1.0, API v1.0, Component v1.0)
> > MCA io: romio (MCA v1.0, API v1.0, Component v1.0)
> > MCA mpool: gm (MCA v1.0, API v1.0, Component v1.0)
> > MCA mpool: sm (MCA v1.0, API v1.0, Component v1.0)
> > MCA pml: ob1 (MCA v1.0, API v1.0, Component v1.0)
> > MCA pml: teg (MCA v1.0, API v1.0, Component v1.0)
> > MCA pml: uniq (MCA v1.0, API v1.0, Component v1.0)
> > MCA ptl: gm (MCA v1.0, API v1.0, Component v1.0)
> > MCA ptl: self (MCA v1.0, API v1.0, Component v1.0)
> > MCA ptl: sm (MCA v1.0, API v1.0, Component v1.0)
> > MCA ptl: tcp (MCA v1.0, API v1.0, Component v1.0)
> > MCA btl: gm (MCA v1.0, API v1.0, Component v1.0)
> > MCA btl: self (MCA v1.0, API v1.0, Component v1.0)
> > MCA btl: sm (MCA v1.0, API v1.0, Component v1.0)
> > MCA btl: tcp (MCA v1.0, API v1.0, Component v1.0)
> > MCA topo: unity (MCA v1.0, API v1.0, Component v1.0)
> > MCA gpr: null (MCA v1.0, API v1.0, Component v1.0)
> > MCA gpr: proxy (MCA v1.0, API v1.0, Component v1.0)
> > MCA gpr: replica (MCA v1.0, API v1.0,
> Component v1.0)
> > MCA iof: proxy (MCA v1.0, API v1.0, Component v1.0)
> > MCA iof: svc (MCA v1.0, API v1.0, Component v1.0)
> > MCA ns: proxy (MCA v1.0, API v1.0, Component v1.0)
> > MCA ns: replica (MCA v1.0, API v1.0,
> Component v1.0)
> > MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
> > MCA ras: dash_host (MCA v1.0, API v1.0,
> Component v1.0)
> > MCA ras: hostfile (MCA v1.0, API v1.0,
> Component v1.0)
> > MCA ras: localhost (MCA v1.0, API v1.0,
> Component v1.0)
> > MCA ras: slurm (MCA v1.0, API v1.0, Component v1.0)
> > MCA rds: hostfile (MCA v1.0, API v1.0,
> Component v1.0)
> > MCA rds: resfile (MCA v1.0, API v1.0,
> Component v1.0)
> > MCA rmaps: round_robin (MCA v1.0, API v1.0, Component
> > v1.0)
> > MCA rmgr: proxy (MCA v1.0, API v1.0, Component v1.0)
> > MCA rmgr: urm (MCA v1.0, API v1.0, Component v1.0)
> > MCA rml: oob (MCA v1.0, API v1.0, Component v1.0)
> > MCA pls: daemon (MCA v1.0, API v1.0,
> Component v1.0)
> > MCA pls: fork (MCA v1.0, API v1.0, Component v1.0)
> > MCA pls: proxy (MCA v1.0, API v1.0, Component v1.0)
> > MCA pls: rsh (MCA v1.0, API v1.0, Component v1.0)
> > MCA pls: slurm (MCA v1.0, API v1.0, Component v1.0)
> > MCA sds: env (MCA v1.0, API v1.0, Component v1.0)
> > MCA sds: pipe (MCA v1.0, API v1.0, Component v1.0)
> > MCA sds: seed (MCA v1.0, API v1.0, Component v1.0)
> > MCA sds: singleton (MCA v1.0, API v1.0,
> Component v1.0)
> > MCA sds: slurm (MCA v1.0, API v1.0, Component v1.0)
> >
> >
> > Here is the command-line I am using to invoke OpenMPI for
> my build of
> > Tachyon:
> >
> > /opt/openmpi-1.0rc3-pgi-6.0/bin/mpirun --prefix
> > /opt/openmpi-1.0rc3-pgi-6.0 --mca pls_rsh_agent rsh --hostfile
> > hostfile.gigeth -np 16 tachyon_base.mpi -o scene.tga scene.dat
> >
> > Attaching gdb to one of the hung processes, I get the
> following stack
> > trace:
> >
> > (gdb) bt
> > #0 0x0000002a95d6b87d in opal_sys_timer_get_cycles ()
> > from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
> > #1 0x0000002a95d83509 in opal_timer_base_get_cycles ()
> > from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
> > #2 0x0000002a95d8370c in opal_progress ()
> > from /opt/openmpi-1.0rc3-pgi-6.0/lib/libopal.so.0
> > #3 0x0000002a95a6d8a5 in opal_condition_wait ()
> > from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
> > #4 0x0000002a95a6de49 in ompi_request_wait_all ()
> > from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
> > #5 0x0000002a95937602 in PMPI_Waitall ()
> > from /opt/openmpi-1.0rc3-pgi-6.0/lib/libmpi.so.0
> > #6 0x00000000004092d4 in rt_waitscanlines (voidhandle=0x635a60)
> > at parallel.c:229
> > #7 0x000000000040b515 in renderscene (scene=0x6394d0) at
> render.c:285
> > #8 0x0000000000404f75 in rt_renderscene (voidscene=0x6394d0) at
> > api.c:95 #9 0x0000000000418ac7 in main (argc=6,
> argv=0x7fbfffec38) at
> > main.c:431
> > (gdb)
> >
> > So based on this stack trace, it appears that the application is
> > hanging on an MPI_Waitall call for some reason.
> >
> > Does anyone have any ideas as to why this might be
> happening? If this
> > is covered in the FAQ somewhere, then please accept my apologies in
> > advance.
> >
> > Many thanks,
> >
> > +chris
> >
> > --
> > Chris Parrott 5204 E. Ben White Blvd., M/S 628
> > Product Development Engineer Austin, TX 78741
> > Computational Products Group (512) 602-8710 / (512)
> 602-7745 (fax)
> > Advanced Micro Devices [email protected]
> >
> >
> >
> ----------------------------------------------------------------------
> > --
> >
> > _______________________________________________
> > users mailing list
> > [email protected]
> http://www.open-> mpi.org/mailman/listinfo.cgi/users
>
>