[OMPI users] mpirun in openmpi-1.2.3-rc1 hangs at the end when nproc >=16 (was Re: mpirun in openmpi-1.2.2 fails to exit after client program finishes)

Code Master Sun, 17 Jun 2007 05:27:28 -0400

Hi!

I've just tried out openmpi-1.2.3-rc1.  I ran my client programs
successfully when nproc < 16.  However when the number of nodes >=16,
mpirun hangs (in master only) at the end of the execution after all
processes (includes the client program itself and orted*) exits.


Then I performed ps x on the master node and found mpirun is the only
entry.  Apparently it is in the sleeping mode (S+).

So does this give more hint about what went wrong?

Thanks!



On 6/12/07, Ralph H Castain <r...@lanl.gov> wrote:

Hi there

Sorry for the delayed response - I was tied up this weekend and almost
completely away from the computer. Doesn't happen very often (probably not
often enough! ;-) )

I can only think of one think you could try with 1.2.2. I note that you have
enabled mpi threads and progress threads. Do you really need the threading
capabilities?? If you can possibly live without them, at least for a trial,
then I would re-configure with --without-threads --disable-mpi-threads
--disable-progress-threads.

Our threading support is really not that great just yet, so it is entirely
possible that you are hitting some kind of thread locked condition.
Unfortunately, it is impossible to tell at this point, though we hopefully
will have some new diagnostics shortly that will help us developers to debug
such situations.

I did recently introduce some major changes to the system that *might*
affect this behavior. However, those are only in our subversion trunk and
will never be moved to the 1.2 code series - they will be released with the
1.3 series sometime late this year/early next year. If you would like, you
can checkout the trunk and try your code with that version to see if you get
some improved behavior.

Hope that is of some help. Let me know what you see and I'll try to help you
out.

Ralph


On 6/11/07 4:02 AM, "Code Master" <cpp.codemas...@gmail.com> wrote:

> Hi Ralph,
>
> I'm using openmpi-1.2.2 to compile and run my client app.  After my
> app and orted processes exit successfully in all master and slave
> nodes, mpirun hangs in the master node (mpirun has also exited
> successfully in all slave node.
>
> This only happens in openmpi-1.2.2.
>
> Can you see why this is happening? (I've included the ./configure
> script in the records below) Also would you please help me out?  I
> really need to get mpirun working in order to benchmark my parallel
> programs for my dissertation.
>
> Thanks!
>
> ---------- Forwarded message ----------
> From: Code Master <cpp.codemas...@gmail.com>
> Date: Jun 9, 2007 9:44 AM
> Subject: Re: [OMPI users] mpirun in openmpi-1.2.2 fails to exit after
> client program finishes
> To: Open MPI Users <us...@open-mpi.org>
>
>
> On 6/9/07, Jeff Squyres <jsquy...@cisco.com> wrote:
>> On Jun 8, 2007, at 9:29 AM, Code Master wrote:
>>
>>> I compiled openmpi-1.2.2 with:
>>>
>>> ./configure CFLAGS=-g -pg -O3 --prefix=/home/foo/490_research/490/
>>> src/mpi.optimized_profiling/  \
>>> --enable-mpi-threads --enable-progress-threads --enable-static --
>>> disable-shared --without-memory-manager  \
>>> --without-libnuma --disable-mpi-f77 --disable-mpi-f90 --disable-mpi-
>>> cxx --disable-mpi-cxx-seek --disable-dlopen
>>>
>>> (Thanks Jeff, now I know that I have to add --without-memory-
>>> manager and --without-libnuma for static linking)
>>
>> Good.
>>
>>> make all
>>> make install
>>>
>>> then I run my client app with:
>>>
>>> /home/foo/490_research/490/src/mpi.optimized_profiling/bin/mpirun --
>>> hostfile ../hostfile -n 32 raytrace -finputs/car.env
>>>
>>> The program runs well and each process completes succssfully (I can
>>> tell because all processes have now generated gmon.out successfully
>>> and a "ps aux" on other slave nodes (except the originating node)
>>> show that my program in slave nodes have already exited (not
>>> existant).  Therefore I think this may have something to do with
>>> mpirun,which hangs forever.
>>
>> Be aware that you may have problems with multiple processes writing
>> to the same gmon.out, unless you're running each instance in a
>> different directory (your command line doesn't indicate that you are,
>> but that doesn't necessarily prove anything).
>
> I am sure this is not happening, because  in my program, after the MPI
> initialization, the main() invokes chdir() which immediately change
> the directory to the process's own directory (named after the
> proc_id).  Therefore they all have their own directory to write to.
>
>>> Can you see anything wrong in my ./configure command which explains
>>> the mpirun hang at the end of the run?  How can I fix it?
>>
>> No, everything looks fine.
>>
>> So you confirm that all raytrace instances have exited and all orteds
>> have exited, leaving *only* mpirun runnning?
>
> Yes, I am sure that all raytrace instances as well as all mpi-related
> processes (including mpirun and orteds etc.) have exited in all slave
> nodes.  In the *master* node, all raytrace instances and all orteds
> have exited as well, leaving *only* mpirun running in the *master*
> node.
>
> 14818 pts/0    S+     0:00
> /home/foo/490_research/490/src/mpi.optimized_profiling/bin/mpirun
> --hostfile ../hostfile -n 32 raytrace -finputs/car.env -s
> 1
>> There was a race condition about this at one point; Ralph -- can you
>> comment further?
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>>
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/users
>>

[OMPI users] mpirun in openmpi-1.2.3-rc1 hangs at the end when nproc >=16 (was Re: mpirun in openmpi-1.2.2 fails to exit after client program finishes)

Reply via email to