Rainer,

I was hopping my patch solve the problem completely ... look like it's not the case :( How exactly you get the dead-lock in the mpi_test_suite ? Which configure options ? Only --enable-progress- threads ?

  Thanks,
    george.

On Jan 19, 2006, at 11:12 AM, Rainer Keller wrote:

Hello dear all,

George's patch svn:open-mpi r8741 makes the dead-lock, experienced on a threaded build without this patch the on the mpi_test_suite sometimes go away
(compiled with --enable-progress-threads)

Previously, we would hang here:

rusraink@pcglap12:~/WORK/OPENMPI/ompi-tests/mpi_test_suite/COMPILE- clean-threads>
mpirun -np 2 ./mpi_test_suite -r FULL -c MPI_COMM_WORLD -d MPI_INT

P2P tests Ring (3/31), comm MPI_COMM_WORLD (1/1), type MPI_INT (6/1)
[... Tests snipped ...]
P2P tests Alltoall with MPI_Probe (MPI_ANY_SOURCE) (20/31), comm
MPI_COMM_WORLD (1/1), type MPI_INT (6/1)
Collective tests Bcast (23/31), comm MPI_COMM_WORLD (1/1), type MPI_INT (6/1)
...
Here we used to always hang.

Now, we get through most of the times (9 out of 10).
This is all without the below patch.

CU,
Rainer

On Wednesday 18 January 2006 22:39, Brian Barrett wrote:
Occurrences:
      ompi/class/ompi_free_list.h

This is ok as is, because the loop protecting against a spurious
wakeup is already there.  If two threads are waiting, both are woken
up, and there's only one request (or somehow, no requests), then
they'll try to remove from the list, get NULL, and continue through
the bigger while() loop.  So that works as expected.

      opal/class/opal_free_list.h

Same reasoning as ompi_free_list.

      ompi/request/req_wait.c          /* Two Occurences: not a
               must, but... */

I believe these are both correct.  The first is in a larger do { ...}
while loop that will handle the case of a wakeup with no requests
ready.  The second is in a tight while() loop already, so we're ok
there.

      orte/mca/gpr/proxy/gpr_proxy_compound_cmd.c

This one I'd like Ralph to look at, because I"m not sure I understand
the logic completely.  It looks like this is potentially a problem.
Only one thread will be woken up at a time, since the mutex has to be
re-acquired.  So the question becomes, will anyone give up the mutex
with component.compound_cmd_mode left set to true, and I think the
answer is yes.  This looks like it could be a possible bug if people
are using the compound command code when multiple threads are
active.  Thankfully, I don't think this happens very often.

      orte/mca/iof/base/iof_base_flush.c:108

This looks like it's wrapped in a larger while loop and is safe from
any restart wait conditions.

      orte/mca/pls/rsh/pls_rsh_module.c:892

This could be a bit of a problem, but I don't think spurious wake-ups
will cause any real problems.  The worst case is that possibly we end
up trying to concurrently start more processes than we really
intended.  But Tim might have more insight than I do.


Just my $0.02

Brian

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
---------------------------------------------------------------------
Dipl.-Inf. Rainer Keller       email: kel...@hlrs.de
  High Performance Computing     Tel: ++49 (0)711-685 5858
    Center Stuttgart (HLRS)        Fax: ++49 (0)711-685 5832
  POSTAL:Nobelstrasse 19             http://www.hlrs.de/people/keller
  ACTUAL:Allmandring 30, R. O.030      AIM:rusraink
  70550 Stuttgart
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

"Half of what I say is meaningless; but I say it so that the other half may reach you"
                                  Kahlil Gibran


Reply via email to