On Thu, 2 Aug 2012 10:25:53 -0400
Jeff Squyres <jsquy...@cisco.com> wrote:

> On Aug 1, 2012, at 9:44 AM, Christopher Yeoh wrote:
> 
> I run one of my MTT configs with --enable-mpi-thread-multiple, but
> only run single-threaded apps (i.e., MPI_THREAD_SINGLE).  This just
> checks the bozo case.
> 
> > I'm seeing even very basic programs hang. If it is working for you,
> > what architecture are you running on? (may help me debug what is
> > going on with my setup). In contrast, 1.6 on the same machines work
> > fine for me (well as fine as MPI_THREAD_MULTIPLE has ever worked
> > anyway ;-) 
> 
> I wonder what broke on the trunk...

I don't know, but its getting pretty frustrating trying to work out
what is going wrong. I've narrowed it down to a very simple test case 
(you don't need to explicitly spawn any threads).

Just need a program like:

int main(int argc, char **argv)
{
        char hostname[4096];
        int rank, size, provided;

        MPI_Init_thread( &argc, &argv, MPI_THREAD_MULTIPLE, &provided );

        if (provided != MPI_THREAD_MULTIPLE) {
                MPI_Finalize();
                errx(1, "MPI_Init_thread expected, MPI_THREAD_MULTIPLE (%d), "
                         "got %d \n", MPI_THREAD_MULTIPLE, provided);
        }

        printf("%s(%d) of %d provided=(%d)\n", hostname, rank, size, provided);

        MPI_Barrier();

        MPI_Finalize();
}

If its run with "--mpi-preconnect_mpi 1" then it hangs in MPI_Init_thread. If 
not,
then it hangs in MPI_Barrier. Get a backtrace that looks like this (with the 
former):

(gdb) bt
#0  0x0000008039720d6c in .pthread_cond_wait () from 
/lib64/power6/libpthread.so.0
#1  0x00000400001299d8 in opal_condition_wait (c=0x400004763f8, m=0x40000476460)
    at ../../ompi-trunk.chris2/opal/threads/condition.h:79
#2  0x000004000012a08c in ompi_request_default_wait_all (count=2, 
requests=0xfffffa9db20, 
    statuses=0x0) at ../../ompi-trunk.chris2/ompi/request/req_wait.c:281
#3  0x000004000012f56c in ompi_init_preconnect_mpi ()
    at ../../ompi-trunk.chris2/ompi/runtime/ompi_mpi_preconnect.c:72
#4  0x000004000012c738 in ompi_mpi_init (argc=1, argv=0xfffffa9f278, 
requested=3, 
    provided=0xfffffa9edd8) at 
../../ompi-trunk.chris2/ompi/runtime/ompi_mpi_init.c:800
#5  0x000004000017a064 in PMPI_Init_thread (argc=0xfffffa9ee20, 
argv=0xfffffa9ee28, required=3, 
    provided=0xfffffa9edd8) at pinit_thread.c:84
#6  0x0000000010000ae4 in main (argc=1, argv=0xfffffa9f278) at test2.c:15

(neither of the requests are received so presumably messages are getting lost).

In contrast if you run against the exact same build of OMPI with pretty 
much the same test program but do "MPI_Init(&argc, &argv)" then it works fine.

If anyone has any suggestions they'd be very welcome. I've resorted to staring 
at log outputs
(with openib and rml verbose) comparing running with MPI_THREAD_MULTIPLE and 
without
to try to work out where there might be a race. 

Regards,

Chris
-- 
cy...@ozlabs.org

Reply via email to