I've replied in the ticket. https://svn.open-mpi.org/trac/ompi/ticket/2530#comment:2 thanks! John
On Mon, Aug 9, 2010 at 2:42 PM, Jeff Squyres <jsquy...@cisco.com> wrote: > I've opened a ticket about this -- if it's an actual problem, it's a 1.5 > blocker: > > https://svn.open-mpi.org/trac/ompi/ticket/2530 > > What version of knem and Linux are you using? > > > > On Aug 9, 2010, at 4:50 PM, John Hsu wrote: > > > problem "fixed" by adding the --mca btl_sm_use_knem 0 option (with > -npernode 11), so I proceeded to bump up -npernode to 12: > > > > $ ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX > -npernode 12 --mca btl_sm_use_knem 0 ./bin/mpi_test > > > > and the same error occurs, > > > > (gdb) bt > > #0 0x00007fcca6ae5cf3 in epoll_wait () from /lib/libc.so.6 > > #1 0x00007fcca7e5ea4b in epoll_dispatch () > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0 > > #2 0x00007fcca7e665fa in opal_event_base_loop () > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0 > > #3 0x00007fcca7e37e69 in opal_progress () > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0 > > #4 0x00007fcca15b6e95 in mca_pml_ob1_recv () > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/openmpi/mca_pml_ob1.so > > #5 0x00007fcca7dd635c in PMPI_Recv () > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0 > > #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800, > buf=0x7fff2a0d7e00, > > count=1, datatype=..., source=23, tag=100, status=...) > > at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36 > > #7 0x0000000000409a57 in main (argc=1, argv=0x7fff2a0d8028) > > at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/mpi_test/src/mpi_test.cpp:30 > > (gdb) > > > > > > (gdb) bt > > #0 0x00007f5dc31d2cf3 in epoll_wait () from /lib/libc.so.6 > > #1 0x00007f5dc454ba4b in epoll_dispatch () > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0 > > #2 0x00007f5dc45535fa in opal_event_base_loop () > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0 > > #3 0x00007f5dc4524e69 in opal_progress () > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0 > > #4 0x00007f5dbdca4b1d in mca_pml_ob1_send () > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/openmpi/mca_pml_ob1.so > > #5 0x00007f5dc44c574f in PMPI_Send () > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0 > > #6 0x000000000040adda in MPI::Comm::Send (this=0x612800, > buf=0x7fff6e0c0790, > > count=1, datatype=..., dest=0, tag=100) > > at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29 > > #7 0x0000000000409b72 in main (argc=1, argv=0x7fff6e0c09b8) > > at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/mpi_test/src/mpi_test.cpp:38 > > (gdb) > > > > > > > > > > On Mon, Aug 9, 2010 at 6:39 AM, Jeff Squyres <jsquy...@cisco.com> wrote: > > In your first mail, you mentioned that you are testing the new knem > support. > > > > Can you try disabling knem and see if that fixes the problem? (i.e., run > with --mca btl_sm_use_knem 0") If it fixes the issue, that might mean we > have a knem-based bug. > > > > > > > > On Aug 6, 2010, at 1:42 PM, John Hsu wrote: > > > > > Hi, > > > > > > sorry for the confusion, that was indeed the trunk version of things I > was running. > > > > > > Here's the same problem using > > > > > > > http://www.open-mpi.org/software/ompi/v1.5/downloads/openmpi-1.5rc5.tar.bz2 > > > > > > command-line: > > > > > > ../openmpi_devel/bin/mpirun -hostfile hostfiles/hostfile.wgsgX > -npernode 11 ./bin/mpi_test > > > > > > back trace on sender: > > > > > > (gdb) bt > > > #0 0x00007fa003bcacf3 in epoll_wait () from /lib/libc.so.6 > > > #1 0x00007fa004f43a4b in epoll_dispatch () > > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0 > > > #2 0x00007fa004f4b5fa in opal_event_base_loop () > > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0 > > > #3 0x00007fa004f1ce69 in opal_progress () > > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0 > > > #4 0x00007f9ffe69be95 in mca_pml_ob1_recv () > > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/openmpi/mca_pml_ob1.so > > > #5 0x00007fa004ebb35c in PMPI_Recv () > > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0 > > > #6 0x000000000040ae48 in MPI::Comm::Recv (this=0x612800, > buf=0x7fff8f5cbb50, count=1, datatype=..., source=29, > > > tag=100, status=...) > > > at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36 > > > #7 0x0000000000409a57 in main (argc=1, argv=0x7fff8f5cbd78) > > > at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/mpi_test/src/mpi_test.cpp:30 > > > (gdb) > > > > > > back trace on receiver: > > > > > > (gdb) bt > > > #0 0x00007fcce1ba5cf3 in epoll_wait () from /lib/libc.so.6 > > > #1 0x00007fcce2f1ea4b in epoll_dispatch () > > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0 > > > #2 0x00007fcce2f265fa in opal_event_base_loop () > > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0 > > > #3 0x00007fcce2ef7e69 in opal_progress () > > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0 > > > #4 0x00007fccdc677b1d in mca_pml_ob1_send () > > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/openmpi/mca_pml_ob1.so > > > #5 0x00007fcce2e9874f in PMPI_Send () > > > from > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/lib/libmpi.so.0 > > > #6 0x000000000040adda in MPI::Comm::Send (this=0x612800, > buf=0x7fff3f18ad20, count=1, datatype=..., dest=0, tag=100) > > > at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29 > > > #7 0x0000000000409b72 in main (argc=1, argv=0x7fff3f18af48) > > > at > /wg/stor5/wgsim/hsu/projects/cturtle_mpi/wg-ros-pkg-unreleased/stacks/mpi/mpi_test/src/mpi_test.cpp:38 > > > (gdb) > > > > > > and attached is my mpi_test file for reference. > > > > > > thanks, > > > John > > > > > > > > > On Fri, Aug 6, 2010 at 6:24 AM, Ralph Castain <r...@open-mpi.org> > wrote: > > > You clearly have an issue with version confusion. The file cited in > your warning: > > > > > > > [wgsg0:29074] Warning -- mutex was double locked from > errmgr_hnp.c:772 > > > > > > does not exist in 1.5rc5. It only exists in the developer's trunk at > this time. Check to ensure you have the right paths set, blow away the > install area (in case you have multiple versions installed on top of each > other), etc. > > > > > > > > > > > > On Aug 5, 2010, at 5:16 PM, John Hsu wrote: > > > > > > > Hi All, > > > > I am new to openmpi and have encountered an issue using pre-release > 1.5rc5, for a simple mpi code (see attached). In this test, nodes 1 to n > sends out a random number to node 0, node 0 sums all numbers received. > > > > > > > > This code works fine on 1 machine with any number of nodes, and on 3 > machines running 10 nodes per machine, but when we try to run 11 nodes per > machine this warning appears: > > > > > > > > [wgsg0:29074] Warning -- mutex was double locked from > errmgr_hnp.c:772 > > > > > > > > And node 0 (master summing node) hangs on receiving plus another > random node hangs on sending indefinitely. Below are the back traces: > > > > > > > > (gdb) bt > > > > #0 0x00007f0c5f109cd3 in epoll_wait () from /lib/libc.so.6 > > > > #1 0x00007f0c6052db53 in epoll_dispatch (base=0x2310bf0, > arg=0x22f91f0, tv=0x7fff90f623e0) at epoll.c:215 > > > > #2 0x00007f0c6053ae58 in opal_event_base_loop (base=0x2310bf0, > flags=2) at event.c:838 > > > > #3 0x00007f0c6053ac27 in opal_event_loop (flags=2) at event.c:766 > > > > #4 0x00007f0c604ebb5a in opal_progress () at > runtime/opal_progress.c:189 > > > > #5 0x00007f0c59b79cb1 in opal_condition_wait (c=0x7f0c608003a0, > m=0x7f0c60800400) at ../../../../opal/threads/ > > > > condition.h:99 > > > > #6 0x00007f0c59b79dff in ompi_request_wait_completion > (req=0x2538d80) at ../../../../ompi/request/request.h:377 > > > > #7 0x00007f0c59b7a8d7 in mca_pml_ob1_recv (addr=0x7fff90f626a0, > count=1, datatype=0x612600, src=45, tag=100, comm=0x7f0c607f2b40, > > > > status=0x7fff90f62668) at pml_ob1_irecv.c:104 > > > > #8 0x00007f0c60425dbc in PMPI_Recv (buf=0x7fff90f626a0, count=1, > type=0x612600, source=45, tag=100, comm=0x7f0c607f2b40, > status=0x7fff90f62668) > > > > at precv.c:78 > > > > #9 0x000000000040ae14 in MPI::Comm::Recv (this=0x612800, > buf=0x7fff90f626a0, count=1, datatype=..., source=45, tag=100, status=...) > > > > at > /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:36 > > > > #10 0x0000000000409a27 in main (argc=1, argv=0x7fff90f628c8) > > > > at > /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mpi_test/src/mpi_test.cpp:30 > > > > (gdb) > > > > > > > > and for sender is: > > > > > > > > (gdb) bt > > > > #0 0x00007fedb919fcd3 in epoll_wait () from /lib/libc.so.6 > > > > #1 0x00007fedba5e0a93 in epoll_dispatch (base=0x2171880, > arg=0x216c6e0, tv=0x7ffffa8a4130) at epoll.c:215 > > > > #2 0x00007fedba5edde0 in opal_event_base_loop (base=0x2171880, > flags=2) at event.c:838 > > > > #3 0x00007fedba5edbaf in opal_event_loop (flags=2) at event.c:766 > > > > #4 0x00007fedba59c43a in opal_progress () at > runtime/opal_progress.c:189 > > > > #5 0x00007fedb2796f97 in opal_condition_wait (c=0x7fedba8ba6e0, > m=0x7fedba8ba740) > > > > at ../../../../opal/threads/condition.h:99 > > > > #6 0x00007fedb279742e in ompi_request_wait_completion > (req=0x2392d80) at ../../../../ompi/request/request.h:377 > > > > #7 0x00007fedb2798e0c in mca_pml_ob1_send (buf=0x23b6210, count=100, > datatype=0x612600, dst=0, tag=100, > > > > sendmode=MCA_PML_BASE_SEND_STANDARD, comm=0x7fedba8ace80) at > pml_ob1_isend.c:125 > > > > #8 0x00007fedba4c9a08 in PMPI_Send (buf=0x23b6210, count=100, > type=0x612600, dest=0, tag=100, comm=0x7fedba8ace80) > > > > at psend.c:75 > > > > #9 0x000000000040ae52 in MPI::Comm::Send (this=0x612800, > buf=0x23b6210, count=100, datatype=..., dest=0, tag=100) > > > > at > /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/openmpi_devel/include/openmpi/ompi/mpi/cxx/comm_inln.h:29 > > > > #10 0x0000000000409bec in main (argc=1, argv=0x7ffffa8a4658) > > > > at > /wg/stor5/wgsim/hsu/projects/cturtle/wg-ros-pkg-unreleased/stacks/mpi/mpi_test/src/mpi_test.cpp:42 > > > > (gdb) > > > > > > > > The "deadlock" appears to be a machine dependent race condition, > different machines fails with different combinations of nodes / machine. > > > > > > > > Below is my command line for reference: > > > > > > > > $ ../openmpi_devel/bin/mpirun -x PATH -hostfile > hostfiles/hostfile.wgsgX -npernode 11 -mca btl tcp,sm,self -mca > orte_base_help_aggregate 0 -mca opal_debug_locks 1 ./bin/mpi_test > > > > > > > > The problem does not exist in release 1.4.2 or earlier. We are > testing unreleased codes for potential knem benefits, but can fall back to > 1.4.2 if necessary. > > > > > > > > My apologies in advance if I've missed something basic, thanks for > any help :) > > > > > > > > regards, > > > > John > > > > <test.cpp>_______________________________________________ > > > > users mailing list > > > > us...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > > > > _______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > <mpi_test.cpp>_______________________________________________ > > > users mailing list > > > us...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > > > -- > > Jeff Squyres > > jsquy...@cisco.com > > For corporate legal information go to: > > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > > _______________________________________________ > > users mailing list > > us...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/users > > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > _______________________________________________ > users mailing list > us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/users >