Re: [OMPI devel] Infiniband always disabled for required threadlevel MPI_THREAD_MULTIPLE ?
On Tue, 2009-09-15 at 13:28 -0400, Jeff Squyres wrote: > On Sep 15, 2009, at 1:22 PM, Kiril Dichev wrote: > > > Then I noticed some code and comments in > > ompi/mca/btl/openib/btl_openib_component.c which seem to disable this > > component when MPI_THREAD_MULTIPLE is used for the initialization > > (as is > > the case with IMB). Is that intentional ? > > > > > Yes. The openib BTL is not [yet] thread safe. Sorry. :-( Ah OK, thanks for the fast reply. Regards, Kiril
Re: [OMPI devel] application hangs with multiple dup
there is a ticket on that topic already (#2009), and I just added some comments to that... Jeff Squyres wrote: On Sep 10, 2009, at 7:12 PM, Edgar Gabriel wrote: so I can confirm that I can reproduce the hang, and we (George, Rainer and me) have looked into that and are continue digging. I hate to say that, but it looked to us as if messages were 'lost' (sender clearly called send and but the data is not in any of the queues on the receiver side), which seems to be consistent with two other bug reports currently being discussed on the mailing list. I could reproduce the hang with both sm and tcp, so its probably not a btl issue but somewhere higher. Is this is, indeed, happening, someone please file a bug in trac. Thanks. -- Edgar Gabriel Assistant Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
Re: [OMPI devel] Deadlock when creating too many communicators
All, there didn't appear any discussion on the problem below, but I believe that this is just yet another manifestation of what you have already found as ticket #2009. Best Wolfgang > Howdy, > here's a creative way to deadlock a program: create and destroy 65500 and > some communicators and send a message on each of them: > > #include > > #define CHECK(a) \ > { \ > int err = (a);\ > if (err != 0) std::cout << "Error in line " << __LINE__ << std::endl; \ > } > > int main (int argc, char *argv[]) > { > int a=0, b; > > MPI_Init (&argc, &argv); > > for (int i=0; i<100; ++i) > { > if (i % 100 == 0) std::cout<< "Duplication event " << i << std::endl; > > MPI_Comm dup; > CHECK(MPI_Comm_dup (MPI_COMM_WORLD, &dup)); > CHECK(MPI_Allreduce(&a, &b, 1, MPI_INT, MPI_MIN, dup)); > CHECK(MPI_Comm_free (&dup)); > } > > MPI_Finalize(); > } > --- > If you run this, for example, on two processors with OpenMPI 1.2.6 or > 1.3.2, you'll see that the program runs until after it produces 65500 as > output, and then just hangs -- on my system somewhere in the operating > system poll(), running full steam. > > Since I take care of destroying the communicators again, I would have > expected this to work. I use creating many communicators basically as a > debugging tool: every object gets its own communicator to work on to > ensure that different objects don't communicate by accident with each > other just because they all use MPI_COMM_WORLD. It would be nice if this > mode of using MPI could be made to work. > > Best & thanks in advance! > Wolfgang -- - Wolfgang Bangerthemail:bange...@math.tamu.edu www: http://www.math.tamu.edu/~bangerth/
[OMPI devel] Dynamic languages, dlopen() issues, and symbol visibility of libtool ltdl API in current trunk
Hi all.. I have to contact you again about the issues related to dlopen()ing libmpi with RTLD_LOCAL, as many dynamic languages (Python in my case) do. So far, I've been able to manage the issues (despite the "do nothing" policy from Open MPI devs, which I understand) in a more or less portable manner by taking advantage of the availability of libtool ltdl symbols in the Open MPI libraries (specifically, in libopen-pal). For reference, all this hackery is here: http://code.google.com/p/mpi4py/source/browse/trunk/src/compat/openmpi.h However, I noticed that in current trunk (v1.4, IIUC) things have changed and libtool symbols are not externally available. Again, I understand the reason and acknowledge that such change is a really good thing. However, this change has broken all my hackery for dlopen()ing libmpi before the call to MPI_Init(). Is there any chance that libopen-pal could provide some properly prefixed (let say, using "opal_" as a prefix) wrapper calls to a small subset of the libtool ltdl API? The following set of wrapper calls would is the minimum required to properly load libmpi in a portable manner and cleanup resources (let me abuse of my previous suggestion and add the opal_ prefix): opal_lt_dlinit() opal_lt_dlexit() opal_lt_dladvise_init(a) opal_lt_dladvise_destroy(a) opal_lt_dladvise_global(a) opal_lt_dladvise_ext(a) opal_lt_dlopenadvise(n,a) opal_lt_dlclose(h) Any chance this request could be considered? I would really like to have this before any Open MPI tarball get released without libtool symbols exposed... -- Lisandro Dalcín --- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594
Re: [OMPI devel] application hangs with multiple dup
just wanted to give a heads-up that I *think* I know what the problem is. I should have a fix (with a description) either later today or tomorrow morning... Thanks Edgar Edgar Gabriel wrote: so I can confirm that I can reproduce the hang, and we (George, Rainer and me) have looked into that and are continue digging. I hate to say that, but it looked to us as if messages were 'lost' (sender clearly called send and but the data is not in any of the queues on the receiver side), which seems to be consistent with two other bug reports currently being discussed on the mailing list. I could reproduce the hang with both sm and tcp, so its probably not a btl issue but somewhere higher. Thanks Edgar Thomas Ropars wrote: Edgar Gabriel wrote: Two short questions: do you have any open MPI mca parameters set in a file or at runtime? No And second, is there any difference if you disable the hierarch coll module (which does communicate additionally as well?) e.g. mpirun --mca coll ^hierarch -np 4 ./mytest No, there is no difference. I don't know if it can help but : I've first had the problem when launching bt.A.4 and sp.A.4 of the NAS Parallel Benchmarks (3.3 version). Thomas Thanks Edgar Thomas Ropars wrote: Ashley Pittman wrote: On Wed, 2009-09-09 at 17:44 +0200, Thomas Ropars wrote: Thank you. I think you missed the top three lines of the output but that doesn't matter. main() at ?:? PMPI_Comm_dup() at pcomm_dup.c:62 ompi_comm_dup() at communicator/comm.c:661 - [0,2] (2 processes) - ompi_comm_nextcid() at communicator/comm_cid.c:264 ompi_comm_allreduce_intra() at communicator/comm_cid.c:619 ompi_coll_tuned_allreduce_intra_dec_fixed() at coll_tuned_decision_fixed.c:61 ompi_coll_tuned_allreduce_intra_recursivedoubling() at coll_tuned_allreduce.c:223 ompi_request_default_wait_all() at request/req_wait.c:262 opal_condition_wait() at ../opal/threads/condition.h:99 - [1,3] (2 processes) - ompi_comm_nextcid() at communicator/comm_cid.c:245 ompi_comm_allreduce_intra() at communicator/comm_cid.c:619 ompi_coll_tuned_allreduce_intra_dec_fixed() at coll_tuned_decision_fixed.c:61 ompi_coll_tuned_allreduce_intra_recursivedoubling() at coll_tuned_allreduce.c:223 ompi_request_default_wait_all() at request/req_wait.c:262 opal_condition_wait() at ../opal/threads/condition.h:99 Lines 264 and 245 of comm_cid.c are both in a for loop which calls allreduce() twice in a loop until a certain condition is met. As such it's hard to tell from this trace if it is processes [0,2] are "ahead" or [1,3] are "behind". Either way you look at it however the all_reduce() should not deadlock like that so it's as likely to be a bug in reduce as it is in ompi_comm_nextcid() from the trace. I assume all four processes are actually in the same call to comm_dup, re-compiling your program with -g and re-running padb would confirm this as it would show the line numbers. Yes they are all in the second call to comm_dup. Thomas Ashley, ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Edgar Gabriel Assistant Professor Parallel Software Technologies Lab http://pstl.cs.uh.edu Department of Computer Science University of Houston Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA Tel: +1 (713) 743-3857 Fax: +1 (713) 743-3335
[OMPI devel] RFC: IPv6 support
WHAT: change the IPv6 configuration option to enable IPv6 if and only if specifically requested WHY: IPv6 support is only marginally maintained, and is currently broken yet again. The current default setting is causing user systems to break if (a) their kernel has support for IPv6, but (b) the system administrator has not actually configured the interfaces to use IPv6. TIMEOUT: end of Sept SCOPE: OMPI trunk + 1.3.4 DETAIL: There appears to have been an unfortunate change in the way OMPI supports IPv6. Early on, we had collectively agreed to disable IPv6 support unless specifically instructed to build it. This was decided because IPv6 support was shaky, at best, and used by only a small portion of the community. Given the lack of committed resources to maintain it, we felt at that time that enabling it by default would cause an inordinate amount of trouble. Unfortunately, at some point someone changed this default behavior. We now enable IPv6 support by default if the system has the required header files. This test is inadequate as it in no way determines that the support is active. The current result of this test is to not only cause all the IPv6-related code to compile, but to actually require that every TCP interface provide an IPv6 socket. This latter requirement causes OMPI to abort on any system where the header files exist, but the system admin has not configured every TCP interface to have an IPv6 address...a situation which is proving fairly common. The proposed change will heal the current breakage, and can be reversed at some future time if adequate IPv6 maintenance commitment exists. In the meantime, it will allow me to quit the continual litany of telling users to manually --disable-ipv6, and allow OMPI to run out- of-the-box again. Ralph