Re: [OMPI devel] Infiniband always disabled for required threadlevel MPI_THREAD_MULTIPLE ?

2009-09-16 Thread Kiril Dichev
On Tue, 2009-09-15 at 13:28 -0400, Jeff Squyres wrote:
> On Sep 15, 2009, at 1:22 PM, Kiril Dichev wrote:
> 
> > Then I noticed some code and comments in
> > ompi/mca/btl/openib/btl_openib_component.c which seem to disable this
> > component when MPI_THREAD_MULTIPLE is used for the initialization  
> > (as is
> > the case with IMB). Is that intentional ?
> >
> 
> 
> Yes.  The openib BTL is not [yet] thread safe.  Sorry.  :-(

Ah OK, thanks for the fast reply.

Regards,
Kiril




Re: [OMPI devel] application hangs with multiple dup

2009-09-16 Thread Edgar Gabriel
there is a ticket on that topic already (#2009),  and I just added some 
comments to that...


Jeff Squyres wrote:

On Sep 10, 2009, at 7:12 PM, Edgar Gabriel wrote:


so I can confirm that I can reproduce the hang, and we (George, Rainer
and me) have looked into that and are continue digging.

I hate to say that, but it looked to us as if messages were 'lost'
(sender clearly called send and but the data is not in any of the queues
on the receiver side), which seems to be consistent with two other bug
reports currently being discussed on the mailing list. I could reproduce
the hang with both sm and tcp,  so its probably not a btl issue but
somewhere higher.



Is this is, indeed, happening, someone please file a bug in trac.

Thanks.



--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335


Re: [OMPI devel] Deadlock when creating too many communicators

2009-09-16 Thread Wolfgang Bangerth

All,
there didn't appear any discussion on the problem below, but I believe that 
this is just yet another manifestation of what you have already found as 
ticket #2009.

Best
 Wolfgang

> Howdy,
> here's a creative way to deadlock a program: create and destroy 65500 and
> some communicators and send a message on each of them:
> 
> #include 
>
> #define CHECK(a)  \
>   {   \
> int err = (a);\
> if (err != 0) std::cout << "Error in line " << __LINE__ << std::endl; \
>   }
>
> int main (int argc, char *argv[])
> {
>   int a=0, b;
>
>   MPI_Init (&argc, &argv);
>
>   for (int i=0; i<100; ++i)
> {
>   if (i % 100 == 0) std::cout<< "Duplication event " << i << std::endl;
>
>   MPI_Comm dup;
>   CHECK(MPI_Comm_dup (MPI_COMM_WORLD, &dup));
>   CHECK(MPI_Allreduce(&a, &b, 1, MPI_INT, MPI_MIN, dup));
>   CHECK(MPI_Comm_free (&dup));
> }
>
>   MPI_Finalize();
> }
> ---
> If you run this, for example, on two processors with OpenMPI 1.2.6 or
> 1.3.2, you'll see that the program runs until after it produces 65500 as
> output, and then just hangs -- on my system somewhere in the operating
> system poll(), running full steam.
>
> Since I take care of destroying the communicators again, I would have
> expected this to work. I use creating many communicators basically as a
> debugging tool: every object gets its own communicator to work on to
> ensure that different objects don't communicate by accident with each
> other just because they all use MPI_COMM_WORLD. It would be nice if this
> mode of using MPI could be made to work.
>
> Best & thanks in advance!
>  Wolfgang



-- 
-
Wolfgang Bangerthemail:bange...@math.tamu.edu
 www: http://www.math.tamu.edu/~bangerth/



[OMPI devel] Dynamic languages, dlopen() issues, and symbol visibility of libtool ltdl API in current trunk

2009-09-16 Thread Lisandro Dalcin
Hi all.. I have to contact you again about the issues related to
dlopen()ing libmpi with RTLD_LOCAL, as many dynamic languages (Python
in my case) do.

So far, I've been able to manage the issues (despite the "do nothing"
policy from Open MPI devs, which I understand) in a more or less
portable manner by taking advantage of the availability of libtool
ltdl symbols in the Open MPI libraries (specifically, in libopen-pal).
For reference, all this hackery is here:
http://code.google.com/p/mpi4py/source/browse/trunk/src/compat/openmpi.h

However, I noticed that in current trunk (v1.4, IIUC) things have
changed and libtool symbols are not externally available. Again, I
understand the reason and acknowledge that such change is a really
good thing. However, this change has broken all my hackery for
dlopen()ing libmpi before the call to MPI_Init().

Is there any chance that libopen-pal could provide some properly
prefixed (let say, using "opal_" as a prefix) wrapper calls to a small
subset of the libtool ltdl API? The following set of wrapper calls
would is the minimum required to properly load libmpi in a portable
manner and cleanup resources (let me abuse of my previous suggestion
and add the opal_ prefix):

opal_lt_dlinit()
opal_lt_dlexit()

opal_lt_dladvise_init(a)
opal_lt_dladvise_destroy(a)
opal_lt_dladvise_global(a)
opal_lt_dladvise_ext(a)

opal_lt_dlopenadvise(n,a)
opal_lt_dlclose(h)

Any chance this request could be considered? I would really like to
have this before any Open MPI tarball get released without libtool
symbols exposed...


-- 
Lisandro Dalcín
---
Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
PTLC - Güemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594



Re: [OMPI devel] application hangs with multiple dup

2009-09-16 Thread Edgar Gabriel
just wanted to give a heads-up that I *think* I know what the problem 
is. I should have a fix (with a description) either later today or 
tomorrow morning...


Thanks
Edgar

Edgar Gabriel wrote:
so I can confirm that I can reproduce the hang, and we (George, Rainer 
and me) have looked into that and are continue digging.


I hate to say that, but it looked to us as if messages were 'lost' 
(sender clearly called send and but the data is not in any of the queues 
on the receiver side), which seems to be consistent with two other bug 
reports currently being discussed on the mailing list. I could reproduce 
the hang with both sm and tcp,  so its probably not a btl issue but 
somewhere higher.


Thanks
Edgar

Thomas Ropars wrote:

Edgar Gabriel wrote:
Two short questions: do you have any open MPI mca parameters set in a 
file or at runtime?

No
And second, is there any difference if you disable the hierarch coll 
module (which does communicate additionally as well?) e.g.


mpirun --mca coll ^hierarch -np 4 ./mytest

No, there is no difference.

I don't know if it can help but : I've first had the problem when 
launching bt.A.4 and sp.A.4 of the NAS Parallel Benchmarks (3.3 version).


Thomas


Thanks
Edgar

Thomas Ropars wrote:

Ashley Pittman wrote:

On Wed, 2009-09-09 at 17:44 +0200, Thomas Ropars wrote:

Thank you.  I think you missed the top three lines of the output but
that doesn't matter.

 

main() at ?:?
  PMPI_Comm_dup() at pcomm_dup.c:62
ompi_comm_dup() at communicator/comm.c:661
  -
  [0,2] (2 processes)
  -
  ompi_comm_nextcid() at communicator/comm_cid.c:264
ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
  ompi_coll_tuned_allreduce_intra_dec_fixed() at 
coll_tuned_decision_fixed.c:61
ompi_coll_tuned_allreduce_intra_recursivedoubling() at 
coll_tuned_allreduce.c:223
  ompi_request_default_wait_all() at 
request/req_wait.c:262
opal_condition_wait() at 
../opal/threads/condition.h:99

  -
  [1,3] (2 processes)
  -
  ompi_comm_nextcid() at communicator/comm_cid.c:245
ompi_comm_allreduce_intra() at communicator/comm_cid.c:619
  ompi_coll_tuned_allreduce_intra_dec_fixed() at 
coll_tuned_decision_fixed.c:61
ompi_coll_tuned_allreduce_intra_recursivedoubling() at 
coll_tuned_allreduce.c:223
  ompi_request_default_wait_all() at 
request/req_wait.c:262
opal_condition_wait() at 
../opal/threads/condition.h:99



Lines 264 and 245 of comm_cid.c are both in a for loop which calls
allreduce() twice in a loop until a certain condition is met.  As such
it's hard to tell from this trace if it is processes [0,2] are "ahead"
or [1,3] are "behind".  Either way you look at it however the
all_reduce() should not deadlock like that so it's as likely to be 
a bug

in reduce as it is in ompi_comm_nextcid() from the trace.

I assume all four processes are actually in the same call to comm_dup,
re-compiling your program with -g and re-running padb would confirm 
this

as it would show the line numbers.
  

Yes they are all in the second call to comm_dup.

Thomas

Ashley,

  


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




--
Edgar Gabriel
Assistant Professor
Parallel Software Technologies Lab  http://pstl.cs.uh.edu
Department of Computer Science  University of Houston
Philip G. Hoffman Hall, Room 524Houston, TX-77204, USA
Tel: +1 (713) 743-3857  Fax: +1 (713) 743-3335


[OMPI devel] RFC: IPv6 support

2009-09-16 Thread Ralph Castain
WHAT: change the IPv6 configuration option to enable IPv6 if and only  
if specifically requested


WHY: IPv6 support is only marginally maintained, and is currently  
broken yet again. The current default setting is causing user systems  
to break if (a) their kernel has support for IPv6, but (b) the system  
administrator has not actually configured the interfaces to use IPv6.


TIMEOUT: end of Sept

SCOPE: OMPI trunk + 1.3.4

DETAIL:
There appears to have been an unfortunate change in the way OMPI  
supports IPv6. Early on, we had collectively agreed to disable IPv6  
support unless specifically instructed to build it. This was decided  
because IPv6 support was shaky, at best, and used by only a small  
portion of the community. Given the lack of committed resources to  
maintain it, we felt at that time that enabling it by default would  
cause an inordinate amount of trouble.


Unfortunately, at some point someone changed this default behavior. We  
now enable IPv6 support by default if the system has the required  
header files. This test is inadequate as it in no way determines that  
the support is active. The current result of this test is to not only  
cause all the IPv6-related code to compile, but to actually require  
that every TCP interface provide an IPv6 socket.


This latter requirement causes OMPI to abort on any system where the  
header files exist, but the system admin has not configured every TCP  
interface to have an IPv6 address...a situation which is proving  
fairly common.


The proposed change will heal the current breakage, and can be  
reversed at some future time if adequate IPv6 maintenance commitment  
exists. In the meantime, it will allow me to quit the continual litany  
of telling users to manually --disable-ipv6, and allow OMPI to run out- 
of-the-box again.


Ralph