Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-16 Thread Bryan Lally
Ashley Pittman wrote: Do you have a stack trace of your hung application to hand, in particular when you say "All processes have made the same call to MPI_Allreduce. The processes are all in opal_progress, called (with intervening calls) by MPI_Allreduce." do the intervening calls include

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-16 Thread Bryan Lally
Ashley Pittman wrote: Whilst the fact that it appears to only happen on your machine implies it's not a general problem with OpenMPI the fact that it happens in the same location/rep count every time does swing the blame back the other way. This sounds a _lot_ like the problem I was seeing,

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-15 Thread Ashley Pittman
On Wed, 2009-06-10 at 09:07 -0600, Ralph Castain wrote: > Hi Ashley > > Thanks! I would definitely be interested and will look at the tool. > Meantime, I have filed a bunch of data on this in ticket #1944, so > perhaps you might take a glance at that and offer some thoughts? > >

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-12 Thread Sylvain Jeaugey
Hi Ralph, I managed to have a deadlock after a whole night, but not the same you have : after a quick analysis, process 0 seems to be blocked in the very first send through shared memory. Still maybe a bug, but not the same as yours IMO. I also figured out that libnuma support was not in my

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ralph Castain
Well, it would - except then -all- the procs would run real slow! :-) Still, might be a reasonable diagnostic step to try...will give it a shot. On Wed, Jun 10, 2009 at 1:12 PM, Bogdan Costescu < bogdan.coste...@iwr.uni-heidelberg.de> wrote: > On Wed, 10 Jun 2009, Ralph Castain wrote: > > I

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Bogdan Costescu
On Wed, 10 Jun 2009, Ralph Castain wrote: I appreciate the input and have captured it in the ticket. Since this appears to be a NUMA-related issue, the lack of NUMA support in your setup makes the test difficult to interpret. Based on this reasoning, disabling libnuma support in your OpenMPI

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ralph Castain
Much appreciated! Per some of my other comments on this thread and on the referenced ticket, can you tell me what kernel you have on that machine? I assume you have NUMA support enabled, given that chipset? Thanks! Ralph On Wed, Jun 10, 2009 at 10:29 AM, Sylvain Jeaugey

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ralph Castain
I appreciate the input and have captured it in the ticket. Since this appears to be a NUMA-related issue, the lack of NUMA support in your setup makes the test difficult to interpret. I agree, though, that this is likely something peculiar to our particular setup. Of primary concern is that it

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Bogdan Costescu
On Wed, 10 Jun 2009, Ralph Castain wrote: Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps you might take a glance at that and offer some thoughts? https://svn.open-mpi.org/trac/ompi/ticket/1944 I wasn't able to reproduce this. I have run with the following setup: -

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Sylvain Jeaugey
Hum, very glad that padb works with Open MPI, I couldn't live without it. In my opinion, the best debug tool for parallel applications, and more importantly, the only one that scales. About the issue, I couldn't reproduce it on my platform (tried 2 nodes with 2 to 8 processes each, nodes are

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ralph Castain
Hi Ashley Thanks! I would definitely be interested and will look at the tool. Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps you might take a glance at that and offer some thoughts? https://svn.open-mpi.org/trac/ompi/ticket/1944 Will be back after I look at the tool.

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ashley Pittman
Ralph, If I may say this is exactly the type of problem the tool I have been working on recently aims to help with and I'd be happy to help you through it. Firstly I'd say of the three collectives you mention, MPI_Allgather, MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one

[OMPI devel] Hang in collectives involving shared memory

2009-06-09 Thread Ralph Castain
Hi folks As mentioned in today's telecon, we at LANL are continuing to see hangs when running even small jobs that involve shared memory in collective operations. This has been the topic of discussion before, but I bring it up again because (a) the problem is beginning to become epidemic across