subject:"\[OMPI devel\] Hang in collectives involving shared memory"

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-16 Thread Bryan Lally

Ashley Pittman wrote: Do you have a stack trace of your hung application to hand, in particular when you say "All processes have made the same call to MPI_Allreduce. The processes are all in opal_progress, called (with intervening calls) by MPI_Allreduce." do the intervening calls include mc

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-16 Thread Ashley Pittman

On Tue, 2009-06-16 at 13:39 -0600, Bryan Lally wrote: > Ashley Pittman wrote: > > > Whilst the fact that it appears to only happen on your machine implies > > it's not a general problem with OpenMPI the fact that it happens in the > > same location/rep count every time does swing the blame back th

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-16 Thread Bryan Lally

Ashley Pittman wrote: Whilst the fact that it appears to only happen on your machine implies it's not a general problem with OpenMPI the fact that it happens in the same location/rep count every time does swing the blame back the other way. This sounds a _lot_ like the problem I was seeing, my

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-15 Thread Ashley Pittman

On Wed, 2009-06-10 at 09:07 -0600, Ralph Castain wrote: > Hi Ashley > > Thanks! I would definitely be interested and will look at the tool. > Meantime, I have filed a bunch of data on this in ticket #1944, so > perhaps you might take a glance at that and offer some thoughts? > > https://svn.open-

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-12 Thread Eugene Loh

Sylvain Jeaugey wrote: Hi Ralph, I managed to have a deadlock after a whole night, but not the same you have : after a quick analysis, process 0 seems to be blocked in the very first send through shared memory. Still maybe a bug, but not the same as yours IMO. Yes, that's the one Terry and

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-12 Thread Sylvain Jeaugey

Hi Ralph, I managed to have a deadlock after a whole night, but not the same you have : after a quick analysis, process 0 seems to be blocked in the very first send through shared memory. Still maybe a bug, but not the same as yours IMO. I also figured out that libnuma support was not in my

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ralph Castain

Well, it would - except then -all- the procs would run real slow! :-) Still, might be a reasonable diagnostic step to try...will give it a shot. On Wed, Jun 10, 2009 at 1:12 PM, Bogdan Costescu < bogdan.coste...@iwr.uni-heidelberg.de> wrote: > On Wed, 10 Jun 2009, Ralph Castain wrote: > > I app

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Bogdan Costescu

On Wed, 10 Jun 2009, Ralph Castain wrote: I appreciate the input and have captured it in the ticket. Since this appears to be a NUMA-related issue, the lack of NUMA support in your setup makes the test difficult to interpret. Based on this reasoning, disabling libnuma support in your OpenMPI

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ralph Castain

Much appreciated! Per some of my other comments on this thread and on the referenced ticket, can you tell me what kernel you have on that machine? I assume you have NUMA support enabled, given that chipset? Thanks! Ralph On Wed, Jun 10, 2009 at 10:29 AM, Sylvain Jeaugey wrote: > Hum, very glad

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ralph Castain

I appreciate the input and have captured it in the ticket. Since this appears to be a NUMA-related issue, the lack of NUMA support in your setup makes the test difficult to interpret. I agree, though, that this is likely something peculiar to our particular setup. Of primary concern is that it mig

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Bogdan Costescu

On Wed, 10 Jun 2009, Ralph Castain wrote: Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps you might take a glance at that and offer some thoughts? https://svn.open-mpi.org/trac/ompi/ticket/1944 I wasn't able to reproduce this. I have run with the following setup: -

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Sylvain Jeaugey

Hum, very glad that padb works with Open MPI, I couldn't live without it. In my opinion, the best debug tool for parallel applications, and more importantly, the only one that scales. About the issue, I couldn't reproduce it on my platform (tried 2 nodes with 2 to 8 processes each, nodes are t

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ashley Pittman

On Wed, 2009-06-10 at 09:07 -0600, Ralph Castain wrote: > Hi Ashley > > Thanks! I would definitely be interested and will look at the tool. Great. My plan was to introduce the tool to this list today or tomorrow anyway but this problem falls right it's it's target area so I brought it up early.

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ralph Castain

Hi Ashley Thanks! I would definitely be interested and will look at the tool. Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps you might take a glance at that and offer some thoughts? https://svn.open-mpi.org/trac/ompi/ticket/1944 Will be back after I look at the tool.

Re: [OMPI devel] Hang in collectives involving shared memory

2009-06-10 Thread Ashley Pittman

Ralph, If I may say this is exactly the type of problem the tool I have been working on recently aims to help with and I'd be happy to help you through it. Firstly I'd say of the three collectives you mention, MPI_Allgather, MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one

[OMPI devel] Hang in collectives involving shared memory

2009-06-09 Thread Ralph Castain

Hi folks As mentioned in today's telecon, we at LANL are continuing to see hangs when running even small jobs that involve shared memory in collective operations. This has been the topic of discussion before, but I bring it up again because (a) the problem is beginning to become epidemic across ou

Re: [OMPI devel] Hang in collectives involving shared memory

Re: [OMPI devel] Hang in collectives involving shared memory

Re: [OMPI devel] Hang in collectives involving shared memory

Re: [OMPI devel] Hang in collectives involving shared memory

Re: [OMPI devel] Hang in collectives involving shared memory

Re: [OMPI devel] Hang in collectives involving shared memory

Re: [OMPI devel] Hang in collectives involving shared memory

Re: [OMPI devel] Hang in collectives involving shared memory

Re: [OMPI devel] Hang in collectives involving shared memory

Re: [OMPI devel] Hang in collectives involving shared memory

Re: [OMPI devel] Hang in collectives involving shared memory

Re: [OMPI devel] Hang in collectives involving shared memory

Re: [OMPI devel] Hang in collectives involving shared memory

Re: [OMPI devel] Hang in collectives involving shared memory

Re: [OMPI devel] Hang in collectives involving shared memory

[OMPI devel] Hang in collectives involving shared memory

16 matches

Site Navigation

Mail list logo

Footer information