Ashley Pittman wrote:
Do you have a stack trace of your hung application to hand, in
particular when you say "All
processes have made the same call to MPI_Allreduce. The processes are
all in opal_progress, called (with intervening calls) by MPI_Allreduce."
do the intervening calls include
Ashley Pittman wrote:
Whilst the fact that it appears to only happen on your machine implies
it's not a general problem with OpenMPI the fact that it happens in the
same location/rep count every time does swing the blame back the other
way.
This sounds a _lot_ like the problem I was seeing,
On Wed, 2009-06-10 at 09:07 -0600, Ralph Castain wrote:
> Hi Ashley
>
> Thanks! I would definitely be interested and will look at the tool.
> Meantime, I have filed a bunch of data on this in ticket #1944, so
> perhaps you might take a glance at that and offer some thoughts?
>
>
Hi Ralph,
I managed to have a deadlock after a whole night, but not the same you
have : after a quick analysis, process 0 seems to be blocked in the very
first send through shared memory. Still maybe a bug, but not the same as
yours IMO.
I also figured out that libnuma support was not in my
Well, it would - except then -all- the procs would run real slow! :-)
Still, might be a reasonable diagnostic step to try...will give it a shot.
On Wed, Jun 10, 2009 at 1:12 PM, Bogdan Costescu <
bogdan.coste...@iwr.uni-heidelberg.de> wrote:
> On Wed, 10 Jun 2009, Ralph Castain wrote:
>
> I
On Wed, 10 Jun 2009, Ralph Castain wrote:
I appreciate the input and have captured it in the ticket. Since
this appears to be a NUMA-related issue, the lack of NUMA support in
your setup makes the test difficult to interpret.
Based on this reasoning, disabling libnuma support in your OpenMPI
Much appreciated!
Per some of my other comments on this thread and on the referenced ticket,
can you tell me what kernel you have on that machine? I assume you have NUMA
support enabled, given that chipset?
Thanks!
Ralph
On Wed, Jun 10, 2009 at 10:29 AM, Sylvain Jeaugey
I appreciate the input and have captured it in the ticket. Since this
appears to be a NUMA-related issue, the lack of NUMA support in your setup
makes the test difficult to interpret.
I agree, though, that this is likely something peculiar to our particular
setup. Of primary concern is that it
On Wed, 10 Jun 2009, Ralph Castain wrote:
Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps
you might take a glance at that and offer some thoughts?
https://svn.open-mpi.org/trac/ompi/ticket/1944
I wasn't able to reproduce this. I have run with the following setup:
-
Hum, very glad that padb works with Open MPI, I couldn't live without it.
In my opinion, the best debug tool for parallel applications, and more
importantly, the only one that scales.
About the issue, I couldn't reproduce it on my platform (tried 2 nodes
with 2 to 8 processes each, nodes are
Hi Ashley
Thanks! I would definitely be interested and will look at the tool.
Meantime, I have filed a bunch of data on this in ticket #1944, so perhaps
you might take a glance at that and offer some thoughts?
https://svn.open-mpi.org/trac/ompi/ticket/1944
Will be back after I look at the tool.
Ralph,
If I may say this is exactly the type of problem the tool I have been
working on recently aims to help with and I'd be happy to help you
through it.
Firstly I'd say of the three collectives you mention, MPI_Allgather,
MPI_Reduce and MPI_Bcast one exhibit a many-to-many, one a many-to-one
Hi folks
As mentioned in today's telecon, we at LANL are continuing to see hangs when
running even small jobs that involve shared memory in collective operations.
This has been the topic of discussion before, but I bring it up again
because (a) the problem is beginning to become epidemic across
13 matches
Mail list logo