Matthew MacManes wrote:
I would be happy to help troubleshoot, but I am not much of a programmer to
know how. The hang is reproducible, and -mca btl ^sm is about 15% faster.
if you want to shoot me some instructions off list, I can give it a go.
The application that I am working with, prima
I would be happy to help troubleshoot, but I am not much of a programmer to
know how. The hang is reproducible, and -mca btl ^sm is about 15% faster.
if you want to shoot me some instructions off list, I can give it a go.
The application that I am working with, primarily, is ABySS:
http://www
Matthew MacManes wrote:
On my system, mpirun -np 8 -mca btl_sm_num_fifos 7 is much slower (and
appeared to hang after several thousand interations) than -mca btl ^sm
If the hang is reproducible, we should perhaps have a look. Also, the
fact that it's much slower is interesting. Can you c
On my system, mpirun -np 8 -mca btl_sm_num_fifos 7 is much slower (and
appeared to hang after several thousand interations) than -mca btl ^sm
Is there another better way I should be modifying fifos to get better
performance?
Matt
On Dec 11, 2009, at 4:04 AM, Terry Dontje wrote:
>>
>> Date
Date: Thu, 10 Dec 2009 17:57:27 -0500
From: Jeff Squyres
On Dec 10, 2009, at 5:53 PM, Gus Correa wrote:
> How does the efficiency of loopback
> (let's say, over TCP and over IB) compare with "sm"?
Definitely not as good; that's why we have sm. :-) I don't have any
quantificatio
Some additional data:
Without threads it still hangs, similar behavior as before.
All of the tests were run on a system running FC11 with X5550 processors.
I just reran on a node of a RHEL 5.3 cluster with E5530 processors (dual
Nehalam):
- openmpi 1.3.4 and gcc 4.1.2
- No issues: connecti
Gus Correa wrote:
Why wouldn't shared memory work right on Nehalem?
We don't know exactly what is driving this problem, but the issue
appears to be related to memory fences. Messages have to be posted to a
receiver's queue. By default, each process (since OMPI 1.3.2) has only
one queue.
On Dec 10, 2009, at 5:53 PM, Gus Correa wrote:
> How does the efficiency of loopback
> (let's say, over TCP and over IB) compare with "sm"?
Definitely not as good; that's why we have sm. :-) I don't have any
quantification of that assertion, though (i.e., no numbers to back that up).
> FYI, I
Hi Jeff
Thanks for jumping in! :)
And for your clarifications too, of course.
How does the efficiency of loopback
(let's say, over TCP and over IB) compare with "sm"?
FYI, I do NOT see the problem reported by Matthew et al.
on our AMD Opteron Shanghai dual-socket quad-core.
They run a quite ou
Jeff Squyres wrote:
Why wouldn't shared memory work right on Nehalem?
(That is probably distressing for Mark, Matthew, and other Nehalem owners.)
To be clear, we don't know that this is a Nehalem-specific problem.
I have definitely had this problem on Harpertown cores.
- Jonathan
--
On Dec 10, 2009, at 5:01 PM, Gus Correa wrote:
> > Just a quick interjection, I also have a dual-quad Nehalem system, HT
> > on, 24GB ram, hand compiled 1.3.4 with options: --enable-mpi-threads
> > --enable-mpi-f77=no --with-openib=no
> >
> > With v1.3.4 I see roughly the same behavior, hello, rin
Hi Matthew, Mark, Mattijs
Great news that a solution was found, actually two,
which seem to have been around for a while.
Thanks Mark and Mattijs posting the solutions.
Much better that all can be solved by software,
with a single mca parameter.
A pity that it took a while for the actual
nature
On Dec 10, 2009, at 5:01 PM, Gus Correa wrote:
> A couple of questions to the OpenMPI pros:
> If shared memory ("sm") is turned off on a standalone computer,
> which mechanism is used for MPI communication?
> TCP via loopback port? Other?
Whatever device supports node-local loopback. TCP is one
Hi All,
I agree that the issue is troublesome. It apparently has been reported, and
there is an active bug report, with some technical discussion of the underlying
problems, found here: https://svn.open-mpi.org/trac/ompi/ticket/2043
For now, it is OK, but it is an issue that hopefully will be
HI Mark, Matthew, list
Oh well, Mark's direct experience on a Nehalem
is a game changer, and his recommendation to turn off the shared
memory feature may be the way to go for Matthew, at least to have
things working.
Thank you Mark, your interjection sheds new light on the awkward
situation repor
Mark,
Exciting.. SOLVED.. There is an open ticket #2043 regarding
Nehelem/OpenMPI/Hang problem (https://svn.open-mpi.org/trac/ompi/ticket/2043)..
Seems like the problem might be specific to gcc4.4x and OMPI <1.3.2.. It seems
like there is a group up us with dual socket nehalems trying to use o
On Thursday 10 December 2009 15:42:49 Mark Bolstad wrote:
> Just a quick interjection, I also have a dual-quad Nehalem system, HT on,
> 24GB ram, hand compiled 1.3.4 with options: --enable-mpi-threads
> --enable-mpi-f77=no --with-openib=no
>
> With v1.3.4 I see roughly the same behavior, hello, rin
Just a quick interjection, I also have a dual-quad Nehalem system, HT on,
24GB ram, hand compiled 1.3.4 with options: --enable-mpi-threads
--enable-mpi-f77=no --with-openib=no
With v1.3.4 I see roughly the same behavior, hello, ring work, connectivity
fails randomly with np >= 8. Turning on -v inc
Hi Matthew
Save any misinterpretation I may have made of the code:
Hello_c has no real communication, except for a final Barrier
synchronization.
Each process prints "hello world" and that's it.
Ring probes a little more, with processes Send(ing) and
Recv(cieving) messages.
Ring just passes a m
Hi Gus and List,
1st of all Gus, I want to say thanks.. you have been a huge help, and when I
get this fixed, I owe you big time!
However, the problems continue...
I formatted the HD, reinstalled OS to make sure that I was working from
scratch. I did your step A, which seemed to go fine:
macma
Hi Matthew
There is no point in trying to troubleshoot MrBayes and ABySS
if not even the OpenMPI test programs run properly.
You must straighten them out first.
**
Suggestions:
**
A) While you are at OpenMPI, do yourself a favor,
and install it from source on a separate directory.
Who knows i
Hi Gus,
Interestingly the results for the connectivity_c test... works fine with -np
<8. For -np >8 it works some of the time, other times it HANGS. I have got to
believe that this is a big clue!! Also, when it hangs, sometimes I get the
message "mpirun was unable to cleanly terminate the daemo
Hi Matthew
Please see comments/answers inline below.
Matthew MacManes wrote:
Hi Gus,
Thanks for your ideas.. I have a few questions, and will try to answer
yours in hopes of solving this!!
A simple way to test OpenMPI on your system is to run the
test programs that come with the OpenMPI sou
Hi Gus,
Thanks for your ideas.. I have a few questions, and will try to answer yours in
hopes of solving this!!
Should I worry about setting things like --num-cores --bind-to-cores? This, I
think, gets at your questions about processor affinity.. Am I right? I could
not exactly figure out th
24 matches
Mail list logo