Re: [OMPI users] Debugging Runtime/Ethernet Problems

2013-09-20 Thread Jeff Squyres (jsquyres)
On Sep 20, 2013, at 1:00 PM, Lloyd Brown wrote: > It is interesting to me, though, that I need to explicitly exclude > lo/127.0.0.1 in this case, but when I'm on an Ethernet-only node, and I > just do the plain "mpirun ./appname", I don't have to exclude anything, > and it figures out to use em1,

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Gus Correa
On 09/20/2013 12:48 PM, Noam Bernstein wrote: On Sep 20, 2013, at 11:52 AM, Gus Correa wrote: Hi Noam Could it be that Torque, or probably more likely NFS, is too slow to create/make available the PBS_NODEFILE? What if you insert a "sleep 2", or whatever number of seconds you want, before t

Re: [OMPI users] Debugging Runtime/Ethernet Problems

2013-09-20 Thread Lloyd Brown
1 - How do I check the BTLs available? Something like "ompi_info | grep -i btl"? If so, here's the list: > MCA btl: ofud (MCA v2.0, API v2.0, Component v1.6.3) > MCA btl: openib (MCA v2.0, API v2.0, Component v1.6.3) > MCA btl: self (MCA v2.0, A

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 11:52 AM, Gus Correa wrote: > Hi Noam > > Could it be that Torque, or probably more likely NFS, > is too slow to create/make available the PBS_NODEFILE? > > What if you insert a "sleep 2", > or whatever number of seconds you want, > before the mpiexec command line? > Or may

Re: [OMPI users] Debugging Runtime/Ethernet Problems

2013-09-20 Thread Jeff Squyres (jsquyres)
On Sep 20, 2013, at 12:27 PM, Lloyd Brown wrote: > Interesting. I was taking the approach of "only exclude what you're > certain you don't want" (the native IB and TCP/IPoIB stuff) since I > wasn't confident enough in my knowledge of the OpenMPI internals, to > know what I should explicitly incl

Re: [OMPI users] Debugging Runtime/Ethernet Problems

2013-09-20 Thread Lloyd Brown
Interesting. I was taking the approach of "only exclude what you're certain you don't want" (the native IB and TCP/IPoIB stuff) since I wasn't confident enough in my knowledge of the OpenMPI internals, to know what I should explicitly include. However, taking Jeff's suggestion, this does seem to

Re: [OMPI users] error building openmpi-1.7.3a1r29213 on Solaris

2013-09-20 Thread Jeff Squyres (jsquyres)
Looks like Ralph noticed that we fixed this on the trunk and forgot to bring it over to v1.7. I just committed it on v1.7 in r29215. Give it a whirl in tonight's v1.7 nightly tarball. On Sep 20, 2013, at 7:00 AM, Siegmar Gross wrote: > Hi, > > I tried to install openmpi-1.7.3a1r29213 on "

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Gus Correa
Hi Noam Could it be that Torque, or probably more likely NFS, is too slow to create/make available the PBS_NODEFILE? What if you insert a "sleep 2", or whatever number of seconds you want, before the mpiexec command line? Or maybe better, a "ls -l $PBS_NODEFILE; cat $PBS_NODEFILE", just to make

Re: [OMPI users] compilation aborted for Handler.cpp (code 2)

2013-09-20 Thread Jeff Squyres (jsquyres)
Sorry for the delay replying -- I actually replied on the original thread yesterday, but it got hung up in my outbox and I didn't notice that it didn't actually go out until a few moments ago. :-( I'm *guessing* that this is a problem with your local icpc installation. Can you compile / run ot

Re: [OMPI users] Debugging Runtime/Ethernet Problems

2013-09-20 Thread Jeff Squyres (jsquyres)
Correct -- it doesn't make sense to specify both include *and* exclude: by specifying one, you're implicitly (but exactly/precisely) specifying the other. My suggestion would be to use positive notation, not negative notation. For example: mpirun --mca btl tcp,self --mca btl_tcp_if_include eth

Re: [OMPI users] compilation aborted for Handler.cpp (code 2)

2013-09-20 Thread Jeff Squyres (jsquyres)
I can't tell if this is a busted compiler installation or not. The first error is: - /usr/include/c++/4.6.3/bits/stl_algobase.h(573): error: type name is not allowed const bool __simple = (__is_trivial(_ValueType1) ^ detected duri

Re: [OMPI users] Debugging Runtime/Ethernet Problems

2013-09-20 Thread Ralph Castain
I don't think you are allowed to specify both include and exclude options at the same time as they conflict - you should either exclude ib0 or include eth0 (or whatever). My guess is that the various nodes are trying to communicate across disjoint networks. We've seen that before when, for exam

Re: [OMPI users] Debugging Runtime/Ethernet Problems

2013-09-20 Thread Elken, Tom
> The trouble is when I try to add some "--mca" parameters to force it to > use TCP/Ethernet, the program seems to hang. I get the headers of the > "osu_bw" output, but no results, even on the first case (1 byte payload > per packet). This is occurring on both the IB-enabled nodes, and on the > E

[OMPI users] Debugging Runtime/Ethernet Problems

2013-09-20 Thread Lloyd Brown
Hi, all. We've got a couple of clusters running RHEL 6.2, and have several centrally-installed versions/compilations of OpenMPI. Some of the nodes have 4xQDR Infiniband, and all the nodes have 1 gigabit ethernet. I was gathering some bandwidth and latency numbers using the OSU/OMB tests, and not

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 10:36 AM, Noam Bernstein wrote: > > On Sep 20, 2013, at 10:22 AM, Reuti wrote: > >> >> Is the location for the spool directory local or shared by NFS? Disk full? > > No - locally mounted, and far from full on all the nodes. Another new observation, which may shift the

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 10:22 AM, Reuti wrote: > > Is the location for the spool directory local or shared by NFS? Disk full? No - locally mounted, and far from full on all the nodes. Noam

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Reuti
Hi, Am 20.09.2013 um 16:12 schrieb Noam Bernstein: > On Sep 20, 2013, at 10:04 AM, Noam Bernstein > wrote: > >> Never mind - I was sure that my earlier tests showed that the $PBS_NODEFILE >> was there, but now it seems like every time the job fails it's because this >> file really is missing.

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 10:04 AM, Noam Bernstein wrote: > > Never mind - I was sure that my earlier tests showed that the $PBS_NODEFILE > was there, but now it seems like every time the job fails it's because this > file really is missing. Time to check why torque isn't always creating > the nodef

Re: [OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
On Sep 20, 2013, at 9:55 AM, Noam Bernstein wrote: > > This is completely unrepeatable - resubmitting the same job almost > always works the second time around. The line appears to be > associated with looking for the torque/maui generated node file, > and when I do something like > echo $PBS_

[OMPI users] intermittent node file error running with torque/maui integration

2013-09-20 Thread Noam Bernstein
Hi - we've been using openmpi for a while, but only for the last few months with torque/maui. Intermittently (maybe 1/10 jobs), we get mpi jobs that fail with the error: [compute-2-4:32448] [[52041,0],0] ORTE_ERROR_LOG: File open failure in file ras_tm_module.c at line 142 [compute-2-4:32448] [

[OMPI users] error building openmpi-1.7.3a1r29213 on Solaris

2013-09-20 Thread Siegmar Gross
Hi, I tried to install openmpi-1.7.3a1r29213 on "openSuSE Linux 12.1", "Solaris 10 x86_64", and "Solaris 10 sparc" with "Sun C 5.12" and gcc-4.8.0 in 64-bit mode. Unfortunately "make" breaks with the same error for both compilers on both Solaris platforms. tyr openmpi-1.7.3a1r29213-SunOS.sparc.6

Re: [OMPI users] Fwd: compilation aborted for Handler.cpp (code 2)

2013-09-20 Thread Syed Ahsan Ali
Output of make V=1 is attached. Again same error. If intel compiler is using C++ headers from gfortran then how can we avoid this. On Fri, Sep 20, 2013 at 11:07 AM, Bert Wesarg wrote: > Hi, > > On Fri, Sep 20, 2013 at 4:49 AM, Syed Ahsan Ali wrote: >> I am trying to compile openmpi-1.6.5 on fc16

Re: [OMPI users] Fwd: compilation aborted for Handler.cpp (code 2)

2013-09-20 Thread Bert Wesarg
Hi, On Fri, Sep 20, 2013 at 4:49 AM, Syed Ahsan Ali wrote: > I am trying to compile openmpi-1.6.5 on fc16.x86_64 with icc and ifort > but getting the subject error. config.out and make.out is attached. > Following command was used for configure > > ./configure CC=icc CXX=icpc FC=ifort F77=ifort