Re: [OMPI users] Question on run-time error "ORTE was unable to reliably start"

2016-07-28 Thread Ralph Castain
What kind of system was this on? ssh, slurm, ...? > On Jul 28, 2016, at 1:55 PM, Blosch, Edwin L wrote: > > I am running cases that are starting just fine and running for a few hours, > then they die with a message that seems like a startup type of failure. >

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
This is all on one node, yes? Try adding the following: -mca odls_base_verbose 5 -mca state_base_verbose 5 -mca errmgr_base_verbose 5 Lot of garbage, but should tell us what is going on. On Aug 18, 2014, at 9:36 AM, Maxime Boissonneault wrote: > Here it

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
of > mpirun -np 4 --mca plm_base_verbose 10 -mca odls_base_verbose 5 -mca > state_base_verbose 5 -mca errmgr_base_verbose 5 ring_c |& tee > output_ringc_verbose.txt > > > Maxime > > Le 2014-08-18 12:48, Ralph Castain a écrit : >> This is all on one node,

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
, Maxime Boissonneault <maxime.boissonnea...@calculquebec.ca> wrote: > Indeed, that makes sense now. > > Why isn't OpenMPI attempting to connect with the local loop for same node ? > This used to work with 1.6.5. > > Maxime > > Le 2014-08-18 13:11, Ralph Castain a é

Re: [OMPI users] Segmentation fault in OpenMPI 1.8.1

2014-08-18 Thread Ralph Castain
oes see ib0 (despite not seeing > lo), but does not attempt to use it. > > > On the compute nodes, we have eth0 (management), ib0 and lo, and it works. I > am unsure why it does work on the compute nodes and not on the login nodes. > The only difference is the presence of a public interf

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Ralph Castain
Just to clarify: OMPI will bind the process to *all* N cores, not just to one. On Aug 20, 2014, at 4:26 AM, tmish...@jcity.maeda.co.jp wrote: > Reuti, > > If you want to allocate 10 procs with N threads, the Torque > script below should work for you: > > qsub -l nodes=10:ppn=N > mpirun

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Ralph Castain
On Aug 20, 2014, at 6:58 AM, Reuti wrote: > Hi, > > Am 20.08.2014 um 13:26 schrieb tmish...@jcity.maeda.co.jp: > >> Reuti, >> >> If you want to allocate 10 procs with N threads, the Torque >> script below should work for you: >> >> qsub -l nodes=10:ppn=N >>

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Ralph Castain
On Aug 20, 2014, at 9:04 AM, Reuti <re...@staff.uni-marburg.de> wrote: > Am 20.08.2014 um 16:26 schrieb Ralph Castain: > >> On Aug 20, 2014, at 6:58 AM, Reuti <re...@staff.uni-marburg.de> wrote: >> >>> Hi, >>> >>> Am 20.08.2014 um

Re: [OMPI users] ORTE daemon has unexpectedly failed after launch

2014-08-20 Thread Ralph Castain
It was not yet fixed - but should be now. On Aug 20, 2014, at 6:39 AM, Timur Ismagilov wrote: > Hello! > > As i can see, the bug is fixed, but in Open MPI v1.9a1r32516 i still have > the problem > > a) > $ mpirun -np 1 ./hello_c > >

Re: [OMPI users] ORTE daemon has unexpectedly failed after launch

2014-08-20 Thread Ralph Castain
yes, i know - it is cmr'd On Aug 20, 2014, at 10:26 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote: > btw, we get same error in v1.8 branch as well. > > > On Wed, Aug 20, 2014 at 8:06 PM, Ralph Castain <r...@open-mpi.org> wrote: > It was not yet fixed - but sh

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-20 Thread Ralph Castain
On Aug 20, 2014, at 11:16 AM, Reuti <re...@staff.uni-marburg.de> wrote: > Am 20.08.2014 um 19:05 schrieb Ralph Castain: > >>> >>> Aha, this is quite interesting - how do you do this: scanning the >>> /proc//status or alike? What happens i

Re: [OMPI users] Clarification about OpenMPI, slurm and PMI interface

2014-08-20 Thread Ralph Castain
Or you can add -nolocal|--nolocalDo not run any MPI applications on the local node to your mpirun command line and we won't run any application procs on the node where mpirun is executing On Aug 20, 2014, at 4:28 PM, Joshua Ladd wrote: > Hi, Filippo > > When

Re: [OMPI users] ORTE daemon has unexpectedly failed after launch

2014-08-21 Thread Ralph Castain
> > Wed, 20 Aug 2014 10:48:38 -0700 от Ralph Castain <r...@open-mpi.org>: > yes, i know - it is cmr'd > > On Aug 20, 2014, at 10:26 AM, Mike Dubman <mi...@dev.mellanox.co.il> wrote: > >> btw, we get same error in v1.8 branch as well. >> >> >>

Re: [OMPI users] Clarification about OpenMPI, slurm and PMI interface

2014-08-21 Thread Ralph Castain
l burden. Of course > login node and front-end node can be two separated hosts but I am looking for > a way to keep our setup as-it-is without introducing structural changes. > > > Hi Ralph, > > On Aug 21, 2014, at 12:36 AM, Ralph Castain <r...@open-mpi.org> wrote:

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Ralph Castain
On Aug 21, 2014, at 2:51 AM, Reuti <re...@staff.uni-marburg.de> wrote: > Am 20.08.2014 um 23:16 schrieb Ralph Castain: > >> >> On Aug 20, 2014, at 11:16 AM, Reuti <re...@staff.uni-marburg.de> wrote: >> >>> Am 20.08.2014 um 19:05 schrieb Ralph

Re: [OMPI users] Running a hybrid MPI+openMP program

2014-08-21 Thread Ralph Castain
On Aug 21, 2014, at 6:54 AM, Reuti <re...@staff.uni-marburg.de> wrote: > Am 21.08.2014 um 15:45 schrieb Ralph Castain: > >> On Aug 21, 2014, at 2:51 AM, Reuti <re...@staff.uni-marburg.de> wrote: >> >>> Am 20.08.2014 um 23:16 schrieb Ralph Castain: >&

Re: [OMPI users] OpenMPI 1.8.1 to 1.8.2rc4

2014-08-21 Thread Ralph Castain
Should not be required (unless they are statically built) as we do strive to maintain ABI within a series On Aug 21, 2014, at 9:39 AM, Maxime Boissonneault wrote: > Hi, > Would you say that softwares compiled using OpenMPI 1.8.1 need to be > recompiled

Re: [OMPI users] Clarification about OpenMPI, slurm and PMI interface

2014-08-21 Thread Ralph Castain
On Aug 21, 2014, at 10:58 AM, Filippo Spiga <spiga.fili...@gmail.com> wrote: > Dear Ralph > > On Aug 21, 2014, at 2:30 PM, Ralph Castain <r...@open-mpi.org> wrote: >> I'm afraid that none of the mapping or binding options would be available >> under srun as th

Re: [OMPI users] building openmpi 1.8.1 with intel 14.0.1

2014-08-21 Thread Ralph Castain
FWIW: I just tried on my Mac with the Intel 14.0 compilers, and it configured and built just fine. However, that was with the current state of the 1.8 branch (the upcoming 1.8.2 release), so you might want to try that in case there is a difference. On Aug 21, 2014, at 12:59 PM, Gus Correa

Re: [OMPI users] long initialization

2014-08-22 Thread Ralph Castain
MPI > semenov@compiler-2 Distribution, ident: 1.9a1r32570, repo rev: r32570, Aug > 21, 2014 (nightly snapshot tarball), 146) > > > Thu, 21 Aug 2014 06:26:13 -0700 от Ralph Castain <r...@open-mpi.org>: > Not sure I understand. The problem has been fixed in both the trunk and

Re: [OMPI users] openmpi-1.8.1 Unable to compile on CentOS6.5

2014-08-26 Thread Ralph Castain
Looks like there is something wrong with your gfortran install: *** Fortran compiler checking for gfortran... gfortran checking whether we are using the GNU Fortran compiler... yes checking whether gfortran accepts -g... yes checking whether ln -s works... yes checking if Fortran compiler

Re: [OMPI users] OpenMPI Remote Execution Problem (Application does not start)

2014-08-26 Thread Ralph Castain
Add --enable-debug to your configure, and then re-run the --host test and add "--leave-session-attached -mca plm_base_verbose 5 -ma oob_base_verbose 5" and let's see what's going on On Aug 26, 2014, at 7:31 AM, Benjamin Giehle wrote: > Hello, > > i have a problem with

Re: [OMPI users] long initialization

2014-08-26 Thread Ralph Castain
all), 146) > > real 1m3.932s > user 0m0.035s > sys 0m0.072s > > > > > Tue, 26 Aug 2014 07:03:58 -0700 от Ralph Castain <r...@open-mpi.org>: > hmmmwhat is your allocation like? do you have a large hostfile, for > example? > > if you add a --hos

Re: [OMPI users] long initialization

2014-08-27 Thread Ralph Castain
lo Link encap:Local Loopback > inet addr:127.0.0.1 Mask:255.0.0.0 > UP LOOPBACK RUNNING MTU:16436 Metric:1 > RX packets:31235107 errors:0 dropped:0 overruns:0 frame:0 > TX packets:31235107 errors:0 dropped:0 overruns:0 carrier:0 > collisions:0 txqueuelen:0 > RX bytes:1327509

Re: [OMPI users] How does binding option affect network traffic?

2014-08-28 Thread Ralph Castain
On Aug 28, 2014, at 11:50 AM, McGrattan, Kevin B. Dr. wrote: > My institute recently purchased a linux cluster with 20 nodes; 2 sockets per > node; 6 cores per socket. OpenMPI v 1.8.1 is installed. I want to run 15 > jobs. Each job requires 16 MPI processes. For

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-28 Thread Ralph Castain
I'm unaware of any changes to the Slurm integration between rc4 and final release. It sounds like this might be something else going on - try adding "--leave-session-attached --debug-daemons" to your 1.8.2 command line and let's see if any errors get reported. On Aug 28, 2014, at 12:20 PM,

Re: [OMPI users] Weird error with OMPI 1.6.3

2014-08-29 Thread Ralph Castain
No, it isn't - but we aren't really maintaining the 1.6 series any more. You might try updating to 1.6.5 and see if it remains there On Aug 29, 2014, at 9:12 AM, Maxime Boissonneault wrote: > It looks like > -npersocket 1 > > cannot be used alone. If I

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-29 Thread Ralph Castain
ED AT 2014-08-29T07:16:20 WITH > SIGNAL 9 *** > srun.slurm: error: borg01x144: task 1: Exited with exit code 213 > slurmd[borg01x144]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH > SIGNAL 9 *** > slurmd[borg01x144]: *** STEP 2332583.4 KILLED AT 2014-08-29T07:16:20 WITH > S

Re: [OMPI users] Weird error with OMPI 1.6.3

2014-08-29 Thread Ralph Castain
wrote: > It is still there in 1.6.5 (we also have it). > > I am just wondering if there is something wrong in our installation that > makes MPI unabled to detect that there are two sockets per node if we do not > include a npernode directive. > > Maxime > > Le 2014-0

Re: [OMPI users] How does binding option affect network traffic?

2014-08-29 Thread Ralph Castain
un a maximum of 6 mpirun's at a time across a given set of nodes. So you'd need to stage your allocations correctly to make it work. > > ------ > > Date: Thu, 28 Aug 2014 13:27:12 -0700 > From: Ralph Castain <r...@open-mpi.org> > To: Open MPI

Re: [OMPI users] How does binding option affect network traffic?

2014-08-29 Thread Ralph Castain
B/B/B/B/B] > > [burn008:07256] MCW rank 2 bound to socket 0[core 0[hwt 0]], socket 0[core > 1[hwt 0]], socket 0[core 2[hwt 0]], socket 0[core 3[hwt 0]], socket 0[core > 4[hwt 0]], socket 0[core 5[hwt 0]]: [B/B/B/B/B/B][./././././.] and so on. > > > > *From:* users [mailto:

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

2014-08-30 Thread Ralph Castain
hwloc requires the numactl-devel package in addition to the numactl one If I understand the email thread correctly, it sounds like you have at least some nodes in your system that have fewer cores than others - is that correct? >> Here are the definitions of the two parallel environments tested

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-08-31 Thread Ralph Castain
closed[borg01w063:03815] mca: base: close: unloading component tcp On Fri, Aug 29, 2014 at 3:18 PM, Ralph Castain <r...@open-mpi.org> wrote: Rats - I also need "-mca plm_base_verbose 5" on there so I can see the cmd line being executed. Can you add it? On Aug 29, 2014, at 11:16 A

Re: [OMPI users] core dump on MPI_Finalize in child process.

2014-09-01 Thread Ralph Castain
You need to disconnect the parent/child from each other prior to finalizing - see the attached example simple_spawn.c Description: Binary data On Aug 31, 2014, at 9:44 PM, Roy wrote: > Hi all, > > I'm using MPI_Comm_spawn to start new child process. > I found that

Re: [OMPI users] core dump on MPI_Finalize in child process.

2014-09-01 Thread Ralph Castain
call > MPI_Finalize. To be more precise, the disconnect has a single role to > redivide the application in separated groups of connected processes in order > to prevent error propagation (such as MPI_Abort). > > George. > > > > On Mon, Sep 1, 2014 at 12:58 AM

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-01 Thread Ralph Castain
ess6 of8 is on borg01w218 > > I'll ask the admin to apply the patch locally...and wait for 1.8.3, I suppose. > > Thanks, > Matt > > On Sun, Aug 31, 2014 at 10:08 AM, Ralph Castain <r...@open-mpi.org> wrote: > HmmmI may see the problem. Would you be so kind

Re: [OMPI users] same problems and bus error with openmpi-1.9a1r32657 and gcc

2014-09-02 Thread Ralph Castain
Would you please try r32662? I believe I finally found and fixed this problem. On Sep 2, 2014, at 6:12 AM, Siegmar Gross wrote: > Hi, > > yesterday I installed openmpi-1.9a1r32657 on my machines (Solaris > 10 Sparc (tyr), Solaris 10 x86_64 (sunpc0), and

Re: [OMPI users] SIGSEGV with Java, openmpi-1.8.2, and Sun C and gcc-4.9.0

2014-09-02 Thread Ralph Castain
I believe this was fixed in the trunk and is now scheduled to come across to 1.8.3 On Sep 2, 2014, at 4:21 AM, Siegmar Gross wrote: > Hi, > > yesterday I installed openmpi-1.8.2 on my machines (Solaris 10 Sparc > (tyr), Solaris 10 x86_64 (sunpc0), and

Re: [OMPI users] problems and bus error with openmpi-1.9a1r32657

2014-09-02 Thread Ralph Castain
Hi Siegmar Could you please configure this OMPI install with --enable-debug so that gdb will provide line numbers where the error is occurring? Otherwise, I'm having a hard time chasing this problem down. Thanks Ralph On Sep 2, 2014, at 6:01 AM, Siegmar Gross

Re: [OMPI users] problems and bus error with openmpi-1.9a1r32657

2014-09-02 Thread Ralph Castain
I don't see any line numbers on the errors I flagged - all I see are the usual memory offsets in bytes, which is of little help. I'm afraid I don't what what you'd have to do under SunOS to get line numbers, but I can't do much without it On Sep 2, 2014, at 10:26 AM, Siegmar Gross

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

2014-09-02 Thread Ralph Castain
__________ > From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain > [r...@open-mpi.org] > Sent: Saturday, August 30, 2014 7:15 AM > To: Open MPI Users > Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots > (u

Re: [OMPI users] problems and bus error with openmpi-1.9a1r32657

2014-09-02 Thread Ralph Castain
The difficulty here is that you have bundled several errors again into a single message, making it hard to keep the conversation from getting terribly confused. I was trying to address the segfault errors on cleanup, which have nothing to do with the accept being rejected. It looks like those

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots (updated findings)

2014-09-02 Thread Ralph Castain
ce 8/28/14) all occurred after we upgraded our cluster > to OpenMPI 1.8.2 on . Maybe I should've created a new thread rather > than tacking on these issues to my existing thread. > > -Bill Lane > > > From: users [users-boun...@open-mpi.o

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-02 Thread Ralph Castain
; Thanks. > > > > On Sep 2, 2014, at 9:36 AM, Matt Thompson <fort...@gmail.com> wrote: > >> On that machine, it would be SLES 11 SP1. I think it's soon transitioning to >> SLES 11 SP3. >> >> I also use Open MPI on an RHEL 6.5 box (possibly soon to be RH

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-03 Thread Ralph Castain
Thanks Matt - that does indeed resolve the "how" question :-) We'll talk internally about how best to resolve the issue. We could, of course, add a flag to indicate "we are using a shellscript version of srun" so we know to quote things, but it would mean another thing that the user would have

Re: [OMPI users] `return EXIT_FAILURE;` triggers error message

2014-09-03 Thread Ralph Castain
Exiting with a non-zero status is considered as indicating a failure that needs reporting. On Sep 3, 2014, at 1:48 PM, Nico Schlömer wrote: > Hi all, > > with OpenMPI 1.6.5 (default Debian/Ubuntu), I'm running the program > > ``` > #include > #include > > int

Re: [OMPI users] SGE and openMPI

2014-09-04 Thread Ralph Castain
Just to help separate out the issues, you might try running the hello_c program in the OMPI examples directory - this will verify whether the problem is in the mpirun command or in your program On Sep 4, 2014, at 6:26 AM, Donato Pera wrote: > Hi, > > the text was

Re: [OMPI users] Issues with OpenMPI 1.8.2, GCC 4.9.1, and SLURM Interactive Jobs

2014-09-04 Thread Ralph Castain
Still begs the bigger question, though, as others have used script wrappers before - and I'm not sure we (OMPI) want to be in the business of dictating the scripting language they can use. :-) Jeff and I will argue that one out On Sep 4, 2014, at 7:38 AM, Jeff Squyres (jsquyres)

Re: [OMPI users] new overcommitment warning?

2014-09-06 Thread Ralph Castain
On Sep 5, 2014, at 3:34 PM, Allin Cottrell wrote: > I suspect there is a new (to openmpi 1.8.N?) warning with respect to > requesting a number of MPI processes greater than the number of "real" cores > on a given machine. I can provide a good deal more information is that's

Re: [OMPI users] How does binding option affect network traffic?

2014-09-06 Thread Ralph Castain
On Sep 5, 2014, at 10:44 AM, McGrattan, Kevin B. Dr. wrote: > I am testing a new cluster that we just bought, which is why I am loading > things this way. I am deliberately increasing network traffic. But in > general, we submit jobs intermittently with various

Re: [OMPI users] new overcommitment warning?

2014-09-06 Thread Ralph Castain
On Sep 6, 2014, at 7:52 AM, Allin Cottrell <cottr...@wfu.edu> wrote: > On Fri, 5 Sep 2014, Ralph Castain wrote: > >> On Sep 5, 2014, at 3:34 PM, Allin Cottrell <cottr...@wfu.edu> wrote: >> >>> I suspect there is a new (to openmpi 1.8.N?) warning with res

Re: [OMPI users] new overcommitment warning?

2014-09-06 Thread Ralph Castain
On Sep 6, 2014, at 11:00 AM, Allin Cottrell <cottr...@wfu.edu> wrote: > On Sat, 6 Sep 2014, Ralph Castain wrote: > >> On Sep 6, 2014, at 7:52 AM, Allin Cottrell <cottr...@wfu.edu> wrote: >> >>> On Fri, 5 Sep 2014, Ralph Castain wrote: >>> >&g

Re: [OMPI users] launch openmpi programs in Docker containers

2014-09-09 Thread Ralph Castain
If you assign unique IP addresses to each container, you can then create a hostfile that contains the IP addresses. Feed that to mpirun and it will work just fine. If you really want to do it under slurm, then slurm is going to need the list of those IP addresses anyway. We read the slurm

Re: [OMPI users] [Error running] OpenMPI after the installation of Torque (PBS)

2014-09-10 Thread Ralph Castain
What OMPI version? On Sep 10, 2014, at 1:53 AM, Red Red wrote: > Hi, > > > after the installation of a Torque PBS when I start a simple program with > mpirun I get this result (i have already installed again): > > [oxygen1:04280] [[INVALID],INVALID] ORTE_ERROR_LOG: Not

Re: [OMPI users] still SIGSEGV for Java and openmpi-1.8.3a1r32692 on Solaris

2014-09-10 Thread Ralph Castain
Working on the memory alignment issues in the trunk, and they are being scheduled to come across as we go. On Sep 10, 2014, at 9:08 AM, Siegmar Gross wrote: > Hi, > > today I installed openmpi-1.8.3a1r32692 on my machines (Solaris > 10 Sparc (tyr),

Re: [OMPI users] Multiple threads for an mpi process

2014-09-12 Thread Ralph Castain
Hmmm...well, the info is there. There is an envar OMPI_COMM_WORLD_LOCAL_SIZE which tells you how many procs are on this node. If you tell your proc how many cores (or hwthreads) to use, it would be a simple division to get what you want. You could also detect the number of cores or hwthreads

Re: [OMPI users] launch openmpi programs in Docker containers

2014-09-13 Thread Ralph Castain
registry.hub.docker.com/search?q=openmpi > > I haven’t tested it to see whats in there, but the guys on the ompi mailing > list might want to check that out. > > -Ben > > On Sep 9, 2014, at 3:54 PM, Ralph Castain <r...@open-mpi.org> wrote: > If you assign un

Re: [OMPI users] removed maffinity, paffinity in 1.7+

2014-09-15 Thread Ralph Castain
Not really "removed" - say rather "renamed". The PLPA system was replaced by HWLOC starting with the 1.7 series. The binding directives were replaced with --bind-to options as they became much more fine-grained than before - you can bind all the way do the hardware thread level. If you don't

Re: [OMPI users] launch openmpi programs in Docker containers

2014-09-15 Thread Ralph Castain
h private to hosts and > dynamic. But it's a start. > > Thank you > > > > On Wed, Sep 10, 2014 at 12:54 AM, Ralph Castain <r...@open-mpi.org> wrote: >> If you assign unique IP addresses to each container, you can then create a >> hostfile that contains the IP

Re: [OMPI users] --prefix, segfaulting

2014-09-17 Thread Ralph Castain
You should check that your path would also hit /usr/bin/mpiexec and not some other version of it On Sep 17, 2014, at 4:01 PM, Nico Schlömer wrote: > Hi all! > > Today, I observed a really funky behavior of my stock > ``` > $ mpiexec --version > mpiexec (OpenRTE)

Re: [OMPI users] --prefix, segfaulting

2014-09-17 Thread Ralph Castain
uld also hit /usr/bin/mpiexec and not some >> other version of it > > ``` > $ which mpiexec > /usr/bin/mpiexec > ``` > Is this what you mean? > > –Nico > > On Thu, Sep 18, 2014 at 1:04 AM, Ralph Castain <r...@open-mpi.org> wrote: >> You

Re: [OMPI users] Process is hanging

2014-09-21 Thread Ralph Castain
Can you please tell us what version of OMPI you are using? On Sep 21, 2014, at 6:08 AM, Lee-Ping Wang wrote: > Hi there, > > I’m running into an issue where mpirun isn’t terminating when my executable > has a nonzero exit status – instead it’s hanging indefinitely.

Re: [OMPI users] Process is hanging

2014-09-21 Thread Ralph Castain
:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Sunday, September 21, 2014 6:56 AM > To: Open MPI Users > Subject: Re: [OMPI users] Process is hanging > > Can you please tell us what version of OMPI you are using? > > > On Sep 21, 2014, at 6:08 AM, Lee

Re: [OMPI users] Process is hanging

2014-09-21 Thread Ralph Castain
gt; existing. > > Thanks, > > - Lee-Ping > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Sunday, September 21, 2014 8:54 AM > To: Open MPI Users > Subject: Re: [OMPI users] Process is hanging > > Just to be clear

Re: [OMPI users] Process is hanging

2014-09-22 Thread Ralph Castain
Lee-Ping > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Sunday, September 21, 2014 11:49 AM > To: Open MPI Users > Subject: Re: [OMPI users] Process is hanging > > Thanks - I asked because the output you sent shows a bunch of se

Re: [OMPI users] Process is hanging

2014-09-22 Thread Ralph Castain
> > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Monday, September 22, 2014 8:09 AM > To: Open MPI Users > Subject: Re: [OMPI users] Process is hanging > > Could you try using the nightly 1.8 tarball? I know there was a problem

Re: [OMPI users] Strange affinity messages with 1.8 and torque 5

2014-09-23 Thread Ralph Castain
FWIW: that warning has been removed from the upcoming 1.8.3 release On Sep 23, 2014, at 11:45 AM, Reuti wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Am 23.09.2014 um 19:53 schrieb Brock Palen: > >> I found a fun head scratcher, with openmpi 1.8.2

Re: [OMPI users] Strange affinity messages with 1.8 and torque 5

2014-09-24 Thread Ralph Castain
gt;> Le 2014-09-23 15:05, Brock Palen a écrit : >>> Yes the request to torque was procs=64, >>> >>> We are using cpusets. >>> >>> the mpirun without -np 64 creates 64 spawned hostnames. >>> >>> Brock Palen >>&g

Re: [OMPI users] Running program on a cluster

2014-09-24 Thread Ralph Castain
No, it doesn't matter at all for OMPI - any order is fine. The issue I see is that your mpiexec isn't the OMPI one, but is from someone else. I have no idea whose mpiexec you are using On Sep 24, 2014, at 6:38 PM, XingFENG wrote: > I have found the solution. The

Re: [OMPI users] Running program on a cluster

2014-09-25 Thread Ralph Castain
ocumentation claims > that two mpi are installed, namely, OpenMPI and MPICH2. > > On Thu, Sep 25, 2014 at 11:45 AM, Ralph Castain <r...@open-mpi.org> wrote: > No, it doesn't matter at all for OMPI - any order is fine. The issue I see is > that your mpiexec isn't the OMPI on

Re: [OMPI users] Running program on a cluster

2014-09-25 Thread Ralph Castain
CH version. On Sep 25, 2014, at 4:33 AM, XingFENG <xingf...@cse.unsw.edu.au> wrote: > It returns /usr/bin/mpiexec. > > On Thu, Sep 25, 2014 at 8:57 PM, Ralph Castain <r...@open-mpi.org> wrote: > Do "which mpiexec" and look at the path. The options you show are f

Re: [OMPI users] OpenMPI 1.8.2 segfaults while 1.6.5 works?

2014-09-29 Thread Ralph Castain
Can you pass us the actual mpirun command line being executed? Especially need to see the argv being passed to your application. On Sep 27, 2014, at 7:09 PM, Amos Anderson wrote: > FWIW, I've confirmed that the segfault also happens with OpenMPI 1.7.5. Also, > I

Re: [OMPI users] --prefix, segfaulting

2014-09-29 Thread Ralph Castain
I'm not seeing this with 1.8.3 - can you try with it? On Sep 17, 2014, at 4:38 PM, Ralph Castain <r...@open-mpi.org> wrote: > Yeah, just wanted to make sure you were seeing the same mpiexec in both > cases. There shouldn't be any issue with providing the complete path, though

Re: [OMPI users] OpenMPI 1.8.2 segfaults while 1.6.5 works?

2014-09-29 Thread Ralph Castain
led with OpenMPI > 1.6 ? > > > > On Sep 29, 2014, at 10:28 AM, Ralph Castain <r...@open-mpi.org> wrote: > >> Can you pass us the actual mpirun command line being executed? Especially >> need to see the argv being passed to your application. >> &

Re: [OMPI users] OpenMPI 1.8.2 segfaults while 1.6.5 works?

2014-09-29 Thread Ralph Castain
tested your scenario. On Sep 29, 2014, at 10:55 AM, Ralph Castain <r...@open-mpi.org> wrote: > Okay, so regression-test.py is calling MPI_Init as a singleton, correct? Just > trying to fully understand the scenario > > Singletons are certainly allowed, if that's the scenario &g

Re: [OMPI users] OpenMPI 1.8.2 segfaults while 1.6.5 works?

2014-09-29 Thread Ralph Castain
st/regression/regression-jobs" > (gdb) print argv[2] > $13 = 0x20 > (gdb) > > > > > On Sep 29, 2014, at 11:48 AM, Dave Goodell (dgoodell) <dgood...@cisco.com> > wrote: > >> Looks like boost::mpi and/or your python "mpi" module might be

Re: [OMPI users] General question about running single-node jobs.

2014-09-30 Thread Ralph Castain
I don't know anything about your application, or what the functions in your code are doing. I imagine it's possible that you are trying to open statically defined ports, which means that running the job again too soon could leave the OS thinking the socket is already busy. It takes awhile for

Re: [OMPI users] OpenMPI 1.8.3 build without BTL

2014-09-30 Thread Ralph Castain
ompi_info is just the first time when an executable is built, and so it always is the place where we find missing library issues. It looks like someone has left incorrect configure logic in the system such that we always attempt to build Infiniband-related code, but without linking against the

Re: [OMPI users] still SIGSEGV for Java in openmpi-1.9a1r32807 on Solaris

2014-09-30 Thread Ralph Castain
Don't know about the segfault itself, but I did find and fix the classpath logic so the app is found. Might help you get a little further. On Sep 29, 2014, at 10:58 PM, Siegmar Gross wrote: > Hi, > > yesterday I installed openmpi-1.9a1r32807 on my

Re: [OMPI users] General question about running single-node jobs.

2014-09-30 Thread Ralph Castain
to, and if that code knows how to handle arbitrary connections You might check about those warnings - could be that QCLOCALSCR and QCREF need to be set for the code to work. > > - Lee-Ping > > On Sep 29, 2014, at 8:45 PM, Ralph Castain <r...@open-mpi.org> wrote: > >&

Re: [OMPI users] General question about running single-node jobs.

2014-09-30 Thread Ralph Castain
four different > clusters (where I don't set these environment variables either), it's only > broken on the Blue Waters compute node. Also, the calculation runs without > any problems the first time it's executed on the BW compute node - it's only > subsequent executions that give the error

Re: [OMPI users] About valgrind and OpenMPI

2014-10-02 Thread Ralph Castain
HmmmI would guess you should talk to the Hadoop folks as the problem seems to be a conflict between valgrind and HDFS. Does valgrind even support Java programs? I honestly have never tried to do that before. On Oct 2, 2014, at 4:40 AM, XingFENG wrote: > Hi

Re: [OMPI users] still SIGSEGV with Java in openmpi-1.9.0a1git99c3999 on Solaris

2014-10-05 Thread Ralph Castain
We've talked about this as lot over the last few weeks, trying to come up with some way to maintain the Solaris support - but have come up empty. None of us have access to such a system, and it appears to be very difficult to avoid regularly breaking it. I may, as time permits, try playing with

Re: [OMPI users] Update/patch to check/opal_check_pmi.m4

2014-10-06 Thread Ralph Castain
I've looked at your patch, and it isn't quite right as it only looks for libpmi and not libpmi2. We need to look for each of them as we could have either or both. I'll poke a bit at this tonight and see if I can make this a little simpler - the nesting is getting a little deep. On Mon, Oct 6,

Re: [OMPI users] Open MPI was unable to obtain the username

2014-10-10 Thread Ralph Castain
Sorry about delay - was on travel. Yes, that will avoid the issue. On Oct 10, 2014, at 1:17 PM, Gary Jackson wrote: > > To answer my own question: > > Configure with --disable-getpwuid. > > On 10/10/14, 12:04 AM, Gary Jackson wrote: >> >> I'd like to run MPI on a node to

Re: [OMPI users] Open MPI 1.8.3 openmpi-mca-params.conf: old and new parameters

2014-10-14 Thread Ralph Castain
On Oct 14, 2014, at 5:32 PM, Gus Correa wrote: > Dear Open MPI fans and experts > > This is just a note in case other people run into the same problem. > > I just built Open MPI 1.8.3. > As usual I put my old settings on openmpi-mca-params.conf, > with no further

Re: [OMPI users] Hybrid OpenMPI/OpenMP leading to deadlocks?

2014-10-16 Thread Ralph Castain
If you only have one thread doing MPI calls, then single and funneled are indeed the same. If this is only happening after long run times, I'd suspect resource exhaustion. You might check your memory footprint to see if you are running into leak issues (could be in our library as well as your

Re: [OMPI users] Open MPI 1.8.3 openmpi-mca-params.conf: old and new parameters

2014-10-16 Thread Ralph Castain
f Squyres (jsquyres) wrote: >> We talked off-list -- fixed this on master and just filed >> https://github.com/open-mpi/ompi-release/pull/33 to get this into the v1.8 >> branch. >> >> >> On Oct 14, 2014, at 7:39 PM, Ralph Castain <r...@open-mpi.org> wr

Re: [OMPI users] Open MPI on Cray xc30 and getpwuid

2014-10-16 Thread Ralph Castain
Add --disable-getpwuid to configure On Oct 16, 2014, at 12:36 AM, Aurélien Bouteiller wrote: > I am building trunk on the Cray xc30. > I get the following warning during link (static link) > ../../../orte/.libs/libopen-rte.a(session_dir.o): In function >

Re: [OMPI users] Open MPI 1.8.3 openmpi-mca-params.conf: old and new parameters

2014-10-16 Thread Ralph Castain
to need an update. > That is probably the first place people look for information > about runtime features. > For instance, the process placement examples still > use deprecated parameters and mpiexec options: > -bind-to-core, rmaps_base_schedule_policy, orte_process_binding, etc. On my

Re: [OMPI users] knem in Open MPI 1.8.3

2014-10-16 Thread Ralph Castain
FWIW: vader is the default in 1.8 On Oct 16, 2014, at 1:40 PM, Aurélien Bouteiller wrote: > Are you sure you are not using the vader BTL ? > > Setting mca_btl_base_verbose and/or sm_verbose should spit out some knem > initialization info. > > The CMA linux system

Re: [OMPI users] knem in Open MPI 1.8.3

2014-10-16 Thread Ralph Castain
r the benefit of mere mortals like me >> who don't share the dark or the bright side of the force, >> and just need to keep their MPI applications running in production mode, >> hopefully with Open MPI 1.8, >> can somebody explain more clearly what "vader" is

Re: [OMPI users] Open MPI 1.8.3 openmpi-mca-params.conf: old and new parameters

2014-10-17 Thread Ralph Castain
parameters and mpiexec options: > -bind-to-core, rmaps_base_schedule_policy, orte_process_binding, etc. > > Thank you, > Gus Correa > > On 10/15/2014 11:10 PM, Ralph Castain wrote: >> >> On Oct 15, 2014, at 11:46 AM, Gus Correa <g...@ldeo.columbia.edu >

Re: [OMPI users] Open MPI 1.8: link problem when Fortran+C+Platform LSF

2014-10-17 Thread Ralph Castain
point, an undefined symbol reference from another dynamic >library. --no-as-needed restores the default behaviour. > > > > -- > Dipl.-Inform. Paul Kapinos - High Performance Computing, > RWTH Aachen University, IT Center > Seffenter Weg 23, D 52074 Aa

Re: [OMPI users] Open MPI 1.8.3 openmpi-mca-params.conf: old and new parameters

2014-10-17 Thread Ralph Castain
l model, along with its syntax > and examples. Yeah, I need to do that. LAMA was an alternative implementation of the current map/rank/bind system. It hasn’t been fully maintained since it was introduced, and so I’m not sure how much of it is functional. I need to create an equivalent for the c

Re: [OMPI users] knem in Open MPI 1.8.3

2014-10-17 Thread Ralph Castain
> On Oct 17, 2014, at 12:06 PM, Gus Correa wrote: > > Hi Jeff > > Many thanks for looking into this and filing a bug report at 11:16PM! > > Thanks to Aurelien, Ralph and Nathan for their help and clarifications > also. > > ** > > Related suggestion: > > Add a note

Re: [OMPI users] large memory usage and hangs when preconnecting beyond 1000 cpus

2014-10-18 Thread Ralph Castain
> On Oct 17, 2014, at 3:37 AM, Marshall Ward wrote: > > I currently have a numerical model that, for reasons unknown, requires > preconnection to avoid hanging on an initial MPI_Allreduce call. That is indeed odd - it might take a while for all the connections to form,

Re: [OMPI users] low CPU utilization with OpenMPI

2014-10-23 Thread Ralph Castain
From your error message, I gather you are not running an MPI program, but rather an OSHMEM one? Otherwise, I find the message strange as it only would be emitted from an OSHMEM program. What version of OMPI are you trying to use? > On Oct 22, 2014, at 7:12 PM, Vinson Leung

Re: [OMPI users] Problem with Yosemite

2014-10-24 Thread Ralph Castain
I was able to build and run the trunk without problem on Yosemite with: gcc (MacPorts gcc49 4.9.1_0) 4.9.1 GNU Fortran (MacPorts gcc49 4.9.1_0) 4.9.1 Will test 1.8 branch now, though I believe the fortran support in 1.8 is up-to-date > On Oct 24, 2014, at 6:46 AM, Guillaume Houzeaux

Re: [OMPI users] Problem with Yosemite

2014-10-24 Thread Ralph Castain
http://lists.gnu.org/archive/html/libtool-patches/2014-09/msg2.html > > On Fri, Oct 24, 2014 at 6:09 PM, Ralph Castain <r...@open-mpi.org> wrote: >> I was able to build and run the trunk without problem on Yosemite with: >> >> gcc (MacPorts gcc49 4.9.1_0)

  1   2   3   4   5   6   7   8   9   10   >