Re: [OMPI users] False positive from valgrind in sec_basic.c

2014-05-21 Thread Ralph Castain
Wow, that is hilarious. I see no problem with adding the extra characters :-) Scheduled it for 1.8.2 (copied you on ticket) On May 21, 2014, at 3:29 PM, W Spector wrote: > Hi, > > When running under valgrind, I get warnings from each MPI process at MPI_Init > time. The warnings come from fu

Re: [OMPI users] pinning processes by default

2014-05-23 Thread Ralph Castain
Note that the lama mapper described in those slides may not work as it hasn't been maintained in a while. However, you can use the map-by and bind-to options to do the same things. If you want to disable binding, you can do so by adding "--bind-to none" to the cmd line, or via the MCA param "hw

Re: [OMPI users] MPI_Finalize() maintains load at 100%.

2014-05-23 Thread Ralph Castain
Hmmm...that is a bit of a problem. I've added a note to see if we can turn down the aggressiveness of the MPI layer once we hit finalize, but that won't solve your immediate problem. Our usual suggestion is that you have each proc call finalize before going on to do other things. This avoids th

Re: [OMPI users] pinning processes by default

2014-05-23 Thread Ralph Castain
environment, so I'm assuming there is some other limitation in play here? If so, you could always put the MCA param in the default mca param file - we'll pick it up from there. > > Albert > > On 23/05/14 14:32, Ralph Castain wrote: >> Note that the lama mapper describ

Re: [OMPI users] pinning processes by default

2014-05-23 Thread Ralph Castain
_base_binding_policy = none HTH Ralph > > Thanks, > Albert > > On 23/05/14 15:02, Ralph Castain wrote: >> >> On May 23, 2014, at 6:58 AM, Albert Solernou >> wrote: >> >>> Hi, >>> thanks a lot for your quick answers, and I see my error,

Re: [OMPI users] MPI_Finalize() maintains load at 100%.

2014-05-23 Thread Ralph Castain
On May 23, 2014, at 7:21 AM, Iván Cores González wrote: > Hi Ralph, > Thanks for your response. > I see your point, I try to change the algorithm but some processes finish > while the others are still calling MPI functions. I can't avoid this > behaviour. > The ideal behavior is the processe

Re: [OMPI users] MPI_Finalize() maintains load at 100%.

2014-05-23 Thread Ralph Castain
waiting until it receives some data >> from other processes? > > This solution was my first idea, but I can't do it. I use spawned processes > and > different communicators for manage "groups" of processes, so the ideal > behaviour > is that processes finished and

Re: [OMPI users] MPI_Finalize() maintains load at 100%.

2014-05-23 Thread Ralph Castain
em is that Finalize invokes a barrier, and some of the procs aren't there any more to participate. On May 23, 2014, at 12:03 PM, Ralph Castain wrote: > I'll check to see - should be working > > On May 23, 2014, at 8:07 AM, Iván Cores González wrote: > >>> I as

Re: [OMPI users] False positive from valgrind in sec_basic.c

2014-05-24 Thread Ralph Castain
is merely a band-aid ... >> >> More info @ https://bugs.launchpad.net/ubuntu/+source/valgrind/+bug/852760. >> A better fix for this issue will be to add "-fno-builtin-strdup" to >> your CFLAGS when compiling Open MPI. >> >> Ge

Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-05-26 Thread Ralph Castain
Strange - I note that you are running these as singletons. Can you try running it under mpirun? mpirun -n 1 ./a.out just to see if it is the singleton that is causing the problem, or something in the openib btl itself. On May 26, 2014, at 6:59 AM, Alain Miniussi wrote: > > Hi, > > I have

Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-05-26 Thread Ralph Castain
noticed that process rank 0 with PID 5123 on node tagir exited on > signal 13 (Broken pipe). > -- > [alainm@tagir mpi]$ > > > do you want me to try a gcc build ? > > Alain > > On 26/05/2014 1

Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-05-27 Thread Ralph Castain
4/libnsl.so.1 (0x003beb00) >libutil.so.1 => /lib64/libutil.so.1 (0x003bea00) >libm.so.6 => /lib64/libm.so.6 (0x003bd9a0) >/lib64/ld-linux-x86-64.so.2 (0x003bd8e0) > [alainm@gurney mpi]$ ./a.out > [alainm@gurney mpi]$ > > So it seems to

Re: [OMPI users] Deadly warning "Epoll ADD(4) on fd 2 failed." ?

2014-05-27 Thread Ralph Castain
I'm unaware of any OMPI error message like that - might be caused by something in libevent as that could be using epoll, so it could be caused by us. However, I'm a little concerned about the use of the prerelease version of Slurm as we know that PMI is having some problems over there. So out o

Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-05-27 Thread Ralph Castain
o the attention of the folks who maintain that component and see if they can grok the problem. Thanks! Ralph > > Alain > > On 27/05/2014 17:30, Ralph Castain wrote: >> Ah, good. On the setup that fails, could you use gdb to find the line number >> where it is dividing

Re: [OMPI users] Deadly warning "Epoll ADD(4) on fd 2 failed." ?

2014-05-28 Thread Ralph Castain
If it is a slurm PMI issue, this should resolve it. On May 28, 2014, at 12:03 AM, Filippo Spiga wrote: > Dear Ralph, > > On May 27, 2014, at 6:31 PM, Ralph Castain wrote: >> So out of curiosity - how was this job launched? Via mpirun or directly >> using srun? >

Re: [OMPI users] intel compiler and openmpi 1.8.1

2014-05-29 Thread Ralph Castain
Are you sure you have /Users/lorenzodona/Documents/openmpi-1.8.1/bin at the *beginning* of your PATH? Reason: most common cause of what you are showing is that you are picking up some other version of mpif90 On May 29, 2014, at 4:11 AM, Lorenzo Donà wrote: > I compiled openmpi 1.8.1 with int

Re: [OMPI users] [warn] Epoll ADD(1) on fd 0 failed

2014-05-30 Thread Ralph Castain
Can you pass along the cmd line that generated that output, and how OMPI was configured? On May 30, 2014, at 5:11 AM, Тимур Исмагилов wrote: > Hello! > > I am using Open MPI v1.8.1 and slurm 2.5.6. > > I got this messages when i try to run example (hello_oshmem.cpp) program: > > [warn] Epoll

Re: [OMPI users] divide-by-zero in mca_btl_openib_add_procs

2014-06-03 Thread Ralph Castain
ssi wrote: > Please note that I had the problem with 13.1.0 but not with the 13.1.1 > > > On 28/05/2014 00:47, Ralph Castain wrote: >> On May 27, 2014, at 3:32 PM, Alain Miniussi wrote: >> >>> Unfortunately, the debug library works like a charm (which make the &

Re: [OMPI users] can't preload binary to remote machine

2014-06-03 Thread Ralph Castain
Sorry for delayed response - been a little hectic here. I suspect the problem is that we really need a passwordless ssh connection in order to preload the file for 1.6.5. This isn't required in the 1.8 series, so you might want to try it with 1.8.1. Otherwise, resolve the password issue and it

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-03 Thread Ralph Castain
Sounds odd - can you configure OMPI --enable-debug and run it again? If it fails and you can get a core dump, could you tell us the line number where it is failing? On Jun 3, 2014, at 9:58 AM, Fischer, Greg A. wrote: > Apologies – I forgot to add some of the information requested by the FAQ:

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Ralph Castain
--- > mpirun noticed that process rank 3 with PID 5845 on node 112 exited on > signal 11 (Segmentation fault). > -- > > Does any of that help? > > Greg > > From: users

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Ralph Castain
i-1.8.1/ompi/runtime/ompi_mpi_init.c:464 > #13 0x2b48f1760a37 in PMPI_Init (argc=0x2b48f6300020, > argv=0x2b48f63000b8) at pinit.c:84 > #14 0x004024ef in main (argc=1, argv=0x7fffebf0d1f8) at ring_c.c:19 > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph C

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Ralph Castain
ing_c > > This may not fix the problem, > but have you tried to add the shared memory btl to your mca parameter? > > mpirun -np 2 --mca btl openib,sm,self ring_c > > As far as I know, sm is the preferred transport layer for intra-node > communication. > > Gus Correa &

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Ralph Castain
ot; on your cmd line On Jun 4, 2014, at 9:58 AM, Ralph Castain wrote: > He isn't getting that far - he's failing in MPI_Init when the RTE attempts to > connect to the local daemon > > > On Jun 4, 2014, at 9:53 AM, Gus Correa wrote: > >> Hi Greg >> >&

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Ralph Castain
c:1645 > #10 0x2b82b16f8763 in orte_progress_thread_engine (obj=0x2b82b5300020) at > ../../../../openmpi-1.8.1/orte/mca/ess/base/ess_base_std_app.c:456 > #11 0x2b82b0f1c7b6 in start_thread () from /lib64/libpthread.so.0 > #12 0x2b82b1410d6d in clone () from /lib64/libc.so.6 &g

Re: [OMPI users] intermittent segfaults with openib on ring_c.c

2014-06-04 Thread Ralph Castain
users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Wednesday, June 04, 2014 4:48 PM > To: Open MPI Users > Subject: Re: [OMPI users] intermittent segfaults with openib on ring_c.c > > Urggg...unfortunately, the people who know the most about that code

Re: [OMPI users] [warn] Epoll ADD(1) on fd 0 failed

2014-06-05 Thread Ralph Castain
rt > --prefix=/mnt/data/users/dm2/vol3/semenov/_scratch/openmpi-1.8.1_mxm-3.0 > --with-mxm=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/ --with- > slurm --with-platform=contrib/platform/mellanox/optimized > > > Fri, 30 May 2014 07:09:54 -0700 от Ralph Ca

Re: [OMPI users] Compiling OpenMPI 1.8.1 for Cray XC30

2014-06-05 Thread Ralph Castain
I know Nathan has it running on the XC30, but I don't see a platform file specifically for it in the repo. Did you try the cray_xe6 platform files - I think he may have just augmented those to handle the XC30 case Look in contrib/platform/lanl/cray_xe6 On Jun 5, 2014, at 9:00 AM, Hammond, Simo

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-05 Thread Ralph Castain
On Jun 5, 2014, at 2:13 PM, Dan Dietz wrote: > Hello all, > > I'd like to bind 8 cores to a single MPI rank for hybrid MPI/OpenMP > codes. In OMPI 1.6.3, I can do: > > $ mpirun -np 2 -cpus-per-rank 8 -machinefile ./nodes ./hello > > I get one rank bound to procs 0-7 and the other bound to 8-

Re: [OMPI users] openib segfaults with Torque

2014-06-05 Thread Ralph Castain
Hmmm...I'm not sure how that is going to run with only one proc (I don't know if the program is protected against that scenario). If you run with -np 2 -mca btl openib,sm,self, is it happy? On Jun 5, 2014, at 2:16 PM, Fischer, Greg A. wrote: > Here’s the command I’m invoking and the terminal

Re: [OMPI users] [warn] Epoll ADD(1) on fd 0 failed

2014-06-06 Thread Ralph Castain
due to the above mentioned problem? > > > Thu, 5 Jun 2014 07:45:01 -0700 от Ralph Castain : > FWIW: support for the --resv-ports option was deprecated and removed on the > OMPI side a long time ago. > > I'm not familiar enough with "oshrun" to know if it is do

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Ralph Castain
nf316:21583] [16] /lib64/libc.so.6(__libc_start_main+0xe6)[0x7f3b58301c36] > [binf316:21583] [17] ring_c[0x400889] > [binf316:21583] *** End of error message *** > -- > mpirun noticed that process rank 0 with PID 21583 on node 316 exited on > signal 6 (Abo

Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-06 Thread Ralph Castain
We currently only get the node and slots/node info from PBS - we don't get any task placement info at all. We then use the mpirun cmd options and built-in mappers to map the tasks to the nodes. I suppose we could do more integration in that regard, but haven't really seen a reason to do so - th

Re: [OMPI users] openib segfaults with Torque

2014-06-06 Thread Ralph Castain
Process 0 exiting > Process 1 exiting > > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Friday, June 06, 2014 10:34 AM > To: Open MPI Users > Subject: Re: [OMPI users] openib segfaults with Torque > > Huh - how strange. I can't imag

Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-06 Thread Ralph Castain
irst >> 16 being host node0001 and the last 8 being node0002), it appears that 24 >> MPI tasks try to start on node0001 instead of getting distributed as 16 on >> node0001 and 8 on node0002. Hence, I am curious what is being passed by >> PBS. >> >> --john &g

Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-06 Thread Ralph Castain
On Jun 6, 2014, at 10:24 AM, Gus Correa wrote: > On 06/06/2014 01:05 PM, Ralph Castain wrote: >> You can always add --display-allocation to the cmd line to see what we >> thought we received. >> >> If you configure OMPI with --enable-debug, you can set --mca >&g

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-06 Thread Ralph Castain
04048f4 in main (argc=6, argv=0x7fff5bb2a3a8) at main.c:13 > > ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ cat nodes > conte-a009 > conte-a009 > conte-a055 > conte-a055 > ddietz@conte-a009:/scratch/conte/d/ddietz/hello$ uname -r > 2.6.32-358.14.1.el6.x86_64 > > On

Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-06 Thread Ralph Castain
cheduler is passing along the correct slot count #s (16 and 8, resp). > > Am I running into a bug w/ OpenMPI 1.6? > > --john > > > > -Original Message- > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Ralph Castain > Sent: Friday, June 06, 2014 1:30

Re: [OMPI users] --preload-binary does not work

2014-06-06 Thread Ralph Castain
Supposed to, yes - but I don't know how much testing it has seen. I can try to take a look On Jun 6, 2014, at 12:02 PM, E.O. wrote: > Hello > I am using OpenMPI ver 1.8.1 on a cluster of 4 machines. > One Redhat 6.2 and three busybox machine. They are all 64bit environment. > > I want to use -

Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-06 Thread Ralph Castain
You might want to update to 1.6.5, if you can - I'll see what I can find On Jun 6, 2014, at 12:07 PM, Sasso, John (GE Power & Water, Non-GE) wrote: > Version 1.6 (i.e. prior to 1.6.1) > > -Original Message- > From: users [mailto:users-boun...@open-mpi.org] On Be

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-06 Thread Ralph Castain
t; [conte-a009:55685] Failing at address: 0x4c > [conte-a009:55685] [ 0] /lib64/libpthread.so.0[0x327f80f500] > [conte-a009:55685] [ 1] > /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x951)[0x2b5b069a50e1] > [conte-a009:55685

Re: [OMPI users] --preload-binary does not work

2014-06-06 Thread Ralph Castain
Yeah, it doesn't require ssh any more - but I haven't tested it in a bit, and so it's possible something crept in there. On Jun 6, 2014, at 12:27 PM, Reuti wrote: > Am 06.06.2014 um 21:04 schrieb Ralph Castain: > >> Supposed to, yes - but I don't know how muc

Re: [OMPI users] --preload-binary does not work

2014-06-06 Thread Ralph Castain
Okay, I found the problem and think I have a fix that I posted (copied EO on it). You are welcome to download the patch and try it. Scheduled for release in 1.8.2 Thanks Ralph On Jun 6, 2014, at 1:01 PM, Ralph Castain wrote: > Yeah, it doesn't require ssh any more - but I haven

Re: [OMPI users] --preload-binary does not work

2014-06-06 Thread Ralph Castain
y to each node beforehand of course). I guess this is a different > issue? > > Eiichi > > > eiichi > > > > On Fri, Jun 6, 2014 at 5:35 PM, Ralph Castain wrote: > Okay, I found the problem and think I have a fix that I posted (copied EO on > it). You are w

Re: [OMPI users] --preload-binary does not work

2014-06-06 Thread Ralph Castain
Looks like there is some strange interaction there, but I doubt I'll get around to fixing it soon unless someone has a burning reason to not use tree spawn when preloading binaries. I'll mark it down as something to look at as time permits. On Jun 6, 2014, at 4:28 PM, Ralph Cast

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-06 Thread Ralph Castain
0],1] >> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] >> plm:base:orted_report_launch from daemon [[24164,0],1] on node >> conte-a055 >> [conte-a009.rcac.purdue.edu:55685] [[24164,0],0] RECEIVED TOPOLOGY >> FROM NODE conte-a055 >> [conte-a009.rcac.purdue.edu:55685] [[24164

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-08 Thread Ralph Castain
f possible, thus providing better performance. Scheduled this for 1.8.2, asking Tetsuya to review. On Jun 6, 2014, at 6:25 PM, Ralph Castain wrote: > HmmmTetsuya is quite correct. Afraid I got distracted by the segfault > (still investigating that one). Our default policy for 2 proces

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-08 Thread Ralph Castain
1] > /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-rte.so.7(orte_plm_base_complete_setup+0x951)[0x2b5b069a50e1] > [conte-a009:55685] [ 2] > /scratch/conte/d/ddietz/openmpi-1.8.1-debug/intel-14.0.2.144/lib/libopen-pal.so.6(opal_libevent2021_event_base_loop+0xa05)[0x2b

Re: [OMPI users] Determining what parameters a scheduler passes to OpenMPI

2014-06-08 Thread Ralph Castain
Sorry about the comment re cpus-per-proc - confused this momentarily with another user also using Torque. I confirmed that this works fine with 1.6.5, and would guess you are hitting some bug in 1.6.0. Can you update? On Jun 6, 2014, at 12:20 PM, Ralph Castain wrote: > You might want

Re: [OMPI users] orted 1.6.4 and 1.8.1 segv with bonded Cisco P81E

2014-06-09 Thread Ralph Castain
On Jun 9, 2014, at 2:41 PM, Vineet Rawat wrote: > Hi, > > We've deployed OpenMPI on a small cluster but get a SEGV in orted. Debug > information is very limited as the cluster is at a remote customer site. They > have a network card with which I'm not familiar (Cisco Systems Inc VIC P81E > P

Re: [OMPI users] orted 1.6.4 and 1.8.1 segv with bonded Cisco P81E

2014-06-09 Thread Ralph Castain
There is one new "feature" in 1.8 - it now checks to see if the version on the backend matches the version on the frontend. In other words, mpirun checks to see if the orted connecting to it is from the same version - if not, the orted will die. Shouldn't segfault, though - just abort. You cou

Re: [OMPI users] [warn] Epoll ADD(1) on fd 0 failed

2014-06-10 Thread Ralph Castain
read change > was %d; write change was %d]", > > > > On Fri, Jun 6, 2014 at 3:38 PM, Ralph Castain wrote: > Possible - honestly don't know > > On Jun 6, 2014, at 12:16 AM, Timur Ismagilov wrote: > >> Sometimes, after termination of the

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-11 Thread Ralph Castain
te-a084:51113] [ 6] mpirun[0x404719] > [conte-a084:51113] *** End of error message *** > Segmentation fault (core dumped) > > On Sun, Jun 8, 2014 at 4:54 PM, Ralph Castain wrote: >> I'm having no luck poking at this segfault issue. For some strange reason, >> we s

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-11 Thread Ralph Castain
44 AM, Dan Dietz wrote: > Sorry - it crashes with both torque and rsh launchers. The output from > a gdb backtrace on the core files looks identical. > > Dan > > On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain wrote: >> Afraid I'm a little confused now - are you say

Re: [OMPI users] openib segfaults with Torque

2014-06-11 Thread Ralph Castain
Yeah, I think we've seen that somewhere before too... On Jun 11, 2014, at 2:59 PM, Joshua Ladd wrote: > Agreed. The problem is not with UDCM. I don't think something is wrong with > the system. I think his Torque is imposing major constraints on the maximum > size that can be locked into memo

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Ralph Castain
it solves the problem? On Jun 11, 2014, at 2:15 PM, Ralph Castain wrote: > Okay, let me poke around some more. It is clearly tied to the coprocessors, > but I'm not yet sure just why. > > One thing you might do is try the nightly 1.8.2 tarball - there have been a > numb

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Ralph Castain
for release isn't far off. Maybe not end June as last-minute things always surface in the RC process, but likely early July. On Jun 12, 2014, at 8:11 AM, Bennet Fauber wrote: > On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain wrote: >> I've poked and prodded, and the 1.8.2 tar

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Ralph Castain
rs to be crashing in a similar > fashion. :-( I used the latest snapshot 1.8.2a1r31981. > > Dan > > On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain wrote: >> I've poked and prodded, and the 1.8.2 tarball seems to be handling this >> situation just fine. I don't h

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Ralph Castain
Kewl - thanks! I'm a Purdue alum, if that helps :-) On Jun 12, 2014, at 9:04 AM, Dan Dietz wrote: > That shouldn't be a problem. Let me figure out the process and I'll > get back to you. > > Dan > > On Thu, Jun 12, 2014 at 11:50 AM, Ralph Castain wrote: >

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-16 Thread Ralph Castain
wrote: > That shouldn't be a problem. Let me figure out the process and I'll > get back to you. > > Dan > > On Thu, Jun 12, 2014 at 11:50 AM, Ralph Castain wrote: >> Arggh - is there any way I can get access to this beast so I can debug this? >> I can't

Re: [OMPI users] how to get mpirun to scale from 16 to 64 cores

2014-06-16 Thread Ralph Castain
Well, for one, there is never any guarantee of linear scaling with the number of procs - that is very application dependent. You can actually see performance decrease with number of procs if the application doesn't know how to exploit them. One thing that stands out is your mapping and binding

Re: [OMPI users] how to get mpirun to scale from 16 to 64 cores

2014-06-17 Thread Ralph Castain
Well, yes and no. Besides the real question would be if this app, which this person didn't write, was written as a threaded application. On Jun 16, 2014, at 8:32 PM, Zehan Cui wrote: > Hi Yuping, > > Maybe using multi-threads inside a socket, and MPI among sockets is better > choice for such

Re: [OMPI users] how to get mpirun to scale from 16 to 64 cores

2014-06-17 Thread Ralph Castain
e_timestep_loop --animation_freq -1 > > I run above command, still do not improve. Would you give me a detailed > command with options? > Thank you. > > Best regards, > > Yuping > > > -------- > On Tue, 6/17/14, Ralph Castain w

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-18 Thread Ralph Castain
The default binding option depends on the number of procs - it is bind-to core for np=2, and bind-to socket for np > 2. You never said, but should I assume you ran 4 ranks? If so, then we should be trying to bind-to socket. I'm not sure what your cpuset is telling us - are you binding us to a so

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-19 Thread Ralph Castain
on this node. > > Brock Palen > www.umich.edu/~brockp > CAEN Advanced Computing > XSEDE Campus Champion > bro...@umich.edu > (734)936-1985 > > > > On Jun 18, 2014, at 11:10 PM, Ralph Castain wrote: > >> The default binding option depends on the number of procs -

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Ralph Castain
e > ether/or depending on the job. > > Brock Palen > www.umich.edu/~brockp > CAEN Advanced Computing > XSEDE Campus Champion > bro...@umich.edu > (734)936-1985 > > > > On Jun 19, 2014, at 2:44 PM, Ralph Castain wrote: > >> Sorry, I should have bee

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Ralph Castain
all >> available processors) >> [nyx5775.engin.umich.edu:27951] MCW rank 16 is not bound (or bound to all >> available processors) >> [nyx5597.engin.umich.edu:68073] MCW rank 29 is not bound (or bound to all >> available processors) >> [nyx5597.engin.umich.edu:

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-20 Thread Ralph Castain
ig file. How can I make > this the default, that then a user can override? > > Brock Palen > www.umich.edu/~brockp > CAEN Advanced Computing > XSEDE Campus Champion > bro...@umich.edu > (734)936-1985 > > > > On Jun 20, 2014, at 1:21 PM, Ralph Castain wrote: >

Re: [OMPI users] btl_openib_connect_oob.c:867:rml_recv_cb error after Infini-band stack update.

2014-06-20 Thread Ralph Castain
What was updated? If the OS, did you remember to set the memory registration limits to max? On Jun 20, 2014, at 11:25 AM, Ivanov, Aleksandar (INR) wrote: > > Dear Sir or Madam, > > I am using the openmpi 1.6.5 library compiled with IFORT / ICC 13.1.5. Since > a recent update of our machi

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-24 Thread Ralph Castain
ssonnea...@calculquebec.ca> wrote: > Hi, > I've been following this thread because it may be relevant to our setup. > > Is there a drawback of having orte_hetero_nodes=1 as default MCA parameter > ? Is there a reason why the most generic case is not assumed ? > > Maxim

Re: [OMPI users] affinity issues under cpuset torque 1.8.1

2014-06-24 Thread Ralph Castain
ger. And yes we don't have the PHI stuff > installed on all nodes, strange that 'all all' is now very short, > ompi_info -a still works though. > > > > Brock Palen > www.umich.edu/~brockp > CAEN Advanced Computing > XSEDE Campus Champion > bro...@umich.edu

Re: [OMPI users] Compiling OpenMPI for Intel Xeon Phi/MIC

2014-06-26 Thread Ralph Castain
I'm on the road today, but will be back tomorrow afternoon (US Pacific time) and can forward my notes on this again. In the interim, just go to our user mailing list archives and search for "phi" and you'll see the conversations. Basically, you have to cross-compile OMPI to run on the Phi. I've be

Re: [OMPI users] OpenMPI 1.8.1 runs more OpenMP Threads on the same core

2014-06-27 Thread Ralph Castain
You should add this to your cmd line: --map-by core:pe=4 This will bind each process to 4 cores Sent from my iPhone > On Jun 27, 2014, at 5:22 AM, Luigi Santangelo > wrote: > > Hi all, > My system is a 64 core, with Debian 3.2.57 64 bit, GNU gcc 4.7, kernel Linux > 3.2.0 and OpenMPI 1.8.1.

Re: [OMPI users] Problem moving from 1.4 to 1.6

2014-06-27 Thread Ralph Castain
Let me steer you on a different course. Can you run "ompi_info" and paste the output here? It looks to me like someone installed a version that includes uDAPL support, so you may have to disable some additional things to get it to run. On Jun 27, 2014, at 9:53 AM, Jeffrey A Cummings wrote:

Re: [OMPI users] Missing -enable-crdebug option in configure step

2014-06-30 Thread Ralph Castain
I don't recall ever seeing such an option in Open MPI - what makes you believe it should exist? On Jun 29, 2014, at 9:25 PM, Đỗ Mai Anh Tú wrote: > Hi all, > > I am trying to run the checkpoint/restart enabled debugging code in Open > MPI. This requires configure this option at the set up step

Re: [OMPI users] Problem moving from 1.4 to 1.6

2014-06-30 Thread Ralph Castain
0, API v2.0, Component v1.6.2) > MCA ess: tool (MCA v2.0, API v2.0, Component v1.6.2) > MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.6.2) > MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.6.2) > MCA notifier: command (MCA v2.0,

Re: [OMPI users] Problem moving from 1.4 to 1.6

2014-06-30 Thread Ralph Castain
> jeffrey.a.cummi...@aero.org > > > > From:Ralph Castain > To:Open MPI Users , > Date:06/30/2014 02:13 PM > Subject:Re: [OMPI users] Problem moving from 1.4 to 1.6 > Sent by:"users" > > > > Yeah, this

Re: [OMPI users] openMP and mpi problem

2014-07-02 Thread Ralph Castain
OMPI started binding by default during the 1.7 series. You should add the following to your cmd line: --map-by :pe=$OMP_NUM_THREADS This will give you a dedicated core for each thread. Alternatively, you could instead add --bind-to socket OMPI 1.5.5 doesn't bind at all unless directed to do s

Re: [OMPI users] mpi prorg fails (big data)

2014-07-02 Thread Ralph Castain
I would suggest having him look at the core file with a debugger and see where it fails. Sounds like he has a memory corruption problem. On Jun 24, 2014, at 3:31 AM, Dr.Peer-Joachim Koch wrote: > Hi, > > one of our cluster users reported a problem with openmpi. > He created a short sample (ju

Re: [OMPI users] openMP and mpi problem

2014-07-02 Thread Ralph Castain
d 0.021 sec > Do i have utility similar to the 'top' with sbatch? > > Also, every time, i got the message in ompi 1.9: > mca: base: components_register: component sbgp / ibnet register function > failed > Is it bad? > > Regards, > Timur > > Wed, 2 J

Re: [OMPI users] openMP and mpi problem

2014-07-03 Thread Ralph Castain
f the available mappers was able to perform the requested > mapping operation. This can happen if you request a map type > (e.g., loadbalance) and the corresponding mapper was not built. > ... > > > > Wed, 2 Jul 2014 07:36:48 -0700 от Ralph Castain : > Let's keep this on

Re: [OMPI users] mpi_comm_spawn question

2014-07-03 Thread Ralph Castain
Unfortunately, that has never been supported. The problem is that the embedded mpirun picks up all those MCA params that were provided to the original application process, and gets hopelessly confused. We have tried in the past to figure out a solution, but it has proved difficult to separate th

Re: [OMPI users] openMP and mpi problem

2014-07-04 Thread Ralph Castain
component sbgp / ibnet > register function failed > Main 21.366504 secs total /1 > Computation 21.048671 secs total /1000 > [node1-128-29:21569] mca: base: close: unloading component lama > [node1-128-29:21569] mca: base: close: component mindist closed > [node1-128-29:21569] mca: bas

Re: [OMPI users] openMP and mpi problem

2014-07-04 Thread Ralph Castain
'{print $2" slots="$1}' > $HOSTFILE || { rm > -f $HOSTFILE; exit 255; } > LD_PRELOAD=/mnt/data/users/dm2/vol3/semenov/_scratch/mxm/mxm-3.0/lib/libmxm.so > mpirun -x LD_PRELOAD -x MXM_SHM_KCOPY_MODE=off --map-by slot:pe=8 --mca > rmaps_base_verbose 20 --hostfile $HOS

Re: [OMPI users] All MPI processes have rank 0

2014-07-07 Thread Ralph Castain
With that little info, no - you haven't told us anything. How are you running this "rank test", how was OMPI configured, etc? On Jul 7, 2014, at 6:21 AM, Alexander Frolov wrote: > Hi! > > I am running MPI rank test using srun and all processes think that they are > rank 0. > > * slurm 14.11

Re: [OMPI users] All MPI processes have rank 0

2014-07-07 Thread Ralph Castain
olo/local/openmpi-1.8.1-gcc-4.8.2 --with-openib > --enable-mpi-thread-multiple CC=/local/usr/local/bin/gcc > CXX=/local/usr/local/bin/g++ > --- > slurm configured as follows: > > ./configure --prefix=/home/frolo/local/slurm > > (I'm running it as a user) > --- >

Re: [OMPI users] Can't run with more than two nodes in the hostfile

2014-07-14 Thread Ralph Castain
During the 1.7 series and for all follow-on series, OMPI changed to a mode where it launches a daemon on all allocated nodes at the startup of mpirun. This allows us to determine the hardware topology of the nodes and take that into account when mapping. You can override that behavior by either

Re: [OMPI users] Can't run with more than two nodes in the hostfile

2014-07-14 Thread Ralph Castain
gt; > and while > > /opt/openmpi/bin/mpirun -host host18 ompi_info > > works > > /opt/openmpi/bin/mpirun --mca plm_rsh_no_tree_spawn 1 -host host18 ompi_info > > hangs, is there some condition in the use of this parameter. > > Yours truly > > Ricardo

Re: [OMPI users] Can't run with more than two nodes in the hostfile

2014-07-14 Thread Ralph Castain
Hmmm...no, it worked just fine for me. It sounds like something else is going on. Try configuring OMPI with --enable-debug, and then add -mca plm_base_verbose 10 to get a better sense of what is going on. On Jul 14, 2014, at 10:27 AM, Ralph Castain wrote: > I confess I haven'

Re: [OMPI users] Can't run with more than two nodes in the hostfile

2014-07-15 Thread Ralph Castain
t; [nexus10.nlroc:27438] mca:base:select:( plm) Skipping component [slurm]. > Query failed to return a module > [nexus10.nlroc:27438] mca:base:select:( plm) Selected component [rsh] > [nexus10.nlroc:27438] mca: base: close: component isolated closed > [nexus10.nlroc:27438] mca: base:

Re: [OMPI users] Can't run with more than two nodes in the hostfile

2014-07-15 Thread Ralph Castain
Forgive me, but I am now fully confused - case 1 and case 3 appear identical to me, except for the debug-daemons flag on case 3. On Jul 15, 2014, at 7:56 AM, Ricardo Fernández-Perea wrote: > What I mean with "another mpi process". > I have 4 nodes where there is process that use mpi and whe

Re: [OMPI users] latest stable and win7/msvc2013

2014-07-16 Thread Ralph Castain
Given that a number of Windows components and #if protections were removed in the 1.7/1.8 series, I very much doubt this will build or work. Are you intending to try and recreate that code? Otherwise, there is a port to cygwin available from that community. On Jul 16, 2014, at 8:52 AM, MM wrot

Re: [OMPI users] latest stable and win7/msvc2013

2014-07-17 Thread Ralph Castain
FWIW: now that I have access to the Intel compiler suite, including for Windows, I've been toying with creating a more stable support solution for OMPI on Windows. It's a low-priority task for me because it isn't clear that we have very many Windows users in HPC land, and the cygwin port already

Re: [OMPI users] latest stable and win7/msvc2013

2014-07-17 Thread Ralph Castain
Guess we just haven't seen that much activity on the OMPI list, even when we did support Windows. I'll see what I can do - I think this new method will allow it to be stable and require far less maintenance than what we had before. On Jul 17, 2014, at 10:42 AM, Jed Brown wrote: > Rob Latham

Re: [OMPI users] latest stable and win7/msvc2013

2014-07-17 Thread Ralph Castain
Yeah, but I'm cheap and get the Intel compilers for free :-) On Jul 17, 2014, at 11:38 AM, Rob Latham wrote: > > > On 07/17/2014 01:19 PM, Jed Brown wrote: >> Damien writes: >> >>> Is this something that could be funded by Microsoft, and is it time to >>> approach them perhaps? MS MPI is b

Re: [OMPI users] latest stable and win7/msvc2013

2014-07-17 Thread Ralph Castain
this stage. On Jul 17, 2014, at 11:52 AM, Jed Brown wrote: > Ralph Castain writes: > >> Yeah, but I'm cheap and get the Intel compilers for free :-) > > Fine for you, but not for the people trying to integrate your library > in a stack developed using MSVC.

Re: [OMPI users] Incorrect escaping of OMPI_MCA environment variables with spaces (for rsh?)

2014-07-18 Thread Ralph Castain
I'm not exactly sure how to fix what you described. The semicolon is escaped because otherwise the cmd line would think it had been separated - the orted cmd line is ssh'd to the remote node and cannot include an unescaped terminator. The space isn't a "special character" in that sense, and actu

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots

2014-07-19 Thread Ralph Castain
That's a pretty old OMPI version, and we don't really support it any longer. However, I can provide some advice: * have you tried running the simple "hello_c" example we provide? This would at least tell you if the problem is in your app, which is what I'd expect given your description * try u

Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots

2014-07-19 Thread Ralph Castain
er > values > be necessary? > > Thank you for your help. > > -Bill Lane > > From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain > [r...@open-mpi.org] > Sent: Saturday, July 19, 2014 8:07 AM > To: Open MPI Users > Subject: Re: [OMPI users] Mpir

Re: [OMPI users] after mpi_finalize(Err)

2014-07-20 Thread Ralph Castain
On Jul 20, 2014, at 7:11 AM, Diego Avesani wrote: > Dear all, > I have a question about mpi_finalize. > > After mpi_finalize the program returs single core, Have I understand > correctly? No - we don't kill any processes. We just tear down the MPI system. All your processes continue to execu

<    1   2   3   4   5   6   7   8   9   10   >