Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-16 Thread Ralph Castain
Just to wrap this up for the user list: this has now been fixed and added to 1.8.2 in the nightly tarball. The problem proved to be an edge case when partial allocations were combined with coprocessor existence (hit a slightly different code path). On Jun 12, 2014, at 9:04 AM, Dan Dietz

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Ralph Castain
Kewl - thanks! I'm a Purdue alum, if that helps :-) On Jun 12, 2014, at 9:04 AM, Dan Dietz wrote: > That shouldn't be a problem. Let me figure out the process and I'll > get back to you. > > Dan > > On Thu, Jun 12, 2014 at 11:50 AM, Ralph Castain wrote:

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Dan Dietz
That shouldn't be a problem. Let me figure out the process and I'll get back to you. Dan On Thu, Jun 12, 2014 at 11:50 AM, Ralph Castain wrote: > Arggh - is there any way I can get access to this beast so I can debug this? > I can't figure out what in the world is going on,

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Ralph Castain
Arggh - is there any way I can get access to this beast so I can debug this? I can't figure out what in the world is going on, but it seems to be something triggered by your specific setup. On Jun 12, 2014, at 8:48 AM, Dan Dietz wrote: > Unfortunately, the nightly tarball

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Dan Dietz
Unfortunately, the nightly tarball appears to be crashing in a similar fashion. :-( I used the latest snapshot 1.8.2a1r31981. Dan On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain wrote: > I've poked and prodded, and the 1.8.2 tarball seems to be handling this > situation just

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Ralph Castain
It isn't a development tarball - it's the current state of the release branch and is therefore managed much more strictly than the developer trunk. We are preparing it now for release candidate. I have about a dozen CMR's waiting for final review before moving across to 1.8.2, and then we'll

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Bennet Fauber
On Thu, Jun 12, 2014 at 10:56 AM, Ralph Castain wrote: > I've poked and prodded, and the 1.8.2 tarball seems to be handling this > situation Ralph, That's still the development tarball, right? 1.8.2 remains unreleased? Is the an ETA for 1.8.2 the end of this month?

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-12 Thread Ralph Castain
I've poked and prodded, and the 1.8.2 tarball seems to be handling this situation just fine. I don't have access to a Torque machine, but I did set everything to follow the same code path, added faux coprocessors, etc. - and it ran just fine. Can you try the 1.8.2 tarball and see if it solves

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-11 Thread Ralph Castain
Okay, let me poke around some more. It is clearly tied to the coprocessors, but I'm not yet sure just why. One thing you might do is try the nightly 1.8.2 tarball - there have been a number of fixes, and this may well have been caught there. Worth taking a look. On Jun 11, 2014, at 6:44 AM,

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-11 Thread Dan Dietz
Sorry - it crashes with both torque and rsh launchers. The output from a gdb backtrace on the core files looks identical. Dan On Wed, Jun 11, 2014 at 9:37 AM, Ralph Castain wrote: > Afraid I'm a little confused now - are you saying it works fine under Torque, > but segfaults

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-11 Thread Ralph Castain
Afraid I'm a little confused now - are you saying it works fine under Torque, but segfaults under rsh? Could you please clarify your current situation? On Jun 11, 2014, at 6:27 AM, Dan Dietz wrote: > It looks like it is still segfaulting with the rsh launcher: > >

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-11 Thread Dan Dietz
It looks like it is still segfaulting with the rsh launcher: ddietz@conte-a084:/scratch/conte/d/ddietz/hello$ mpirun -mca plm rsh -np 4 -machinefile ./nodes ./hello [conte-a084:51113] *** Process received signal *** [conte-a084:51113] Signal: Segmentation fault (11) [conte-a084:51113] Signal

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-09 Thread Dan Dietz
Ack - that was my fault. Too early on a monday morning. This seems to work perfectly when I correctly submit a job! Thanks! Dan On Mon, Jun 9, 2014 at 9:34 AM, Dan Dietz wrote: > Yes, you're exactly right - this system has 2 Phi cards per node. I > believe the "PCI 8086"

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-09 Thread Dan Dietz
Yes, you're exactly right - this system has 2 Phi cards per node. I believe the "PCI 8086" device in the lstopo output is them. Possibly related, we've observed a weird bug with Torque and the allocation it provides when you request the Phis. When requesting them you get a nodefile with only 1

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-08 Thread tmishima
It's a good idea to provide the default setting for the modifier pe. Okay, I can take a look to review but a bit busy now, so please give me a few days. Regards, Tetsuya > Okay, I revised the command line option to be a little more user-friendly. You can now specify the equivalent of the old

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-08 Thread Ralph Castain
I'm having no luck poking at this segfault issue. For some strange reason, we seem to think there are coprocessors on those remote nodes - e.g., a Phi card. Yet your lstopo output doesn't seem to show it. Out of curiosity, can you try running this with "-mca plm rsh"? This will substitute the

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-08 Thread Ralph Castain
Okay, I revised the command line option to be a little more user-friendly. You can now specify the equivalent of the old --cpus-per-proc as just "--map-by :pe=N", leaving the mapping policy set as the default. We will default to NUMA so the cpus will all be in the same NUMA region, if possible,

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-06 Thread Ralph Castain
HmmmTetsuya is quite correct. Afraid I got distracted by the segfault (still investigating that one). Our default policy for 2 processes is to map-by core, and that would indeed fail when cpus-per-proc > 1. However, that seems like a non-intuitive requirement, so let me see if I can make

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-06 Thread tmishima
Hi Dan, Please try: mpirun -np 2 --map-by socket:pe=8 ./hello or mpirun -np 2 --map-by slot:pe=8 ./hello You can not bind 8 cpus to the object "core" which has only one cpu. This limitation started from 1.8 series. The objcet "socket" has 8 cores in your case. So you can do it. And, the

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-06 Thread Ralph Castain
Okay, I'll poke into this - thanks! On Jun 6, 2014, at 12:48 PM, Dan Dietz wrote: > No problem - > > These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz chips. > 2 per node, 8 cores each. No threading enabled. > > $ lstopo > Machine (64GB) > NUMANode L#0 (P#0

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-06 Thread Dan Dietz
No problem - These are model name : Intel(R) Xeon(R) CPU E5-2670 0 @ 2.60GHz chips. 2 per node, 8 cores each. No threading enabled. $ lstopo Machine (64GB) NUMANode L#0 (P#0 32GB) Socket L#0 + L3 L#0 (20MB) L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-06 Thread Ralph Castain
Sorry to pester with questions, but I'm trying to narrow down the issue. * What kind of chips are on these machines? * If they have h/w threads, are they enabled? * you might have lstopo on one of those machines - could you pass along its output? Otherwise, you can run a simple "mpirun -n 1

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-06 Thread Dan Dietz
Thanks for the reply. I tried out the --display-allocation option with several different combinations and have attached the output. I see this behavior on both RHEL6.4, RHEL6.5, and RHEL5.10 clusters. Here's debugging info on the segfault. Does that help? FWIW this does not seem to crash on the

Re: [OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-05 Thread Ralph Castain
On Jun 5, 2014, at 2:13 PM, Dan Dietz wrote: > Hello all, > > I'd like to bind 8 cores to a single MPI rank for hybrid MPI/OpenMP > codes. In OMPI 1.6.3, I can do: > > $ mpirun -np 2 -cpus-per-rank 8 -machinefile ./nodes ./hello > > I get one rank bound to procs 0-7 and

[OMPI users] Bind multiple cores to rank - OpenMPI 1.8.1

2014-06-05 Thread Dan Dietz
Hello all, I'd like to bind 8 cores to a single MPI rank for hybrid MPI/OpenMP codes. In OMPI 1.6.3, I can do: $ mpirun -np 2 -cpus-per-rank 8 -machinefile ./nodes ./hello I get one rank bound to procs 0-7 and the other bound to 8-15. Great! But I'm having some difficulties doing this with