Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-10 Thread Friedley, Andrew
It's Intel Fabric Suite, a software distribution for True Scale that contains drivers, infiniband stack, PSM, and more: https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=24129&lang=eng Andrew > -Original Message- > From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adr

Re: [OMPI devel] mpirun does not honor rankfile

2014-11-10 Thread Tom Wurgler
Sure it would help. I'll test it whenever you're ready. thanks! From: devel on behalf of Ralph Castain Sent: Monday, November 10, 2014 4:15 PM To: Open MPI Developers Subject: Re: [OMPI devel] mpirun does not honor rankfile Here’s what I can do. It looks like

Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-10 Thread Adrian Reber
What is IFS? On Mon, Nov 10, 2014 at 09:12:41PM +, Friedley, Andrew wrote: > Hi Adrian, > > Yes, I suggest trying either RH support or Intel's support at > ibsupp...@intel.com. They might have seen this problem before. Since you're > running the RHEL versions of PSM and related software,

Re: [OMPI devel] mpirun does not honor rankfile

2014-11-10 Thread Ralph Castain
Here’s what I can do. It looks like LSF is providing PU numbers instead of core numbers, probably to get around the uniqueness issue. This works fine for me as I can just lookup the correct PU number and then bind you to the core that includes that PU. It means you have to provide the physical r

Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-10 Thread Friedley, Andrew
Hi Adrian, Yes, I suggest trying either RH support or Intel's support at ibsupp...@intel.com. They might have seen this problem before. Since you're running the RHEL versions of PSM and related software, one thing you could try is IFS. I think I was running IFS 7.3.0, so that's a difference

Re: [OMPI devel] 1.8.3 and PSM errors

2014-11-10 Thread Adrian Reber
Andrew, thanks for looking into this. I was able to reproduce this error on RHEL 7 with PSM provided by RHEL: infinipath-psm-3.2-2_ga8c3e3e_open.2.el7.x86_64 infinipath-psm-devel-3.2-2_ga8c3e3e_open.2.el7.x86_64 $ mpirun -np 32 mpi_test_suite -t "environment" mpi_test_suite:4877 terminated wit

Re: [OMPI devel] mpirun does not honor rankfile

2014-11-10 Thread Tom Wurgler
On all but the 2 machines with the newer bios (just the first socket): mach1:~ # lstopo -p --of console NUMANode P#0 (12GB) + L3 (5118KB) L2 (512KB) + L1 (64KB) + Core P#0 + PU P#0 L2 (512KB) + L1 (64KB) + Core P#1 + PU P#4 L2 (512KB) + L1 (64KB) + Core P#2 + PU P#8 L2

Re: [OMPI devel] mpirun does not honor rankfile

2014-11-10 Thread Ralph Castain
So a key point here is that PU in lstopo output equates to hyperthread when hyperthreads are enabled, and those are always uniquely numbered. On my (admittedly puny by comparison) dual-socket Nehalem box, I get this for physical: $ lstopo -p --of console Machine (16GB) NUMANode P#0 (8127MB) +

Re: [OMPI devel] mpirun does not honor rankfile

2014-11-10 Thread Tom Wurgler
If we run > lstopo --output-format fig we get a diagram of the socket/numa/core layouts and all but those 2 give "PU P#0", PU P#4, PU P#8in the smallest box. and in the lower left corner it says "physical" If we then add an option > lstopo --logical --output-format fig we get PU L#0,

Re: [OMPI devel] mpirun does not honor rankfile

2014-11-10 Thread Ralph Castain
Hmmm….and those are, of course, intended to be physical core numbers. I wonder how they are numbering them? The OS index won’t be unique, which is what is causing us trouble, so they must have some way of translating them to provide a unique number. > On Nov 10, 2014, at 10:42 AM, Tom Wurgler

Re: [OMPI devel] mpirun does not honor rankfile

2014-11-10 Thread Tom Wurgler
LSF gives this, for example, over which we (LSF users) have no control. rank 0=mach1 slot=0 rank 1=mach1 slot=4 rank 2=mach1 slot=8 rank 3=mach1 slot=12 rank 4=mach1 slot=16 rank 5=mach1 slot=20 rank 6=mach1 slot=24 rank 7=mach1 slot=28 rank 8=mach1 slot=32 rank 9=mach1 slot=36 rank 10=mach1 slot

Re: [OMPI devel] mpirun does not honor rankfile

2014-11-10 Thread Ralph Castain
I’ve been taking a look at this, and I believe I can get something implemented shortly. However, one problem I’ve encountered is that physical core indexes are NOT unique in many systems, e.g., x86 when hyperthreads are enabled. So you would have to specify socket:core in order to get a unique l

Re: [OMPI devel] MTT diligence

2014-11-10 Thread Joshua Ladd
Ralph, Unfortunately, for competitive reasons, we are unable, at this time, to push our regression testing results to the public database. We are investigating ways to sanitize this data to the satisfaction of our legal department, however, we don't anticipate being able to share this data anytime