It's Intel Fabric Suite, a software distribution for True Scale that contains
drivers, infiniband stack, PSM, and more:
https://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=24129&lang=eng
Andrew
> -Original Message-
> From: devel [mailto:devel-boun...@open-mpi.org] On Behalf Of Adr
Sure it would help. I'll test it whenever you're ready.
thanks!
From: devel on behalf of Ralph Castain
Sent: Monday, November 10, 2014 4:15 PM
To: Open MPI Developers
Subject: Re: [OMPI devel] mpirun does not honor rankfile
Here’s what I can do. It looks like
What is IFS?
On Mon, Nov 10, 2014 at 09:12:41PM +, Friedley, Andrew wrote:
> Hi Adrian,
>
> Yes, I suggest trying either RH support or Intel's support at
> ibsupp...@intel.com. They might have seen this problem before. Since you're
> running the RHEL versions of PSM and related software,
Here’s what I can do. It looks like LSF is providing PU numbers instead of core
numbers, probably to get around the uniqueness issue. This works fine for me as
I can just lookup the correct PU number and then bind you to the core that
includes that PU. It means you have to provide the physical r
Hi Adrian,
Yes, I suggest trying either RH support or Intel's support at
ibsupp...@intel.com. They might have seen this problem before. Since you're
running the RHEL versions of PSM and related software, one thing you could try
is IFS. I think I was running IFS 7.3.0, so that's a difference
Andrew,
thanks for looking into this. I was able to reproduce this error on RHEL 7
with PSM provided by RHEL:
infinipath-psm-3.2-2_ga8c3e3e_open.2.el7.x86_64
infinipath-psm-devel-3.2-2_ga8c3e3e_open.2.el7.x86_64
$ mpirun -np 32 mpi_test_suite -t "environment"
mpi_test_suite:4877 terminated wit
On all but the 2 machines with the newer bios (just the first socket):
mach1:~ # lstopo -p --of console
NUMANode P#0 (12GB) + L3 (5118KB)
L2 (512KB) + L1 (64KB) + Core P#0 + PU P#0
L2 (512KB) + L1 (64KB) + Core P#1 + PU P#4
L2 (512KB) + L1 (64KB) + Core P#2 + PU P#8
L2
So a key point here is that PU in lstopo output equates to hyperthread when
hyperthreads are enabled, and those are always uniquely numbered. On my
(admittedly puny by comparison) dual-socket Nehalem box, I get this for
physical:
$ lstopo -p --of console
Machine (16GB)
NUMANode P#0 (8127MB) +
If we run
> lstopo --output-format fig
we get a diagram of the socket/numa/core layouts and all but those 2 give "PU
P#0", PU P#4,
PU P#8in the smallest box. and in the lower left corner it says "physical"
If we then add an option
> lstopo --logical --output-format fig
we get PU L#0,
Hmmm….and those are, of course, intended to be physical core numbers. I wonder
how they are numbering them? The OS index won’t be unique, which is what is
causing us trouble, so they must have some way of translating them to provide a
unique number.
> On Nov 10, 2014, at 10:42 AM, Tom Wurgler
LSF gives this, for example, over which we (LSF users) have no control.
rank 0=mach1 slot=0
rank 1=mach1 slot=4
rank 2=mach1 slot=8
rank 3=mach1 slot=12
rank 4=mach1 slot=16
rank 5=mach1 slot=20
rank 6=mach1 slot=24
rank 7=mach1 slot=28
rank 8=mach1 slot=32
rank 9=mach1 slot=36
rank 10=mach1 slot
I’ve been taking a look at this, and I believe I can get something implemented
shortly. However, one problem I’ve encountered is that physical core indexes
are NOT unique in many systems, e.g., x86 when hyperthreads are enabled. So you
would have to specify socket:core in order to get a unique l
Ralph,
Unfortunately, for competitive reasons, we are unable, at this time, to
push our regression testing results to the public database. We are
investigating ways to sanitize this data to the satisfaction of our legal
department, however, we don't anticipate being able to share this data
anytime
13 matches
Mail list logo