Hi Ralph,
Andrej - what version of Slurm are you using here?
It's slurm 20.11.3, i.e. the latest release afaik.
But Gilles is correct; the proposed test failed:
andrej@terra:~/system/tests/MPI$ salloc -N 2 -n 2
salloc: Granted job allocation 838
andrej@terra:~/system/tests/MPI$ srun hostnam
Hi Gilles,
Here is what you can try
$ salloc -N 4 -n 384
/* and then from the allocation */
$ srun -n 1 orted
/* that should fail, but the error message can be helpful */
$ /usr/local/bin/mpirun --mca plm slurm --mca plm_base_verbose 10 true
andrej@terra:~/system/tests/MPI$ salloc -N 4 -n 3
try (and send the logs if that fails)
$ salloc -N 4 -n 384
and once you get the allocation
$ env | grep ^SLURM_
$ mpirun --mca plm_base_verbose 10 --mca plm slurm true
Cheers,
Gilles
On Tue, Feb 2, 2021 at 9:27 AM Andrej Prsa via devel
wrote:
Hi Gilles,
I can reproduce this behavior ... whe
Hi Gilles,
I can reproduce this behavior ... when running outside of a slurm allocation.
I just tried from slurm (sbatch run.sh) and I get the exact same error.
What does
$ env | grep ^SLURM_
reports?
Empty; no environment variables have been defined.
Thanks,
Andrej
Hi Ralph, Gilles,
I fail to understand why you continue to think that PMI has anything to do with
this problem. I see no indication of a PMIx-related issue in anything you have
provided to date.
Oh, I went off the traceback that yelled about pmix, and slurm not being
able to find it until I
The saga continues.
I managed to build slurm with pmix by first patching slurm using this
patch and manually building the plugin:
https://bugs.schedmd.com/show_bug.cgi?id=10683
Now srun shows pmix as an option:
andrej@terra:~/system/tests/MPI$ srun --mpi=list
srun: MPI types are...
srun: cra
Hi Gilles,
srun -N 1 -n 1 orted
that is expected to fail, but it should at least find all its
dependencies and start
This was quite illuminating!
andrej@terra:~/system/tests/MPI$ srun -N 1 -n 1 orted
srun: /usr/local/lib/slurm/switch_generic.so: Incompatible Slurm plugin
version (20.02.6)
s
--
This was the exact problem that prompted me to try and upgrade from
4.0.3 to 4.1.0. Openmpi 4.1.0 (in debug mode, with internal pmix) is now
installed on the head and on all compute nodes.
I'd appreciate any ideas on what to try to overcome this.
Cheers,
Andrej
On 2/1/2
x3x.so (assuming you built
with the internal pmix)
what was your exact configure command line?
fwiw, in your build tree, there should be a
opal/mca/pmix/pmix3x/.libs/mca_pmix_pmix3x.so
if it's there, try running
sudo make install
once more and see if it helps
Cheers,
Gilles
On Mon, Feb 1
Hi Gilles,
that's odd, there should be a mca_pmix_pmix3x.so (assuming you built
with the internal pmix)
Ah, I didn't -- I linked against the latest git pmix; here's the
configure line:
./configure --prefix=/usr/local --with-pmix=/usr/local --with-slurm
--without-tm --without-moab --without
Hi Gilles,
it seems only flux is a PMIx option, which is very suspicious.
can you check other components are available?
ls -l /usr/local/lib/openmpi/mca_pmix_*.so
andrej@terra:~/system/tests/MPI$ ls -l /usr/local/lib/openmpi/mca_pmix_*.so
-rwxr-xr-x 1 root root 97488 FebĀ 1 08:20
/usr/local
Hi Gilles,
what is your mpirun command line?
is mpirun invoked from a batch allocation?
I call mpirun directly; here's a full output:
andrej@terra:~/system/tests/MPI$ mpirun --mca ess_base_verbose 10 --mca
pmix_base_verbose 10 -np 4 python testmpi.py
[terra:203257] mca: base: components_regi
Hi Gilles,
I invite you to do some cleanup
sudo rm -rf /usr/local/lib/openmpi /usr/local/lib/pmix
and then
sudo make install
and try again.
Good catch! Alright, I deleted /usr/local/lib/openmpi and
/usr/local/lib/pmix, then I rebuilt (make clean; make) and installed
pmix from the latest mast
Hi Ralph,
Just trying to understand - why are you saying this is a pmix problem?
Obviously, something to do with mpirun is failing, but I don't see any
indication here that it has to do with pmix.
No -- 4.0.3 had the pmix problem -- whenever I tried to submit jobs
across multiple nodes usin
Hello list,
I just upgraded openmpi from 4.0.3 to 4.1.0 to see if it would solve a
weird openpmix problem we've been having; I configured it using:
./configure --prefix=/usr/local --with-pmix=internal --with-slurm
--without-tm --without-moab --without-singularity --without-fca
--without-hcol
Hi Gilles,
Thanks for your reply!
> by "running on the head node", shall i understand you mean
> "running mpirun command *and* all mpi tasks on the head node" ?
Precisely.
> by "running on the compute node", shall i understand you mean
> "running mpirun on the compute node *and* all mpi tasks o
Hi everyone,
We have a small cluster of 6 identical 48-core nodes for astrophysical
research. We are struggling on getting openmpi to run efficiently on
the nodes. The head node is running ubuntu and openmpi-1.6.5 on a local
disk. All worker nodes are booting from NFS exported root that resides
on
Hi Ralph,
> I don't know what version of OMPI you're working with, so I can't
> precisely pinpoint the line in question. However, it looks likely to
> be an error caused by not finding the PBS nodefile.
This is openmpi 1.6.5.
> We look in the environment for PBS_NODEFILE to find the directory
>
Hi all,
I asked this question on the torque mailing list, and I found several
similar issues on the web, but no definitive solutions. When we run our
MPI programs via torque/maui, at random times, in ~50-70% of all cases,
the job will fail with the following error message:
[node1:51074] [[36074,0
Hi Jeff,
My apologies for the delay in replying, I was flying back from the UK
to the States, but now I'm here and I can provide a more timely
response.
> I confirm that the hwloc message you sent (and your posts to the
> hwloc-users list) indicate that hwloc is getting confused by a buggy
> BIOS
Hi again,
I generated a video that demonstrates the problem; for brevity I did
not run a full process, but I'm providing the timing below. If you'd
like me to record a full process, just let me know -- but as I said in
my previous email, 32 procs drop to 1 after about a minute and the
computation
Hi Ralph, Chris,
You guys are both correct:
(1) The output that I passed along /is/ exemplary of only 32 processors
running (provided htop reports things correctly). The job I
submitted is the exact same process called 48 times (well, np
times), so all procs should take about the same
s a debug
> build (i.e., was configured with --enable-debug), and then set -mca
> odls_base_verbose 5 --leave-session-attached on the cmd line?
>
> It'll be a little noisy, but should tell us why the other 16 procs
> aren't getting launched
>
>
> On Aug 21, 2014, a
> Hate to keep bothering you, but could you ensure this is a debug
> build (i.e., was configured with --enable-debug), and then set -mca
> odls_base_verbose 5 --leave-session-attached on the cmd line?
No bother at all -- would love to help. I recompiled 1.8.2rc4 with
debug and issued:
/usr/local/
> How odd - can you run it with --display-devel-map and send that
> along? It will give us a detailed statement of where it thinks
> everything should run.
Sure thing -- please find it attached.
Cheers,
Andrej
test.std.bz2
Description: application/bzip
> Starting early in the 1.7 series, we began to bind procs by default
> > to cores when -np <= 2, and to sockets if np > 2. Is it possible
> > this is what you are seeing?
> >
> >
> > On Aug 21, 2014, at 12:45 PM, Andrej Prsa wrote:
> >
> >> D
Dear devels,
I have been trying out 1.8.2rcs recently and found a show-stopping
problem on our cluster. Running any job with any number of processors
larger than 32 will always employ only 32 cores per node (our nodes
have 48 cores). We are seeing identical behavior with 1.8.2rc4,
1.8.2rc2, and 1.
27 matches
Mail list logo