Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Ralph, Andrej - what version of Slurm are you using here? It's slurm 20.11.3, i.e. the latest release afaik. But Gilles is correct; the proposed test failed: andrej@terra:~/system/tests/MPI$ salloc -N 2 -n 2 salloc: Granted job allocation 838 andrej@terra:~/system/tests/MPI$ srun

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, Here is what you can try $ salloc -N 4 -n 384 /* and then from the allocation */ $ srun -n 1 orted /* that should fail, but the error message can be helpful */ $ /usr/local/bin/mpirun --mca plm slurm --mca plm_base_verbose 10 true andrej@terra:~/system/tests/MPI$ salloc -N 4 -n

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
can try (and send the logs if that fails) $ salloc -N 4 -n 384 and once you get the allocation $ env | grep ^SLURM_ $ mpirun --mca plm_base_verbose 10 --mca plm slurm true Cheers, Gilles On Tue, Feb 2, 2021 at 9:27 AM Andrej Prsa via devel wrote: Hi Gilles, I can reproduce this behavior ... whe

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, I can reproduce this behavior ... when running outside of a slurm allocation. I just tried from slurm (sbatch run.sh) and I get the exact same error. What does $ env | grep ^SLURM_ reports? Empty; no environment variables have been defined. Thanks, Andrej

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Ralph, Gilles, I fail to understand why you continue to think that PMI has anything to do with this problem. I see no indication of a PMIx-related issue in anything you have provided to date. Oh, I went off the traceback that yelled about pmix, and slurm not being able to find it until

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
The saga continues. I managed to build slurm with pmix by first patching slurm using this patch and manually building the plugin: https://bugs.schedmd.com/show_bug.cgi?id=10683 Now srun shows pmix as an option: andrej@terra:~/system/tests/MPI$ srun --mpi=list srun: MPI types are... srun:

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, srun -N 1 -n 1 orted that is expected to fail, but it should at least find all its dependencies and start This was quite illuminating! andrej@terra:~/system/tests/MPI$ srun -N 1 -n 1 orted srun: /usr/local/lib/slurm/switch_generic.so: Incompatible Slurm plugin version (20.02.6)

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
ESS -- This was the exact problem that prompted me to try and upgrade from 4.0.3 to 4.1.0. Openmpi 4.1.0 (in debug mode, with internal pmix) is now installed on the head and on all compute nodes. I'd appreciate any ideas on what to try to overcome this. Cheers, Andrej On 2/1/21 9:57 AM, Andrej Prsa wrote:

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
built with the internal pmix) what was your exact configure command line? fwiw, in your build tree, there should be a opal/mca/pmix/pmix3x/.libs/mca_pmix_pmix3x.so if it's there, try running sudo make install once more and see if it helps Cheers, Gilles On Mon, Feb 1, 2021 at 11:05 PM Andrej

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, that's odd, there should be a mca_pmix_pmix3x.so (assuming you built with the internal pmix) Ah, I didn't -- I linked against the latest git pmix; here's the configure line: ./configure --prefix=/usr/local --with-pmix=/usr/local --with-slurm --without-tm --without-moab

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, it seems only flux is a PMIx option, which is very suspicious. can you check other components are available? ls -l /usr/local/lib/openmpi/mca_pmix_*.so andrej@terra:~/system/tests/MPI$ ls -l /usr/local/lib/openmpi/mca_pmix_*.so -rwxr-xr-x 1 root root 97488 FebĀ  1 08:20

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, what is your mpirun command line? is mpirun invoked from a batch allocation? I call mpirun directly; here's a full output: andrej@terra:~/system/tests/MPI$ mpirun --mca ess_base_verbose 10 --mca pmix_base_verbose 10 -np 4 python testmpi.py [terra:203257] mca: base:

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, I invite you to do some cleanup sudo rm -rf /usr/local/lib/openmpi /usr/local/lib/pmix and then sudo make install and try again. Good catch! Alright, I deleted /usr/local/lib/openmpi and /usr/local/lib/pmix, then I rebuilt (make clean; make) and installed pmix from the latest

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-01-31 Thread Andrej Prsa via devel
Hi Ralph, Just trying to understand - why are you saying this is a pmix problem? Obviously, something to do with mpirun is failing, but I don't see any indication here that it has to do with pmix. No -- 4.0.3 had the pmix problem -- whenever I tried to submit jobs across multiple nodes

[OMPI devel] mpirun 4.1.0 segmentation fault

2021-01-31 Thread Andrej Prsa via devel
Hello list, I just upgraded openmpi from 4.0.3 to 4.1.0 to see if it would solve a weird openpmix problem we've been having; I configured it using: ./configure --prefix=/usr/local --with-pmix=internal --with-slurm --without-tm --without-moab --without-singularity --without-fca

Re: [OMPI devel] Problem running openmpi on nodes connected via eth

2015-10-21 Thread Andrej Prsa
Hi Gilles, Thanks for your reply! > by "running on the head node", shall i understand you mean > "running mpirun command *and* all mpi tasks on the head node" ? Precisely. > by "running on the compute node", shall i understand you mean > "running mpirun on the compute node *and* all mpi tasks

[OMPI devel] Problem running openmpi on nodes connected via eth

2015-10-20 Thread Andrej Prsa
Hi everyone, We have a small cluster of 6 identical 48-core nodes for astrophysical research. We are struggling on getting openmpi to run efficiently on the nodes. The head node is running ubuntu and openmpi-1.6.5 on a local disk. All worker nodes are booting from NFS exported root that resides

Re: [OMPI devel] Intermittent MPI issues with torque/maui

2014-08-26 Thread Andrej Prsa
Hi Ralph, > I don't know what version of OMPI you're working with, so I can't > precisely pinpoint the line in question. However, it looks likely to > be an error caused by not finding the PBS nodefile. This is openmpi 1.6.5. > We look in the environment for PBS_NODEFILE to find the directory >

[OMPI devel] Intermittent MPI issues with torque/maui

2014-08-26 Thread Andrej Prsa
Hi all, I asked this question on the torque mailing list, and I found several similar issues on the web, but no definitive solutions. When we run our MPI programs via torque/maui, at random times, in ~50-70% of all cases, the job will fail with the following error message: [node1:51074]

Re: [OMPI devel] 1.8.2rc4 problem: only 32 out of 48 cores are working

2014-08-25 Thread Andrej Prsa
Hi Jeff, My apologies for the delay in replying, I was flying back from the UK to the States, but now I'm here and I can provide a more timely response. > I confirm that the hwloc message you sent (and your posts to the > hwloc-users list) indicate that hwloc is getting confused by a buggy >

Re: [OMPI devel] 1.8.2rc4 problem: only 32 out of 48 cores are working

2014-08-22 Thread Andrej Prsa
Hi again, I generated a video that demonstrates the problem; for brevity I did not run a full process, but I'm providing the timing below. If you'd like me to record a full process, just let me know -- but as I said in my previous email, 32 procs drop to 1 after about a minute and the computation

Re: [OMPI devel] 1.8.2rc4 problem: only 32 out of 48 cores are working

2014-08-22 Thread Andrej Prsa
Hi Ralph, Chris, You guys are both correct: (1) The output that I passed along /is/ exemplary of only 32 processors running (provided htop reports things correctly). The job I submitted is the exact same process called 48 times (well, np times), so all procs should take about the

Re: [OMPI devel] 1.8.2rc4 problem: only 32 out of 48 cores are working

2014-08-21 Thread Andrej Prsa
> build (i.e., was configured with --enable-debug), and then set -mca > odls_base_verbose 5 --leave-session-attached on the cmd line? > > It'll be a little noisy, but should tell us why the other 16 procs > aren't getting launched > > > On Aug 21, 2014, at 3:27 PM, A

Re: [OMPI devel] 1.8.2rc4 problem: only 32 out of 48 cores are working

2014-08-21 Thread Andrej Prsa
> Hate to keep bothering you, but could you ensure this is a debug > build (i.e., was configured with --enable-debug), and then set -mca > odls_base_verbose 5 --leave-session-attached on the cmd line? No bother at all -- would love to help. I recompiled 1.8.2rc4 with debug and issued:

Re: [OMPI devel] 1.8.2rc4 problem: only 32 out of 48 cores are working

2014-08-21 Thread Andrej Prsa
> How odd - can you run it with --display-devel-map and send that > along? It will give us a detailed statement of where it thinks > everything should run. Sure thing -- please find it attached. Cheers, Andrej test.std.bz2 Description: application/bzip

Re: [OMPI devel] 1.8.2rc4 problem: only 32 out of 48 cores are working

2014-08-21 Thread Andrej Prsa
> > > Starting early in the 1.7 series, we began to bind procs by default > > to cores when -np <= 2, and to sockets if np > 2. Is it possible > > this is what you are seeing? > > > > > > On Aug 21, 2014, at 12:45 PM, Andrej Prsa <aprs...@gmail.com&g

[OMPI devel] 1.8.2rc4 problem: only 32 out of 48 cores are working

2014-08-21 Thread Andrej Prsa
Dear devels, I have been trying out 1.8.2rcs recently and found a show-stopping problem on our cluster. Running any job with any number of processors larger than 32 will always employ only 32 cores per node (our nodes have 48 cores). We are seeing identical behavior with 1.8.2rc4, 1.8.2rc2, and