Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application

2022-02-03 Thread Ralph Castain via users
hen consider cpus? From the manpages I thought this is the default behaviour. By the way, if I'll manage to understand everything correctly, I can also contribute to fix these inconsistencies in the manpages. I'd be more than happy to help where I can. On 03.02

Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application

2022-02-03 Thread Ralph Castain via users
nkfile and the allocation.txt file). > > I was wondering if somehow mpirun cannot find all the hosts sometimes (but > sometimes it can, so it's a mistery to me)? > > Just wanted to point that out. Now I'll get in touch with the cluster support > to see if it's po

Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application

2022-02-02 Thread Ralph Castain via users
Are you willing to try this with OMPI master? Asking because it would be hard to push changes all the way back to 4.0.x every time we want to see if we fixed something. Also, few of us have any access to LSF, though I doubt that has much impact here as it sounds like the issue is in the rank_fi

Re: [OMPI users] Error using rankfile to bind multiple cores on the same node for threaded OpenMPI application

2022-02-02 Thread Ralph Castain via users
Errr...what version OMPI are you using? > On Feb 2, 2022, at 3:03 PM, David Perozzi via users > wrote: > > Helo, > > I'm trying to run a code implemented with OpenMPI and OpenMP (for threading) > on a large cluster that uses LSF for the job scheduling and dispatch. The > problem with LSF is

Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Ralph Castain via users
ed natively by Open MPI or abstraction layers) and/or with an uncommon topology (for which collective communications are not fully optimized by Open MPI). In the latter case, using the system/vendor MPI is the best option performance wise. Cheers, Gilles On Fri, Jan 28, 2022 at 2:23 AM Ralph Casta

Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Ralph Castain via users
share.net/rcastain/pmix-bridging-the-container-boundary <https://www.slideshare.net/rcastain/pmix-bridging-the-container-boundary> [video]  https://www.sylabs.io/2019/04/sug-talk-intels-ralph-castain-on-bridging-the-container-boundary-with-pmix/ <https://www.sylabs.io/2019/04/sug-talk-

Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Ralph Castain via users
Just to complete this - there is always a lingering question regarding shared memory support. There are two ways to resolve that one: * run one container per physical node, launching multiple procs in each container. The procs can then utilize shared memory _inside_ the container. This is the c

Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-27 Thread Ralph Castain via users
> Fair enough Ralph! I was implicitly assuming a "build once / run everywhere" > use case, my bad for not making my assumption clear. > If the container is built to run on a specific host, there are indeed other > options to achieve near native performances. > Err...that isn't actually what I m

Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-26 Thread Ralph Castain via users
gine the past issues I'd experienced were just due to the PMI differences in the different MPI implementations at the time.  I owe you a beer or something at the next in-person SC conference!   Cheers,   - Brian On Wed, Jan 26, 2022 at 4:54 PM Ralph Castain via users mailto:users@lists.open

Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-26 Thread Ralph Castain via users
il.com> > wrote: Hi Ralph,  My singularity image has OpenMPI, but my host doesnt (Intel MPI). And I am not sure if I the system would work with Intel + OpenMPI.  Luis  Enviado do Email <https://go.microsoft.com/fwlink/?LinkId=550986>  para Windows  De: Ralph Castain via users <ma

Re: [OMPI users] RES: OpenMPI - Intel MPI

2022-01-26 Thread Ralph Castain via users
 My singularity image has OpenMPI, but my host doesnt (Intel MPI). And I am not sure if I the system would work with Intel + OpenMPI.  Luis  Enviado do Email <https://go.microsoft.com/fwlink/?LinkId=550986>  para Windows  De: Ralph Castain via users <mailto:users@lists.open-mpi.org> En

Re: [OMPI users] OpenMPI - Intel MPI

2022-01-26 Thread Ralph Castain via users
Err...the whole point of a container is to put all the library dependencies _inside_ it. So why don't you just install OMPI in your singularity image? On Jan 26, 2022, at 6:42 AM, Luis Alfredo Pires Barbosa via users mailto:users@lists.open-mpi.org> > wrote: Hello all, I have Intel MPI in my

Re: [OMPI users] Creating An MPI Job from Procs Launched by a Different Launcher

2022-01-25 Thread Ralph Castain via users
Short answer is yes, but it it a bit complicated to do. On Jan 25, 2022, at 12:28 PM, Saliya Ekanayake via users mailto:users@lists.open-mpi.org> > wrote: Hi, I am trying to run an MPI program on a platform that launches the processes using a custom launcher (not mpiexec). This will end up spa

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-25 Thread Ralph Castain via users
ons: >> > - use mpirun >> > - rebuild Open MPI with PMI support as Ralph previously explained >> > - use SLURM PMIx: >> > srun --mpi=list >> > will list the PMI flavors provided by SLURM >> > a) if

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-24 Thread Ralph Castain via users
-fPIC -c99 >> -tp p7-64' 'CXXFLAGS=-O1 -fPIC -tp p7-64' 'FCFLAGS=-O1 -fPIC -tp p7-64' >> 'LD=ld' '--enable-shared' '--enable-static' '--without-tm' >> '--enable-mpi-cxx' '--disable-wrapper-runpath' &g

Re: [OMPI users] Open MPI + Slurm + lmod

2022-01-24 Thread Ralph Castain via users
If you look at your configure line, you forgot to include --with-pmi=. We don't build the Slurm PMI support by default due to the GPL licensing issues - you have to point at it. > On Jan 24, 2022, at 6:41 AM, Matthias Leopold via users > wrote: > > Hi, > > we have 2 DGX A100 machines and I'

Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Ralph Castain via users
ile generated by > PBS, before passing it to mpirun), but it's pity that tm support is > not included in these pre-built OpenMPI installations. > > On Tue, Jan 18, 2022 at 11:56 PM Ralph Castain via users > wrote: >> >> Hostfile isn't being ignored - it is doin

Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Ralph Castain via users
foo > one should use: > mpirun -n 2 --host node1,node2 ./foo > > Rather strange, but it's important that it works somehow. Thanks for your > help! > > On Tue, Jan 18, 2022 at 10:54 PM Ralph Castain via users > wrote: >> >> Are you launching the

Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Ralph Castain via users
; wrote: > > I have one process per node, here is corresponding line from my job > submission script (with compute nodes named "node1" and "node2"): > > #PBS -l select=1:ncpus=1:mpiprocs=1:host=node1+1:ncpus=1:mpiprocs=1:host=node2 > > On Tue, Jan

Re: [OMPI users] OMPI_COMM_WORLD_LOCAL_SIZE problem between PBS and MLNX_OFED

2022-01-18 Thread Ralph Castain via users
Afraid I can't understand your scenario - when you say you "submit a job" to run on two nodes, how many processes are you running on each node?? > On Jan 18, 2022, at 1:07 PM, Crni Gorac via users > wrote: > > Using OpenMPI 4.1.2 from MLNX_OFED_LINUX-5.5-1.0.3.2 distribution, and > have PBS 1

Re: [OMPI users] stdout scrambled in file

2021-12-18 Thread Ralph Castain via users
FWIW: this has been "fixed" in PMIx/PRRTE and should make it into OMPI v5 if the OMPI community accepts it. The default behavior has been changed to output a full line-at-a-time so that the output from different ranks doesn't get mixed together. The negative to this, of course, is that we now in

Re: [OMPI users] stdout scrambled in file

2021-12-05 Thread Ralph Castain via users
There are several output-controlling options - e.g., you could redirect the output from each process to its own file or directory. However, it makes little sense to me for someone to write convergence data into a file and then parse it. Typically, convergence data results from all procs reachin

Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn

2021-11-05 Thread Ralph Castain via users
argv_, 1, info, rank_, MPI_COMM_SELF, &intercom, error_codes);  From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users Sent: Friday, November 5, 2021 9:50 AM To: Open MPI Users mailto:users@lists.open-mpi.org> > Cc: Ralph Castain mailto:r...@open-

Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn

2021-11-05 Thread Ralph Castain via users
--map-by node \     -np 21  \     -wdir ${work_dir}  …  Here is my qsub command for the program “Needles”.  qsub -V -j oe -e $tmpdir_stdio -o $tmpdir_stdio -f -X -N Needles -l nodes=21:ppn=9  RunNeedles.bash;   From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of 

Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn

2021-11-03 Thread Ralph Castain via users
th-tm.   I tried Gilles workaround but the failure still occurred.    What do I need to provide you so that you can investigate this possible bug?  Thanks, Kurt  From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users Sent: Wednesday, November 3, 2021 8:45

Re: [OMPI users] Reserving slots and filling them after job launch with MPI_Comm_spawn

2021-11-03 Thread Ralph Castain via users
Sounds like a bug to me - regardless of configuration, if the hostfile contains an entry for each slot on a node, OMPI should have added those up. On Nov 3, 2021, at 2:49 AM, Gilles Gouaillardet via users mailto:users@lists.open-mpi.org> > wrote: Kurt, Assuming you built Open MPI with tm supp

Re: [OMPI users] [External] Re: cpu binding of mpirun to follow slurm setting

2021-10-11 Thread Ralph Castain via users
d that? Thanks. >Ray > > > From: users on behalf of Ralph Castain via > users > Sent: Monday, October 11, 2021 1:49 PM > To: Open MPI Users > Cc: Ralph Castain > Subject: Re: [OMPI users] [External] Re: cpu bi

Re: [OMPI users] [External] Re: cpu binding of mpirun to follow slurm setting

2021-10-11 Thread Ralph Castain via users
via users mailto:users@lists.open-mpi.org> > wrote: OK thank you. Seems that srun is a better option for normal users. Chang On 10/11/21 1:23 PM, Ralph Castain via users wrote: Sorry, your output wasn't clear about cores vs hwthreads. Apparently, your Slurm config is setup to use hwt

Re: [OMPI users] [External] Re: cpu binding of mpirun to follow slurm setting

2021-10-11 Thread Ralph Castain via users
al cores, so two processes sharing a physical core. I guess there is a way to do that by playing with mapping. I just want to know if this is a bug in mpirun, or this feature for interacting with slurm was never implemented. Chang On 10/11/21 10:07 AM, Ralph Castain via users wrote: You just n

Re: [OMPI users] [External] Re: cpu binding of mpirun to follow slurm setting

2021-10-11 Thread Ralph Castain via users
You just need to tell mpirun that you want your procs to be bound to cores, not socket (which is the default). Add "--bind-to core" to your mpirun cmd line On Oct 10, 2021, at 11:17 PM, Chang Liu via users mailto:users@lists.open-mpi.org> > wrote: Yes they are. This is an interactive job from

Re: [OMPI users] cpu binding of mpirun to follow slurm setting

2021-10-10 Thread Ralph Castain via users
Could you please include (a) what version of OMPI you are talking about, and (b) the binding patterns you observed from both srun and mpirun? > On Oct 9, 2021, at 6:41 PM, Chang Liu via users > wrote: > > Hi, > > I wonder if mpirun can follow the cpu binding settings from slurm, when > runn

Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-08-11 Thread Ralph Castain via users
> ||_// the State| Ryan Novosielski - novos...@rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\of NJ| Office of Advanced Research Computing - MSB C630, > Newark > `' > >&g

Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-07-29 Thread Ralph Castain via users
ob step aborted: Waiting up to 32 seconds for job step to finish. > srun: error: gpu004: tasks 0-1: Exited with exit code 1 > > -- > #BlackLivesMatter > ____ > || \\UTGERS, > |---*O*--- > ||_// the State|

Re: [OMPI users] OpenMPI 4.0.6 w/GCC 8.5 on CentOS 7.9; "WARNING: There was an error initializing an OpenFabrics device."

2021-07-29 Thread Ralph Castain via users
Ryan - I suspect what Sergey was trying to say was that you need to ensure OMPI doesn't try to use the OpenIB driver, or at least that it doesn't attempt to initialize it. Try adding OMPI_MCA_pml=ucx to your environment. On Jul 29, 2021, at 1:56 AM, Sergey Oblomov via users mailto:users@lists

Re: [OMPI users] How to set parameters to utilize multiple network interfaces?

2021-06-11 Thread Ralph Castain via users
You can still use "map-by" to get what you want since you know there are four interfaces per node - just do "--map-by ppr:8:node". Note that you definitely do NOT want to list those multiple IP addresses in your hostfile - all you are doing is causing extra work for mpirun as it has to DNS resol

Re: [OMPI users] [Help] Must orted exit after all spawned proecesses exit

2021-05-19 Thread Ralph Castain via users
To answer your specific questions: The backend daemons (orted) will not exit until all locally spawned procs exit. This is not configurable - for one thing, OMPI procs will suicide if they see the daemon depart, so it makes no sense to have the daemon fail if a proc terminates. The logic behind

Re: [OMPI users] unable to launch a job on a system with OmniPath

2021-05-19 Thread Ralph Castain via users
The original configure line is correct ("--without-orte") - just a typo in the later text. You may be running into some issues with Slurm's built-in support for OMPI. Try running it with OMPI's "mpirun" instead and see if you get better performance. You'll have to reconfigure to remove the "--w

Re: [OMPI users] How do I launch workers by our private protocol?

2021-04-21 Thread Ralph Castain via users
I'm not sure we support what you are wanting to do. You can direct mpiexec to use a specified script to launch its daemons on remote nodes. The daemons will need to connect back via TCP to mpiexec. The daemons are responsible for fork/exec'ing the local MPI application procs on each node. Those

Re: [OMPI users] Dynamic process allocation hangs

2021-03-25 Thread Ralph Castain via users
Hmmm...disturbing. The changes I made have somehow been lost. I'll have to redo it - will get back to you when it is restored. On Mar 25, 2021, at 2:54 PM, L Lutret mailto:lu.lut...@gmail.com> > wrote: Hi Ralph, Thanks for your response. I tried with the master branch a very simple spawn from

Re: [OMPI users] Dynamic process allocation hangs

2021-03-24 Thread Ralph Castain via users
Apologies for the very long delay in response. This has been verified fixed in OMPI's master branch that is to be released as v5.0 in the near future. Unfortunately, there are no plans to backport that fi to earlier release series. We therefore recommend that you upgrade to v5.0 if you retain in

Re: [OMPI users] building openshem on opa

2021-03-22 Thread Ralph Castain via users
You did everything right - the OSHMEM implementation in OMPI only supports UCX as it is essentially a Mellanox offering. I think the main impediment to broadening it is simply interest and priority on the part of the non-UCX developers. > On Mar 22, 2021, at 7:51 AM, Michael Di Domenico via use

Re: [OMPI users] How do you change ports used? [EXT]

2021-03-19 Thread Ralph Castain via users
or available ports, but is it checking those ports are also available on all the other hosts it’s going to run on? On 18 Mar 2021, at 15:57, Ralph Castain via users mailto:users@lists.open-mpi.org> > wrote: Hmmm...then you have something else going on. By default, OMPI will ask the lo

Re: [OMPI users] How do you change ports used? [EXT]

2021-03-18 Thread Ralph Castain via users
(pure default), it just doesn’t function (I’m guessing because it chose “bad” or in-use ports). On 18 Mar 2021, at 14:11, Ralph Castain via users mailto:users@lists.open-mpi.org> > wrote: Hard to say - unless there is some reason, why not make it large enough to not be an issue? You

Re: [OMPI users] How do you change ports used? [EXT]

2021-03-18 Thread Ralph Castain via users
hat range resulted in the issue I posted about here before, where mpirun just does nothing for 5mins and then terminates itself, without any error messages.) Cheers, Sendu. On 17 Mar 2021, at 13:25, Ralph Castain via users mailto:users@lists.open-mpi.org> > wrote: What you are miss

Re: [OMPI users] How do you change ports used?

2021-03-17 Thread Ralph Castain via users
What you are missing is that there are _two_ messaging layers in the system. You told the btl/tcp layer to use the specified ports, but left the oob/tcp one unspecified. You need to add oob_tcp_dynamic_ipv4_ports = 46207-46239 or whatever range you want to specify Note that if you want the btl

Re: [OMPI users] Stable and performant openMPI version for Ubuntu20.04 ?

2021-03-04 Thread Ralph Castain via users
Excuse me, but would you please ensure that you do not send mail to a mailing list containing this label: [AMD Official Use Only - Internal Distribution Only] Thank you Ralph On Mar 4, 2021, at 4:55 AM, Raut, S Biplab via users mailto:users@lists.open-mpi.org> > wrote: [AMD Official Use Only

Re: [OMPI users] Mapping, binding and ranking

2021-03-01 Thread Ralph Castain via users
other policies. I have also tried with --cpu-set with identical results. Probably rankfile is my only option too. On 28/02/2021 22:44, Ralph Castain via users wrote: The only way I know of to do what you want is --map-by ppr:32:socket --bind-to core --cpu-list 0,2,4,6,... whe

Re: [OMPI users] Mapping, binding and ranking

2021-03-01 Thread Ralph Castain via users
So then I need a rankfile listing all the hosts? John On 3/1/21 10:26 AM, Ralph Castain via users wrote: I'm afraid not - you have simply told us that all cpus are available. I don't know of any way to accomplish what John wants other than with a rankfile. On Mar 1, 2021,

Re: [OMPI users] Mapping, binding and ranking

2021-03-01 Thread Ralph Castain via users
one bound to one core, and the second bound to all the rest, with no use of hyperthreads. Would this be --map-by ppr:2:node --bind-to core --cpu-list 0,1-31 ? Thx On 2/28/21 5:44 PM, Ralph Castain via users wrote: The only way I know of to do what you want is --map-by pp

Re: [OMPI users] Mapping, binding and ranking

2021-02-28 Thread Ralph Castain via users
[../BB/../.. /../../../../../../../../../../../../../../../../../../../../../../../../../../. ./../../../../../../../../../../../../../../../../../../../../../../../../../../ ../../../../../../..][../../../../../../../../../../../../../../../../../../../. ./../../../../../../../../../../../../../../../../../../../../../../../../../../ ../../../../../../../../../../../../../../../../../..] On 28/02/2021 16:24, Ralph Castain via users wrote: Did you read the documentation on rankfile? The "slot=N" directive saids to "put this proc on

Re: [OMPI users] Mapping, binding and ranking

2021-02-28 Thread Ralph Castain via users
../../../../../../../..] And this is still different from the output produce using the rankfile. Cheers, Luis On 28/02/2021 14:06, Ralph Castain via users wrote: Your command line is incorrect: --map-by ppr:32:socket:PE=4 --bind-to hwthread

Re: [OMPI users] Mapping, binding and ranking

2021-02-28 Thread Ralph Castain via users
Your command line is incorrect: --map-by ppr:32:socket:PE=4 --bind-to hwthread should be --map-by ppr:32:socket:PE=2 --bind-to core On Feb 28, 2021, at 5:57 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote: I should have said, "I would like to run 128 MPI processes on 2

Re: [OMPI users] Binding blocks of processes in round-robin fashion

2021-01-29 Thread Ralph Castain via users
Okay, I can't promise when I'll get to it, but I'll try to have it in time for OMPI v5. On Jan 29, 2021, at 1:30 AM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote: Hi Ralph, It would be great to have it for load balancing issues. Ideally one could do something like --

Re: [OMPI users] Binding blocks of processes in round-robin fashion

2021-01-28 Thread Ralph Castain via users
t work, and all the app-contexts wind up in MPI_COMM_WORLD. On Jan 28, 2021, at 3:18 PM, Luis Cebamanos via users mailto:users@lists.open-mpi.org> > wrote: That's right Ralph! On 28/01/2021 23:13, Ralph Castain via users wrote: Trying to wrap my head around this, so let me try a 2-nod

Re: [OMPI users] Binding blocks of processes in round-robin fashion

2021-01-28 Thread Ralph Castain via users
Trying to wrap my head around this, so let me try a 2-node example. You want (each rank bound to a single core): ranks 0-3 to be mapped onto node1 ranks 4-7 to be mapped onto node2 ranks 8-11 to be mapped onto node1 ranks 12-15 to be mapped onto node2 etc.etc. Correct? > On Jan 28, 2021, at 3:0

Re: [OMPI users] MCA parameter "orte_base_help_aggregate"

2021-01-25 Thread Ralph Castain via users
There should have been an error message right above that - all this is saying is that the same error message was output by 7 more processes besides the one that was output. It then indicates that process 3 (which has pid 0?) was killed. Looking at the help message tag, it looks like no NICs were

Re: [OMPI users] OpenMPI 4.0.5 error with Omni-path

2021-01-25 Thread Ralph Castain via users
I think you mean add "--mca mtl ofi" to the mpirun cmd line > On Jan 25, 2021, at 10:18 AM, Heinz, Michael William via users > wrote: > > What happens if you specify -mtl ofi ? > > -Original Message- > From: users On Behalf Of Patrick Begou via > users > Sent: Monday, January 25, 20

Re: [OMPI users] MPMD hostfile: executables on same hosts

2020-12-21 Thread Ralph Castain via users
You want to use the "sequential" mapper and then specify each proc's location, like this for your hostfile: host1 host1 host2 host2 host3 host3 host1 host2 host3 and then add "--mca rmaps seq" to your mpirun cmd line. Ralph On Dec 21, 2020, at 5:22 AM, Vineet Soni via users mailto:users@lists

Re: [OMPI users] pmi.h/pmi2.h found but libpmi/libpmi missing

2020-12-20 Thread Ralph Castain via users
Did you remember to build the Slurm pmi and pmi2 libraries? They aren't built by default - IIRC, you have to manually go into a subdirectory and do a "make install" to have them built and installed. You might check the Slurm documentation for details. You also might need to add a --with-pmi-li

Re: [OMPI users] MPI-IO on Lustre - OMPIO or ROMIO?

2020-12-02 Thread Ralph Castain via users
Just a point to consider. OMPI does _not_ want to get in the mode of modifying imported software packages. That is a blackhole of effort we simply cannot afford. The correct thing to do would be to flag Rob Latham on that PR and ask that he upstream the fix into ROMIO so we can absorb it. We sh

Re: [OMPI users] PRRTE DVM: how to tell prun to not share nodes among prun jobs?

2020-11-14 Thread Ralph Castain via users
That would be very kind of you and most welcome! > On Nov 14, 2020, at 12:38 PM, Alexei Colin wrote: > > On Sat, Nov 14, 2020 at 08:07:47PM +0000, Ralph Castain via users wrote: >> IIRC, the correct syntax is: >> >> prun -host +e ... >> >> This tells P

Re: [OMPI users] PRRTE DVM: how to tell prun to not share nodes among prun jobs?

2020-11-14 Thread Ralph Castain via users
IIRC, the correct syntax is: prun -host +e ... This tells PRRTE that you want empty nodes for this application. You can even specify how many empty nodes you want: prun -host +e:2 ... I haven't tested that in a bit, so please let us know if it works or not so we can fix it if necessary. As f

Re: [OMPI users] [External] Re: mpi/pmix: ERROR: Error handler invoked: status = -25: No such file or directory (2)

2020-11-12 Thread Ralph Castain via users
be > expected. I just want to make sure that this was the case, and the error > below wasn't a sign of another issue with the job. > > Prentice > > On 11/11/20 5:47 PM, Ralph Castain via users wrote: >> Looks like it is coming from the Slurm PMIx plugin, not OM

Re: [OMPI users] mpi/pmix: ERROR: Error handler invoked: status = -25: No such file or directory (2)

2020-11-11 Thread Ralph Castain via users
Looks like it is coming from the Slurm PMIx plugin, not OMPI. Artem - any ideas? Ralph > On Nov 11, 2020, at 10:03 AM, Prentice Bisbal via users > wrote: > > One of my users recently reported a failed job that was using OpenMPI 4.0.4 > compiled with PGI 20.4. There two different errors repo

Re: [OMPI users] Starting a mixed fortran python MPMD application

2020-11-04 Thread Ralph Castain via users
Afraid I would have no idea - all I could tell them is that there was a bug and it has been fixed On Nov 2, 2020, at 12:18 AM, Andrea Piacentini via users mailto:users@lists.open-mpi.org> > wrote: I installed version 4.0.5 and the problem appears to be fixed. Can you please help us explaini

Re: [OMPI users] Starting a mixed fortran python MPMD application

2020-10-28 Thread Ralph Castain via users
Could you please tell us what version of OMPI you are using? On Oct 28, 2020, at 11:16 AM, Andrea Piacentini via users mailto:users@lists.open-mpi.org> > wrote: Good morning we need to launch a MPMD application with two fortran excutables and one interpreted python (mpi4py) application.

Re: [OMPI users] Limiting IP addresses used by OpenMPI

2020-09-30 Thread Ralph Castain via users
I'm not sure where you are looking, but those params are indeed present in the opal/mca/btl/tcp component: /*  *  Called by MCA framework to open the component, registers  *  component parameters.  */ static int mca_btl_tcp_component_register(void) {     char* message;     /* register TCP compo

Re: [OMPI users] OMPI 4.0.4 how to use mpirun properly in numa architecture

2020-08-20 Thread Ralph Castain via users
m a chemist and not a sysadmin (I miss a lot a specialized sysadmin in our Department!). Carlo Il giorno gio 20 ago 2020 alle ore 18:45 Ralph Castain via users mailto:users@lists.open-mpi.org> > ha scritto: Your use-case sounds more like a workflow than an application - in which case, yo

Re: [OMPI users] OMPI 4.0.4 how to use mpirun properly in numa architecture

2020-08-20 Thread Ralph Castain via users
Your use-case sounds more like a workflow than an application - in which case, you probably should be using PRRTE to execute it instead of "mpirun" as PRRTE will "remember" the multiple jobs and avoid the overload scenario you describe. This link will walk you thru how to get and build it:  http

Re: [OMPI users] Issues with MPI_Comm_Spawn

2020-08-12 Thread Ralph Castain via users
., 12 ago. 2020 18:29, Ralph Castain via users mailto:users@lists.open-mpi.org> > escribió: Setting aside the known issue with comm_spawn in v4.0.4, how are you planning to forward stdin without the use of "mpirun"? Something has to collect stdin of the terminal and distribute it to

Re: [OMPI users] Issues with MPI_Comm_Spawn

2020-08-12 Thread Ralph Castain via users
Setting aside the known issue with comm_spawn in v4.0.4, how are you planning to forward stdin without the use of "mpirun"? Something has to collect stdin of the terminal and distribute it to the stdin of the processes > On Aug 12, 2020, at 9:20 AM, Alvaro Payero Pinto via users > wrote: > >

Re: [OMPI users] OMPI 4.0.4 crashes (or hangs) with dynamically processes allocation. OMPI 4.0.1 don't.

2020-08-11 Thread Ralph Castain via users
Howard - if there is a problem in PMIx that is causing this problem, then we really could use a report on it ASAP as we are getting ready to release v3.1.6 and I doubt we have addressed anything relevant to what is being discussed here. On Aug 11, 2020, at 4:35 PM, Martín Morales via users mai

Re: [OMPI users] ORTE HNP Daemon Error - Generated by Tweaking MTU

2020-08-10 Thread Ralph Castain via users
My apologies - I should have included "--debug-daemons" for the mpirun cmd line so that the stderr of the backend daemons would be output. > On Aug 10, 2020, at 10:28 AM, John Duffy via users > wrote: > > Thanks Ralph > > I will do all of that. Much appreciated.

Re: [OMPI users] ORTE HNP Daemon Error - Generated by Tweaking MTU

2020-08-10 Thread Ralph Castain via users
Well, we aren't really that picky :-) While I agree with Gilles that we are unlikely to be able to help you resolve the problem, we can give you a couple of ideas on how to chase it down First, be sure to build OMPI with "--enable-debug" and then try adding "--mca oob_base_verbose 100" to you

Re: [OMPI users] MPI is still dominant paradigm?

2020-08-07 Thread Ralph Castain via users
The Java bindings were added specifically to support the Spark/Hadoop communities, so I see no reason why you couldn't use them for Akka or whatever. Note that there are also Python wrappers for MPI at mpi4py that you could build upon. There is plenty of evidence out there for a general migrati

Re: [OMPI users] Correct mpirun Options for Hybrid OpenMPI/OpenMP

2020-08-03 Thread Ralph Castain via users
Be default, OMPI will bind your procs to a single core. You probably want to at least bind to socket (for NUMA reasons), or not bind at all if you want to use all the cores on the node. So either add "--bind-to socket" or "--bind-to none" to your cmd line. On Aug 3, 2020, at 1:33 AM, John Duff

Re: [OMPI users] Running with Intel Omni-Path

2020-08-01 Thread Ralph Castain via users
Add "--mca pml cm" to your cmd line On Jul 31, 2020, at 9:54 PM, Supun Kamburugamuve via users mailto:users@lists.open-mpi.org> > wrote: Hi all, I'm trying to setup OpenMPI on a cluster with the Omni-Path network. When I try the following command it gives an error. mpirun -n 2 --hostfile nod

Re: [OMPI users] Moving an installation

2020-07-24 Thread Ralph Castain via users
While possible, it is highly unlikely that your desktop version is going to be binary compatible with your cluster... On Jul 24, 2020, at 9:55 AM, Lana Deere via users mailto:users@lists.open-mpi.org> > wrote: I have open-mpi 4.0.4 installed on my desktop and my small test programs are working.

Re: [OMPI users] Any reason why I can't start an mpirun job from within an mpi process?

2020-07-11 Thread Ralph Castain via users
You cannot cascade mpirun cmds like that - the child mpirun picks up envars that causes it to break. You'd have to either use comm_spawn to start the child job, or do a fork/exec where you can set the environment to be some pristine set of values. > On Jul 11, 2020, at 1:12 PM, John Retterer v

Re: [OMPI users] slot number calculation when no config files?

2020-06-08 Thread Ralph Castain via users
Note that you can also resolve it by adding --use-hwthread-cpus to your cmd line - it instructs mpirun to treat the HWTs as independent cpus so you would have 4 slots in this case. > On Jun 8, 2020, at 11:28 AM, Collin Strassburger via users > wrote: > > Hello David, > > The slot calculatio

Re: [OMPI users] Running mpirun with grid

2020-06-01 Thread Ralph Castain via users
Afraid I have no real ideas here. Best I can suggest is taking the qrsh cmd line from the prior debug output and try running it manually. This might give you a chance to manipulate it and see if you can identify what is causing it an issue, if anything. Without mpirun executing, the daemons will

Re: [OMPI users] Running mpirun with grid

2020-06-01 Thread Ralph Castain via users
mpdir_base). > Please check with your sys admin to determine the correct location to use. > > * compilation of the orted with dynamic libraries when static are required > (e.g., on Cray). Please check your configure cmd line and consider using > one of the contrib/platform definitions for your system type. > > * an inability to create a connection back to mpirun due to a > lack of comm

Re: [OMPI users] Running mpirun with grid

2020-05-31 Thread Ralph Castain via users
The messages about the daemons is coming from two different sources. Grid is saying it was able to spawn the orted - then the orted is saying it doesn't know how to communicate and fails. I think the root of the problem lies in the plm output that shows the qrsh it will use to start the job. Fo

Re: [OMPI users] I can't build openmpi 4.0.X using PMIx 3.1.5 to use with Slurm

2020-05-12 Thread Ralph Castain via users
Try adding --without-psm2 to the PMIx configure line - sounds like you have that library installed on your machine, even though you don't have omnipath. On May 12, 2020, at 4:42 AM, Leandro via users mailto:users@lists.open-mpi.org> > wrote: HI,  I compile it statically to make sure compilers

Re: [OMPI users] I can't build openmpi 4.0.X using PMIx 3.1.5 to use with Slurm

2020-05-11 Thread Ralph Castain via users
I'm not sure I understand why you are trying to build CentOS rpms for PMIx, Slurm, or OMPI - all three are readily available online. Is there some particular reason you are trying to do this yourself? I ask because it is non-trivial to do and requires significant familiarity with both the intri

Re: [OMPI users] can't open /dev/ipath, network down (err=26)

2020-05-08 Thread Ralph Castain via users
I fear those cards are past end-of-life so far as support is concerned. I'm not sure if anyone can really advise you on them. It sounds like the fabric is experiencing failures, but that's just a guess. On May 8, 2020, at 12:56 PM, Prentice Bisbal via users mailto:users@lists.open-mpi.org> > w

Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-05-06 Thread Ralph Castain via users
The following (from what you posted earlier): $ srun --mpi=list srun: MPI types are... srun: none srun: pmix_v3 srun: pmi2 srun: openmpi srun: pmix would indicate that Slurm was built against a PMIx v3.x release. Using OMPI v4.0.3 with pmix=internal should be just fine so long as you set --mpi=p

Re: [OMPI users] Can't start jobs with srun.

2020-04-26 Thread Ralph Castain via users
PMIx: $ srun --mpi=list srun: MPI types are... srun: none srun: pmi2 srun: openmpi I did launch the job with srun --mpi=pmi2 Does OpenMPI 4 need PMIx specifically? On 4/23/20 10:23 AM, Ralph Castain via users wrote: Is Slurm built with PMIx support? Did you tell srun to use it? On Apr 23

Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-04-23 Thread Ralph Castain via users
ls. Why is that? Can I not trust the output > of --mpi=list? > > Prentice > > On 4/23/20 10:43 AM, Ralph Castain via users wrote: >> No, but you do have to explicitly build OMPI with non-PMIx support if that >> is what you are going to use. In this case, you need to

Re: [OMPI users] [External] Re: Can't start jobs with srun.

2020-04-23 Thread Ralph Castain via users
--mpi=pmi2 > > Does OpenMPI 4 need PMIx specifically? > > > On 4/23/20 10:23 AM, Ralph Castain via users wrote: >> Is Slurm built with PMIx support? Did you tell srun to use it? >> >> >>> On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users >>

Re: [OMPI users] Can't start jobs with srun.

2020-04-23 Thread Ralph Castain via users
Is Slurm built with PMIx support? Did you tell srun to use it? > On Apr 23, 2020, at 7:00 AM, Prentice Bisbal via users > wrote: > > I'm using OpenMPI 4.0.3 with Slurm 19.05.5 I'm testing the software with a > very simple hello, world MPI program that I've used reliably for years. When > I

Re: [OMPI users] Meaning of mpiexec error flags

2020-04-14 Thread Ralph Castain via users
the difference between the working node flag (0x11) and the non-working nodes’ flags (0x13) is the flagPRRTE_NODE_FLAG_LOC_VERIFIED.    What does that imply?   The location of the daemon has NOT been verified?  Kurt  From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph C

Re: [OMPI users] Meaning of mpiexec error flags

2020-04-13 Thread Ralph Castain via users
I updated the message to explain the flags (instead of a numerical value) for OMPI v5. In brief: #define PRRTE_NODE_FLAG_DAEMON_LAUNCHED    0x01   // whether or not the daemon on this node has been launched #define PRRTE_NODE_FLAG_LOC_VERIFIED               0x02   // whether or not the location

Re: [OMPI users] Clean termination after receiving multiple SIGINT

2020-04-06 Thread Ralph Castain via users
mailto:moritz.kreut...@siemens.com> www.sw.siemens.com <http://www.sw.siemens.com/>   From: users mailto:users-boun...@lists.open-mpi.org> > On Behalf Of Ralph Castain via users Sent: Montag, 6. April 2020 16:32 To: Open MPI Users mailto:users@lists.open-mpi.org> > Cc: Ralph Castain mail

Re: [OMPI users] Clean termination after receiving multiple SIGINT

2020-04-06 Thread Ralph Castain via users
Currently, mpirun takes that second SIGINT to mean "you seem to be stuck trying to cleanly abort - just die", which means mpirun exits immediately without doing any cleanup. The individual procs all commit suicide when they see their daemons go away, which is why you don't get zombies left behin

Re: [OMPI users] mpirun CLI parsing

2020-03-30 Thread Ralph Castain via users
I'm afraid the short answer is "no" - there is no way to do that today. > On Mar 30, 2020, at 1:45 PM, Jean-Baptiste Skutnik via users > wrote: > > Hello, > > I am writing a wrapper around `mpirun` which requires pre-processing of the > user's program. To achieve this, I need to isolate the

Re: [OMPI users] MPI_Comm_spawn: no allocated resources for the application ...

2020-03-16 Thread Ralph Castain via users
Sorry for the incredibly late reply. Hopefully, you have already managed to find the answer. I'm not sure what your comm_spawn command looks like, but it appears you specified the host in it using the "dash_host" info-key, yes? The problem is that this is interpreted the same way as the "-host

Re: [OMPI users] Interpreting the output of --display-map and --display-allocation

2020-03-16 Thread Ralph Castain via users
FWIW: I have replaced those flags in the display option output with their string equivalent to make interpretation easier. This is available in OMPI master and will be included in the v5 release. > On Nov 21, 2019, at 2:08 AM, Peter Kjellström via users > wrote: > > On Mon, 18 Nov 2019 17:4

Re: [OMPI users] Propagating SIGINT instead of SIGTERM to children processes

2020-03-16 Thread Ralph Castain via users
Hi Nathan Sorry for the long, long delay in responding - no reasonable excuse (just busy, switching over support areas, etc.). Hopefully, you already found the solution. You can specify the signals to forward to children using an MCA parameter: OMPI_MCA_ess_base_forward_signals=SIGINT should d

Re: [OMPI users] [EXTERNAL] Shmem errors on Mac OS Catalina

2020-02-06 Thread Ralph Castain via users
It is also wise to create a "tmp" directory under your home directory, and reset TMPDIR to point there. Avoiding use of the system tmpdir is highly advisable under Mac OS, especially Catalina. On Feb 6, 2020, at 4:09 PM, Gutierrez, Samuel K. via users mailto:users@lists.open-mpi.org> > wrote:

  1   2   >