Re: [OMPI users] Using OpenMPI / ORTE as cluster aware GNU Parallel
Hi, "r...@open-mpi.org" writes: > You might want to try using the DVM (distributed virtual machine) > mode in ORTE. You can start it on an allocation using the “orte-dvm” > cmd, and then submit jobs to it with “mpirun --hnp ”, where foo > is either the contact info printed out by orte-dvm, or the name of > the file you told orte-dvm to put that info in. You’ll need to take > it from OMPI master at this point. this question looked interesting so I gave it a try. In a cluster with Slurm I had no problem submitting a job which launched an orte-dvm -report-uri ... and then use that file to launch jobs onto that virtual machine via orte-submit. To be useful to us at this point, I should be able to start executing jobs if there are cores available and just hold them in a queue if the cores are already filled. At this point this is not happenning, and if I try to submit a second job while the previous one has not finished, I get a message like: , | DVM ready | -- | All nodes which are allocated for this job are already filled. | -- ` With the DVM, is it possible to keep these jobs in some sort of queue, so that they will be executed when the cores get free? Thanks, -- Ángel de Vicente http://www.iac.es/galeria/angelv/ - ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protección de Datos, acceda a http://www.iac.es/disclaimer.php WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Using OpenMPI / ORTE as cluster aware GNU Parallel
> On Feb 27, 2017, at 4:58 AM, Angel de Vicente wrote: > > Hi, > > "r...@open-mpi.org" writes: >> You might want to try using the DVM (distributed virtual machine) >> mode in ORTE. You can start it on an allocation using the “orte-dvm” >> cmd, and then submit jobs to it with “mpirun --hnp ”, where foo >> is either the contact info printed out by orte-dvm, or the name of >> the file you told orte-dvm to put that info in. You’ll need to take >> it from OMPI master at this point. > > this question looked interesting so I gave it a try. In a cluster with > Slurm I had no problem submitting a job which launched an orte-dvm > -report-uri ... and then use that file to launch jobs onto that virtual > machine via orte-submit. > > To be useful to us at this point, I should be able to start executing > jobs if there are cores available and just hold them in a queue if the > cores are already filled. At this point this is not happenning, and if I > try to submit a second job while the previous one has not finished, I > get a message like: > > , > | DVM ready > | -- > | All nodes which are allocated for this job are already filled. > | -- > ` > > With the DVM, is it possible to keep these jobs in some sort of queue, > so that they will be executed when the cores get free? It wouldn’t be hard to do so - as long as it was just a simple FIFO scheduler. I wouldn’t want it to get too complex. > > Thanks, > -- > Ángel de Vicente > http://www.iac.es/galeria/angelv/ > - > ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protección de > Datos, acceda a http://www.iac.es/disclaimer.php > WARNING: For more information on privacy and fulfilment of the Law concerning > the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Using OpenMPI / ORTE as cluster aware GNU Parallel
Hi, "r...@open-mpi.org" writes: >> With the DVM, is it possible to keep these jobs in some sort of queue, >> so that they will be executed when the cores get free? > > It wouldn’t be hard to do so - as long as it was just a simple FIFO > scheduler. I wouldn’t want it to get too complex. a simple FIFO should be probably enough. This can be useful as a simple way to make a multi-core machine accessible to a small group of (friendly) users, making sure that they don't oversubscribe the machine, but without going the full route of installing/maintaining a full resource manager. Cheers, -- Ángel de Vicente http://www.iac.es/galeria/angelv/ - ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protección de Datos, acceda a http://www.iac.es/disclaimer.php WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Using OpenMPI / ORTE as cluster aware GNU Parallel
Hi, > Am 27.02.2017 um 14:33 schrieb Angel de Vicente : > > Hi, > > "r...@open-mpi.org" writes: >>> With the DVM, is it possible to keep these jobs in some sort of queue, >>> so that they will be executed when the cores get free? >> >> It wouldn’t be hard to do so - as long as it was just a simple FIFO >> scheduler. I wouldn’t want it to get too complex. > > a simple FIFO should be probably enough. This can be useful as a simple > way to make a multi-core machine accessible to a small group of (friendly) > users, making sure that they don't oversubscribe the machine, but > without going the full route of installing/maintaining a full resource > manager. At first I thought you want to run a queuing system inside a queuing system, but this looks like you want to replace the resource manager. Under which user account the DVM daemons will run? Are all users using the same account? -- Reuti signature.asc Description: Message signed with OpenPGP using GPGMail ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] fatal error with openmpi-2.1.0rc1 on Linux with Sun C
Hi, I tried to install openmpi-2.1.0rc1 on my "SUSE Linux Enterprise Server 12.2 (x86_64)" with Sun C 5.14. Unfortunately, "make" breaks with the following error. I had reported the same problem for openmpi-master-201702150209-404fe32. Gilles was able to solve the problem (https://github.com/pmix/master/pull/309). ... CC src/dstore/pmix_esh.lo "/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h", line 159: warning: parameter in inline asm statement unused: %3 "/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h", line 205: warning: parameter in inline asm statement unused: %2 "/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h", line 226: warning: parameter in inline asm statement unused: %2 "/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h", line 247: warning: parameter in inline asm statement unused: %2 "/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h", line 268: warning: parameter in inline asm statement unused: %2 cc: Fatal error in /opt/sun/developerstudio12.5/lib/compilers/bin/acomp : Signal number = 139 Makefile:1329: recipe for target 'src/dstore/pmix_esh.lo' failed make[4]: *** [src/dstore/pmix_esh.lo] Error 1 make[4]: Leaving directory '/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1-Linux.x86_64.64_cc/opal/mca/pmix/pmix112/pmix' Makefile:1596: recipe for target 'all-recursive' failed make[3]: *** [all-recursive] Error 1 make[3]: Leaving directory '/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1-Linux.x86_64.64_cc/opal/mca/pmix/pmix112/pmix' Makefile:1941: recipe for target 'all-recursive' failed make[2]: *** [all-recursive] Error 1 make[2]: Leaving directory '/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1-Linux.x86_64.64_cc/opal/mca/pmix/pmix112' Makefile:2307: recipe for target 'all-recursive' failed make[1]: *** [all-recursive] Error 1 make[1]: Leaving directory '/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1-Linux.x86_64.64_cc/opal' Makefile:1806: recipe for target 'all-recursive' failed make: *** [all-recursive] Error 1 loki openmpi-2.1.0rc1-Linux.x86_64.64_cc 129 Gilles, I would be grateful, if you can fix the problem for openmpi-2.1.0rc1 as well. Thank you very much for your help in advance. Kind regards Siegmar ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] Fwd: [OMPI USERS] Fault Tolerance and migration
Hi, I am interested in using OpenMPI to manage the distribution on a MicroZed cluster. This MicroZed boards come with a Zynq device, which has a dual-core ARM cortex A9. One of the objectives of the project I am working on is resilience, so I am trully interested in the fault tolerance provided by OpenMPI. The thing I want to know is if there is any implementation for run-time migration. For instance, if I have an octa-MicroZed cluster running an MPI job and I unplug the Ethernet cable of one of them or I reboot another one, is there any support in OpenMPI to detect these failures and migrate the ranks to other processors on run-time execution? Thank you in advance, Alberto. ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Using OpenMPI / ORTE as cluster aware GNU Parallel
> Am 27.02.2017 um 18:24 schrieb Angel de Vicente : > > […] > > For a small group of users if the DVM can run with my user and there is > no restriction on who can use it or if I somehow can authorize others to > use it (via an authority file or similar) that should be enough. AFAICS there is no user authorization at all. Everyone can hijack a running DVM once he knows the URI. The only problem might be, that all processes are running under the account of the user who started the DVM. I.e. output files have to go to the home directory of this user, as any other user can't write to his own directory any longer this way. Running the DVM under root might help, but this would be a high risk that any faulty script might write to a place where sensible system information is stored and may leave the machine unusable afterwards. My first attempts using DVM often leads to a terminated DVM once a process returned with a non-zero exit code. But once the DVM is gone, the queued jobs might be lost too I fear. I would wish that the DVM could be more forgivable (or this feature be adjustable what to do in case of a non-zero exit code). -- Reuti signature.asc Description: Message signed with OpenPGP using GPGMail ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Using OpenMPI / ORTE as cluster aware GNU Parallel
Hi, Reuti writes: > At first I thought you want to run a queuing system inside a queuing > system, but this looks like you want to replace the resource manager. yes, if this could work reasonably well, we could do without the resource manager. > Under which user account the DVM daemons will run? Are all users using the > same account? Well, if this could work only for one user, this could still be useful as I could use it as I do now use GNU Parallel or a private Condor system, where I can submit hundreds of jobs, but make sure they get executed without oversubscribing. For a small group of users if the DVM can run with my user and there is no restriction on who can use it or if I somehow can authorize others to use it (via an authority file or similar) that should be enough. Thanks, -- Ángel de Vicente http://www.iac.es/galeria/angelv/ - ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protección de Datos, acceda a http://www.iac.es/disclaimer.php WARNING: For more information on privacy and fulfilment of the Law concerning the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Using OpenMPI / ORTE as cluster aware GNU Parallel
> On Feb 27, 2017, at 9:39 AM, Reuti wrote: > > >> Am 27.02.2017 um 18:24 schrieb Angel de Vicente : >> >> […] >> >> For a small group of users if the DVM can run with my user and there is >> no restriction on who can use it or if I somehow can authorize others to >> use it (via an authority file or similar) that should be enough. > > AFAICS there is no user authorization at all. Everyone can hijack a running > DVM once he knows the URI. The only problem might be, that all processes are > running under the account of the user who started the DVM. I.e. output files > have to go to the home directory of this user, as any other user can't write > to his own directory any longer this way. We can add some authorization protection, at least at the user/group level. One can resolve the directory issue by creating some place that has group authorities, and then requesting that to be the working directory. > > Running the DVM under root might help, but this would be a high risk that any > faulty script might write to a place where sensible system information is > stored and may leave the machine unusable afterwards. > I would advise against that > My first attempts using DVM often leads to a terminated DVM once a process > returned with a non-zero exit code. But once the DVM is gone, the queued jobs > might be lost too I fear. I would wish that the DVM could be more forgivable > (or this feature be adjustable what to do in case of a non-zero exit code). We just fixed that issue the other day :-) > > -- Reuti > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] fatal error with openmpi-2.1.0rc1 on Linux with Sun C
Drat! Thanks for letting us know. That fix was missed when we swept through to create the PMIx v1.2.1 - which triggered the OMPI v2.1.0rc1. Sorry about that :( Jeff filed an Issue to track this here: https://github.com/open-mpi/ompi/issues/3048 I've filed a PR against PMIx to bring it into the next PMIx v1.2.2 here: https://github.com/pmix/master/pull/322 We'll followup on the Issue with the resolution tomorrow morning during the OMPI developer's teleconf. On Mon, Feb 27, 2017 at 8:05 AM, Siegmar Gross < siegmar.gr...@informatik.hs-fulda.de> wrote: > Hi, > > I tried to install openmpi-2.1.0rc1 on my "SUSE Linux Enterprise > Server 12.2 (x86_64)" with Sun C 5.14. Unfortunately, "make" > breaks with the following error. I had reported the same problem > for openmpi-master-201702150209-404fe32. Gilles was able to solve > the problem (https://github.com/pmix/master/pull/309). > > ... > CC src/dstore/pmix_esh.lo > "/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h", > line 159: warning: parameter in inline asm statement unused: %3 > "/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h", > line 205: warning: parameter in inline asm statement unused: %2 > "/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h", > line 226: warning: parameter in inline asm statement unused: %2 > "/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h", > line 247: warning: parameter in inline asm statement unused: %2 > "/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h", > line 268: warning: parameter in inline asm statement unused: %2 > cc: Fatal error in /opt/sun/developerstudio12.5/lib/compilers/bin/acomp : > Signal number = 139 > Makefile:1329: recipe for target 'src/dstore/pmix_esh.lo' failed > make[4]: *** [src/dstore/pmix_esh.lo] Error 1 > make[4]: Leaving directory '/export2/src/openmpi-2.1.0/op > enmpi-2.1.0rc1-Linux.x86_64.64_cc/opal/mca/pmix/pmix112/pmix' > Makefile:1596: recipe for target 'all-recursive' failed > make[3]: *** [all-recursive] Error 1 > make[3]: Leaving directory '/export2/src/openmpi-2.1.0/op > enmpi-2.1.0rc1-Linux.x86_64.64_cc/opal/mca/pmix/pmix112/pmix' > Makefile:1941: recipe for target 'all-recursive' failed > make[2]: *** [all-recursive] Error 1 > make[2]: Leaving directory '/export2/src/openmpi-2.1.0/op > enmpi-2.1.0rc1-Linux.x86_64.64_cc/opal/mca/pmix/pmix112' > Makefile:2307: recipe for target 'all-recursive' failed > make[1]: *** [all-recursive] Error 1 > make[1]: Leaving directory '/export2/src/openmpi-2.1.0/op > enmpi-2.1.0rc1-Linux.x86_64.64_cc/opal' > Makefile:1806: recipe for target 'all-recursive' failed > make: *** [all-recursive] Error 1 > loki openmpi-2.1.0rc1-Linux.x86_64.64_cc 129 > > > Gilles, I would be grateful, if you can fix the problem for > openmpi-2.1.0rc1 as well. Thank you very much for your help > in advance. > > > Kind regards > > Siegmar > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > -- Josh Hursey IBM Spectrum MPI Developer ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] Does MPI_Iallreduce work with CUDA-Aware in OpenMPI-2.0.2?
Hi list, I would like to know if MPI_Iallreduce is supported with cuda-aware in openMPI-2.0.2? The page https://www.open-mpi.org/faq/?category=runcuda, updated on 06/2016, says not supported until openmpi--1.8.5. Any updates on this? Thank you Junjie Qian ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Fwd: [OMPI USERS] Fault Tolerance and migration
Alberto, In the master there is no such support (we had support for migration a while back, but we have stripped it out). However, at UTK we developed a fork of Open MPI, called ULFM, which provides fault management capabilities. This fork provides support to detect failures, and support for handling the fault in the MPI layer. I suggest you look at fault-tolerance.org for more info. George. On Mon, Feb 27, 2017 at 11:23 AM, Alberto Ortiz wrote: > Hi, > I am interested in using OpenMPI to manage the distribution on a MicroZed > cluster. This MicroZed boards come with a Zynq device, which has a > dual-core ARM cortex A9. One of the objectives of the project I am working > on is resilience, so I am trully interested in the fault tolerance provided > by OpenMPI. > > The thing I want to know is if there is any implementation for run-time > migration. For instance, if I have an octa-MicroZed cluster running an MPI > job and I unplug the Ethernet cable of one of them or I reboot another one, > is there any support in OpenMPI to detect these failures and migrate the > ranks to other processors on run-time execution? > > Thank you in advance, > Alberto. > > > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
[OMPI users] Issues with different IB adapters and openmpi 2.0.2
We have a couple nodes with different IB adapters in them: font1/var/log/lspci:03:00.0 InfiniBand [0c06]: Mellanox Technologies MT25204 [InfiniHost III Lx HCA] [15b3:6274] (rev 20) font2/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220 InfiniBand HCA [1077:7220] (rev 02) font3/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220 InfiniBand HCA [1077:7220] (rev 02) With 1.10.3 we saw the following errors with mpirun: [font2.cora.nwra.com:13982] [[23220,1],10] selected pml cm, but peer [[23220,1],0] on font1 selected pml ob1 which crashed MPI_Init. We worked around this by passing "--mca pml ob1". I notice now with openmpi 2.0.2 without that option I no longer see errors, but the mpi program will hang shortly after startup. Re-adding the option makes it work, so I'm assuming the underlying problem is still the same, but openmpi appears to have stopped alerting me to the issue. Thoughts? -- Orion Poplawski Technical Manager 720-772-5637 NWRA, Boulder/CoRA Office FAX: 303-415-9702 3380 Mitchell Lane or...@nwra.com Boulder, CO 80301 http://www.nwra.com ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users
Re: [OMPI users] Issues with different IB adapters and openmpi 2.0.2
Hi Orion Does the problem occur if you only use font2 and 3? Do you have MXM installed on the font1 node? The 2.x series is using PMIX and it could be that is impacting the PML sanity check. Howard Orion Poplawski schrieb am Mo. 27. Feb. 2017 um 14:50: > We have a couple nodes with different IB adapters in them: > > font1/var/log/lspci:03:00.0 InfiniBand [0c06]: Mellanox Technologies > MT25204 > [InfiniHost III Lx HCA] [15b3:6274] (rev 20) > font2/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220 > InfiniBand > HCA [1077:7220] (rev 02) > font3/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220 > InfiniBand > HCA [1077:7220] (rev 02) > > With 1.10.3 we saw the following errors with mpirun: > > [font2.cora.nwra.com:13982] [[23220,1],10] selected pml cm, but peer > [[23220,1],0] on font1 selected pml ob1 > > which crashed MPI_Init. > > We worked around this by passing "--mca pml ob1". I notice now with > openmpi > 2.0.2 without that option I no longer see errors, but the mpi program will > hang shortly after startup. Re-adding the option makes it work, so I'm > assuming the underlying problem is still the same, but openmpi appears to > have > stopped alerting me to the issue. > > Thoughts? > > -- > Orion Poplawski > Technical Manager 720-772-5637 > NWRA, Boulder/CoRA Office FAX: 303-415-9702 > 3380 Mitchell Lane or...@nwra.com > Boulder, CO 80301 http://www.nwra.com > ___ > users mailing list > users@lists.open-mpi.org > https://rfd.newmexicoconsortium.org/mailman/listinfo/users > ___ users mailing list users@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/users