Re: [OMPI users] Using OpenMPI / ORTE as cluster aware GNU Parallel

2017-02-27 Thread Angel de Vicente
Hi,

"r...@open-mpi.org"  writes:
> You might want to try using the DVM (distributed virtual machine)
> mode in ORTE. You can start it on an allocation using the “orte-dvm”
> cmd, and then submit jobs to it with “mpirun --hnp ”, where foo
> is either the contact info printed out by orte-dvm, or the name of
> the file you told orte-dvm to put that info in. You’ll need to take
> it from OMPI master at this point.

this question looked interesting so I gave it a try. In a cluster with
Slurm I had no problem submitting a job which launched an orte-dvm
-report-uri ... and then use that file to launch jobs onto that virtual
machine via orte-submit. 

To be useful to us at this point, I should be able to start executing
jobs if there are cores available and just hold them in a queue if the
cores are already filled. At this point this is not happenning, and if I
try to submit a second job while the previous one has not finished, I
get a message like:

,
| DVM ready
| --
| All nodes which are allocated for this job are already filled.
| --
`

With the DVM, is it possible to keep these jobs in some sort of queue,
so that they will be executed when the cores get free?

Thanks,
-- 
Ángel de Vicente
http://www.iac.es/galeria/angelv/  
-
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protección de 
Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law concerning 
the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Using OpenMPI / ORTE as cluster aware GNU Parallel

2017-02-27 Thread r...@open-mpi.org

> On Feb 27, 2017, at 4:58 AM, Angel de Vicente  wrote:
> 
> Hi,
> 
> "r...@open-mpi.org"  writes:
>> You might want to try using the DVM (distributed virtual machine)
>> mode in ORTE. You can start it on an allocation using the “orte-dvm”
>> cmd, and then submit jobs to it with “mpirun --hnp ”, where foo
>> is either the contact info printed out by orte-dvm, or the name of
>> the file you told orte-dvm to put that info in. You’ll need to take
>> it from OMPI master at this point.
> 
> this question looked interesting so I gave it a try. In a cluster with
> Slurm I had no problem submitting a job which launched an orte-dvm
> -report-uri ... and then use that file to launch jobs onto that virtual
> machine via orte-submit. 
> 
> To be useful to us at this point, I should be able to start executing
> jobs if there are cores available and just hold them in a queue if the
> cores are already filled. At this point this is not happenning, and if I
> try to submit a second job while the previous one has not finished, I
> get a message like:
> 
> ,
> | DVM ready
> | --
> | All nodes which are allocated for this job are already filled.
> | --
> `
> 
> With the DVM, is it possible to keep these jobs in some sort of queue,
> so that they will be executed when the cores get free?

It wouldn’t be hard to do so - as long as it was just a simple FIFO scheduler. 
I wouldn’t want it to get too complex.

> 
> Thanks,
> -- 
> Ángel de Vicente
> http://www.iac.es/galeria/angelv/  
> -
> ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protección de 
> Datos, acceda a http://www.iac.es/disclaimer.php
> WARNING: For more information on privacy and fulfilment of the Law concerning 
> the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en
> 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Using OpenMPI / ORTE as cluster aware GNU Parallel

2017-02-27 Thread Angel de Vicente
Hi,

"r...@open-mpi.org"  writes:
>> With the DVM, is it possible to keep these jobs in some sort of queue,
>> so that they will be executed when the cores get free?
>
> It wouldn’t be hard to do so - as long as it was just a simple FIFO 
> scheduler. I wouldn’t want it to get too complex.

a simple FIFO should be probably enough. This can be useful as a simple
way to make a multi-core machine accessible to a small group of (friendly)
users, making sure that they don't oversubscribe the machine, but
without going the full route of installing/maintaining a full resource
manager. 

Cheers,
-- 
Ángel de Vicente
http://www.iac.es/galeria/angelv/  
-
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protección de 
Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law concerning 
the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Using OpenMPI / ORTE as cluster aware GNU Parallel

2017-02-27 Thread Reuti
Hi,

> Am 27.02.2017 um 14:33 schrieb Angel de Vicente :
> 
> Hi,
> 
> "r...@open-mpi.org"  writes:
>>> With the DVM, is it possible to keep these jobs in some sort of queue,
>>> so that they will be executed when the cores get free?
>> 
>> It wouldn’t be hard to do so - as long as it was just a simple FIFO 
>> scheduler. I wouldn’t want it to get too complex.
> 
> a simple FIFO should be probably enough. This can be useful as a simple
> way to make a multi-core machine accessible to a small group of (friendly)
> users, making sure that they don't oversubscribe the machine, but
> without going the full route of installing/maintaining a full resource
> manager.

At first I thought you want to run a queuing system inside a queuing system, 
but this looks like you want to replace the resource manager.

Under which user account the DVM daemons will run? Are all users using the same 
account?

-- Reuti


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] fatal error with openmpi-2.1.0rc1 on Linux with Sun C

2017-02-27 Thread Siegmar Gross

Hi,

I tried to install openmpi-2.1.0rc1 on my "SUSE Linux Enterprise
Server 12.2 (x86_64)" with Sun C 5.14. Unfortunately, "make"
breaks with the following error. I had reported the same problem
for openmpi-master-201702150209-404fe32. Gilles was able to solve
the problem (https://github.com/pmix/master/pull/309).

...
  CC   src/dstore/pmix_esh.lo
"/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h",
 line 159: warning: parameter in inline asm statement unused: %3
"/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h",
 line 205: warning: parameter in inline asm statement unused: %2
"/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h",
 line 226: warning: parameter in inline asm statement unused: %2
"/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h",
 line 247: warning: parameter in inline asm statement unused: %2
"/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h",
 line 268: warning: parameter in inline asm statement unused: %2
cc: Fatal error in /opt/sun/developerstudio12.5/lib/compilers/bin/acomp : 
Signal number = 139
Makefile:1329: recipe for target 'src/dstore/pmix_esh.lo' failed
make[4]: *** [src/dstore/pmix_esh.lo] Error 1
make[4]: Leaving directory 
'/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1-Linux.x86_64.64_cc/opal/mca/pmix/pmix112/pmix'
Makefile:1596: recipe for target 'all-recursive' failed
make[3]: *** [all-recursive] Error 1
make[3]: Leaving directory 
'/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1-Linux.x86_64.64_cc/opal/mca/pmix/pmix112/pmix'
Makefile:1941: recipe for target 'all-recursive' failed
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory 
'/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1-Linux.x86_64.64_cc/opal/mca/pmix/pmix112'
Makefile:2307: recipe for target 'all-recursive' failed
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory 
'/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1-Linux.x86_64.64_cc/opal'
Makefile:1806: recipe for target 'all-recursive' failed
make: *** [all-recursive] Error 1
loki openmpi-2.1.0rc1-Linux.x86_64.64_cc 129


Gilles, I would be grateful, if you can fix the problem for
openmpi-2.1.0rc1 as well. Thank you very much for your help
in advance.


Kind regards

Siegmar
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


[OMPI users] Fwd: [OMPI USERS] Fault Tolerance and migration

2017-02-27 Thread Alberto Ortiz
Hi,
I am interested in using OpenMPI to manage the distribution on a MicroZed
cluster. This MicroZed boards come with a Zynq device, which has a
dual-core ARM cortex A9. One of the objectives of the project I am working
on is resilience, so I am trully interested in the fault tolerance provided
by OpenMPI.

The thing I want to know is if there is any implementation for run-time
migration. For instance, if I have an octa-MicroZed cluster running an MPI
job and I unplug the Ethernet cable of one of them or I reboot another one,
is there any support in OpenMPI to detect these failures and migrate the
ranks to other processors on run-time execution?

Thank you in advance,
Alberto.
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Using OpenMPI / ORTE as cluster aware GNU Parallel

2017-02-27 Thread Reuti

> Am 27.02.2017 um 18:24 schrieb Angel de Vicente :
> 
> […]
> 
> For a small group of users if the DVM can run with my user and there is
> no restriction on who can use it or if I somehow can authorize others to
> use it (via an authority file or similar) that should be enough.

AFAICS there is no user authorization at all. Everyone can hijack a running DVM 
once he knows the URI. The only problem might be, that all processes are 
running under the account of the user who started the DVM. I.e. output files 
have to go to the home directory of this user, as any other user can't write to 
his own directory any longer this way.

Running the DVM under root might help, but this would be a high risk that any 
faulty script might write to a place where sensible system information is 
stored and may leave the machine unusable afterwards.

My first attempts using DVM often leads to a terminated DVM once a process 
returned with a non-zero exit code. But once the DVM is gone, the queued jobs 
might be lost too I fear. I would wish that the DVM could be more forgivable 
(or this feature be adjustable what to do in case of a non-zero exit code).

-- Reuti


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Using OpenMPI / ORTE as cluster aware GNU Parallel

2017-02-27 Thread Angel de Vicente
Hi,

Reuti  writes:
> At first I thought you want to run a queuing system inside a queuing
> system, but this looks like you want to replace the resource manager.

yes, if this could work reasonably well, we could do without the
resource manager.

> Under which user account the DVM daemons will run? Are all users using the 
> same account?

Well, if this could work only for one user, this could still be useful
as I could use it as I do now use GNU Parallel or a private Condor
system, where I can submit hundreds of jobs, but make sure they get
executed without oversubscribing.

For a small group of users if the DVM can run with my user and there is
no restriction on who can use it or if I somehow can authorize others to
use it (via an authority file or similar) that should be enough.

Thanks,
-- 
Ángel de Vicente
http://www.iac.es/galeria/angelv/  
-
ADVERTENCIA: Sobre la privacidad y cumplimiento de la Ley de Protección de 
Datos, acceda a http://www.iac.es/disclaimer.php
WARNING: For more information on privacy and fulfilment of the Law concerning 
the Protection of Data, consult http://www.iac.es/disclaimer.php?lang=en

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] Using OpenMPI / ORTE as cluster aware GNU Parallel

2017-02-27 Thread r...@open-mpi.org

> On Feb 27, 2017, at 9:39 AM, Reuti  wrote:
> 
> 
>> Am 27.02.2017 um 18:24 schrieb Angel de Vicente :
>> 
>> […]
>> 
>> For a small group of users if the DVM can run with my user and there is
>> no restriction on who can use it or if I somehow can authorize others to
>> use it (via an authority file or similar) that should be enough.
> 
> AFAICS there is no user authorization at all. Everyone can hijack a running 
> DVM once he knows the URI. The only problem might be, that all processes are 
> running under the account of the user who started the DVM. I.e. output files 
> have to go to the home directory of this user, as any other user can't write 
> to his own directory any longer this way.

We can add some authorization protection, at least at the user/group level. One 
can resolve the directory issue by creating some place that has group 
authorities, and then requesting that to be the working directory.

> 
> Running the DVM under root might help, but this would be a high risk that any 
> faulty script might write to a place where sensible system information is 
> stored and may leave the machine unusable afterwards.
> 

I would advise against that

> My first attempts using DVM often leads to a terminated DVM once a process 
> returned with a non-zero exit code. But once the DVM is gone, the queued jobs 
> might be lost too I fear. I would wish that the DVM could be more forgivable 
> (or this feature be adjustable what to do in case of a non-zero exit code).

We just fixed that issue the other day :-)

> 
> -- Reuti
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] fatal error with openmpi-2.1.0rc1 on Linux with Sun C

2017-02-27 Thread Josh Hursey
Drat! Thanks for letting us know. That fix was missed when we swept through
to create the PMIx v1.2.1 - which triggered the OMPI v2.1.0rc1. Sorry about
that :(

Jeff filed an Issue to track this here:
  https://github.com/open-mpi/ompi/issues/3048

I've filed a PR against PMIx to bring it into the next PMIx v1.2.2 here:
  https://github.com/pmix/master/pull/322

We'll followup on the Issue with the resolution tomorrow morning during the
OMPI developer's teleconf.


On Mon, Feb 27, 2017 at 8:05 AM, Siegmar Gross <
siegmar.gr...@informatik.hs-fulda.de> wrote:

> Hi,
>
> I tried to install openmpi-2.1.0rc1 on my "SUSE Linux Enterprise
> Server 12.2 (x86_64)" with Sun C 5.14. Unfortunately, "make"
> breaks with the following error. I had reported the same problem
> for openmpi-master-201702150209-404fe32. Gilles was able to solve
> the problem (https://github.com/pmix/master/pull/309).
>
> ...
>   CC   src/dstore/pmix_esh.lo
> "/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h",
> line 159: warning: parameter in inline asm statement unused: %3
> "/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h",
> line 205: warning: parameter in inline asm statement unused: %2
> "/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h",
> line 226: warning: parameter in inline asm statement unused: %2
> "/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h",
> line 247: warning: parameter in inline asm statement unused: %2
> "/export2/src/openmpi-2.1.0/openmpi-2.1.0rc1/opal/include/opal/sys/amd64/atomic.h",
> line 268: warning: parameter in inline asm statement unused: %2
> cc: Fatal error in /opt/sun/developerstudio12.5/lib/compilers/bin/acomp :
> Signal number = 139
> Makefile:1329: recipe for target 'src/dstore/pmix_esh.lo' failed
> make[4]: *** [src/dstore/pmix_esh.lo] Error 1
> make[4]: Leaving directory '/export2/src/openmpi-2.1.0/op
> enmpi-2.1.0rc1-Linux.x86_64.64_cc/opal/mca/pmix/pmix112/pmix'
> Makefile:1596: recipe for target 'all-recursive' failed
> make[3]: *** [all-recursive] Error 1
> make[3]: Leaving directory '/export2/src/openmpi-2.1.0/op
> enmpi-2.1.0rc1-Linux.x86_64.64_cc/opal/mca/pmix/pmix112/pmix'
> Makefile:1941: recipe for target 'all-recursive' failed
> make[2]: *** [all-recursive] Error 1
> make[2]: Leaving directory '/export2/src/openmpi-2.1.0/op
> enmpi-2.1.0rc1-Linux.x86_64.64_cc/opal/mca/pmix/pmix112'
> Makefile:2307: recipe for target 'all-recursive' failed
> make[1]: *** [all-recursive] Error 1
> make[1]: Leaving directory '/export2/src/openmpi-2.1.0/op
> enmpi-2.1.0rc1-Linux.x86_64.64_cc/opal'
> Makefile:1806: recipe for target 'all-recursive' failed
> make: *** [all-recursive] Error 1
> loki openmpi-2.1.0rc1-Linux.x86_64.64_cc 129
>
>
> Gilles, I would be grateful, if you can fix the problem for
> openmpi-2.1.0rc1 as well. Thank you very much for your help
> in advance.
>
>
> Kind regards
>
> Siegmar
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>



-- 
Josh Hursey
IBM Spectrum MPI Developer
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Does MPI_Iallreduce work with CUDA-Aware in OpenMPI-2.0.2?

2017-02-27 Thread Junjie Qian
Hi list,

I would like to know if MPI_Iallreduce is supported with cuda-aware in 
openMPI-2.0.2?
The page https://www.open-mpi.org/faq/?category=runcuda, updated on 06/2016, 
says not supported until openmpi--1.8.5.

Any updates on this?
Thank you
Junjie Qian

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Fwd: [OMPI USERS] Fault Tolerance and migration

2017-02-27 Thread George Bosilca
Alberto,

In the master there is no such support (we had support for migration a
while back, but we have stripped it out). However, at UTK we developed a
fork of Open MPI, called ULFM,  which provides fault management
capabilities. This fork provides support to detect failures, and support
for handling the fault in the MPI layer.

I suggest you look at fault-tolerance.org for more info.

  George.


On Mon, Feb 27, 2017 at 11:23 AM, Alberto Ortiz 
wrote:

> Hi,
> I am interested in using OpenMPI to manage the distribution on a MicroZed
> cluster. This MicroZed boards come with a Zynq device, which has a
> dual-core ARM cortex A9. One of the objectives of the project I am working
> on is resilience, so I am trully interested in the fault tolerance provided
> by OpenMPI.
>
> The thing I want to know is if there is any implementation for run-time
> migration. For instance, if I have an octa-MicroZed cluster running an MPI
> job and I unplug the Ethernet cable of one of them or I reboot another one,
> is there any support in OpenMPI to detect these failures and migrate the
> ranks to other processors on run-time execution?
>
> Thank you in advance,
> Alberto.
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

[OMPI users] Issues with different IB adapters and openmpi 2.0.2

2017-02-27 Thread Orion Poplawski
We have a couple nodes with different IB adapters in them:

font1/var/log/lspci:03:00.0 InfiniBand [0c06]: Mellanox Technologies MT25204
[InfiniHost III Lx HCA] [15b3:6274] (rev 20)
font2/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220 InfiniBand
HCA [1077:7220] (rev 02)
font3/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220 InfiniBand
HCA [1077:7220] (rev 02)

With 1.10.3 we saw the following errors with mpirun:

[font2.cora.nwra.com:13982] [[23220,1],10] selected pml cm, but peer
[[23220,1],0] on font1 selected pml ob1

which crashed MPI_Init.

We worked around this by passing "--mca pml ob1".  I notice now with openmpi
2.0.2 without that option I no longer see errors, but the mpi program will
hang shortly after startup.  Re-adding the option makes it work, so I'm
assuming the underlying problem is still the same, but openmpi appears to have
stopped alerting me to the issue.

Thoughts?

-- 
Orion Poplawski
Technical Manager  720-772-5637
NWRA, Boulder/CoRA Office FAX: 303-415-9702
3380 Mitchell Lane   or...@nwra.com
Boulder, CO 80301   http://www.nwra.com
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users


Re: [OMPI users] Issues with different IB adapters and openmpi 2.0.2

2017-02-27 Thread Howard Pritchard
Hi Orion

Does the problem occur if you only use font2 and 3?  Do you have MXM
installed on the font1 node?

The 2.x series is using PMIX and it could be that is impacting the PML
sanity check.

Howard


Orion Poplawski  schrieb am Mo. 27. Feb. 2017 um 14:50:

> We have a couple nodes with different IB adapters in them:
>
> font1/var/log/lspci:03:00.0 InfiniBand [0c06]: Mellanox Technologies
> MT25204
> [InfiniHost III Lx HCA] [15b3:6274] (rev 20)
> font2/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220
> InfiniBand
> HCA [1077:7220] (rev 02)
> font3/var/log/lspci:03:00.0 InfiniBand [0c06]: QLogic Corp. IBA7220
> InfiniBand
> HCA [1077:7220] (rev 02)
>
> With 1.10.3 we saw the following errors with mpirun:
>
> [font2.cora.nwra.com:13982] [[23220,1],10] selected pml cm, but peer
> [[23220,1],0] on font1 selected pml ob1
>
> which crashed MPI_Init.
>
> We worked around this by passing "--mca pml ob1".  I notice now with
> openmpi
> 2.0.2 without that option I no longer see errors, but the mpi program will
> hang shortly after startup.  Re-adding the option makes it work, so I'm
> assuming the underlying problem is still the same, but openmpi appears to
> have
> stopped alerting me to the issue.
>
> Thoughts?
>
> --
> Orion Poplawski
> Technical Manager  720-772-5637
> NWRA, Boulder/CoRA Office FAX: 303-415-9702
> 3380 Mitchell Lane   or...@nwra.com
> Boulder, CO 80301   http://www.nwra.com
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users