[OMPI users] Issue about cm PML

2016-03-16 Thread dpchoudh .
Hello all I have a simple test setup, consisting of two Dell workstation nodes with similar hardware profile. Both the nodes have (identical) 1. Qlogic 4x DDR infiniband 2. Chelsio C310 iWARP ethernet. Both of these cards are connected back to back, without a switch. With this setup, I can run

Re: [OMPI users] running OpenMPI jobs (either 1.10.1 or 1.8.7) on SoGE more problems

2016-03-16 Thread Ralph Castain
That’s an SGE error message - looks like your tmp file system on one of the remote nodes is full. We don’t control where SGE puts its files, but it might be that your backend nodes are having issues with us doing a tree-based launch (i.e., where each backend daemon launches more daemons along

[OMPI users] running OpenMPI jobs (either 1.10.1 or 1.8.7) on SoGE more problems

2016-03-16 Thread Lane, William
I'm getting an error message early on: [csclprd3-0-11:17355] [[36373,0],17] plm:rsh: using "/opt/sge/bin/lx-amd64/qrsh -inherit -nostdin -V -verbose" for launching unable to write to file /tmp/285019.1.verylong.q/qrsh_error: No space left on device[csclprd3-6-10:18352] [[36373,0],21] plm:rsh:

Re: [OMPI users] locked memory and queue pairs

2016-03-16 Thread Cabral, Matias A
I didn't go into the code to see who is actually calling this error message, but I suspect this may be a generic error for "out of memory" kind of thing and not specific to the que pair. To confirm please add -mca pml_base_verbose 100 and add -mca mtl_base_verbose 100 to see what is being

Re: [OMPI users] locked memory and queue pairs

2016-03-16 Thread Michael Di Domenico
On Wed, Mar 16, 2016 at 3:37 PM, Cabral, Matias A wrote: > Hi Michael, > > I may be missing some context, if you are using the qlogic cards you will > always want to use the psm mtl (-mca pml cm -mca mtl psm) and not openib btl. > As Tom suggest, confirm the limits

Re: [OMPI users] locked memory and queue pairs

2016-03-16 Thread Cabral, Matias A
Hi Michael, I may be missing some context, if you are using the qlogic cards you will always want to use the psm mtl (-mca pml cm -mca mtl psm) and not openib btl. As Tom suggest, confirm the limits are setup on every node: could it be the alltoall is reaching a node that "others" are not?

Re: [OMPI users] locked memory and queue pairs

2016-03-16 Thread Michael Di Domenico
On Wed, Mar 16, 2016 at 12:12 PM, Elken, Tom wrote: > Hi Mike, > > In this file, > $ cat /etc/security/limits.conf > ... > < do you see at the end ... > > > * hard memlock unlimited > * soft memlock unlimited > # -- All InfiniBand Settings End here -- > ? Yes. I double

Re: [OMPI users] locked memory and queue pairs

2016-03-16 Thread Elken, Tom
Hi Mike, In this file, $ cat /etc/security/limits.conf ... < do you see at the end ... > * hard memlock unlimited * soft memlock unlimited # -- All InfiniBand Settings End here -- ? -Tom > -Original Message- > From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Michael Di >

Re: [OMPI users] locked memory and queue pairs

2016-03-16 Thread Michael Di Domenico
On Thu, Mar 10, 2016 at 11:54 AM, Michael Di Domenico wrote: > when i try to run an openmpi job with >128 ranks (16 ranks per node) > using alltoall or alltoallv, i'm getting an error that the process was > unable to get a queue pair. > > i've checked the max locked memory

Re: [OMPI users] Error with MPI_Register_datarep

2016-03-16 Thread Edgar Gabriel
On 3/16/2016 7:06 AM, Éric Chamberland wrote: Le 16-03-14 15:07, Rob Latham a écrit : On mpich's discussion list the point was made that libraries like HDF5 and (Parallel-)NetCDF provide not only the sort of platform portability Eric desires, but also provide a self-describing file format.

Re: [OMPI users] Error with MPI_Register_datarep

2016-03-16 Thread Éric Chamberland
Le 16-03-14 15:07, Rob Latham a écrit : On mpich's discussion list the point was made that libraries like HDF5 and (Parallel-)NetCDF provide not only the sort of platform portability Eric desires, but also provide a self-describing file format. ==rob But I do not agree with that. If

Re: [OMPI users] Fault tolerant feature in Open MPI

2016-03-16 Thread Husen R
In the case of MPI application (not gromacs), How do I relocate MPI application from one node to another node while it is running ? I'm sorry, as far as I know the *ompi-restart *command is used to restart application, based on checkpoint file, once the application already terminated (no longer

Re: [OMPI users] Fault tolerant feature in Open MPI

2016-03-16 Thread Jeff Hammond
Just checkpoint-restart the app to relocate. The overhead will be lower than trying to do with MPI. Jeff On Wednesday, March 16, 2016, Husen R wrote: > Hi Jeff, > > Thanks for the reply. > > After consulting the Gromacs docs, as you suggested, Gromacs already > supports

Re: [OMPI users] Fault tolerant feature in Open MPI

2016-03-16 Thread Husen R
Hi Jeff, Thanks for the reply. After consulting the Gromacs docs, as you suggested, Gromacs already supports checkpoint/restart. thanks for the suggestion. Previously, I asked about checkpoint/restart in Open MPI because I want to checkpoint MPI Application and restart/migrate it while it is

Re: [OMPI users] Open SHMEM Error

2016-03-16 Thread Gilles Gouaillardet
Ray, from shmem_ptr man page : RETURN VALUES shmem_ptr returns a pointer to the data object on the specified remote PE. If target is not remotely accessible, a NULL pointer is returned. since you are running your application on two hosts and one task per host, the target is not

Re: [OMPI users] Open SHMEM Error

2016-03-16 Thread RYAN RAY
Dear Gilles I have attached the source code and the hostfile. Regards Ryan From: Gilles Gouaillardet gilles.gouaillar...@gmail.com Sent: Tue, 15 Mar 2016 15:44:48 To: Open MPI Users us...@open-mpi.org Subject: Re: [OMPI users] Open SHMEM Error Ryan, can you please post your source code and

Re: [OMPI users] Fault tolerant feature in Open MPI

2016-03-16 Thread Jeff Hammond
Why do you need OpenMPI to do this? Molecular dynamics trajectories are trivial to checkpoint and restart at the application level. I'm sure Gromacs already supports this. Please consult the Gromacs docs or user support for details. Jeff On Tuesday, March 15, 2016, Husen R

[OMPI users] Fault tolerant feature in Open MPI

2016-03-16 Thread Husen R
Dear Open MPI Users, Does the current stable release of Open MPI (v1.10 series) support fault tolerant feature ? I got the information from Open MPI FAQ that The checkpoint/restart support was last released as part of the v1.6 series. I just want to make sure about this. and by the way, does