Hello all
I have a simple test setup, consisting of two Dell workstation nodes with
similar hardware profile.
Both the nodes have (identical)
1. Qlogic 4x DDR infiniband
2. Chelsio C310 iWARP ethernet.
Both of these cards are connected back to back, without a switch.
With this setup, I can run
That’s an SGE error message - looks like your tmp file system on one of the
remote nodes is full. We don’t control where SGE puts its files, but it might
be that your backend nodes are having issues with us doing a tree-based launch
(i.e., where each backend daemon launches more daemons along
I'm getting an error message early on:
[csclprd3-0-11:17355] [[36373,0],17] plm:rsh: using "/opt/sge/bin/lx-amd64/qrsh
-inherit -nostdin -V -verbose" for launching
unable to write to file /tmp/285019.1.verylong.q/qrsh_error: No space left on
device[csclprd3-6-10:18352] [[36373,0],21] plm:rsh:
I didn't go into the code to see who is actually calling this error message,
but I suspect this may be a generic error for "out of memory" kind of thing and
not specific to the que pair. To confirm please add -mca pml_base_verbose 100
and add -mca mtl_base_verbose 100 to see what is being
On Wed, Mar 16, 2016 at 3:37 PM, Cabral, Matias A
wrote:
> Hi Michael,
>
> I may be missing some context, if you are using the qlogic cards you will
> always want to use the psm mtl (-mca pml cm -mca mtl psm) and not openib btl.
> As Tom suggest, confirm the limits
Hi Michael,
I may be missing some context, if you are using the qlogic cards you will
always want to use the psm mtl (-mca pml cm -mca mtl psm) and not openib btl.
As Tom suggest, confirm the limits are setup on every node: could it be the
alltoall is reaching a node that "others" are not?
On Wed, Mar 16, 2016 at 12:12 PM, Elken, Tom wrote:
> Hi Mike,
>
> In this file,
> $ cat /etc/security/limits.conf
> ...
> < do you see at the end ... >
>
> * hard memlock unlimited
> * soft memlock unlimited
> # -- All InfiniBand Settings End here --
> ?
Yes. I double
Hi Mike,
In this file,
$ cat /etc/security/limits.conf
...
< do you see at the end ... >
* hard memlock unlimited
* soft memlock unlimited
# -- All InfiniBand Settings End here --
?
-Tom
> -Original Message-
> From: users [mailto:users-boun...@open-mpi.org] On Behalf Of Michael Di
>
On Thu, Mar 10, 2016 at 11:54 AM, Michael Di Domenico
wrote:
> when i try to run an openmpi job with >128 ranks (16 ranks per node)
> using alltoall or alltoallv, i'm getting an error that the process was
> unable to get a queue pair.
>
> i've checked the max locked memory
On 3/16/2016 7:06 AM, Éric Chamberland wrote:
Le 16-03-14 15:07, Rob Latham a écrit :
On mpich's discussion list the point was made that libraries like HDF5
and (Parallel-)NetCDF provide not only the sort of platform
portability Eric desires, but also provide a self-describing file format.
Le 16-03-14 15:07, Rob Latham a écrit :
On mpich's discussion list the point was made that libraries like HDF5
and (Parallel-)NetCDF provide not only the sort of platform
portability Eric desires, but also provide a self-describing file format.
==rob
But I do not agree with that.
If
In the case of MPI application (not gromacs), How do I relocate MPI
application from one node to another node while it is running ?
I'm sorry, as far as I know the *ompi-restart *command is used to restart
application, based on checkpoint file, once the application already
terminated (no longer
Just checkpoint-restart the app to relocate. The overhead will be lower
than trying to do with MPI.
Jeff
On Wednesday, March 16, 2016, Husen R wrote:
> Hi Jeff,
>
> Thanks for the reply.
>
> After consulting the Gromacs docs, as you suggested, Gromacs already
> supports
Hi Jeff,
Thanks for the reply.
After consulting the Gromacs docs, as you suggested, Gromacs already
supports checkpoint/restart. thanks for the suggestion.
Previously, I asked about checkpoint/restart in Open MPI because I want to
checkpoint MPI Application and restart/migrate it while it is
Ray,
from shmem_ptr man page :
RETURN VALUES
shmem_ptr returns a pointer to the data object on the specified
remote PE. If target is not remotely accessible, a NULL pointer is returned.
since you are running your application on two hosts and one task per
host, the target is not
Dear Gilles
I have attached the source code and the hostfile.
Regards
Ryan
From: Gilles Gouaillardet gilles.gouaillar...@gmail.com
Sent: Tue, 15 Mar 2016 15:44:48
To: Open MPI Users us...@open-mpi.org
Subject: Re: [OMPI users] Open SHMEM Error
Ryan,
can you please post your source code and
Why do you need OpenMPI to do this? Molecular dynamics trajectories are
trivial to checkpoint and restart at the application level. I'm sure
Gromacs already supports this. Please consult the Gromacs docs or user
support for details.
Jeff
On Tuesday, March 15, 2016, Husen R
Dear Open MPI Users,
Does the current stable release of Open MPI (v1.10 series) support fault
tolerant feature ?
I got the information from Open MPI FAQ that The checkpoint/restart support
was last released as part of the v1.6 series.
I just want to make sure about this.
and by the way, does
18 matches
Mail list logo