Re: [OMPI users] UC EXTERNAL: Re: How to set up state-less node /tmpfor OpenMPI usage

2011-11-04 Thread David Turner

Indeed, my terminology is inexact.  I believe you are correct; our
diskless nodes use tmpfs, not ramdisk.  Thanks for the clarification!

On 11/4/11 11:00 AM, Rushton Martin wrote:

There appears to be some confusion about ramdisks and tmpfs.  A ramdisk
sets aside a fixed amount of memory for its exclusive use, so that a
file being written to ramdisk goes first to the cache, then to ramdisk,
and may exist in both for some time.  tmpfs however opens up the cache
to programs so that a file being written goes to cache and stays there.
The "size" of a tmpfs pseudo-disk is the maximum it can grow to (which
according to the mount man page defaults to 50% of memory).  Hence only
enough memory to hold the data is actually used which ties up with David
Turner's figures.

You can easily tell which method is in use from df.  A traditional
ramdisk will appears as /dev/ramN (N = 0, 1 ...) whereas a tmpfs device
will be a simple name, often tmpfs.  I would guess that the single "-"
in David's df command is precisely this.  On our diskless nodes root
shows as device compute_x86_64, whilst /tmp, /dev/shm and /var/tmp show
as "none".

HTH,

Martin Rushton
HPC System Manager, Weapons Technologies
Tel: 01959 514777, Mobile: 07939 219057
email: jmrush...@qinetiq.com
www.QinetiQ.com
QinetiQ - Delivering customer-focused solutions

Please consider the environment before printing this email.
-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Blosch, Edwin L
Sent: 04 November 2011 16:19
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node
/tmpfor OpenMPI usage

OK, I wouldn't have guessed that the space for /tmp isn't actually in
RAM until it's needed.  That's the key piece of knowledge I was missing;
I really appreciate it.  So you can allow /tmp to be reasonably sized,
but if you aren't actually using it, then it doesn't take up 11 GB of
RAM.  And you prevent users from crashing the node by setting mem limit
to 4 GB less than the available memory. Got it.

I agree with your earlier comment:  these are fairly common systems now.
We have program- and owner-specific disks where I work, and after the
program ends, the disks are archived or destroyed.  Before the stateless
configuration option, the entire computer, nodes and switches as well as
disks, were archived or destroyed after each program.  Not too
cost-effective.

Is this a reasonable final summary? :  OpenMPI uses temporary files in
such a way that it is performance-critical that these so-called session
files, used for shared-memory communications, must be "local".  For
state-less clusters, this means the node image must include a /tmp or
/wrk partition, intelligently sized so as not to enable an application
to exhaust the physical memory of the node, and care must be taken not
to mask this in-memory /tmp with an NFS mounted filesystem.  It is not
uncommon for cluster enablers to exclude /tmp from a typical base Linux
filesystem image or mount it over NFS, as a means of providing users
with a larger-sized /tmp that is not limited to a fraction of the node's
physical memory, or to avoid garbage accumulation in /tmp taking up the
physical RAM.  But not having /tmp or mounting it over NFS is not a
viable stateless-node configuration option if you intend to run OpenMPI.
Instead you could have a /bigtmp which is NFS-mounted and a /tmp whi!
  ch is local, for example. Starting in OpenMPI 1.7.x, shared-memory
communication will no longer go through memory-mapped files, and
vendors/users will no longer need to be vigilant concerning this OpenMPI
performance requirement on stateless node configuration.


Is that a reasonable summary?

If so, would it be helpful to include this as an FAQ entry under General
category?  Or the "shared memory" category?  Or the "troubleshooting"
category?


Thanks



-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of David Turner
Sent: Friday, November 04, 2011 1:38 AM
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node
/tmp for OpenMPI usage

% df /tmp
Filesystem   1K-blocks  Used Available Use% Mounted on
- 12330084822848  11507236   7% /
% df /
Filesystem   1K-blocks  Used Available Use% Mounted on
- 12330084822848  11507236   7% /

That works out to 11GB.  But...

The compute nodes have 24GB.  Freshly booted, about 3.2GB is consumed by
the kernel, various services, and the root file system.
At this time, usage of /tmp is essentially nil.

We set user memory limits to 20GB.

I would imagine that the size of the session directories depends on a
number of factors; perhaps the developers can comment on that.  I have
only seen total sizes in the 10s of MBs on our 8-node, 24GB nodes.

As long as they're removed after each job, they don't really compete
with the application for available memory.


Re: [OMPI users] UC EXTERNAL: Re: How to set up state-less node /tmpfor OpenMPI usage

2011-11-04 Thread Rushton Martin
There appears to be some confusion about ramdisks and tmpfs.  A ramdisk
sets aside a fixed amount of memory for its exclusive use, so that a
file being written to ramdisk goes first to the cache, then to ramdisk,
and may exist in both for some time.  tmpfs however opens up the cache
to programs so that a file being written goes to cache and stays there.
The "size" of a tmpfs pseudo-disk is the maximum it can grow to (which
according to the mount man page defaults to 50% of memory).  Hence only
enough memory to hold the data is actually used which ties up with David
Turner's figures.

You can easily tell which method is in use from df.  A traditional
ramdisk will appears as /dev/ramN (N = 0, 1 ...) whereas a tmpfs device
will be a simple name, often tmpfs.  I would guess that the single "-"
in David's df command is precisely this.  On our diskless nodes root
shows as device compute_x86_64, whilst /tmp, /dev/shm and /var/tmp show
as "none".

HTH,

Martin Rushton
HPC System Manager, Weapons Technologies
Tel: 01959 514777, Mobile: 07939 219057
email: jmrush...@qinetiq.com
www.QinetiQ.com
QinetiQ - Delivering customer-focused solutions

Please consider the environment before printing this email.
-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Blosch, Edwin L
Sent: 04 November 2011 16:19
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node
/tmpfor OpenMPI usage

OK, I wouldn't have guessed that the space for /tmp isn't actually in
RAM until it's needed.  That's the key piece of knowledge I was missing;
I really appreciate it.  So you can allow /tmp to be reasonably sized,
but if you aren't actually using it, then it doesn't take up 11 GB of
RAM.  And you prevent users from crashing the node by setting mem limit
to 4 GB less than the available memory. Got it.

I agree with your earlier comment:  these are fairly common systems now.
We have program- and owner-specific disks where I work, and after the
program ends, the disks are archived or destroyed.  Before the stateless
configuration option, the entire computer, nodes and switches as well as
disks, were archived or destroyed after each program.  Not too
cost-effective.

Is this a reasonable final summary? :  OpenMPI uses temporary files in
such a way that it is performance-critical that these so-called session
files, used for shared-memory communications, must be "local".  For
state-less clusters, this means the node image must include a /tmp or
/wrk partition, intelligently sized so as not to enable an application
to exhaust the physical memory of the node, and care must be taken not
to mask this in-memory /tmp with an NFS mounted filesystem.  It is not
uncommon for cluster enablers to exclude /tmp from a typical base Linux
filesystem image or mount it over NFS, as a means of providing users
with a larger-sized /tmp that is not limited to a fraction of the node's
physical memory, or to avoid garbage accumulation in /tmp taking up the
physical RAM.  But not having /tmp or mounting it over NFS is not a
viable stateless-node configuration option if you intend to run OpenMPI.
Instead you could have a /bigtmp which is NFS-mounted and a /tmp whi!
 ch is local, for example. Starting in OpenMPI 1.7.x, shared-memory
communication will no longer go through memory-mapped files, and
vendors/users will no longer need to be vigilant concerning this OpenMPI
performance requirement on stateless node configuration. 


Is that a reasonable summary?

If so, would it be helpful to include this as an FAQ entry under General
category?  Or the "shared memory" category?  Or the "troubleshooting"
category?


Thanks



-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of David Turner
Sent: Friday, November 04, 2011 1:38 AM
To: Open MPI Users
Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node
/tmp for OpenMPI usage

% df /tmp
Filesystem   1K-blocks  Used Available Use% Mounted on
- 12330084822848  11507236   7% /
% df /
Filesystem   1K-blocks  Used Available Use% Mounted on
- 12330084822848  11507236   7% /

That works out to 11GB.  But...

The compute nodes have 24GB.  Freshly booted, about 3.2GB is consumed by
the kernel, various services, and the root file system.
At this time, usage of /tmp is essentially nil.

We set user memory limits to 20GB.

I would imagine that the size of the session directories depends on a
number of factors; perhaps the developers can comment on that.  I have
only seen total sizes in the 10s of MBs on our 8-node, 24GB nodes.

As long as they're removed after each job, they don't really compete
with the application for available memory.

On 11/3/11 8:40 PM, Ed Blosch wrote:
> Thanks very much, exactly what I wanted to hear. How big is /tmp?
>
> -Original Message-
> From: users-boun...@open-mpi.org