There appears to be some confusion about ramdisks and tmpfs. A ramdisk sets aside a fixed amount of memory for its exclusive use, so that a file being written to ramdisk goes first to the cache, then to ramdisk, and may exist in both for some time. tmpfs however opens up the cache to programs so that a file being written goes to cache and stays there. The "size" of a tmpfs pseudo-disk is the maximum it can grow to (which according to the mount man page defaults to 50% of memory). Hence only enough memory to hold the data is actually used which ties up with David Turner's figures.
You can easily tell which method is in use from df. A traditional ramdisk will appears as /dev/ramN (N = 0, 1 ...) whereas a tmpfs device will be a simple name, often tmpfs. I would guess that the single "-" in David's df command is precisely this. On our diskless nodes root shows as device compute_x86_64, whilst /tmp, /dev/shm and /var/tmp show as "none". HTH, Martin Rushton HPC System Manager, Weapons Technologies Tel: 01959 514777, Mobile: 07939 219057 email: jmrush...@qinetiq.com www.QinetiQ.com QinetiQ - Delivering customer-focused solutions Please consider the environment before printing this email. -----Original Message----- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of Blosch, Edwin L Sent: 04 November 2011 16:19 To: Open MPI Users Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmpfor OpenMPI usage OK, I wouldn't have guessed that the space for /tmp isn't actually in RAM until it's needed. That's the key piece of knowledge I was missing; I really appreciate it. So you can allow /tmp to be reasonably sized, but if you aren't actually using it, then it doesn't take up 11 GB of RAM. And you prevent users from crashing the node by setting mem limit to 4 GB less than the available memory. Got it. I agree with your earlier comment: these are fairly common systems now. We have program- and owner-specific disks where I work, and after the program ends, the disks are archived or destroyed. Before the stateless configuration option, the entire computer, nodes and switches as well as disks, were archived or destroyed after each program. Not too cost-effective. Is this a reasonable final summary? : OpenMPI uses temporary files in such a way that it is performance-critical that these so-called session files, used for shared-memory communications, must be "local". For state-less clusters, this means the node image must include a /tmp or /wrk partition, intelligently sized so as not to enable an application to exhaust the physical memory of the node, and care must be taken not to mask this in-memory /tmp with an NFS mounted filesystem. It is not uncommon for cluster enablers to exclude /tmp from a typical base Linux filesystem image or mount it over NFS, as a means of providing users with a larger-sized /tmp that is not limited to a fraction of the node's physical memory, or to avoid garbage accumulation in /tmp taking up the physical RAM. But not having /tmp or mounting it over NFS is not a viable stateless-node configuration option if you intend to run OpenMPI. Instead you could have a /bigtmp which is NFS-mounted and a /tmp whi! ch is local, for example. Starting in OpenMPI 1.7.x, shared-memory communication will no longer go through memory-mapped files, and vendors/users will no longer need to be vigilant concerning this OpenMPI performance requirement on stateless node configuration. Is that a reasonable summary? If so, would it be helpful to include this as an FAQ entry under General category? Or the "shared memory" category? Or the "troubleshooting" category? Thanks -----Original Message----- From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On Behalf Of David Turner Sent: Friday, November 04, 2011 1:38 AM To: Open MPI Users Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node /tmp for OpenMPI usage % df /tmp Filesystem 1K-blocks Used Available Use% Mounted on - 12330084 822848 11507236 7% / % df / Filesystem 1K-blocks Used Available Use% Mounted on - 12330084 822848 11507236 7% / That works out to 11GB. But... The compute nodes have 24GB. Freshly booted, about 3.2GB is consumed by the kernel, various services, and the root file system. At this time, usage of /tmp is essentially nil. We set user memory limits to 20GB. I would imagine that the size of the session directories depends on a number of factors; perhaps the developers can comment on that. I have only seen total sizes in the 10s of MBs on our 8-node, 24GB nodes. As long as they're removed after each job, they don't really compete with the application for available memory. On 11/3/11 8:40 PM, Ed Blosch wrote: > Thanks very much, exactly what I wanted to hear. How big is /tmp? > > -----Original Message----- > From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] > On Behalf Of David Turner > Sent: Thursday, November 03, 2011 6:36 PM > To: us...@open-mpi.org > Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node > /tmp for OpenMPI usage > > I'm not a systems guy, but I'll pitch in anyway. On our cluster, all > the compute nodes are completely diskless. The root file system, > including /tmp, resides in memory (ramdisk). OpenMPI puts these > session directories therein. All our jobs run through a batch system > (torque). At the conclusion of each batch job, an epilogue process > runs that removes all files belonging to the owner of the current > batch job from /tmp (and also looks for and kills orphan processes > belonging to the user). This epilogue had to written by our systems > staff. > > I believe this is a fairly common configuration for diskless clusters. > > On 11/3/11 4:09 PM, Blosch, Edwin L wrote: >> Thanks for the help. A couple follow-up-questions, maybe this starts >> to > go outside OpenMPI: >> >> What's wrong with using /dev/shm? I think you said earlier in this >> thread > that this was not a safe place. >> >> If the NFS-mount point is moved from /tmp to /work, would a /tmp >> magically > appear in the filesystem for a stateless node? How big would it be, > given that there is no local disk, right? That may be something I > have to ask the vendor, which I've tried, but they don't quite seem to get the question. >> >> Thanks >> >> >> >> >> -----Original Message----- >> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] >> On > Behalf Of Ralph Castain >> Sent: Thursday, November 03, 2011 5:22 PM >> To: Open MPI Users >> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less node >> /tmp > for OpenMPI usage >> >> >> On Nov 3, 2011, at 2:55 PM, Blosch, Edwin L wrote: >> >>> I might be missing something here. Is there a side-effect or >>> performance > loss if you don't use the sm btl? Why would it exist if there is a > wholly equivalent alternative? What happens to traffic that is > intended for another process on the same node? >> >> There is a definite performance impact, and we wouldn't recommend >> doing > what Eugene suggested if you care about performance. >> >> The correct solution here is get your sys admin to make /tmp local. >> Making > /tmp NFS mounted across multiple nodes is a major "faux pas" in the > Linux world - it should never be done, for the reasons stated by Jeff. >> >> >>> >>> Thanks >>> >>> >>> -----Original Message----- >>> From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] >>> On > Behalf Of Eugene Loh >>> Sent: Thursday, November 03, 2011 1:23 PM >>> To: us...@open-mpi.org >>> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less >>> node > /tmp for OpenMPI usage >>> >>> Right. Actually "--mca btl ^sm". (Was missing "btl".) >>> >>> On 11/3/2011 11:19 AM, Blosch, Edwin L wrote: >>>> I don't tell OpenMPI what BTLs to use. The default uses sm and puts >>>> a > session file on /tmp, which is NFS-mounted and thus not a good choice. >>>> >>>> Are you suggesting something like --mca ^sm? >>>> >>>> >>>> -----Original Message----- >>>> From: users-boun...@open-mpi.org >>>> [mailto:users-boun...@open-mpi.org] On > Behalf Of Eugene Loh >>>> Sent: Thursday, November 03, 2011 12:54 PM >>>> To: us...@open-mpi.org >>>> Subject: Re: [OMPI users] EXTERNAL: Re: How to set up state-less >>>> node > /tmp for OpenMPI usage >>>> >>>> I've not been following closely. Why must one use shared-memory >>>> communications? How about using other BTLs in a "loopback" fashion? >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>>> _______________________________________________ >>>> users mailing list >>>> us...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >>> _______________________________________________ >>> users mailing list >>> us...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/users >> >> >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/users > > -- Best regards, David Turner User Services Group email: dptur...@lbl.gov NERSC Division phone: (510) 486-4027 Lawrence Berkeley Lab fax: (510) 486-4316 _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users _______________________________________________ users mailing list us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/users This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. QinetiQ may monitor email traffic data and also the content of email for the purposes of security. QinetiQ Limited (Registered in England & Wales: Company Number: 3796233) Registered office: Cody Technology Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com.