On Nov 7, 2011, at 12:12 PM, Blosch, Edwin L wrote:

> Thanks for the valuable input. I'll change to a wait-and-watch approach.
> 
> The FAQ on tuning sm says "If the session directory is located on a network 
> filesystem, the shared memory BTL latency will be extremely high."  And the 
> title is 'Why am I seeing incredibly poor performance...'.  So I made the 
> leap that this configuration must be avoided at all costs...

(sorry for jumping in late; it's the week before SC, and lots of deadlines are 
approaching!)

This is definitely true: if OMPI's mmap files are located on a network 
filesystem (such as if /tmp is NFS-mounted), your latencies will be higher.  I 
don't claim to know all the exact reasons why, but I have personally seen 
enough empirical evidence to believe it.  Perhaps newer versions of 
Linux/NFS/whatever have made the issue better.  But I'm quite sure that it was 
happening; that's why we put in that warning.

Here's a few points to add to this discussion, in no particular order:

1. Keep in mind the difference between the session directory and the shared 
memory backing files: the session directory contains some meta data that OMPI 
processes need.  In general, most of that data is not performance-critical, 
such that if it's on a networked filesystem, general MPI performance will not 
be affected.  In 1.4.x and 1.5.x, the shared memory mmap files are also located 
in the session directory, and as described above, we have definitely seen a 
negative MPI latency performance impact when this file is on a networked file 
system.

2. In the upcoming OMPI v1.7, we revamped the shared memory backing system such 
that mmap does not have to be used, and therefore will not care if /tmp is on a 
networked filesystem.

3. I don't know whether /tmp on an networked filesystem is 100% "proper" or 
not.  I know that some people do it, but there are uniqueness requirements that 
can definitely be violated in various other tools in this case.  OMPI may not 
be the only software package that can run into problems here, even if the 
problems are rare and difficult to track down (e.g., because two processes with 
the same PID on different machines tried to use the same filename in /tmp, or 
attempts to use file locking, etc.).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to