Can you also check there is no cpu binding issue (several mpi tasks and/or OpenMP threads if any, bound to the same core and doing time sharing ? A simple way to check that is to log into a compute node, run top and then press 1 f j If some cores have higher usage than others, you are likely doing time sharing. An other option is to disable cpu binding (ompi and openmp if any) and see if things get better (This is suboptimal but still better than time sharing)
"Jeff Squyres (jsquyres)" <jsquy...@cisco.com> wrote: >- Is /tmp on that machine on NFS or local? > >- Have you looked at the text of the help message that came out before the "9 >more processes have sent help message help-opal-shmem-mmap.txt / mmap on nfs" >message? It should contain details about what the problematic NFS directory >is. > >- Do you know that it's MPI that is causing this low CPU utilization? > >- You mentioned other MPI implementations; have you tested with them to see if >they get better CPU utilization? > >- What happens if you run this application on a single machine, with no >network messaging? > >- Do you know what specifically in your application is slow? I.e., have you >done any instrumentation to see what steps / API calls are running slowly, and >then tried to figure out why? > >- Do you have blocking message patterns that might operate well in shared >memory, but expose the inefficiencies of its algorithms/design when it moves >to higher-latency transports? > >- How long does your application run for? > >I ask these questions because MPI applications tend to be quite complicated. >Sometimes it's the application itself that is the cause of slowdown / >inefficiencies. > > > >On Oct 23, 2014, at 9:29 PM, Vinson Leung <lwhvinson1...@gmail.com> wrote: > >> Later I change another machine and set the TMPDIR to default /tmp, but the >> problem (low CPU utilization under 20%) still occur :< >> >> Vincent >> >> On Thu, Oct 23, 2014 at 10:38 PM, Jeff Squyres (jsquyres) >> <jsquy...@cisco.com> wrote: >> If normal users can't write to /tmp (or if /tmp is an NFS-mounted >> filesystem), that's the underlying problem. >> >> @Vinson -- you should probably try to get that fixed. >> >> >> >> On Oct 23, 2014, at 10:35 AM, Joshua Ladd <jladd.m...@gmail.com> wrote: >> >> > It's not coming from OSHMEM but from the OPAL "shmem" framework. You are >> > going to get terrible performance - possibly slowing to a crawl having all >> > processes open their backing files for mmap on NSF. I think that's the >> > error that he's getting. >> > >> > >> > Josh >> > >> > On Thu, Oct 23, 2014 at 6:06 AM, Vinson Leung <lwhvinson1...@gmail.com> >> > wrote: >> > HI, Thanks for your reply:) >> > I really run an MPI program (compile with OpenMPI and run with "mpirun -n >> > 8 ......"). My OpenMPI version is 1.8.3 and my program is Gromacs. BTW, >> > what is OSHMEM ? >> > >> > Best >> > Vincent >> > >> > On Thu, Oct 23, 2014 at 12:21 PM, Ralph Castain <r...@open-mpi.org> wrote: >> > From your error message, I gather you are not running an MPI program, but >> > rather an OSHMEM one? Otherwise, I find the message strange as it only >> > would be emitted from an OSHMEM program. >> > >> > What version of OMPI are you trying to use? >> > >> >> On Oct 22, 2014, at 7:12 PM, Vinson Leung <lwhvinson1...@gmail.com> wrote: >> >> >> >> Thanks for your reply:) >> >> Follow your advice I tried to set the TMPDIR to /var/tmp and /dev/shm and >> >> even reset to /tmp (I get the system permission), the problem still occur >> >> (CPU utilization still lower than 20%). I have no idea why and ready to >> >> give up OpenMPI instead of using other MPI library. >> >> >> >> --------Old Message------------- >> >> >> >> Date: Tue, 21 Oct 2014 22:21:31 -0400 >> >> From: Brock Palen <bro...@umich.edu> >> >> To: Open MPI Users <us...@open-mpi.org> >> >> Subject: Re: [OMPI users] low CPU utilization with OpenMPI >> >> Message-ID: <cc54135d-0cfe-440a-8df2-06b587e17...@umich.edu> >> >> Content-Type: text/plain; charset=us-ascii >> >> >> >> Doing special files on NFS can be weird, try the other /tmp/ locations: >> >> >> >> /var/tmp/ >> >> /dev/shm (ram disk careful!) >> >> >> >> Brock Palen >> >> www.umich.edu/~brockp >> >> CAEN Advanced Computing >> >> XSEDE Campus Champion >> >> bro...@umich.edu >> >> (734)936-1985 >> >> >> >> >> >> >> >> > On Oct 21, 2014, at 10:18 PM, Vinson Leung <lwhvinson1...@gmail.com> >> >> > wrote: >> >> > >> >> > Because of permission reason (OpenMPI can not write temporary file to >> >> > the default /tmp directory), I change the TMPDIR to my local directory >> >> > (export TMPDIR=/home/user/tmp ) and then the MPI program can run. But >> >> > the CPU utilization is very low under 20% (8 MPI rank running in Intel >> >> > Xeon 8-core CPU). >> >> > >> >> > And I also got some message when I run with OpenMPI: >> >> > [cn3:28072] 9 more processes have sent help message >> >> > help-opal-shmem-mmap.txt / mmap on nfs >> >> > [cn3:28072] Set MCA parameter "orte_base_help_aggregate" to 0 to see >> >> > all help / error messages >> >> > >> >> > Any idea? >> >> > Thanks >> >> > >> >> > VIncent > > >-- >Jeff Squyres >jsquy...@cisco.com >For corporate legal information go to: >http://www.cisco.com/web/about/doing_business/legal/cri/ > >_______________________________________________ >users mailing list >us...@open-mpi.org >Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users >Link to this post: >http://www.open-mpi.org/community/lists/users/2014/10/25568.php