Yes there is a second HPC Sun Grid Engine cluster on which I've run
this openMPI test code dozens of times on upwards of 400 slots
through SGE using qsub and qrsh, but this was using a much
older version of openMPI (1.3.3 I believe). On that particular cluster the
open files hard and soft limits were an issue.

I have noticed that there has been a new (as of July 2014) CentOS openMPI bug 
that
occurs when CentOS is upgraded from 6.2 to 6.3. I'm not sure if that
bug applies to this situation though.

This particular problem occurs whether or not I submit jobs through SGE (via 
qrsh
or qsub) or outside of SGE which leads me to believe it is an openMPI and/or 
CentOS
issue.

-Bill Lane

________________________________
From: users [users-boun...@open-mpi.org] on behalf of Ralph Castain 
[r...@open-mpi.org]
Sent: Saturday, July 19, 2014 3:21 PM
To: Open MPI Users
Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots

Not for this test case size. You should be just fine with the default values.

If I understand you correctly, you've run this app at scale before on another 
cluster without problem?

On Jul 19, 2014, at 1:34 PM, Lane, William 
<william.l...@cshs.org<mailto:william.l...@cshs.org>> wrote:

Ralph,

It's hard to imagine it's the openMPI code because I've tested this code
extensively on another cluster with 400 nodes and never had any problems.
But I'll try using the hello_c example in any case. Is it still recommended to
raise the open files soft and hard limits to 4096? Or should even larger values
be necessary?

Thank you for your help.

-Bill Lane

________________________________
From: users [users-boun...@open-mpi.org<mailto:users-boun...@open-mpi.org>] on 
behalf of Ralph Castain [r...@open-mpi.org<mailto:r...@open-mpi.org>]
Sent: Saturday, July 19, 2014 8:07 AM
To: Open MPI Users
Subject: Re: [OMPI users] Mpirun 1.5.4 problems when request > 28 slots

That's a pretty old OMPI version, and we don't really support it any longer. 
However, I can provide some advice:

* have you tried running the simple "hello_c" example we provide? This would at 
least tell you if the problem is in your app, which is what I'd expect given 
your description

* try using gdb (or pick your debugger) to look at the corefile and see where 
it is failing

I'd also suggest updating OMPI to the 1.6.5 or 1.8.1 versions, but I doubt 
that's the issue behind this problem.


On Jul 19, 2014, at 1:05 AM, Lane, William 
<william.l...@cshs.org<mailto:william.l...@cshs.org>> wrote:

I'm getting consistent errors of the form:

"mpirun noticed that process rank 3 with PID 802 on node csclprd3-0-8 exited on 
signal 11 (Segmentation fault)."

whenever I request more than 28 slots. These
errors even occur when I run mpirun locally
on a compute node that has 32 slots (8 cores, 16 with hyperthreading).

When I run less than 28 slots I have no problems whatsoever.

OS:
CentOS release 6.3 (Final)

openMPI information:
                 Package: Open MPI 
mockbu...@c6b8.bsys.dev.centos.org<mailto:mockbu...@c6b8.bsys.dev.centos.org> 
Distribution
                Open MPI: 1.5.4
   Open MPI SVN revision: r25060
   Open MPI release date: Aug 18, 2011
                Open RTE: 1.5.4
   Open RTE SVN revision: r25060
   Open RTE release date: Aug 18, 2011
                    OPAL: 1.5.4
       OPAL SVN revision: r25060
       OPAL release date: Aug 18, 2011
            Ident string: 1.5.4
                  Prefix: /usr/lib64/openmpi
 Configured architecture: x86_64-unknown-linux-gnu
          Configure host: c6b8.bsys.dev.centos.org<http://bsys.dev.centos.org/>
           Configured by: mockbuild
           Configured on: Fri Jun 22 06:42:03 UTC 2012
          Configure host: c6b8.bsys.dev.centos.org<http://bsys.dev.centos.org/>
                Built by: mockbuild
                Built on: Fri Jun 22 06:46:48 UTC 2012
              Built host: c6b8.bsys.dev.centos.org<http://bsys.dev.centos.org/>
              C bindings: yes
            C++ bindings: yes
      Fortran77 bindings: yes (all)
      Fortran90 bindings: yes
 Fortran90 bindings size: small
              C compiler: gcc
     C compiler absolute: /usr/bin/gcc
  C compiler family name: GNU
      C compiler version: 4.4.6
            C++ compiler: g++
   C++ compiler absolute: /usr/bin/g++
      Fortran77 compiler: gfortran
  Fortran77 compiler abs: /usr/bin/gfortran
      Fortran90 compiler: gfortran
  Fortran90 compiler abs: /usr/bin/gfortran
             C profiling: yes
           C++ profiling: yes
     Fortran77 profiling: yes
     Fortran90 profiling: yes
          C++ exceptions: no
          Thread support: posix (MPI_THREAD_MULTIPLE: no, progress: no)
           Sparse Groups: no
  Internal debug support: no
  MPI interface warnings: no
     MPI parameter check: runtime
Memory profiling support: no
Memory debugging support: no
         libltdl support: yes
   Heterogeneous support: no
 mpirun default --prefix: no
         MPI I/O support: yes
       MPI_WTIME support: gettimeofday
     Symbol vis. support: yes
          MPI extensions: affinity example
   FT Checkpoint support: no (checkpoint thread: no)
  MPI_MAX_PROCESSOR_NAME: 256
    MPI_MAX_ERROR_STRING: 256
     MPI_MAX_OBJECT_NAME: 64
        MPI_MAX_INFO_KEY: 36
        MPI_MAX_INFO_VAL: 256
       MPI_MAX_PORT_NAME: 1024
  MPI_MAX_DATAREP_STRING: 128
           MCA backtrace: execinfo (MCA v2.0, API v2.0, Component v1.5.4)
          MCA memchecker: valgrind (MCA v2.0, API v2.0, Component v1.5.4)
              MCA memory: linux (MCA v2.0, API v2.0, Component v1.5.4)
           MCA paffinity: hwloc (MCA v2.0, API v2.0, Component v1.5.4)
               MCA carto: auto_detect (MCA v2.0, API v2.0, Component v1.5.4)
               MCA carto: file (MCA v2.0, API v2.0, Component v1.5.4)
           MCA maffinity: first_use (MCA v2.0, API v2.0, Component v1.5.4)
           MCA maffinity: libnuma (MCA v2.0, API v2.0, Component v1.5.4)
               MCA timer: linux (MCA v2.0, API v2.0, Component v1.5.4)
         MCA installdirs: env (MCA v2.0, API v2.0, Component v1.5.4)
         MCA installdirs: config (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA dpm: orte (MCA v2.0, API v2.0, Component v1.5.4)
              MCA pubsub: orte (MCA v2.0, API v2.0, Component v1.5.4)
           MCA allocator: basic (MCA v2.0, API v2.0, Component v1.5.4)
           MCA allocator: bucket (MCA v2.0, API v2.0, Component v1.5.4)
                MCA coll: basic (MCA v2.0, API v2.0, Component v1.5.4)
                MCA coll: hierarch (MCA v2.0, API v2.0, Component v1.5.4)
                MCA coll: inter (MCA v2.0, API v2.0, Component v1.5.4)
                MCA coll: self (MCA v2.0, API v2.0, Component v1.5.4)
                MCA coll: sm (MCA v2.0, API v2.0, Component v1.5.4)
                MCA coll: sync (MCA v2.0, API v2.0, Component v1.5.4)
                MCA coll: tuned (MCA v2.0, API v2.0, Component v1.5.4)
               MCA mpool: fake (MCA v2.0, API v2.0, Component v1.5.4)
               MCA mpool: rdma (MCA v2.0, API v2.0, Component v1.5.4)
               MCA mpool: sm (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA pml: bfo (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA pml: csum (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA pml: ob1 (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA pml: v (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA bml: r2 (MCA v2.0, API v2.0, Component v1.5.4)
              MCA rcache: vma (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA btl: ofud (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA btl: openib (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA btl: self (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA btl: sm (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA btl: tcp (MCA v2.0, API v2.0, Component v1.5.4)
                MCA topo: unity (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA osc: pt2pt (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA osc: rdma (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA iof: hnp (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA iof: orted (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA iof: tool (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA oob: tcp (MCA v2.0, API v2.0, Component v1.5.4)
                MCA odls: default (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA ras: cm (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA ras: gridengine (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA ras: loadleveler (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA ras: slurm (MCA v2.0, API v2.0, Component v1.5.4)
               MCA rmaps: load_balance (MCA v2.0, API v2.0, Component v1.5.4)
               MCA rmaps: rank_file (MCA v2.0, API v2.0, Component v1.5.4)
               MCA rmaps: resilient (MCA v2.0, API v2.0, Component v1.5.4)
               MCA rmaps: round_robin (MCA v2.0, API v2.0, Component v1.5.4)
               MCA rmaps: seq (MCA v2.0, API v2.0, Component v1.5.4)
               MCA rmaps: topo (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA rml: oob (MCA v2.0, API v2.0, Component v1.5.4)
              MCA routed: binomial (MCA v2.0, API v2.0, Component v1.5.4)
              MCA routed: cm (MCA v2.0, API v2.0, Component v1.5.4)
              MCA routed: direct (MCA v2.0, API v2.0, Component v1.5.4)
              MCA routed: linear (MCA v2.0, API v2.0, Component v1.5.4)
              MCA routed: radix (MCA v2.0, API v2.0, Component v1.5.4)
              MCA routed: slave (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA plm: rsh (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA plm: rshd (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA plm: slurm (MCA v2.0, API v2.0, Component v1.5.4)
               MCA filem: rsh (MCA v2.0, API v2.0, Component v1.5.4)
              MCA errmgr: default (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA ess: env (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA ess: hnp (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA ess: singleton (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA ess: slave (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA ess: slurm (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA ess: slurmd (MCA v2.0, API v2.0, Component v1.5.4)
                 MCA ess: tool (MCA v2.0, API v2.0, Component v1.5.4)
             MCA grpcomm: bad (MCA v2.0, API v2.0, Component v1.5.4)
             MCA grpcomm: basic (MCA v2.0, API v2.0, Component v1.5.4)
             MCA grpcomm: hier (MCA v2.0, API v2.0, Component v1.5.4)
            MCA notifier: command (MCA v2.0, API v1.0, Component v1.5.4)
            MCA notifier: smtp (MCA v2.0, API v1.0, Component v1.5.4)
            MCA notifier: syslog (MCA v2.0, API v1.0, Component v1.5.4)

IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
STRICTLY PROHIBITED. If you have received this message in error, please notify 
us immediately by calling (310) 423-6428 and destroy the related message. Thank 
You for your cooperation. _______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/07/24815.php

IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
STRICTLY PROHIBITED. If you have received this message in error, please notify 
us immediately by calling (310) 423-6428 and destroy the related message. Thank 
You for your cooperation. _______________________________________________
users mailing list
us...@open-mpi.org<mailto:us...@open-mpi.org>
Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/users
Link to this post: 
http://www.open-mpi.org/community/lists/users/2014/07/24817.php

IMPORTANT WARNING: This message is intended for the use of the person or entity 
to which it is addressed and may contain information that is privileged and 
confidential, the disclosure of which is governed by applicable law. If the 
reader of this message is not the intended recipient, or the employee or agent 
responsible for delivering it to the intended recipient, you are hereby 
notified that any dissemination, distribution or copying of this information is 
STRICTLY PROHIBITED. If you have received this message in error, please notify 
us immediately by calling (310) 423-6428 and destroy the related message. Thank 
You for your cooperation.

Reply via email to