If you're looking for true fault tolerance, OMPI doesn't have it yet. An audit for the code base to ensure that errors are continuable is planned, but is not currently on the roadmap.

The FT-MPI guys can comment on their timetable for bringing that technology in...


On Mar 18, 2007, at 9:47 PM, Mohammad Huwaidi wrote:

Thanks Jeff.

The kind of faults I was trying to trap are those of application/ node faults/failures. I literally kill the application on another node in hope to try to trap it and react accordingly. This is similar to FT-MPI shrinking the size, etc.

If you suggest a different implementation that will allow me to trap please let me know.

Regards,
Mohammad Huwaidi

users-requ...@open-mpi.org wrote:
Send users mailing list submissions to
        us...@open-mpi.org
To subscribe or unsubscribe via the World Wide Web, visit
        http://www.open-mpi.org/mailman/listinfo.cgi/users
or, via email, send a message with subject or body 'help' to
        users-requ...@open-mpi.org
You can reach the person managing the list at
        users-ow...@open-mpi.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of users digest..."
Today's Topics:
   1. open-mpi 1.2 build failure under Mac OS X 10.3.9
      (Marius Schamschula)
   2. Re: OpenMPI 1.2 bug: segmentation violation in mpi_pack
      (Jeff Squyres)
   3. Re: Fault Tolerance (Jeff Squyres)
   4. Re: Signal 13 (Ralph Castain)
--------------------------------------------------------------------- -
Message: 1
Date: Fri, 16 Mar 2007 18:42:22 -0500
From: Marius Schamschula <mar...@physics.aamu.edu>
Subject: [OMPI users] open-mpi 1.2 build failure under Mac OS X 10.3.9
To: us...@open-mpi.org
Message-ID: <82367db0-ebc6-4438-bbc2-d78963186...@physics.aamu.edu>
Content-Type: text/plain; charset="us-ascii"
Hi all,
I was building open-mpi 1.2 on my G4 running Mac OS X 10.3.9 and had a build failure with the following: depbase=`echo runtime/ompi_mpi_preconnect.lo | sed 's|[^/]* $|.deps/ &|;s|\.lo$||'`; \ if /bin/sh ../libtool --tag=CC --mode=compile gcc -DHAVE_CONFIG_H - I. -I. -I../opal/include -I../orte/include -I../ompi/include -I../ ompi/ include -I.. -D_REENTRANT -O3 -DNDEBUG -finline- functions -fno- strict-aliasing -MT runtime/ ompi_mpi_preconnect.lo -MD -MP -MF "$depbase.Tpo" -c -o runtime/ ompi_mpi_preconnect.lo runtime/ ompi_mpi_preconnect.c; \ then mv -f "$depbase.Tpo" "$depbase.Plo"; else rm -f "$depbase.Tpo"; exit 1; fi libtool: compile: gcc -DHAVE_CONFIG_H -I. -I. -I../opal/include - I../ orte/include -I../ompi/include -I../ompi/include -I.. - D_REENTRANT - O3 -DNDEBUG -finline-functions -fno-strict-aliasing - MT runtime/ ompi_mpi_preconnect.lo -MD -MP -MF runtime/.deps/ ompi_mpi_preconnect.Tpo -c runtime/ompi_mpi_preconnect.c -fno- common -DPIC -o runtime/.libs/ompi_mpi_preconnect.o runtime/ompi_mpi_preconnect.c: In function `ompi_init_do_oob_preconnect': runtime/ompi_mpi_preconnect.c:74: error: storage size of `msg' isn't known
make[2]: *** [runtime/ompi_mpi_preconnect.lo] Error 1
make[1]: *** [all-recursive] Error 1
make: *** [all-recursive] Error 1
$ gcc -v
Reading specs from /usr/libexec/gcc/darwin/ppc/3.3/specs
Thread model: posix
gcc version 3.3 20030304 (Apple Computer, Inc. build 1495)
$ g77 -v
Reading specs from /usr/local/lib/gcc/powerpc-apple- darwin7.3.0/3.5.0/ specs Configured with: ../gcc/configure --enable-threads=posix --enable- languages=f77
Thread model: posix
gcc version 3.5.0 20040429 (experimental)
(g77 from hpc.sf.net)
Note: I had no such problem under Mac OS X 10.4.9 with my ppc and x86 builds. However, I did notice that the configure script did not detect g95 from g95.org correctly:
*** Fortran 90/95 compiler
checking for gfortran... no
checking for f95... no
checking for fort... no
checking for xlf95... no
checking for ifort... no
checking for ifc... no
checking for efc... no
checking for pgf95... no
checking for lf95... no
checking for f90... no
checking for xlf90... no
checking for pgf90... no
checking for epcf90... no
checking whether we are using the GNU Fortran compiler... no
configure --help doesn't give any hint about specifying F95.
TIA,
Marius
--
Marius Schamschula,  Alabama A & M University, Department of Physics
     The Center for Hydrology Soil Climatology and Remote Sensing
    http://optics.physics.aamu.edu/ - http://www.physics.aamu.edu/
           http://wx.aamu.edu/ - http://www.aamu.edu/hscars/
-------------- next part --------------
HTML attachment scrubbed and removed
------------------------------
Message: 2
Date: Fri, 16 Mar 2007 19:46:39 -0400
From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [OMPI users] OpenMPI 1.2 bug: segmentation violation in
        mpi_pack
To: Open MPI Users <us...@open-mpi.org>
Message-ID: <045dabac-1369-4e45-8e0c-fd9fba13c...@cisco.com>
Content-Type: text/plain; charset=US-ASCII; delsp=yes; format=flowed
The problem with both the f77 and f90 programs is that you forgot to put "ierr" as the last argument to MPI_PACK. This causes a segv because neither of them are correct MPI programs. But it's always good to hear that we can deliver a smaller corefile in v1.2! :-)
On Mar 16, 2007, at 7:25 PM, Erik Deumens wrote:
I have a small program in F77 that makes a SEGV crash with
a 130MB core file. It is true that the crash is much cleaner
in OpenMPI 1.2; nice improvement! The core file is 500 MB with
OpenMPI 1.1.

I am running on CentOS 4.4 with the latest patches.

mpif77 -g -o bug bug.f
mpirun -np 2 ./bug

I also have a bug.f90 (which I made first) and it crashes
too with the Intel ifort compiler 9.1.039.

--
Dr. Erik Deumens
Interim Director
Quantum Theory Project
New Physics Building 2334                    deum...@qtp.ufl.edu
University of Florida            http://www.qtp.ufl.edu/~deumens
Gainesville, Florida 32611-8435                    (352)392-6980

      program mainf
c     mpif77 -g -o bug bug.f
c     mpirun -np 2 ./bug
      implicit none
      include 'mpif.h'
      character*80 inpfile
      integer l
      integer i
      integer stat
      integer cmdbuf(4)
      integer lcmdbuf
      integer ierr
      integer ntasks
      integer taskid
      integer bufpos
      integer cmd
      integer ldata
      character*(mpi_max_processor_name) hostnm
      integer iuinp
      integer iuout
      integer lnam
      real*8 bcaststart
      iuinp = 5
      iuout = 6
      lcmdbuf = 16
      i = 0
      call mpi_init(ierr)
      call mpi_comm_size (mpi_comm_world, ntasks, ierr)
      call mpi_comm_rank (mpi_comm_world, taskid, ierr)
      hostnm = ' '
      call mpi_get_processor_name (hostnm, lnam, ierr)
      write (iuout,*) 'task',taskid,'of',ntasks,'on ',hostnm(1:lnam)
      if (taskid == 0) then
        inpfile = ' '
        do
          write (iuout,*) 'Enter .inp filename:'
          read (iuinp,*) inpfile
          if (inpfile /= ' ') exit
        end do
        l = len_trim(inpfile)
        write (iuout,*) 'task',taskid,inpfile(1:l)
        bufpos = 0
        cmd = 1099
        ldata = 7
        write (iuout,*) 'task',taskid,cmdbuf,bufpos
        write (iuout,*) 'task',taskid,cmd,lcmdbuf
        call mpi_pack (cmd, 1, MPI_INTEGER,
     *       cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
        write (iuout,*) 'task',taskid,cmdbuf,bufpos
        write (iuout,*) 'task',taskid,ldata,lcmdbuf
        call mpi_pack (ldata, 1, MPI_INTEGER,
     *       cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
        bcaststart = mpi_wtime()
        write (iuout,*) 'task',taskid,cmdbuf,bufpos
        write (iuout,*) 'task',taskid,bcaststart,lcmdbuf
        call mpi_pack (bcaststart, 1, MPI_DOUBLE_PRECISION,
     *       cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
        write (iuout,*) 'task',taskid,cmdbuf,bufpos
      end if
      call mpi_bcast (cmdbuf, lcmdbuf, MPI_PACKED,
     *     0, MPI_COMM_WORLD, ierr)
      call mpi_finalize(ierr)
      stop
      end program mainf

program mainf
  ! ifort -g -I /share/local/lib/ompi/include -o bug bug.f90
  !       -L /share/local/lib/ompi/lib -lmpi_f77 -lmpi
  ! mpirun -np 2 ./bug
  implicit none
  include 'mpif.h'
  character(len=80) :: inpfile
  character(len=1), dimension(80) :: cinpfile
  integer :: l
  integer :: i
  integer :: stat
  integer, dimension(4) :: cmdbuf
  integer :: lcmdbuf
  integer :: ierr
  integer :: ntasks
  integer :: taskid
  integer :: bufpos
  integer :: cmd
  integer :: ldata
  character(len=mpi_max_processor_name) :: hostnm
  integer :: iuinp = 5
  integer :: iuout = 6
  integer :: lnam
  real(8) :: bcaststart
  lcmdbuf = 16
  i = 0
  call mpi_init(ierr)
  call mpi_comm_size (mpi_comm_world, ntasks, ierr)
  call mpi_comm_rank (mpi_comm_world, taskid, ierr)
  hostnm = ' '
  call mpi_get_processor_name (hostnm, lnam, ierr)
  write (iuout,*) 'task',taskid,'of',ntasks,'on ',hostnm(1:lnam)
  if (taskid == 0) then
     inpfile = ' '
     do
        write (iuout,*) 'Enter .inp filename:'
        read (iuinp,*) inpfile
        if (inpfile /= ' ') exit
     end do
     l = len_trim(inpfile)
     do i=1,l
        cinpfile(i) = inpfile(i:i)
     end do
     cinpfile(l+1) = char(0)
     write (iuout,*) 'task',taskid,inpfile(1:l)
     bufpos = 0
     cmd = 1099
     ldata = 7
     write (iuout,*) 'task',taskid,cmdbuf,bufpos
     ! The next two lines exhibit the bug
     ! Uncomment the first and the program works
     ! Uncomment the second and the program dies in mpi_pack
     ! and produces a 137 MB core file.
     write (iuout,*) 'task',taskid,cmd,lcmdbuf
!     write (iuout,*) 'task',taskid,cmd
     call mpi_pack (cmd, 1, MPI_INTEGER, &
          cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
     write (iuout,*) 'task',taskid,cmdbuf,bufpos
     write (iuout,*) 'task',taskid,ldata,lcmdbuf
     call mpi_pack (ldata, 1, MPI_INTEGER, &
          cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
     bcaststart = mpi_wtime()
     write (iuout,*) 'task',taskid,cmdbuf,bufpos
     write (iuout,*) 'task',taskid,bcaststart,lcmdbuf
     call mpi_pack (bcaststart, 1, MPI_DOUBLE_PRECISION, &
          cmdbuf, lcmdbuf, bufpos, MPI_COMM_WORLD)
     write (iuout,*) 'task',taskid,cmdbuf,bufpos
  end if
  call mpi_bcast (cmdbuf, lcmdbuf, MPI_PACKED, &
       0, MPI_COMM_WORLD, ierr)
  call mpi_finalize(ierr)
  stop
end program mainf

_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users

--

Regards,
Mohammad Huwaidi

We can't resolve problems by using the same kind of thinking we used
when we created them.
                                                --Albert Einstein
<mohammad.vcf>
_______________________________________________
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users


--
Jeff Squyres
Cisco Systems

Reply via email to