Hi Jeff,

1).
 I run it on 4 processors and got:
 node           0 :Hello, world
 node           2 :Hello, world
 node           1 :Hello, world
 node           3 :Hello, world

2). The AZTEC code was compiled with openmpi.

3). Enclosed please fined the fortran routine:
    az_tutorial_with_MPI.f

4). The Aztec code is an open source code. Its last version was released in 2001. Now it is a part of a large package 'TRILINOS'. I didn't consult with AZTEC authors/maintainers about a known problem running AZTEC with Openmpi on Debian.

5). I will recompile AZTEC and the fortran routine with the -g parameter and let you know the results.

Thanks again,
Rachel

On Thu, 2 Sep 2010, Jeff Squyres wrote:

On Sep 2, 2010, at 9:28 AM, Rachel Gordon wrote:

Concerning 1. : I just ran the simple MPI fortran program hello.f which uses:
     call MPI_INIT(ierror)
     call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
     call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)

The program ran with no problem.

Did you print the rank and size, and did they print appropriate values?

Some more information:

The AZTEC test case I am trying to run is running with no problem on my old PC 
cluster (Redhat operating system) using gcc, g77 and:
LIB_LINUX       = /usr/lib/gcc-lib/i386-redhat-linux/2.96/libg2c.a

Concerning 2. Can you instruct me how to perform the check?

Was Aztec and your test program compiled with -g?  If so, you should be able to:

gdb name_of_your_test_program name_of_corefile

That is, if you run the aztec test program and it dumps a corefile, then load up that corefile in gdb.  This will allow 
gdb to present a snapshot of information from that process when it died.  You can "bt" at the gdb prompt to 
get a full back trace of exactly where it died.  Hopefully, it'll include file and line numbers in the Aztec source (if 
Aztec was compiled and linked with -g).  Then you can check the Aztec source to see if there's some problem with how 
they're calling MPI_COMM_SIZE.  You should also be able print variable values and see if they're passing in some 
obviously-bogus value to MPI_COMM_SIZE (e.g., "p variable_name" shows the value of that variable name).  Be 
aware that Fortran passes all parameters by reference, so they show up as pointers in C.  So "p 
variable_name" may show a pointer value.  Therefore, "p *variable_name" will show the dereferenced 
value, which is likely what you want.

However, just by looking at the abbreviated stack trace that OMPI output, it 
*looks* like they cross the Fortran->C barrier at az_set_proc_config_() -- 
meaning that you

call az_set_proc_config()

in Fortran, and the fortran version of that function turns around and calls a C 
function AZ_set_proc_config().  This function then calls another C function, 
parallel_info(), which then calls the C binding for MPI_COMM_SIZE.  So I don't 
think you need to worry about Fortran variables down where Aztec is calling 
MPI_COMM_SIZE.

Just to make sure -- was Aztec compiled with Open MPI?  (not mpich or some 
other MPI)

Beyond this, I'm not much help because I don't know anything about Aztec.  Have 
you pinged the Aztec authors/maintainers to see if there are any known problems 
with Aztec and Open MPI on Debian?

FWIW, I haven't used g77 in years.  Is there any reason you're not using 
gfortran?  I was under the impression (perhaps mistakenly) that gfortran was 
the next generation after g77 and g77 was long-since dead...?


Rachel




On Thu, 2 Sep 2010, Jeff Squyres wrote:

I'm afraid I have no insight into Aztec itself; I don't know anything about it. 
 Two questions:

1. Can you run simple MPI fortran programs that call MPI_Comm_size with 
MPI_COMM_WORLD?

2. Can you get any more information than the stack trace?  I.e., can you gdb a 
corefile to see exactly where in Aztec it's failing and confirm that it's not 
actually a bug in Aztec? I'm not trying to finger point, but if something is 
failing right away in the beginning with a call to MPI_COMM_SIZE, it's 
*usually* an application error of some sort (we haven't even gotten to anything 
complicated yet like MPI_SEND, etc.).  For example:

- The fact that it got through the parameter error checking in
MPI_COMM_SIZE is a good thing, but it doesn't necessarily mean that the 
communicator it passed was valid (for example).
- Did they leave of the ierr argument?  (unlikely, but always possible)



On Sep 2, 2010, at 8:06 AM, Rachel Gordon wrote:

Dear Jeff,

The cluster has only the openmpi version of MPI and the mpi.h file is installed 
in /shared/include/mpi.h

Anyhow, I omitted the COMM size parameter and recompiled/linked the case using:

mpif77 -O   -I../lib  -c -o az_tutorial_with_MPI.o az_tutorial_with_MPI.f
mpif77 az_tutorial_with_MPI.o -O -L../lib -laztec      -o sample

But when I try running 'sample' I get the same:

[cluster:00377] *** Process received signal ***
[cluster:00377] Signal: Segmentation fault (11)
[cluster:00377] Signal code: Address not mapped (1)
[cluster:00377] Failing at address: 0x100000098
[cluster:00377] [ 0] /lib/libpthread.so.0 [0x7f6b55040a80]
[cluster:00377] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e) 
[0x7f6b564d834e]
[cluster:00377] [ 2] sample(parallel_info+0x24) [0x41d2ba]
[cluster:00377] [ 3] sample(AZ_set_proc_config+0x2d) [0x408417]
[cluster:00377] [ 4] sample(az_set_proc_config_+0xc) [0x407b85]
[cluster:00377] [ 5] sample(MAIN__+0x54) [0x407662]
[cluster:00377] [ 6] sample(main+0x2c) [0x44e8ec]
[cluster:00377] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f6b54cfd1a6]
[cluster:00377] [ 8] sample [0x407459]
[cluster:00377] *** End of error message ***
--------------------------------------------------------------------------

Rachel



On Thu, 2 Sep 2010, Jeff Squyres (jsquyres) wrote:

If you're segv'ing in comm size, this usually means you are using the wrong 
mpi.h.  Ensure you are using ompi's mpi.h so that you get the right values for 
all the MPI constants.

Sent from my PDA. No type good.

On Sep 2, 2010, at 7:35 AM, Rachel Gordon <rgor...@techunix.technion.ac.il> 
wrote:

Dear Manuel,

Sorry, it didn't help.

The cluster I am trying to run on has only the openmpi MPI version. So, mpif77 
is equivalent to mpif77.openmpi and mpicc is equivalent to mpicc.openmpi

I changed the Makefile, replacing gfortran by mpif77 and gcc by mpicc.
The compilation and linkage stage ran with no problem:


mpif77 -O   -I../lib -DMAX_MEM_SIZE=16731136 -DCOMM_BUFF_SIZE=200000 
-DMAX_CHUNK_SIZE=200000  -c -o az_tutorial_with_MPI.o az_tutorial_with_MPI.f
mpif77 az_tutorial_with_MPI.o -O -L../lib -laztec      -o sample


But again when I try to run 'sample' I get:

mpirun -np 1 sample


[cluster:24989] *** Process received signal ***
[cluster:24989] Signal: Segmentation fault (11)
[cluster:24989] Signal code: Address not mapped (1)
[cluster:24989] Failing at address: 0x100000098
[cluster:24989] [ 0] /lib/libpthread.so.0 [0x7f5058036a80]
[cluster:24989] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e) 
[0x7f50594ce34e]
[cluster:24989] [ 2] sample(parallel_info+0x24) [0x41d2ba]
[cluster:24989] [ 3] sample(AZ_set_proc_config+0x2d) [0x408417]
[cluster:24989] [ 4] sample(az_set_proc_config_+0xc) [0x407b85]
[cluster:24989] [ 5] sample(MAIN__+0x54) [0x407662]
[cluster:24989] [ 6] sample(main+0x2c) [0x44e8ec]
[cluster:24989] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f5057cf31a6]
[cluster:24989] [ 8] sample [0x407459]
[cluster:24989] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 24989 on node cluster exited on 
signal 11 (Segmentation fault).
--------------------------------------------------------------------------

Thanks for your help and cooperation,
Sincerely,
Rachel



On Wed, 1 Sep 2010, Manuel Prinz wrote:

Hi Rachel,

I'm not very familiar with Fortran, so I'm most likely of not too much
help here. I added Jeff to CC, maybe he can shed some lights into this.

Am Montag, den 09.08.2010, 12:59 +0300 schrieb Rachel Gordon:
package:  openmpi

dpkg --search openmpi
gromacs-openmpi: /usr/share/doc/gromacs-openmpi/copyright
gromacs-dev: /usr/lib/libmd_mpi_openmpi.la
gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.la
gromacs-openmpi: /usr/share/lintian/overrides/gromacs-openmpi
gromacs-openmpi: /usr/lib/libmd_mpi_openmpi.so.5
gromacs-openmpi: /usr/lib/libmd_mpi_d_openmpi.so.5.0.0
gromacs-dev: /usr/lib/libmd_mpi_openmpi.so
gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.so
gromacs-openmpi: /usr/lib/libmd_mpi_openmpi.so.5.0.0
gromacs-openmpi: /usr/bin/mdrun_mpi_d.openmpi
gromacs-openmpi: /usr/lib/libgmx_mpi_d_openmpi.so.5.0.0
gromacs-openmpi: /usr/share/doc/gromacs-openmpi/README.Debian
gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.a
gromacs-openmpi: /usr/bin/mdrun_mpi.openmpi
gromacs-openmpi: /usr/share/doc/gromacs-openmpi/changelog.Debian.gz
gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.la
gromacs-openmpi: /usr/share/man/man1/mdrun_mpi_d.openmpi.1.gz
gromacs-dev: /usr/lib/libgmx_mpi_openmpi.a
gromacs-openmpi: /usr/lib/libgmx_mpi_openmpi.so.5.0.0
gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.so
gromacs-openmpi: /usr/lib/libmd_mpi_d_openmpi.so.5
gromacs-dev: /usr/lib/libgmx_mpi_openmpi.la
gromacs-openmpi: /usr/share/man/man1/mdrun_mpi.openmpi.1.gz
gromacs-openmpi: /usr/share/doc/gromacs-openmpi
gromacs-dev: /usr/lib/libmd_mpi_openmpi.a
gromacs-dev: /usr/lib/libgmx_mpi_openmpi.so
gromacs-openmpi: /usr/lib/libgmx_mpi_openmpi.so.5
gromacs-openmpi: /usr/lib/libgmx_mpi_d_openmpi.so.5
gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.a


Dear support,
I am trying to run a test case of AZTEC library named
az_tutorial_with_MPI.f . The example uses gfortran + MPI. The
compilation and linkage stage goes O.K., generating an executable
'sample'. But when I try to run sample (on 1 or more
processors) the run crushes immediately.

The compilation and linkage stage is done as follows:

gfortran -O  -I/shared/include -I/shared/include/openmpi/ompi/mpi/cxx
-I../lib -DMAX_MEM_SIZE=16731136
-DCOMM_BUFF_SIZE=200000 -DMAX_CHUNK_SIZE=200000  -c -o
az_tutorial_with_MPI.o az_tutorial_with_MPI.f
gfortran az_tutorial_with_MPI.o -O -L../lib -laztec  -lm -L/shared/lib
-lgfortran -lmpi -lmpi_f77 -o sample

Generally, when compiling programs for use with MPI, you should use the
compiler wrappers which do all the magic. In Debian's case this is
mpif77.openmpi and mpi90.openmpi, respectively. Could you give that a
try?

The run:
/shared/home/gordon/Aztec_lib.dir/app>mpirun -np 1 sample

[cluster:12046] *** Process received signal ***
[cluster:12046] Signal: Segmentation fault (11)
[cluster:12046] Signal code: Address not mapped (1)
[cluster:12046] Failing at address: 0x100000098
[cluster:12046] [ 0] /lib/libc.so.6 [0x7fd4a2fa8f60]
[cluster:12046] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e)
[0x7fd4a376c34e]
[cluster:12046] [ 2] sample [0x4178aa]
[cluster:12046] [ 3] sample [0x402a07]
[cluster:12046] [ 4] sample [0x402175]
[cluster:12046] [ 5] sample [0x401c52]
[cluster:12046] [ 6] sample [0x448edc]
[cluster:12046] [ 7] /lib/libc.so.6(__libc_start_main+0xe6)
[0x7fd4a2f951a6]
[cluster:12046] [ 8] sample [0x401a49]
[cluster:12046] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 12046 on node cluster exited
on signal 11 (Segmentation fault).

Here is some information about the machine:

uname -a
Linux cluster 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64
GNU/Linux


lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 5.0.5 (lenny)
Release:        5.0.5
Codename:       lenny

gcc --version
gcc (Debian 4.3.2-1.1) 4.3.2

gfortran --version
GNU Fortran (Debian 4.3.2-1.1) 4.3.2

ldd sample
      linux-vdso.so.1 =>  (0x00007fffffffe000)
      libgfortran.so.3 => /usr/lib/libgfortran.so.3 (0x00007fd29db16000)
      libm.so.6 => /lib/libm.so.6 (0x00007fd29d893000)
      libmpi.so.0 => /shared/lib/libmpi.so.0 (0x00007fd29d5e7000)
      libmpi_f77.so.0 => /shared/lib/libmpi_f77.so.0
(0x00007fd29d3af000)
      libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fd29d198000)
      libc.so.6 => /lib/libc.so.6 (0x00007fd29ce45000)
      libopen-rte.so.0 => /shared/lib/libopen-rte.so.0
(0x00007fd29cbf8000)
      libopen-pal.so.0 => /shared/lib/libopen-pal.so.0
(0x00007fd29c9a2000)
      libdl.so.2 => /lib/libdl.so.2 (0x00007fd29c79e000)
      libnsl.so.1 => /lib/libnsl.so.1 (0x00007fd29c586000)
      libutil.so.1 => /lib/libutil.so.1 (0x00007fd29c383000)
      libpthread.so.0 => /lib/libpthread.so.0 (0x00007fd29c167000)
      /lib64/ld-linux-x86-64.so.2 (0x00007fd29ddf1000)


Let me just mention that the C+MPI test case of the AZTEC library
'az_tutorial.c' runs with no problem.
Also, az_tutorial_with_MPI.f runs O.K. on my 32bit LINUX cluster running
gcc,g77 and MPICH, and on my SGI 16 processors
Ithanium 64 bit machine.

The IA64 architecture is supported by Open MPI, so this should be OK.

Thank you for your help,

Best regards,
Manuel






--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

C====================================================================
C ------------------------
C | CVS File Information |
C ------------------------
C
C $RCSfile: az_tutorial_with_MPI.f,v $
C
C $Author: tuminaro $
C
C $Date: 1999/10/07 00:07:43 $
C
C $Revision: 1.4 $
C
C $Name:  $
C====================================================================*/

C
C***********************************************************************
C      Copyright 1995, Sandia Corporation.  The United States Government
C      retains a nonexclusive license in this software as prescribed in
C      AL 88-1 and AL 91-7.  Export of this program may require a
C      license from the United States Government.
C***********************************************************************
C
C
C
       program main
C
C---------------------------------------------------------------
C      Set up a 2D Poisson test problem and solve it with AZTEC.
C      Author:   Ray Tuminaro, Div 1422, Sandia National Labs
C      date:     11/10/94
C
       implicit none
       include "az_aztecf.h"
       include "mpif.h"
C
       integer   n, nrow
       common    /global/ n
C           POISSON EQUATION WILL BE SOLVED ON an n x n GRID.
C           NOTE: n should be odd for rhs to be properly set.
C
       double precision b(0:1024),x(0:1024)
C           rhs and approximate solution
       integer    i
C
C             See Aztec User's Guide for the variables that follow:
C
       integer proc_config(0:AZ_PROC_SIZE), options(0:AZ_OPTIONS_SIZE)
       double precision params(0:AZ_PARAMS_SIZE)
       integer data_org(0:1024)
       double precision status(0:AZ_STATUS_SIZE)
       integer update(0:1024), external(0:1024)
       integer update_index(0:1024), extern_index(0:1024)
       integer bindx(0:1024)
       double  precision val(0:1024)
       integer N_update,ierror
C
C
       call MPI_INIT(ierror)
C
C           # of unknowns updated on this node
        n = 6
C
C      get number of processors and the name of this processor
C
       call AZ_set_proc_config(proc_config, MPI_COMM_WORLD)
C
C      Define paritioning:matrix rows in ascending order assigned
C      to this node
C
       nrow = n*n
       call AZ_read_update(N_update,update,proc_config,nrow,1,0)
C
C      create the matrix: each processor creates only rows
C      appearing in update[] (using global col. numbers).
C
       bindx(0) = N_update+1
       do 250 i = 0, N_update-1
          call create_matrix_row_5pt(update(i),i,val,bindx)
250    continue
C
C      convert matrix to a local distributed matrix */
C
       call AZ_transform(proc_config,external,bindx,val,update,
     $                   update_index,extern_index,data_org,
     $                   N_update,0,0,0,0,AZ_MSR_MATRIX)
C
C      initialize AZTEC options
C
       call AZ_defaults(options, params)
C
C      Set rhs (delta function at grid center) and initialize guess
C
       do 350 i = 0, N_update-1
          x(update_index(i)) = 0.0
          b(update_index(i)) = 0.0
          if (update(i) .eq. 0) b(update_index(i)) = 1.0
350    continue
C
C      solve the system of equations using b  as the right hand side
C
       call AZ_solve(x,b, options, params, 0,bindx,0,0,
     $               0,val, data_org, status, proc_config)

C
       call MPI_FINALIZE(ierror)
C
       stop
       end

C*********************************************************************
C*********************************************************************
C
       subroutine create_matrix_row_5pt(row,location,val,bindx)
C
       integer row,location,bindx(0:*)
       double precision val(0:*)
       integer   n
       common    /global/ n
C
C Add one row to an MSR matrix corresponding to a 5pt discrete
C approximation to the 2D Poisson operator on an n x n square.
C
C  Parameters:
C     row          == global row number of the new row to be added.
C     location     == local row where diagonal of the new row will be stored.
C     val,bindx    == (see user's guide). On output, val[] and bindx[]
C                     are appended such that the new row has been added.
C
       integer k
C
C      check neighbors in each direction and add nonzero if neighbor exits
C
       k = bindx(location)
       bindx(k)  = row + 1
       if (mod(row,n) .ne. n-1) then
          val(k) = -1.
          k = k + 1
       endif
       bindx(k)  = row - 1
       if (mod(row,n) .ne.   0) then
          val(k) = -1.
          k = k + 1
       endif
       bindx(k)  = row + n
       if (mod(row/n,n) .ne. n-1) then
          val(k) = -1.
          k = k + 1
       endif
       bindx(k)  = row - n
       if (mod(row/n,n) .ne.   0) then
          val(k) = -1.
          k = k + 1
       endif

       bindx(location+1) = k
       val(location)   = 4.
       return
       end

Reply via email to