On Sep 2, 2010, at 9:28 AM, Rachel Gordon wrote: > Concerning 1. : I just ran the simple MPI fortran program hello.f which uses: > call MPI_INIT(ierror) > call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror) > call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror) > > The program ran with no problem.
Did you print the rank and size, and did they print appropriate values? > Some more information: > > The AZTEC test case I am trying to run is running with no problem on my old > PC cluster (Redhat operating system) using gcc, g77 and: > LIB_LINUX = /usr/lib/gcc-lib/i386-redhat-linux/2.96/libg2c.a > > Concerning 2. Can you instruct me how to perform the check? Was Aztec and your test program compiled with -g? If so, you should be able to: gdb name_of_your_test_program name_of_corefile That is, if you run the aztec test program and it dumps a corefile, then load up that corefile in gdb. This will allow gdb to present a snapshot of information from that process when it died. You can "bt" at the gdb prompt to get a full back trace of exactly where it died. Hopefully, it'll include file and line numbers in the Aztec source (if Aztec was compiled and linked with -g). Then you can check the Aztec source to see if there's some problem with how they're calling MPI_COMM_SIZE. You should also be able print variable values and see if they're passing in some obviously-bogus value to MPI_COMM_SIZE (e.g., "p variable_name" shows the value of that variable name). Be aware that Fortran passes all parameters by reference, so they show up as pointers in C. So "p variable_name" may show a pointer value. Therefore, "p *variable_name" will show the dereferenced value, which is likely what you want. However, just by looking at the abbreviated stack trace that OMPI output, it *looks* like they cross the Fortran->C barrier at az_set_proc_config_() -- meaning that you call az_set_proc_config() in Fortran, and the fortran version of that function turns around and calls a C function AZ_set_proc_config(). This function then calls another C function, parallel_info(), which then calls the C binding for MPI_COMM_SIZE. So I don't think you need to worry about Fortran variables down where Aztec is calling MPI_COMM_SIZE. Just to make sure -- was Aztec compiled with Open MPI? (not mpich or some other MPI) Beyond this, I'm not much help because I don't know anything about Aztec. Have you pinged the Aztec authors/maintainers to see if there are any known problems with Aztec and Open MPI on Debian? FWIW, I haven't used g77 in years. Is there any reason you're not using gfortran? I was under the impression (perhaps mistakenly) that gfortran was the next generation after g77 and g77 was long-since dead...? > Rachel > > > > > On Thu, 2 Sep 2010, Jeff Squyres wrote: > >> I'm afraid I have no insight into Aztec itself; I don't know anything about >> it. Two questions: >> >> 1. Can you run simple MPI fortran programs that call MPI_Comm_size with >> MPI_COMM_WORLD? >> >> 2. Can you get any more information than the stack trace? I.e., can you gdb >> a corefile to see exactly where in Aztec it's failing and confirm that it's >> not actually a bug in Aztec? I'm not trying to finger point, but if >> something is failing right away in the beginning with a call to >> MPI_COMM_SIZE, it's *usually* an application error of some sort (we haven't >> even gotten to anything complicated yet like MPI_SEND, etc.). For example: >> >> - The fact that it got through the parameter error checking in > MPI_COMM_SIZE is a good thing, but it doesn't necessarily mean that the > communicator it passed was valid (for example). >> - Did they leave of the ierr argument? (unlikely, but always possible) >> >> >> >> On Sep 2, 2010, at 8:06 AM, Rachel Gordon wrote: >> >>> Dear Jeff, >>> >>> The cluster has only the openmpi version of MPI and the mpi.h file is >>> installed in /shared/include/mpi.h >>> >>> Anyhow, I omitted the COMM size parameter and recompiled/linked the case >>> using: >>> >>> mpif77 -O -I../lib -c -o az_tutorial_with_MPI.o az_tutorial_with_MPI.f >>> mpif77 az_tutorial_with_MPI.o -O -L../lib -laztec -o sample >>> >>> But when I try running 'sample' I get the same: >>> >>> [cluster:00377] *** Process received signal *** >>> [cluster:00377] Signal: Segmentation fault (11) >>> [cluster:00377] Signal code: Address not mapped (1) >>> [cluster:00377] Failing at address: 0x100000098 >>> [cluster:00377] [ 0] /lib/libpthread.so.0 [0x7f6b55040a80] >>> [cluster:00377] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e) >>> [0x7f6b564d834e] >>> [cluster:00377] [ 2] sample(parallel_info+0x24) [0x41d2ba] >>> [cluster:00377] [ 3] sample(AZ_set_proc_config+0x2d) [0x408417] >>> [cluster:00377] [ 4] sample(az_set_proc_config_+0xc) [0x407b85] >>> [cluster:00377] [ 5] sample(MAIN__+0x54) [0x407662] >>> [cluster:00377] [ 6] sample(main+0x2c) [0x44e8ec] >>> [cluster:00377] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f6b54cfd1a6] >>> [cluster:00377] [ 8] sample [0x407459] >>> [cluster:00377] *** End of error message *** >>> -------------------------------------------------------------------------- >>> >>> Rachel >>> >>> >>> >>> On Thu, 2 Sep 2010, Jeff Squyres (jsquyres) wrote: >>> >>>> If you're segv'ing in comm size, this usually means you are using the >>>> wrong mpi.h. Ensure you are using ompi's mpi.h so that you get the right >>>> values for all the MPI constants. >>>> >>>> Sent from my PDA. No type good. >>>> >>>> On Sep 2, 2010, at 7:35 AM, Rachel Gordon >>>> <[email protected]> wrote: >>>> >>>>> Dear Manuel, >>>>> >>>>> Sorry, it didn't help. >>>>> >>>>> The cluster I am trying to run on has only the openmpi MPI version. So, >>>>> mpif77 is equivalent to mpif77.openmpi and mpicc is equivalent to >>>>> mpicc.openmpi >>>>> >>>>> I changed the Makefile, replacing gfortran by mpif77 and gcc by mpicc. >>>>> The compilation and linkage stage ran with no problem: >>>>> >>>>> >>>>> mpif77 -O -I../lib -DMAX_MEM_SIZE=16731136 -DCOMM_BUFF_SIZE=200000 >>>>> -DMAX_CHUNK_SIZE=200000 -c -o az_tutorial_with_MPI.o >>>>> az_tutorial_with_MPI.f >>>>> mpif77 az_tutorial_with_MPI.o -O -L../lib -laztec -o sample >>>>> >>>>> >>>>> But again when I try to run 'sample' I get: >>>>> >>>>> mpirun -np 1 sample >>>>> >>>>> >>>>> [cluster:24989] *** Process received signal *** >>>>> [cluster:24989] Signal: Segmentation fault (11) >>>>> [cluster:24989] Signal code: Address not mapped (1) >>>>> [cluster:24989] Failing at address: 0x100000098 >>>>> [cluster:24989] [ 0] /lib/libpthread.so.0 [0x7f5058036a80] >>>>> [cluster:24989] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e) >>>>> [0x7f50594ce34e] >>>>> [cluster:24989] [ 2] sample(parallel_info+0x24) [0x41d2ba] >>>>> [cluster:24989] [ 3] sample(AZ_set_proc_config+0x2d) [0x408417] >>>>> [cluster:24989] [ 4] sample(az_set_proc_config_+0xc) [0x407b85] >>>>> [cluster:24989] [ 5] sample(MAIN__+0x54) [0x407662] >>>>> [cluster:24989] [ 6] sample(main+0x2c) [0x44e8ec] >>>>> [cluster:24989] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) >>>>> [0x7f5057cf31a6] >>>>> [cluster:24989] [ 8] sample [0x407459] >>>>> [cluster:24989] *** End of error message *** >>>>> -------------------------------------------------------------------------- >>>>> mpirun noticed that process rank 0 with PID 24989 on node cluster exited >>>>> on signal 11 (Segmentation fault). >>>>> -------------------------------------------------------------------------- >>>>> >>>>> Thanks for your help and cooperation, >>>>> Sincerely, >>>>> Rachel >>>>> >>>>> >>>>> >>>>> On Wed, 1 Sep 2010, Manuel Prinz wrote: >>>>> >>>>>> Hi Rachel, >>>>>> >>>>>> I'm not very familiar with Fortran, so I'm most likely of not too much >>>>>> help here. I added Jeff to CC, maybe he can shed some lights into this. >>>>>> >>>>>> Am Montag, den 09.08.2010, 12:59 +0300 schrieb Rachel Gordon: >>>>>>> package: openmpi >>>>>>> >>>>>>> dpkg --search openmpi >>>>>>> gromacs-openmpi: /usr/share/doc/gromacs-openmpi/copyright >>>>>>> gromacs-dev: /usr/lib/libmd_mpi_openmpi.la >>>>>>> gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.la >>>>>>> gromacs-openmpi: /usr/share/lintian/overrides/gromacs-openmpi >>>>>>> gromacs-openmpi: /usr/lib/libmd_mpi_openmpi.so.5 >>>>>>> gromacs-openmpi: /usr/lib/libmd_mpi_d_openmpi.so.5.0.0 >>>>>>> gromacs-dev: /usr/lib/libmd_mpi_openmpi.so >>>>>>> gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.so >>>>>>> gromacs-openmpi: /usr/lib/libmd_mpi_openmpi.so.5.0.0 >>>>>>> gromacs-openmpi: /usr/bin/mdrun_mpi_d.openmpi >>>>>>> gromacs-openmpi: /usr/lib/libgmx_mpi_d_openmpi.so.5.0.0 >>>>>>> gromacs-openmpi: /usr/share/doc/gromacs-openmpi/README.Debian >>>>>>> gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.a >>>>>>> gromacs-openmpi: /usr/bin/mdrun_mpi.openmpi >>>>>>> gromacs-openmpi: /usr/share/doc/gromacs-openmpi/changelog.Debian.gz >>>>>>> gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.la >>>>>>> gromacs-openmpi: /usr/share/man/man1/mdrun_mpi_d.openmpi.1.gz >>>>>>> gromacs-dev: /usr/lib/libgmx_mpi_openmpi.a >>>>>>> gromacs-openmpi: /usr/lib/libgmx_mpi_openmpi.so.5.0.0 >>>>>>> gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.so >>>>>>> gromacs-openmpi: /usr/lib/libmd_mpi_d_openmpi.so.5 >>>>>>> gromacs-dev: /usr/lib/libgmx_mpi_openmpi.la >>>>>>> gromacs-openmpi: /usr/share/man/man1/mdrun_mpi.openmpi.1.gz >>>>>>> gromacs-openmpi: /usr/share/doc/gromacs-openmpi >>>>>>> gromacs-dev: /usr/lib/libmd_mpi_openmpi.a >>>>>>> gromacs-dev: /usr/lib/libgmx_mpi_openmpi.so >>>>>>> gromacs-openmpi: /usr/lib/libgmx_mpi_openmpi.so.5 >>>>>>> gromacs-openmpi: /usr/lib/libgmx_mpi_d_openmpi.so.5 >>>>>>> gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.a >>>>>>> >>>>>>> >>>>>>> Dear support, >>>>>>> I am trying to run a test case of AZTEC library named >>>>>>> az_tutorial_with_MPI.f . The example uses gfortran + MPI. The >>>>>>> compilation and linkage stage goes O.K., generating an executable >>>>>>> 'sample'. But when I try to run sample (on 1 or more >>>>>>> processors) the run crushes immediately. >>>>>>> >>>>>>> The compilation and linkage stage is done as follows: >>>>>>> >>>>>>> gfortran -O -I/shared/include -I/shared/include/openmpi/ompi/mpi/cxx >>>>>>> -I../lib -DMAX_MEM_SIZE=16731136 >>>>>>> -DCOMM_BUFF_SIZE=200000 -DMAX_CHUNK_SIZE=200000 -c -o >>>>>>> az_tutorial_with_MPI.o az_tutorial_with_MPI.f >>>>>>> gfortran az_tutorial_with_MPI.o -O -L../lib -laztec -lm -L/shared/lib >>>>>>> -lgfortran -lmpi -lmpi_f77 -o sample >>>>>> >>>>>> Generally, when compiling programs for use with MPI, you should use the >>>>>> compiler wrappers which do all the magic. In Debian's case this is >>>>>> mpif77.openmpi and mpi90.openmpi, respectively. Could you give that a >>>>>> try? >>>>>> >>>>>>> The run: >>>>>>> /shared/home/gordon/Aztec_lib.dir/app>mpirun -np 1 sample >>>>>>> >>>>>>> [cluster:12046] *** Process received signal *** >>>>>>> [cluster:12046] Signal: Segmentation fault (11) >>>>>>> [cluster:12046] Signal code: Address not mapped (1) >>>>>>> [cluster:12046] Failing at address: 0x100000098 >>>>>>> [cluster:12046] [ 0] /lib/libc.so.6 [0x7fd4a2fa8f60] >>>>>>> [cluster:12046] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e) >>>>>>> [0x7fd4a376c34e] >>>>>>> [cluster:12046] [ 2] sample [0x4178aa] >>>>>>> [cluster:12046] [ 3] sample [0x402a07] >>>>>>> [cluster:12046] [ 4] sample [0x402175] >>>>>>> [cluster:12046] [ 5] sample [0x401c52] >>>>>>> [cluster:12046] [ 6] sample [0x448edc] >>>>>>> [cluster:12046] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) >>>>>>> [0x7fd4a2f951a6] >>>>>>> [cluster:12046] [ 8] sample [0x401a49] >>>>>>> [cluster:12046] *** End of error message *** >>>>>>> -------------------------------------------------------------------------- >>>>>>> mpirun noticed that process rank 0 with PID 12046 on node cluster exited >>>>>>> on signal 11 (Segmentation fault). >>>>>>> >>>>>>> Here is some information about the machine: >>>>>>> >>>>>>> uname -a >>>>>>> Linux cluster 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64 >>>>>>> GNU/Linux >>>>>>> >>>>>>> >>>>>>> lsb_release -a >>>>>>> No LSB modules are available. >>>>>>> Distributor ID: Debian >>>>>>> Description: Debian GNU/Linux 5.0.5 (lenny) >>>>>>> Release: 5.0.5 >>>>>>> Codename: lenny >>>>>>> >>>>>>> gcc --version >>>>>>> gcc (Debian 4.3.2-1.1) 4.3.2 >>>>>>> >>>>>>> gfortran --version >>>>>>> GNU Fortran (Debian 4.3.2-1.1) 4.3.2 >>>>>>> >>>>>>> ldd sample >>>>>>> linux-vdso.so.1 => (0x00007fffffffe000) >>>>>>> libgfortran.so.3 => /usr/lib/libgfortran.so.3 (0x00007fd29db16000) >>>>>>> libm.so.6 => /lib/libm.so.6 (0x00007fd29d893000) >>>>>>> libmpi.so.0 => /shared/lib/libmpi.so.0 (0x00007fd29d5e7000) >>>>>>> libmpi_f77.so.0 => /shared/lib/libmpi_f77.so.0 >>>>>>> (0x00007fd29d3af000) >>>>>>> libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fd29d198000) >>>>>>> libc.so.6 => /lib/libc.so.6 (0x00007fd29ce45000) >>>>>>> libopen-rte.so.0 => /shared/lib/libopen-rte.so.0 >>>>>>> (0x00007fd29cbf8000) >>>>>>> libopen-pal.so.0 => /shared/lib/libopen-pal.so.0 >>>>>>> (0x00007fd29c9a2000) >>>>>>> libdl.so.2 => /lib/libdl.so.2 (0x00007fd29c79e000) >>>>>>> libnsl.so.1 => /lib/libnsl.so.1 (0x00007fd29c586000) >>>>>>> libutil.so.1 => /lib/libutil.so.1 (0x00007fd29c383000) >>>>>>> libpthread.so.0 => /lib/libpthread.so.0 (0x00007fd29c167000) >>>>>>> /lib64/ld-linux-x86-64.so.2 (0x00007fd29ddf1000) >>>>>>> >>>>>>> >>>>>>> Let me just mention that the C+MPI test case of the AZTEC library >>>>>>> 'az_tutorial.c' runs with no problem. >>>>>>> Also, az_tutorial_with_MPI.f runs O.K. on my 32bit LINUX cluster running >>>>>>> gcc,g77 and MPICH, and on my SGI 16 processors >>>>>>> Ithanium 64 bit machine. >>>>>> >>>>>> The IA64 architecture is supported by Open MPI, so this should be OK. >>>>>> >>>>>>> Thank you for your help, >>>>>> >>>>>> Best regards, >>>>>> Manuel >>>>>> >>>>>> >>>>>> >>>> >> >> >> -- >> Jeff Squyres >> [email protected] >> For corporate legal information go to: >> http://www.cisco.com/web/about/doing_business/legal/cri/ >> >> -- Jeff Squyres [email protected] For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/ -- To UNSUBSCRIBE, email to [email protected] with a subject of "unsubscribe". Trouble? Contact [email protected]

