On Sep 2, 2010, at 9:28 AM, Rachel Gordon wrote:

> Concerning 1. : I just ran the simple MPI fortran program hello.f which uses:
>      call MPI_INIT(ierror)
>      call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
>      call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
> 
> The program ran with no problem.

Did you print the rank and size, and did they print appropriate values?

> Some more information:
> 
> The AZTEC test case I am trying to run is running with no problem on my old 
> PC cluster (Redhat operating system) using gcc, g77 and:
> LIB_LINUX       = /usr/lib/gcc-lib/i386-redhat-linux/2.96/libg2c.a
> 
> Concerning 2. Can you instruct me how to perform the check?

Was Aztec and your test program compiled with -g?  If so, you should be able to:

gdb name_of_your_test_program name_of_corefile

That is, if you run the aztec test program and it dumps a corefile, then load 
up that corefile in gdb.  This will allow gdb to present a snapshot of 
information from that process when it died.  You can "bt" at the gdb prompt to 
get a full back trace of exactly where it died.  Hopefully, it'll include file 
and line numbers in the Aztec source (if Aztec was compiled and linked with 
-g).  Then you can check the Aztec source to see if there's some problem with 
how they're calling MPI_COMM_SIZE.  You should also be able print variable 
values and see if they're passing in some obviously-bogus value to 
MPI_COMM_SIZE (e.g., "p variable_name" shows the value of that variable name).  
Be aware that Fortran passes all parameters by reference, so they show up as 
pointers in C.  So "p variable_name" may show a pointer value.  Therefore, "p 
*variable_name" will show the dereferenced value, which is likely what you want.

However, just by looking at the abbreviated stack trace that OMPI output, it 
*looks* like they cross the Fortran->C barrier at az_set_proc_config_() -- 
meaning that you 

call az_set_proc_config()

in Fortran, and the fortran version of that function turns around and calls a C 
function AZ_set_proc_config().  This function then calls another C function, 
parallel_info(), which then calls the C binding for MPI_COMM_SIZE.  So I don't 
think you need to worry about Fortran variables down where Aztec is calling 
MPI_COMM_SIZE.  

Just to make sure -- was Aztec compiled with Open MPI?  (not mpich or some 
other MPI)

Beyond this, I'm not much help because I don't know anything about Aztec.  Have 
you pinged the Aztec authors/maintainers to see if there are any known problems 
with Aztec and Open MPI on Debian?

FWIW, I haven't used g77 in years.  Is there any reason you're not using 
gfortran?  I was under the impression (perhaps mistakenly) that gfortran was 
the next generation after g77 and g77 was long-since dead...?


> Rachel
> 
> 
> 
> 
> On Thu, 2 Sep 2010, Jeff Squyres wrote:
> 
>> I'm afraid I have no insight into Aztec itself; I don't know anything about 
>> it.  Two questions:
>> 
>> 1. Can you run simple MPI fortran programs that call MPI_Comm_size with 
>> MPI_COMM_WORLD?
>> 
>> 2. Can you get any more information than the stack trace?  I.e., can you gdb 
>> a corefile to see exactly where in Aztec it's failing and confirm that it's 
>> not actually a bug in Aztec? I'm not trying to finger point, but if 
>> something is failing right away in the beginning with a call to 
>> MPI_COMM_SIZE, it's *usually* an application error of some sort (we haven't 
>> even gotten to anything complicated yet like MPI_SEND, etc.).  For example:
>> 
>> - The fact that it got through the parameter error checking in 
> MPI_COMM_SIZE is a good thing, but it doesn't necessarily mean that the 
> communicator it passed was valid (for example).
>> - Did they leave of the ierr argument?  (unlikely, but always possible)
>> 
>> 
>> 
>> On Sep 2, 2010, at 8:06 AM, Rachel Gordon wrote:
>> 
>>> Dear Jeff,
>>> 
>>> The cluster has only the openmpi version of MPI and the mpi.h file is 
>>> installed in /shared/include/mpi.h
>>> 
>>> Anyhow, I omitted the COMM size parameter and recompiled/linked the case 
>>> using:
>>> 
>>> mpif77 -O   -I../lib  -c -o az_tutorial_with_MPI.o az_tutorial_with_MPI.f
>>> mpif77 az_tutorial_with_MPI.o -O -L../lib -laztec      -o sample
>>> 
>>> But when I try running 'sample' I get the same:
>>> 
>>> [cluster:00377] *** Process received signal ***
>>> [cluster:00377] Signal: Segmentation fault (11)
>>> [cluster:00377] Signal code: Address not mapped (1)
>>> [cluster:00377] Failing at address: 0x100000098
>>> [cluster:00377] [ 0] /lib/libpthread.so.0 [0x7f6b55040a80]
>>> [cluster:00377] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e) 
>>> [0x7f6b564d834e]
>>> [cluster:00377] [ 2] sample(parallel_info+0x24) [0x41d2ba]
>>> [cluster:00377] [ 3] sample(AZ_set_proc_config+0x2d) [0x408417]
>>> [cluster:00377] [ 4] sample(az_set_proc_config_+0xc) [0x407b85]
>>> [cluster:00377] [ 5] sample(MAIN__+0x54) [0x407662]
>>> [cluster:00377] [ 6] sample(main+0x2c) [0x44e8ec]
>>> [cluster:00377] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) [0x7f6b54cfd1a6]
>>> [cluster:00377] [ 8] sample [0x407459]
>>> [cluster:00377] *** End of error message ***
>>> --------------------------------------------------------------------------
>>> 
>>> Rachel
>>> 
>>> 
>>> 
>>> On Thu, 2 Sep 2010, Jeff Squyres (jsquyres) wrote:
>>> 
>>>> If you're segv'ing in comm size, this usually means you are using the 
>>>> wrong mpi.h.  Ensure you are using ompi's mpi.h so that you get the right 
>>>> values for all the MPI constants.
>>>> 
>>>> Sent from my PDA. No type good.
>>>> 
>>>> On Sep 2, 2010, at 7:35 AM, Rachel Gordon 
>>>> <[email protected]> wrote:
>>>> 
>>>>> Dear Manuel,
>>>>> 
>>>>> Sorry, it didn't help.
>>>>> 
>>>>> The cluster I am trying to run on has only the openmpi MPI version. So, 
>>>>> mpif77 is equivalent to mpif77.openmpi and mpicc is equivalent to 
>>>>> mpicc.openmpi
>>>>> 
>>>>> I changed the Makefile, replacing gfortran by mpif77 and gcc by mpicc.
>>>>> The compilation and linkage stage ran with no problem:
>>>>> 
>>>>> 
>>>>> mpif77 -O   -I../lib -DMAX_MEM_SIZE=16731136 -DCOMM_BUFF_SIZE=200000 
>>>>> -DMAX_CHUNK_SIZE=200000  -c -o az_tutorial_with_MPI.o 
>>>>> az_tutorial_with_MPI.f
>>>>> mpif77 az_tutorial_with_MPI.o -O -L../lib -laztec      -o sample
>>>>> 
>>>>> 
>>>>> But again when I try to run 'sample' I get:
>>>>> 
>>>>> mpirun -np 1 sample
>>>>> 
>>>>> 
>>>>> [cluster:24989] *** Process received signal ***
>>>>> [cluster:24989] Signal: Segmentation fault (11)
>>>>> [cluster:24989] Signal code: Address not mapped (1)
>>>>> [cluster:24989] Failing at address: 0x100000098
>>>>> [cluster:24989] [ 0] /lib/libpthread.so.0 [0x7f5058036a80]
>>>>> [cluster:24989] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e) 
>>>>> [0x7f50594ce34e]
>>>>> [cluster:24989] [ 2] sample(parallel_info+0x24) [0x41d2ba]
>>>>> [cluster:24989] [ 3] sample(AZ_set_proc_config+0x2d) [0x408417]
>>>>> [cluster:24989] [ 4] sample(az_set_proc_config_+0xc) [0x407b85]
>>>>> [cluster:24989] [ 5] sample(MAIN__+0x54) [0x407662]
>>>>> [cluster:24989] [ 6] sample(main+0x2c) [0x44e8ec]
>>>>> [cluster:24989] [ 7] /lib/libc.so.6(__libc_start_main+0xe6) 
>>>>> [0x7f5057cf31a6]
>>>>> [cluster:24989] [ 8] sample [0x407459]
>>>>> [cluster:24989] *** End of error message ***
>>>>> --------------------------------------------------------------------------
>>>>> mpirun noticed that process rank 0 with PID 24989 on node cluster exited 
>>>>> on signal 11 (Segmentation fault).
>>>>> --------------------------------------------------------------------------
>>>>> 
>>>>> Thanks for your help and cooperation,
>>>>> Sincerely,
>>>>> Rachel
>>>>> 
>>>>> 
>>>>> 
>>>>> On Wed, 1 Sep 2010, Manuel Prinz wrote:
>>>>> 
>>>>>> Hi Rachel,
>>>>>> 
>>>>>> I'm not very familiar with Fortran, so I'm most likely of not too much
>>>>>> help here. I added Jeff to CC, maybe he can shed some lights into this.
>>>>>> 
>>>>>> Am Montag, den 09.08.2010, 12:59 +0300 schrieb Rachel Gordon:
>>>>>>> package:  openmpi
>>>>>>> 
>>>>>>> dpkg --search openmpi
>>>>>>> gromacs-openmpi: /usr/share/doc/gromacs-openmpi/copyright
>>>>>>> gromacs-dev: /usr/lib/libmd_mpi_openmpi.la
>>>>>>> gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.la
>>>>>>> gromacs-openmpi: /usr/share/lintian/overrides/gromacs-openmpi
>>>>>>> gromacs-openmpi: /usr/lib/libmd_mpi_openmpi.so.5
>>>>>>> gromacs-openmpi: /usr/lib/libmd_mpi_d_openmpi.so.5.0.0
>>>>>>> gromacs-dev: /usr/lib/libmd_mpi_openmpi.so
>>>>>>> gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.so
>>>>>>> gromacs-openmpi: /usr/lib/libmd_mpi_openmpi.so.5.0.0
>>>>>>> gromacs-openmpi: /usr/bin/mdrun_mpi_d.openmpi
>>>>>>> gromacs-openmpi: /usr/lib/libgmx_mpi_d_openmpi.so.5.0.0
>>>>>>> gromacs-openmpi: /usr/share/doc/gromacs-openmpi/README.Debian
>>>>>>> gromacs-dev: /usr/lib/libgmx_mpi_d_openmpi.a
>>>>>>> gromacs-openmpi: /usr/bin/mdrun_mpi.openmpi
>>>>>>> gromacs-openmpi: /usr/share/doc/gromacs-openmpi/changelog.Debian.gz
>>>>>>> gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.la
>>>>>>> gromacs-openmpi: /usr/share/man/man1/mdrun_mpi_d.openmpi.1.gz
>>>>>>> gromacs-dev: /usr/lib/libgmx_mpi_openmpi.a
>>>>>>> gromacs-openmpi: /usr/lib/libgmx_mpi_openmpi.so.5.0.0
>>>>>>> gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.so
>>>>>>> gromacs-openmpi: /usr/lib/libmd_mpi_d_openmpi.so.5
>>>>>>> gromacs-dev: /usr/lib/libgmx_mpi_openmpi.la
>>>>>>> gromacs-openmpi: /usr/share/man/man1/mdrun_mpi.openmpi.1.gz
>>>>>>> gromacs-openmpi: /usr/share/doc/gromacs-openmpi
>>>>>>> gromacs-dev: /usr/lib/libmd_mpi_openmpi.a
>>>>>>> gromacs-dev: /usr/lib/libgmx_mpi_openmpi.so
>>>>>>> gromacs-openmpi: /usr/lib/libgmx_mpi_openmpi.so.5
>>>>>>> gromacs-openmpi: /usr/lib/libgmx_mpi_d_openmpi.so.5
>>>>>>> gromacs-dev: /usr/lib/libmd_mpi_d_openmpi.a
>>>>>>> 
>>>>>>> 
>>>>>>> Dear support,
>>>>>>> I am trying to run a test case of AZTEC library named
>>>>>>> az_tutorial_with_MPI.f . The example uses gfortran + MPI. The
>>>>>>> compilation and linkage stage goes O.K., generating an executable
>>>>>>> 'sample'. But when I try to run sample (on 1 or more
>>>>>>> processors) the run crushes immediately.
>>>>>>> 
>>>>>>> The compilation and linkage stage is done as follows:
>>>>>>> 
>>>>>>> gfortran -O  -I/shared/include -I/shared/include/openmpi/ompi/mpi/cxx
>>>>>>> -I../lib -DMAX_MEM_SIZE=16731136
>>>>>>> -DCOMM_BUFF_SIZE=200000 -DMAX_CHUNK_SIZE=200000  -c -o
>>>>>>> az_tutorial_with_MPI.o az_tutorial_with_MPI.f
>>>>>>> gfortran az_tutorial_with_MPI.o -O -L../lib -laztec  -lm -L/shared/lib
>>>>>>> -lgfortran -lmpi -lmpi_f77 -o sample
>>>>>> 
>>>>>> Generally, when compiling programs for use with MPI, you should use the
>>>>>> compiler wrappers which do all the magic. In Debian's case this is
>>>>>> mpif77.openmpi and mpi90.openmpi, respectively. Could you give that a
>>>>>> try?
>>>>>> 
>>>>>>> The run:
>>>>>>> /shared/home/gordon/Aztec_lib.dir/app>mpirun -np 1 sample
>>>>>>> 
>>>>>>> [cluster:12046] *** Process received signal ***
>>>>>>> [cluster:12046] Signal: Segmentation fault (11)
>>>>>>> [cluster:12046] Signal code: Address not mapped (1)
>>>>>>> [cluster:12046] Failing at address: 0x100000098
>>>>>>> [cluster:12046] [ 0] /lib/libc.so.6 [0x7fd4a2fa8f60]
>>>>>>> [cluster:12046] [ 1] /shared/lib/libmpi.so.0(MPI_Comm_size+0x6e)
>>>>>>> [0x7fd4a376c34e]
>>>>>>> [cluster:12046] [ 2] sample [0x4178aa]
>>>>>>> [cluster:12046] [ 3] sample [0x402a07]
>>>>>>> [cluster:12046] [ 4] sample [0x402175]
>>>>>>> [cluster:12046] [ 5] sample [0x401c52]
>>>>>>> [cluster:12046] [ 6] sample [0x448edc]
>>>>>>> [cluster:12046] [ 7] /lib/libc.so.6(__libc_start_main+0xe6)
>>>>>>> [0x7fd4a2f951a6]
>>>>>>> [cluster:12046] [ 8] sample [0x401a49]
>>>>>>> [cluster:12046] *** End of error message ***
>>>>>>> --------------------------------------------------------------------------
>>>>>>> mpirun noticed that process rank 0 with PID 12046 on node cluster exited
>>>>>>> on signal 11 (Segmentation fault).
>>>>>>> 
>>>>>>> Here is some information about the machine:
>>>>>>> 
>>>>>>> uname -a
>>>>>>> Linux cluster 2.6.26-2-amd64 #1 SMP Sun Jun 20 20:16:30 UTC 2010 x86_64
>>>>>>> GNU/Linux
>>>>>>> 
>>>>>>> 
>>>>>>> lsb_release -a
>>>>>>> No LSB modules are available.
>>>>>>> Distributor ID: Debian
>>>>>>> Description:    Debian GNU/Linux 5.0.5 (lenny)
>>>>>>> Release:        5.0.5
>>>>>>> Codename:       lenny
>>>>>>> 
>>>>>>> gcc --version
>>>>>>> gcc (Debian 4.3.2-1.1) 4.3.2
>>>>>>> 
>>>>>>> gfortran --version
>>>>>>> GNU Fortran (Debian 4.3.2-1.1) 4.3.2
>>>>>>> 
>>>>>>> ldd sample
>>>>>>>       linux-vdso.so.1 =>  (0x00007fffffffe000)
>>>>>>>       libgfortran.so.3 => /usr/lib/libgfortran.so.3 (0x00007fd29db16000)
>>>>>>>       libm.so.6 => /lib/libm.so.6 (0x00007fd29d893000)
>>>>>>>       libmpi.so.0 => /shared/lib/libmpi.so.0 (0x00007fd29d5e7000)
>>>>>>>       libmpi_f77.so.0 => /shared/lib/libmpi_f77.so.0
>>>>>>> (0x00007fd29d3af000)
>>>>>>>       libgcc_s.so.1 => /lib/libgcc_s.so.1 (0x00007fd29d198000)
>>>>>>>       libc.so.6 => /lib/libc.so.6 (0x00007fd29ce45000)
>>>>>>>       libopen-rte.so.0 => /shared/lib/libopen-rte.so.0
>>>>>>> (0x00007fd29cbf8000)
>>>>>>>       libopen-pal.so.0 => /shared/lib/libopen-pal.so.0
>>>>>>> (0x00007fd29c9a2000)
>>>>>>>       libdl.so.2 => /lib/libdl.so.2 (0x00007fd29c79e000)
>>>>>>>       libnsl.so.1 => /lib/libnsl.so.1 (0x00007fd29c586000)
>>>>>>>       libutil.so.1 => /lib/libutil.so.1 (0x00007fd29c383000)
>>>>>>>       libpthread.so.0 => /lib/libpthread.so.0 (0x00007fd29c167000)
>>>>>>>       /lib64/ld-linux-x86-64.so.2 (0x00007fd29ddf1000)
>>>>>>> 
>>>>>>> 
>>>>>>> Let me just mention that the C+MPI test case of the AZTEC library
>>>>>>> 'az_tutorial.c' runs with no problem.
>>>>>>> Also, az_tutorial_with_MPI.f runs O.K. on my 32bit LINUX cluster running
>>>>>>> gcc,g77 and MPICH, and on my SGI 16 processors
>>>>>>> Ithanium 64 bit machine.
>>>>>> 
>>>>>> The IA64 architecture is supported by Open MPI, so this should be OK.
>>>>>> 
>>>>>>> Thank you for your help,
>>>>>> 
>>>>>> Best regards,
>>>>>> Manuel
>>>>>> 
>>>>>> 
>>>>>> 
>>>> 
>> 
>> 
>> -- 
>> Jeff Squyres
>> [email protected]
>> For corporate legal information go to:
>> http://www.cisco.com/web/about/doing_business/legal/cri/
>> 
>> 


-- 
Jeff Squyres
[email protected]
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




--
To UNSUBSCRIBE, email to [email protected]
with a subject of "unsubscribe". Trouble? Contact [email protected]

Reply via email to