Re: [OMPI users] OpenMPI 1.2.8 on Solaris: configure problems

2008-10-17 Thread Ethan Mallove
On Fri, Oct/17/2008 05:53:07PM, Paul Kapinos wrote:
> Hi guys,
>
> did you test OpenMPI 1.2.8 on Solaris at all?!

We built 1.2.8 on Solaris successfully a few days ago:

  http://www.open-mpi.org/mtt/index.php?do_redir=869

But due to hardware/software/man-hour resource limitations,
there are often combinations of configure options, mpirun
options, etc. that end up going untested. E.g., I see you're
using some configure options we haven't tried:

 * --enable-ltdl-convenience 
 * --no-create 
 * --no-recursion
 * GCC on Solaris 


> We tried to compile OpenMPI 1.2.8 on Solaris on Sparc and on Opteron for 
> both GCC and SUN Studio compiler, in 32bit and 64bit versions, at all 
> 2*2*2=8 versions, in the very same maneer we have installed 1.2.5 and 1.2.6 
> versions.
>
>
> The configuring processes runs through, but if "gmake all" called, it seems 
> to be so, that the configure stage restarts or being resumed:
>
> ..
> orte/mca/smr/bproc/Makefile.am:47: Libtool library used but `LIBTOOL' is 
> undefined
> orte/mca/smr/bproc/Makefile.am:47:   The usual way to define `LIBTOOL' is 
> to add `AC_PROG_LIBTOOL'
> orte/mca/smr/bproc/Makefile.am:47:   to `configure.ac' and run `aclocal' 
> and `autoconf' again.
> orte/mca/smr/bproc/Makefile.am:47:   If `AC_PROG_LIBTOOL' is in 
> `configure.ac', make sure
> orte/mca/smr/bproc/Makefile.am:47:   its definition is in aclocal's search 
> path.
> test/support/Makefile.am:29: library used but `RANLIB' is undefined
> test/support/Makefile.am:29:   The usual way to define `RANLIB' is to add 
> `AC_PROG_RANLIB'
> test/support/Makefile.am:29:   to `configure.ac' and run `autoconf' again.
>
> . and breaks.

I'm confused why aclocal (or are these automake errors?) is
getting invoked in "gmake all". Did you try running
"aclocal" and "autoconf" in the top-level directory? (You
shouldn't have to do that, but it might resolve this
problem.) Make sure "ranlib" is in your PATH, mine's at
/usr/ccs/bin/ranlib.

(Also, we don't have a sys/bproc.h file on our lab machine,
so the above might be an untested scenario.)

>
> If "gmake all" again we also see error messages like:
>
> *** Fortran 77 compiler
> checking for gfortran... gfortran
> checking whether we are using the GNU Fortran 77 compiler... yes
> checking whether gfortran accepts -g... yes
> checking if Fortran 77 compiler works... yes
> checking gfortran external symbol convention... ./configure: line 26340: 
> ./conftest.o: Permission denied
> ./configure: line 26342: ./conftest.o: Permission denied
> ./configure: line 26344: ./conftest.o: Permission denied
> ./configure: line 26346: ./conftest.o: Permission denied
> ./configure: line 26348: ./conftest.o: Permission denied
> configure: error: Could not determine Fortran naming convention.
>

We didn't test 1.2.8 with GCC/Solaris. Let me see if we can
reproduce this, and get back to you.

>
> Considered the configure script we see on these lines in ./configire:
>
> if $NM conftest.o | grep foo_bar__ >/dev/null 2>&1 ; then
>   ompi_cv_f77_external_symbol="double underscore"
> elif $NM conftest.o | grep foo_bar_ >/dev/null 2>&1 ; then
> ompi_cv_f77_external_symbol="single underscore"
> elif $NM conftest.o | grep FOO_bar >/dev/null 2>&1 ; then
> ompi_cv_f77_external_symbol="mixed case"
> elif $NM conftest.o | grep foo_bar >/dev/null 2>&1 ; then
> ompi_cv_f77_external_symbol="no underscore"
> elif $NM conftest.o | grep FOO_BAR >/dev/null 2>&1 ; then
> ompi_cv_f77_external_symbol="upper case"
> else
> $NM conftest.o >conftest.out 2>&1
>
> and searching through ./configire says us, that $NM is never set 
> (neither in ./configure nor in our environment)
>

Is "nm" in your path? I have this in my config.log file:

  NM='/usr/ccs/bin/nm -p'

Thanks,
Ethan


>
> So, we think that somewhat is not OK with ./configure script. Attend to the 
> fact, that we were able to install 1.2.5 and 1.2.6 some time ago on same 
> boxes without problems.
>
> Or maybe we do somewhat wrong?
>

> best regards,
> Paul Kapinos
> HPC Group RZ RWTH Aachen
>
> P.S. Folks, does somebody compiled OpenMPI 1.2.8 on someone Solaris 
> sucessfully?
>
>
> This file contains any messages produced by compilers while
> running configure, to aid debugging if configure makes a mistake.
> 
> It was created by Open MPI configure 1.2.8, which was
> generated by GNU Autoconf 2.61.  Invocation command line was
> 
>   $ ./configure --with-devel-headers CFLAGS=-O2 -m64 CXXFLAGS=-O2 -m64 
> FFLAGS=-O2 -m64 FCFLAGS=-O2 -m64 LDFLAGS=-O2 -m64 
> --prefix=/rwthfs/rz/SW/MPI/openmpi-1.2.8/solx8664/gcc CC=gcc CXX=g++ 
> FC=gfortran --enable-ltdl-convenience --no-create --no-recursion
> 
> ## - ##
> ## Platform. ##
> ## - ##
> 
> hostname = sunoc63.rz.RWTH-Aachen.DE
> uname -m = i86pc
> uname -r = 5.10
> uname -s = SunOS
> uname -v = Generic_137112-06
> 
> /usr/bin/uname -p = i386
> /bin/uname -X = System = SunOS
> Node = 

Re: [OMPI users] The --with-sge option

2008-10-17 Thread Jeff Squyres

On Oct 16, 2008, at 12:06 PM, Mike Hanby wrote:

I’m compiling 1.2.8 on a system with SGE 6.1u4 and came across the  
“--with-sge” option on a Grid Engine posting.


A couple questions:
1.  I don’t see --with-sge mentioned in the “./configure --help"  
output, nor can I find much reference to it on the open-mpi site, is  
this option really implemented? What does it do?


Sorry -- this is an option for OMPI v1.3 and later; it doesn't exist  
in the v1.2 series.


[8:31] svbu-mpi:~/svn/ompi4 % ./configure --help |& grep sge
  --with-sge  Build SGE or Grid Engine support (default:  
no)


So in the v1.3 series, using --without-sge will disable OMPI from  
understanding SGE host lists, etc.


2.  After compiling openmpi providing the --with-sge switch I ran  
the ompi_info binary and grep’d for sge in the output, there isn’t  
any reference, should there be if the option was successfully passed  
to configure?


From your second mail:


I did find the following in ompi_info:

MCA ras: gridengine (MCA v1.0, API v1.3, Component v1.2.7)
MCA pls: gridengine (MCA v1.0, API v1.3, Component v1.2.7)

However I see that in an ompi_info built without using the --with- 
sge switch.


Per above, that should be ok in the 1.2 series.

Also, since I'm building 1.2.8, shouldn't those versions after  
Component reflect 1.2.8?


Yes, actually, they should...  That's somewhat concerning.

I set the PATH and LD_LIBRARY_PATH to point to the temp location of  
my new build and it still reports 1.2.7.



You might want to double check your setup.  Since OMPI uses plugins,  
it can be each to accidentally mix versions by installing one over  
another, etc.


Note that the output from configure will also indicate whether it's  
going to build SGE support, as well.  Look in the stdout of configure  
and search for "gridengine".


--
Jeff Squyres
Cisco Systems




Re: [OMPI users] Debian MPI -- mpirun missing

2008-10-17 Thread Terry Frankcombe
Er, shouldn't this be in the Debian support list?  A correctly installed
OpenMPI will give you mpirun.  If their openmpi-bin package doesn't,
then surely it's broken?  (Or is there a straight openmpi package?)



On Sat, 2008-10-18 at 00:16 +0900, Raymond Wan wrote:
> Hi all,
> 
> I'm very new to MPI and am trying to install it on to a Debian Etch 
> system.  I did have mpich installed and I believe that is causing me 
> problems.  I completely uninstalled it and then ran:
> 
> update-alternatives --remove-all mpicc
> 
> Then, I installed the following packages:
> 
> libibverbs1 openmpi-bin openmpi-common openmpi-libs0 openmpi-dbg openmpi-dev
> 
> And it now says:
> 
>  >> update-alternatives --display mpicc
> mpicc - status is auto.
>  link currently points to /usr/bin/mpicc.openmpi
> /usr/bin/mpicc.openmpi - priority 40
>  slave mpif90: /usr/bin/mpif90.openmpi
>  slave mpiCC: /usr/bin/mpic++.openmpi
>  slave mpic++: /usr/bin/mpic++.openmpi
>  slave mpif77: /usr/bin/mpif77.openmpi
>  slave mpicxx: /usr/bin/mpic++.openmpi
> Current `best' version is /usr/bin/mpicc.openmpi.
> 
> which seems ok to me...  So, I tried to compile something (I had sample 
> code from a book I purchased a while back, but for mpich), however, I 
> can run the program as-is, but I think I should be running it with 
> mpirun -- the FAQ suggests there is one?  But, there is no mpirun 
> anywhere.  It's not in /usr/bin.  I updated the filename database 
> (updatedb) and tried a "locate mpirun", and I get only one hit:
> 
> /usr/include/openmpi/ompi/runtime/mpiruntime.h
> 
> Is there a package that I neglected to install?  I did an "aptitude 
> search openmpi" and installed everything listed...  :-)  Or perhaps I 
> haven't removed all trace of mpich?
> 
> Thank you in advance!
> 
> Ray
> 
> 
> ___
> users mailing list
> us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/users



[OMPI users] Debian MPI -- mpirun missing

2008-10-17 Thread Raymond Wan


Hi all,

I'm very new to MPI and am trying to install it on to a Debian Etch 
system.  I did have mpich installed and I believe that is causing me 
problems.  I completely uninstalled it and then ran:


update-alternatives --remove-all mpicc

Then, I installed the following packages:

libibverbs1 openmpi-bin openmpi-common openmpi-libs0 openmpi-dbg openmpi-dev

And it now says:

>> update-alternatives --display mpicc
mpicc - status is auto.
link currently points to /usr/bin/mpicc.openmpi
/usr/bin/mpicc.openmpi - priority 40
slave mpif90: /usr/bin/mpif90.openmpi
slave mpiCC: /usr/bin/mpic++.openmpi
slave mpic++: /usr/bin/mpic++.openmpi
slave mpif77: /usr/bin/mpif77.openmpi
slave mpicxx: /usr/bin/mpic++.openmpi
Current `best' version is /usr/bin/mpicc.openmpi.

which seems ok to me...  So, I tried to compile something (I had sample 
code from a book I purchased a while back, but for mpich), however, I 
can run the program as-is, but I think I should be running it with 
mpirun -- the FAQ suggests there is one?  But, there is no mpirun 
anywhere.  It's not in /usr/bin.  I updated the filename database 
(updatedb) and tried a "locate mpirun", and I get only one hit:


/usr/include/openmpi/ompi/runtime/mpiruntime.h

Is there a package that I neglected to install?  I did an "aptitude 
search openmpi" and installed everything listed...  :-)  Or perhaps I 
haven't removed all trace of mpich?


Thank you in advance!

Ray




Re: [OMPI users] OpenMPI portability problems: debug info isn'thelpful

2008-10-17 Thread Mike Hanby
Some further clarification, I read a post over on the SGE mailing list
that said the --with-sge is part of ompi 1.3, not 1.2.x.

-Original Message-
From: users-boun...@open-mpi.org [mailto:users-boun...@open-mpi.org] On
Behalf Of Aleksej Saushev
Sent: Thursday, October 16, 2008 12:39 PM
To: Open MPI Users
Subject: Re: [OMPI users] OpenMPI portability problems: debug info
isn'thelpful

Jeff Squyres  writes:

> On Oct 11, 2008, at 10:20 AM, Aleksej Saushev wrote:
>
>> $ ompi_info | grep oob
>> MCA oob: tcp (MCA v1.0, API v1.0, Component v1.0)
>> MCA rml: oob (MCA v1.0, API v1.0, Component v1.2.7)
>
> Good!
>
>>> $ mpirun --mca rml_base_debug 100 -np 2 skosfile
>> [asau.local:09060] mca: base: components_open: Looking for rml
>> components
>> [asau.local:09060] mca: base: components_open: distilling rml
>> components
>> [asau.local:09060] mca: base: components_open: accepting all
>> rml  components
>> [asau.local:09060] mca: base: components_open: opening rml components
>> [asau.local:09060] mca: base: components_open: found loaded
>> component oob
>> [asau.local:09060] mca: base: components_open: component oob
>> open  function successful
>> [asau.local:09060] orte_rml_base_select: initializing rml
>> component  oob
>> [asau.local:09060] orte_rml_base_select: init returned failure
>
> Ah ha -- this is progress.  For some reason, your "oob" RML
> plugin is  declining to run.  I see that its
> query/initialization function is  actually quite short:
>
> if(mca_oob_base_init() != ORTE_SUCCESS)
> return NULL;
> *priority = 1;
> return _rml_oob_module;
>
> So it must be failing the mca_oob_base_init() function -- this
> is what  initializes the underling "OOB" (out of band)
> communications subsystem.
>
> Of course, this doesn't fail often, so we don't have any
> run-time  switches to enable the debugging output.  :-(  Edit
> orte/mca/oob/base/ oob_base_open.c line 43 and change the value
> of mca_oob_base_output  from -1 to 0.  Let's see that output --
> I'm particularly interested in  the output from querying the tcp
> oob component.  I suspect that it's  declining to run as well.
>
> I wonder if this is going to end up being an opal_if() issue --
> where  we are traversing all the IP network interfaces from the
> kernel...   I'll bet even money that it is.

[asau.local:04648] opal_ifinit: ioctl(SIOCGIFFLAGS) failed with errno=6
[asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init_stage1.c at line 182

--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_rml_base_select failed
  --> Returned value -13 instead of ORTE_SUCCESS


--
[asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_system_init.c at line 42
[asau.local:04648] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init.c at line 52

--
Open RTE was unable to initialize properly.  The error occured while
attempting to orte_init().  Returned value -13 instead of ORTE_SUCCESS.

--

Why don't you use strerror(3) to print errno value explanation?

>From :
#define ENXIO   6   /* Device not configured */

It seems that I have to debug network interface probing,
how should I use *_output subroutines so that they do print?
I tried these changes but in vain:

--- opal/util/if.c.orig 2008-08-25 23:16:50.0 +0400
+++ opal/util/if.c  2008-10-15 23:55:07.0 +0400
@@ -242,6 +242,8 @@
 if(ifr->ifr_addr.sa_family != AF_INET)
 continue;

+   opal_output(0, "opal_ifinit: checking netif %s", ifr->ifr_name);
+   /* HERE IT FAILS!! */
 if(ioctl(sd, SIOCGIFFLAGS, ifr) < 0) {
 opal_output(0, "opal_ifinit: ioctl(SIOCGIFFLAGS) failed
with errno=%d", errno);
 continue;
--- opal/util/if.c.orig 2008-08-25 23:16:50.0 +0400
+++ opal/util/if.c  2008-10-15 23:55:07.0 +0400
@@ -242,6 +242,8 @@
 if(ifr->ifr_addr.sa_family != AF_INET)
 continue;

+   fprintf(stderr, "opal_ifinit: checking netif %s\n",
ifr->ifr_name);
+   /* HERE IT FAILS!! */
 if(ioctl(sd, SIOCGIFFLAGS, ifr) < 0) {
 opal_output(0, "opal_ifinit: ioctl(SIOCGIFFLAGS) failed
with errno=%d", errno);
 continue;
--- opal/util/output.c.orig 2008-08-25 23:16:50.0 +0400
+++ opal/util/output.c  2008-10-16 

[OMPI users] Problems with OpenMPI running with Rmpi

2008-10-17 Thread Simone Giannerini
Dear all,

I managed to install successfully Rmpi 0.5-5 on a quad opteron machine (8
cores overall) running on OpenSUSE 11.0 and Open MPI 1.5.2.

this is what I get

> library(Rmpi)
[gauss:24207] mca: base: component_find: unable to open osc pt2pt: file not
found (ignored)
libibverbs: Fatal: couldn't read uverbs ABI version.
--
[0,0,0]: OpenIB on host gauss was unable to find any HCAs.
Another transport will be used instead, although this may result in
lower performance.
--
--

WARNING: Failed to open "OpenIB-cma"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--
--

WARNING: Failed to open "OpenIB-cma-1"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--
--

WARNING: Failed to open "OpenIB-cma-2"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--
--

WARNING: Failed to open "OpenIB-cma-3"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--
--

WARNING: Failed to open "OpenIB-bond"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--
--

WARNING: Failed to open "ofa-v2-ib0"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--
--

WARNING: Failed to open "ofa-v2-ib1"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--
--

WARNING: Failed to open "ofa-v2-ib2"
[DAT_PROVIDER_NOT_FOUND:DAT_NAME_NOT_REGISTERED].
This may be a real error or it may be an invalid entry in the uDAPL
Registry which is contained in the dat.conf file. Contact your local
System Administrator to confirm the availability of the interfaces in
the dat.conf file.
--
--
[0,0,0]: uDAPL on host gauss was unable to find any NICs.
Another transport will be used instead, although this may result in
lower performance.
--
> mpi.spawn.Rslaves()
1 slaves are spawned successfully. 0 failed.
master (rank 0, comm 1) of size 2 is running on: gauss
slave1 (rank 1, comm 1) of size 2 is running on: gauss

as you can see, just 1 cpu per session (2 cores) is recognized and used.

and this is the content of my etc/conf.dat

OpenIB-cma u1.2 nonthreadsafe default libdaplcma.so.1 

[OMPI users] OPAL_PREFIX is not passed to remote node in pls_rsh_module.c

2008-10-17 Thread Teng Lin

Hi All,

We have bundled Open MPI with our product and shipped it to the  
customer. According to http://www.open-mpi.org/faq/?category=building#installdirs 
,


Below is the command we used to launch MPI program:
env OPAL_PREFIX=/path/to/openmpi \
/path/to/openmpi/bin//orterun --prefix /path/to/openmpi -x PATH -x  
LD_LIBRARY_PATH -x OPAL_PREFIX -np 2 --host host1,host2 ring_c


The interesting fact is that it always works on csh/tcsh. But  quite a  
few users told us that they runs into below errors:


[compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init_stage1.c at line 182

--
Sorry!  You were supposed to get help about:
  orte_init:startup:internal-failure
from the file:
  help-orte-runtime
But I couldn't find any file matching that name.  Sorry!

--
[compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_system_init.c at line 42
[compute-28-1.local:11174] [NO-NAME] ORTE_ERROR_LOG: Not found in file
runtime/orte_init.c at line 52

--
Sorry!  You were supposed to get help about:
  orted:init-failure
from the file:
  help-orted.txt
But I couldn't find any file matching that name.  Sorry!


Jeff did mention in http://www.open-mpi.org/community/lists/users/2008/09/6582.php 
 that OPAL_PREFIX was propagated for him automatically. I bet Jeff  
uses csh/tcsh.

Anyway, it can be traced back to how the daemon is launched.

sh/bash:

[x:25369] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh x
OPAL_PREFIX=/opt/openmpi-1.2.4 ;
PATH=/opt/openmpi-1.2.4/bin:$PATH
; export PATH ;
LD_LIBRARY_PATH=/opt/openmpi-1.2.4/lib:$LD_LIBRARY_PATH ; export  
LD_LIBRARY_PATH ;


csh/tcsh:
[x:09886] pls:rsh: executing: (//usr/bin/ssh) /usr/bin/ssh x
setenv OPAL_PREFIX /opt/openmpi-1.2.4 ;


It seems to work after I patched pls_rsh_module.c


--- pls_rsh_module.c.orig   2008-10-16 17:15:32.0 -0400
+++ pls_rsh_module.c2008-10-16 17:15:51.0 -0400
@@ -989,7 +989,7 @@
  "%s/%s/%s",
  (opal_prefix != NULL ?  
"OPAL_PREFIX=" : ""),
  (opal_prefix != NULL ?  
opal_prefix : ""),

-  (opal_prefix != NULL ? " ;" : ""),
+  (opal_prefix != NULL ? " ; export  
OPAL_PREFIX ; " : ""),

  prefix_dir, bin_base,
  prefix_dir, lib_base,
  prefix_dir, bin_base,

Another workaround is to add
export OPAL_PREFIX
into $HOME/.bashrc.

Jeff, is this a bug in the code? Or  there is a reason that  
OPAL_PREFIX is not exported for sh/bash?


Teng


Re: [OMPI users] OPENMPI 1.2.7 & PGI compilers: configure option --disable-ptmalloc2-opt-sbrk

2008-10-17 Thread Francesco Iannone
Hi Jeff

Sorry to disturb you

I send you the Stack Frame captured with Totalview.

The example program "callocrash" goes in Segmentation Violation on sYMALLOc
function:

set_head(remainder, remainder_size | PREV_INUSE);


The Stack frame is

Function "sYSMALLOc":
  nb:0x00025216d050 (9967161424)
  av:0x2a95c1ef00 (_arena) -> (struct
malloc_state)
Local variables:
  old_top:   0x0b8bc110 -> (struct malloc_chunk)
  old_size:  0x00020ef0 (134896)
  old_end:   0x0b8dd000 -> ""
  size:  0x00025218def0 (9967296240)
  correction:0x (0)
  brk:   0x0b8dd000 -> ""
  snd_brk:   0x -> 
  front_misalign:0x (0)
  end_misalign:  0x0b8dd000 (193843200)
  aligned_brk:   0x00507000 -> ""
  p: 0x0b8bc110 -> (struct malloc_chunk)
  remainder: 0x25da29160 -> 
(struct malloc_chunk)
  remainder_size:0x00020ea0 (134816)
  sum:   0x3828b000 (942190592)
  pagemask:  0x0fff (4095)


On 16/10/08 14:05, "Francesco Iannone" 
wrote:

> Hi Jeff
> I used the configure option:
> 
> --enable-ptmalloc2-opt-sbrk
> 
> To solve a segmentation fault in memory allocation with openmpi.1.2.x and
> PGI 7.1-4 and 7.2.
> 
> I have a simple source code (Callocrash.c) as example of this (see belowe).
> 
> Could you test this code on a node with 8 Gbyte of RAM and RedHat enterprise
> 4+ openmpi 1.2.x, PGI 7.1-4.
> 
> I compiled it with:
> 
>  pgcc -o Callocrash Callocreash.c   (it's ok)
>  gnu4 -o Callocrash Callocreash.c   (it's ok)
>  mpicc -o Callocrash Callocreash.c   (Segmentation fault in sysMALLOC when
> it has to allocate 622947588 bytes)
> 
> However thanks in advance
> 
> greetings
> 
> 
> Callocrash.c
> 
> 
> #include 
> #include 
> 
> int main( int argc, char *argv[])
> {
> /*
>  *  memory allocations simulation for ~50M nonzeros:
>  *  nd=180 md=350 mdy=420
>  *
>  *  if this program crashes, there is a compiler problem
>  */
> printf("memory allocations simulation for ~50M nonzeros:  nd=180
> md=350 mdy=420\n");
> printf("if this program crashes, there check your
> compiler/environment configuration\n");
> 
> printf("sizeof(int)%d\n",sizeof(int));
> printf("sizeof(int*)   %d\n",sizeof(int*));
> printf("sizeof(size_t) %d\n",sizeof(size_t));
> 
> if( sizeof(size_t)<8 || sizeof(int*)<8 )
> {
> printf("please compile this program for a 64 bit
> environment!\n");
> return -1;
> }
> 
> int *p;
> 
> printf("allocation 1/4..\n");
> p = calloc(47109185,16);
> if(!p)printf("..failed.\n");
> printf("allocation 2/4..\n");
> p = calloc(47109185,4);
> if(!p)printf("..failed.\n");
> printf("allocation 3/4..\n");
> p = calloc(47109185,4);
> if(!p)printf("..failed.\n");
> printf("allocation 4/4..\n");
>   
> p = calloc(622947588,16);
> if(!p)printf("..failed.\n");
> if(!p) return -1;
> 
> printf("allocations test passed (no crash)\n");
> return 0;
> }
> 
> 
> On 15/10/08 19:42, "Jeff Squyres"  wrote:
> 
>> On Oct 15, 2008, at 9:35 AM, Francesco Iannone wrote:
>> 
>>> I have a cluster of 16 nodes DualCPU DualCore AMD  RAM 16 GB with
>>> InfiniBand
>>> CISCO HCA and switch InfiniBand.
>>> It uses Linux RH Enterprise 4  64 bit , OpenMPI 1.2.7, PGI 7.1-4 and
>>> openib-1.2-7.
>>> 
>>> Hence it means that the option ‹disable-ptmalloc2 is catastrophic in
>>> the
>>> above configuration.
>> 
>> Actually, I notice that in your original message, you said "--disable-
>> ptmalloc2-opt-sbrk", but here you said "--disable-ptmalloc2".  The
>> former is:
>> 
>>Only trigger callbacks when sbrk is used
>> for small
>>allocations, rather than every call to
>> malloc/free.
>>(default: enabled)
>> 
>> So it should be fine to disable; it shouldn't affect overall MPI
>> performance too much.
>> 
>> The latter disables ptmalloc2 entirely (and you'll likely get lower
>> benchmark bandwidth for large messages).
>> 
>> I'm unaware of either of these options leading to problems with the
>> PGI compiler suite; I have tested OMPI v1.2.x with several versions of
>> the PGI compiler without problem (although my latest version is PGI
>> 7.1-4).
> 
> Dr. Francesco Iannone
> Associazione EURATOM-ENEA sulla Fusione
> C.R. ENEA Frascati
> Via E. Fermi 45
> 00044 Frascati (Roma) Italy
> phone 00-39-06-9400-5124
> fax 00-39-06-9400-5524
> mailto:francesco.iann...@frascati.enea.it
>