Re: [OMPI users] ORTE_ERROR_LOG: Timeout in file

2009-04-29 Thread Hugh Dickinson
The remote node starts the following process when mpirun is executed  
on the local node:


25734 ?Ss 0:00 /usr/lib/openmpi/1.2.5-gcc/bin/orted -- 
bootproxy 1 --


I checked and it was not running before mpirun was executed.

I'll look into installing a more recent version of Open MPI.

Hugh

On 29 Apr 2009, at 00:11, Ralph Castain wrote:

Best I can tell, the remote orted never got executed - it looks to  
me like there is something that blocks the ssh from working. Can you  
get into another window and ssh to the remote node? If so, can you  
do a ps and verify that the orted is actually running there?


mpirun is using the same shell on the remote end as you are using  
when you start it. One thing I see that is strange is your entire  
environment is being sent along - I'll have to look at the 1.2.x  
code as I didn't think we were doing that (been a long time since I  
looked though).



On Apr 28, 2009, at 4:57 PM, Hugh Dickinson wrote:

As far as I can tell, both the PATH and LD_LIBRARY_PATH are set  
correctly. I've tried with the full path to the mpirun executable  
and using the --prefix command line option. Neither works. The  
debug output seems to contain a lot of system specific information  
(IPs, usernames and such), which I'm a little reticent to share on  
an open mailing list. As such I've censored that information.  
Hopefully the rest is still of use. One thing I did notice is that  
Open MPI seems to want to use sh instead of bash (which is the  
shell I use). Is that what's meant by the following lines?


[gamma2.censored_domain:22554] pls:rsh: local csh: 0, local sh: 1
[gamma2.censored_domain:22554] pls:rsh: assuming same remote shell  
as local shell

[gamma2.censored_domain:22554] pls:rsh: remote csh: 0, remote sh: 1

If so is there a way to make it use bash?

Cheers,

Hugh


On 28 Apr 2009, at 22:30, Ralph Castain wrote:

Okay, that's one small step forward. You can lock that in by  
setting the appropriate MCA parameter in one of two ways:


1. add the following to your default mca parameter file:  btl =  
tcp,sm,self (I added the shared memory subsystem as this will help  
with performance). You can see how to do this here:


http://www.open-mpi.org/faq/?category=tuning#setting-mca-params

2. add OMPI_MCA_btl=tcp,sm,self to the environment in your .cshrc  
(or equivalent) file


Next, have you looked at the following FAQ:

http://www.open-mpi.org/faq/?category=running#missing-prereqs

Are those things all okay? Have you tried providing a complete  
absolute path when running mpirun (e.g., using /usr/local/openmpi/ 
bin/mpirun instead of just mpirun on the cmd line)?


Another thing to try: add --debug-devel to the cmd line and send  
us the (probably verbose) output.



On Apr 28, 2009, at 3:19 PM, Hugh Dickinson wrote:


Hi,

Yes I'm using ethernet connections. Doing as you suggest removes  
the errors generated by running the small test program, but still  
doesn't allow programs (including the small test program) to  
execute on any node other than the one launching mpirun. If I try  
to do that, the command hangs until I interrupt it, whereupon it  
gives the same timeout errors. It seems that there must be some  
problem with the setup of my Open MPI installation. Do you agree,  
and do you have any idea what it is? Also, is there a global  
settings file so I can instruct Open MPI to always only try  
ethernet?


Cheers,

Hugh

On 28 Apr 2009, at 20:12, Ralph Castain wrote:

In this instance, OMPI is complaining that you are attempting to  
use Infiniband, but no suitable devices are found.


I assume you have Ethernet between your nodes? Can you run this  
with the following added to your mpirun cmd line:


-mca btl tcp,self

That will cause OMPI to ignore the Infiniband subsystem and  
attempt to run via TCP over any available Ethernet.




On Tue, Apr 28, 2009 at 12:16 PM, Hugh Dickinson > wrote:

Many thanks for your help nonetheless.

Hugh


On 28 Apr 2009, at 17:23, jody wrote:

Hi Hugh

I'm sorry, but i must admit that i have never encountered these  
messages,

and i don't know what their cause exactly is.

Perhaps one of the developers can give an explanation?

Jody

On Tue, Apr 28, 2009 at 5:52 PM, Hugh Dickinson
 wrote:
Hi again,

I tried a simple mpi c++ program:

--
#include 
#include 

using namespace MPI;
using namespace std;

int main(int argc, char* argv[]) {
 int rank,size;
 Init(argc,argv);
 rank=COMM_WORLD.Get_rank();
 size=COMM_WORLD.Get_size();
 cout << "P:" << rank << " out of " << size << endl;
 Finalize();
}
--
It didn't work over all the nodes, again same problem - the  
system seems to
hang. However, by  forcing mpirun to use only the node on which  
I'm

launching mpirun I get some more error messages

--
libibverbs: Fatal: couldn't read uverbs ABI version.
libibverbs: Fatal: couldn't read uverbs ABI version.
--
[0,1,0]: OpenIB on host gamma2 was unable to find

Re: [OMPI users] Problem with running openMPI program

2009-04-29 Thread Ankush Kaul
Are there any application that i can implement on a small level, in a lab or
something???

Also what do for clustering web servers?


On Wed, Apr 29, 2009 at 2:46 AM, Gus Correa  wrote:

> Hi Ankush
>
> Glad to hear that your MPI and cluster project were successful.
>
> I don't know if you would call these "mathematical computation"
> or "real life applications" of MPI and clusters, but here are a
> few samples I am familiar with (Earth Science):
>
> Weather forecast:
> http://www.wrf-model.org/index.php
> http://www.mmm.ucar.edu/mm5/
>
> Climate, Atmosphere and Ocean circulation modeling:
> http://www.ccsm.ucar.edu/models/ccsm3.0/
> http://www.jamstec.go.jp/esc/index.en.html
> http://www.metoffice.gov.uk/climatechange/
> http://www.gfdl.noaa.gov/fms
> http://www.nemo-ocean.eu/
>
> Earthquakes, computational seismology, and solid Earth dynamics:
> http://www.gps.caltech.edu/~jtromp/research/index.html
> http://www-esd.lbl.gov/GG/CCS/
>
> A couple of other areas:
>
> Computational Fluid Dynamics, Finite Element Method, etc:
> http://www.foamcfd.org/
> http://www.cimec.org.ar/twiki/bin/view/Cimec/PETScFEM
>
> Computational Chemistry, molecular dynamics, etc:
> http://www.tddft.org/programs/octopus/wiki/index.php/Main_Page
> http://classic.chem.msu.su/gran/gamess/
> http://ambermd.org/
> http://www.gromacs.org/
> http://www.charmm.org/
>
> Gus Correa
>
>
> Ankush Kaul wrote:
>
>> Thanks everyone(esp Gus and Jeff) for the support and guidance. We are
>> almost at the verge of completing our project which could have not been
>> possible without all u guys.
>>
>> I would like to know one more thing, what are real life applications that
>> i can use the cluster for (except mathematical computation)? Can i use if
>> for my web server, if yes then how?
>>
>>
>>
>> On Fri, Apr 24, 2009 at 12:01 AM, Jeff Squyres > jsquy...@cisco.com>> wrote:
>>
>>Excellent answer.  One addendum -- we had a really nice FAQ entry
>>about this kind of stuff on the LAM/MPI web site, which I was
>>horrified to see that we had not copied to the Open MPI site.  So I
>>copied it over this morning.  :-)
>>
>>Have a look at these 3 FAQ (brand new) entries:
>>
>>
>> http://www.open-mpi.org/faq/?category=building#overwrite-pre-installed-ompi
>>   http://www.open-mpi.org/faq/?category=building#where-to-install
>>
>> http://www.open-mpi.org/faq/?category=running#do-i-need-a-common-filesystem
>>
>>Hope that helps.
>>
>>
>>
>>
>>On Apr 23, 2009, at 10:34 AM, Gus Correa wrote:
>>
>>Hi Ankush
>>
>>Jeff already sent clarifications about image processing,
>>and the portable API nature of OpenMPI (and other MPI
>>implementations).
>>
>>As for "mpicc: command not found" this is again a problem with your
>>PATH.
>>Remember the "locate" command?  :)
>>Find where mpicc is installed, and put that directory on your PATH.
>>
>>In any case, I would suggest that you choose a central NFS mounted
>>file system on your cluster master node, and install OpenMPI there,
>>configuring and building it from source (not from yum).
>>If this directory is mounted on all nodes, the same OpenMPI will be
>>available on all nodes.
>>This will give you a single standard version of OpenMPI across
>>the board.
>>
>>Clustering can become a very confusing and tricky business if you
>>have heterogeneous nodes, with different OS/Linux versions,
>>different MPI versions etc, software installed in different
>>locations
>>on each node, etc, regardless of whether you use mpiselector or
>>you set the PATH variable on each node, or you use environment
>>modules
>>package, or any other technique to setup your environment.
>>Installing less software, rather than more software,
>>and doing so in a standardized homogeneous way across all
>>cluster nodes,
>>will give you a cleaner environment, which is easier to understand,
>>control, upgrade, and update.
>>
>>A relatively simple way to install a homogeneous cluster is
>>to use the Rocks Clusters "rolls" suite,
>>which is free and based on CentOS.
>>It will probably give you some extra work in the beginning,
>>but may be worthwhile in the long run.
>>See this:
>>http://www.rocksclusters.org/wordpress/
>>
>>
>>My two cents.
>>
>>Gus Correa
>>
>>  -
>>Gustavo Correa
>>Lamont-Doherty Earth Observatory - Columbia University
>>Palisades, NY, 10964-8000 - USA
>>
>>  -
>>
>>Ankush Kaul wrote:
>> > @Gus, Eugene
>> > I read all you mails and even followed the same procedure, it
>>was blas

Re: [OMPI users] Problem with running openMPI program

2009-04-29 Thread Ankush Kaul
@Gus

the applications in the links u have sent are really high level n i believe
really expensive too as i will have 2 have a physical apparatus for various
measurements along with the cluster. Am i right?


Re: [OMPI users] running problem on Dell blade server, confirm 2d21ce3ce8be64d8104b3ad71b8c59e2514a72eb

2009-04-29 Thread Jeff Squyres

On Apr 25, 2009, at 11:59 AM, Anton Starikov wrote:


I can confirm that I have exactly the same problem, also on Dell
system, even with latest openpmpi.

Our system is:

Dell M905
OpenSUSE 11.1
kernel: 2.6.27.21-0.1-default
ofed-1.4-21.12 from SUSE repositories.
OpenMPI-1.3.2


But what I can also add, it not only affect openmpi, if this messages
are triggered after mpirun:
[node032][[9340,1],11][btl_openib_component.c:3002:poll_device] error
polling HP CQ with -2 errno says Success

Then IB stack hangs. You cannot even reload it, have to reboot node.




Something that severe should not be able to be caused by Open MPI.   
Specifically: Open MPI should not be able to hang the OFED stack.   
Have you run layer 0 diagnostics to know that your fabric is clean?   
You might want to contact your IB vendor to find out how to do that.


--
Jeff Squyres
Cisco Systems



Re: [OMPI users] Problem with running openMPI program

2009-04-29 Thread Gus Correa

Hi Ankush

You can run the MITgcm ocean model test cases and the CAM3 atmospheric
model test with two processors only, but the codes scale well to
any number of processors.
They are "real life" applications, but not too hard to get to work.
It will take some reading of their README and INSTALL files,
and perhaps of their User Guides to understand how they work, though.

You can even run them on a single processor, but if you want to make
the point that your cluster OpenMPI works, you want also to use more 
than one processor.


You can download the tarballs from these links:

http://mitgcm.org/download/
http://www.ccsm.ucar.edu/models/atm-cam/download/

CAM3 will require the NetCDF package, which is easy to install also:
http://www.unidata.ucar.edu/downloads/netcdf/netcdf-3_6_3/index.jsp

You can even get the NetCDF package with yum, if you prefer.
(Try "yum info netcdf".)

However, the MITgcm can work even without NetCDF (although it can
benefit from NetCDF also).

Of course there are simpler MPI programs out there, but they may be
what you called "mathematical computations" as opposed to "real life 
applications".  :)


Somebody already sent you this link before.
It has some simpler MPI programs:

http://www.pdc.kth.se/training/Tutor/MPI/Templates/index-frame.html

These (online) books may have some MPI program examples:

Ian Foster's (online) book (Ch. 8 is on MPI):

http://www.wotug.org/parallel/books/addison-wesley/dbpp/text/book.html

Peter Pacheco's book (a short version is online):

http://www.cs.usfca.edu/mpi/

Here are other MPI program examples (not all are guaranteed to work):

http://www2.cs.uh.edu/~johnson2/labs.html
http://www.redbooks.ibm.com/redbooks/SG245380.html

See more links to MPI tutorials, etc, here:
http://fats-raid.ldeo.columbia.edu/pages/parallel_programming.html#mpi

I hope this helps.

Gus Correa
-
Gustavo Correa
Lamont-Doherty Earth Observatory - Columbia University
Palisades, NY, 10964-8000 - USA
-

Ankush Kaul wrote:

@Gus

the applications in the links u have sent are really high level n i 
believe really expensive too as i will have 2 have a physical apparatus 
for various measurements along with the cluster. Am i right?






___
users mailing list
us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/users




Re: [OMPI users] Purify found bugs inside open-mpi library

2009-04-29 Thread Brian Blank
To Whom This May Concern:

I've been trying to dig a little deeper into this problem and found some
additional information.

First, the stack trace for the ABR and ABW were different. The ABR problem
occurred in datatype_pack.h while the ABW problem occurred in
datatype_unpack.h.  The problem appears to be the same still.  Both errors
are occurring during a call to MEMCPY_CSUM().

I also found there are two different variables playing into this bug.  There
is _copy_blength and _copy_count.  At the top of the function, both of these
variables are set to 2 bytes for MPI_SHORT, 4 bytes for MPI_LONG, and 8
bytes for MPI_DOUBLE.  Then, these variables are multiplied together to get
the size of the memcpy().  Unfortunetly, the correct size are either of
these variables before they are squared.  There sometimes appears to be an
optimization where if two variables are next to each other, they are
sometimes converted into a MPI_BYTE where the size is also incorrectly
taking these squared values into consideration.

I wrote a small test program to illustrate the problem and attached it to
this email.  First, I configured openmpi 1.3.2 with the following options:

./configure --prefix=/myworkingdirectory/openmpi-1.3.2.local
--disable-mpi-f77 --disable-mpi-f90 --enable-debug --enable-mem-debug
--enable-mem-profile

I then modified datatype_pack.h & datatype_unpack.h located in
openmpi-1.3.2/ompi/datatype directory in order to produce additional
debugging output.  The new versions are attached to this email.

Then, I executed "make all install"

Then, I write the attached test.c program.  You can find the output below.
The problems appear in red.

0: sizes '3'  '4'  '8'  '2'
0: offsets   '0'  '4'  '8'  '16'
0: addresses '134510640' '134510644' '134510648' '134510656'
0: name='MPI_CHAR'  _copy_blength='3'  orig_copy_blength='1'
_copy_count='3'  _source='134510640'
0: name='MPI_LONG'  _copy_blength='16'  orig_copy_blength='4'
_copy_count='4'  _source='134510644'
0: name='MPI_DOUBLE'  _copy_blength='64'  orig_copy_blength='8'
_copy_count='8'  _source='134510648'
0: name='MPI_SHORT'  _copy_blength='4'  orig_copy_blength='2'
_copy_count='2'  _source='134510656'
0: one='22'  two='222'  three='33.30'  four='44'
1: sizes '3'  '4'  '8'  '2'
1: offsets   '0'  '4'  '8'  '16'
1: addresses '134510640' '134510644' '134510648' '134510656'
1: name='MPI_CHAR'  _copy_blength='3'  orig_copy_blength='1'
_copy_count='3'  _destination='134510640'
1: name='MPI_LONG'  _copy_blength='16'  orig_copy_blength='4'
_copy_count='4'  _destination='134510644'
1: name='MPI_DOUBLE'  _copy_blength='64'  orig_copy_blength='8'
_copy_count='8'  _destination='134510648'
1: name='MPI_SHORT'  _copy_blength='4'  orig_copy_blength='2'
_copy_count='2'  _destination='134510656'
1: one='22'  two='222'  three='33.30'  four='44'

You can see from the output that the MPI_Send & MPI_Recv functions are
reading or writing too much data from my structure, causing an overflow
condition to occur.  I believe this is causing my application to crash.

Any help on this problem would be appreciated.  If there is anything else
that you need from me, just let me know.

Thanks,
Brian



On Tue, Apr 28, 2009 at 9:58 PM, Brian Blank  wrote:

> To Whom This May Concern:
>
> I am having problems with an OpenMPI application I am writing on the
> Solaris/Intel AMD64 platform.  I am using OpenMPI 1.3.2 which I compiled
> myself using the Solaris C/C++ compiler.
>
> My application was crashing (signal 11) inside a call to malloc() which my
> code was running.  I thought it might be a memory overflow error that was
> causing this, so I fired up Purify.  Purify found several problems inside
> the the OpenMPI library.  I think one of the errors is serious and might be
> causing the problems I was looking for.
>
> The serious error is an Array Bounds Write (ABW) which occurred 824 times
> through 312 calls to MPI_Recv().  This error means that something inside the
> OpenMPI library is writing to memory where it shouldn't be writing to.  Here
> is the stack trace at the time of this error:
>
> Stack Trace 1 (Occurred 596 times)
>
> memcpy rtlib.o
> unpack_predefined_data [datatype_unpack.h:41]
>  MEMCPY_CSUM( _destination, *(SOURCE), _copy_blength, (CONVERTOR) );
> ompi_generic_simple_unpack [datatype_unpack.c:419]
> ompi_convertor_unpack [convertor.c:314]
> mca_pml_ob1_recv_frag_callback_match [pml_ob1_recvfrag.c:218]
> mca_btl_sm_component_progress [btl_sm_component.c:427]
> opal_progress [opal_progress.c:207]
> opal_condition_wait [condition.h:99]
> 
>  bytes.>
>
> Stack Trace 2 (Occurred 228 times)
>
> memcpy rtlib.o
> unpack_predefined_data [datatype_unpack.h:41]
>  MEMCPY_CSUM( _destination, *(SOURCE), _copy_blength, (CONVERTOR) );
> ompi_generic_simple_unpack [datatype_unpack.c:419]
> ompi_convertor_unpack [convertor.c:314]
> mca_pml_ob1_recv_request_progress_match [pml_ob1_recvreq.c:624]
> mca_pml_ob1_Recv_req_start [pml_ob1_recvreq.c:1008]
> mca_pml_ob1_recv [pml_ob1_ir

Re: [OMPI users] Bogus memcpy or bogus valgrind record

2009-04-29 Thread Jeff Squyres

On Apr 22, 2009, at 7:35 PM, François PELLEGRINI wrote:


I have had no answers regarding the trouble (OpenMPI bug ?)
I evidenced when combining OpenMPI and valgrind.



Sorry for the delay in getting back to you; there are so many mails  
and only so many hours in the day...  :-(



I tried it with a newer version of OpenMPI, and the problems
persist, with new, even more worrying, error messages being  
displayed :


==32142== Warning: client syscall munmap tried to modify addresses  
0x-0xFFE


(but this happens for all the programs I tried)

The original error messages, which are still here, were the
following :

==32143== Source and destination overlap in memcpy(0x4A73DA8,  
0x4A73DB0, 16)

==32143==at 0x40236C9: memcpy (mc_replace_strmem.c:402)
==32143==by 0x407C9DC: ompi_ddt_copy_content_same_ddt (dt_copy.c: 
171)

==32143==by 0x512EA61: ompi_coll_tuned_allgather_intra_bruck
(coll_tuned_allgather.c:193)
==32143==by 0x5126D90: ompi_coll_tuned_allgather_intra_dec_fixed
(coll_tuned_decision_fixed.c:562)
==32143==by 0x408986A: PMPI_Allgather (pallgather.c:101)
==32143==by 0x80487D7: main (in /tmp/brol)

I do not get this "memcpy" messages when running on 2 processors.
I therefore assume it is a rounding problem wrt the number of procs.



Good.  This is possibly related to a post from last night:

http://www.open-mpi.org/community/lists/users/2009/04/9138.php.

Some of the valgrind warnings are unavoidable, unfortunately -- e.g.,  
those within system calls.  Note that you *can* avoid the valgrind  
warnings in PLPA (the linux paffainity component) if you configure  
OMPI --with-valgrind.  This will proagmatically tell valgrind that the  
memory access that PLPA is doing "is ok" (i.e., it's specifically  
intended to be an error for long/complicated reasons).


But I'm able to replicate your error (but shouldn't the 2nd buffer be  
the 1st + size (not 2)?) -- let me dig into it a bit... we definitely  
shouldn't be getting invalid writes in the convertor, etc.


I've filed ticket #1903 about this issue:

https://svn.open-mpi.org/trac/ompi/ticket/1903

--
Jeff Squyres
Cisco Systems




Re: [OMPI users] Purify found bugs inside open-mpi library

2009-04-29 Thread Jeff Squyres

Per your mail on the devel list, we'll follow up there.  Many thanks!

On Apr 29, 2009, at 1:09 PM, Brian Blank wrote:


To Whom This May Concern:

I've been trying to dig a little deeper into this problem and found  
some additional information.


First, the stack trace for the ABR and ABW were different. The ABR  
problem occurred in datatype_pack.h while the ABW problem occurred  
in datatype_unpack.h.  The problem appears to be the same still.   
Both errors are occurring during a call to MEMCPY_CSUM().


I also found there are two different variables playing into this  
bug.  There is _copy_blength and _copy_count.  At the top of the  
function, both of these variables are set to 2 bytes for MPI_SHORT,  
4 bytes for MPI_LONG, and 8 bytes for MPI_DOUBLE.  Then, these  
variables are multiplied together to get the size of the memcpy().   
Unfortunetly, the correct size are either of these variables before  
they are squared.  There sometimes appears to be an optimization  
where if two variables are next to each other, they are sometimes  
converted into a MPI_BYTE where the size is also incorrectly taking  
these squared values into consideration.


I wrote a small test program to illustrate the problem and attached  
it to this email.  First, I configured openmpi 1.3.2 with the  
following options:


./configure --prefix=/myworkingdirectory/openmpi-1.3.2.local -- 
disable-mpi-f77 --disable-mpi-f90 --enable-debug --enable-mem-debug  
--enable-mem-profile


I then modified datatype_pack.h & datatype_unpack.h located in  
openmpi-1.3.2/ompi/datatype directory in order to produce additional  
debugging output.  The new versions are attached to this email.


Then, I executed "make all install"

Then, I write the attached test.c program.  You can find the output  
below.  The problems appear in red.


0: sizes '3'  '4'  '8'  '2'
0: offsets   '0'  '4'  '8'  '16'
0: addresses '134510640' '134510644' '134510648' '134510656'
0: name='MPI_CHAR'  _copy_blength='3'  orig_copy_blength='1'   
_copy_count='3'  _source='134510640'
0: name='MPI_LONG'  _copy_blength='16'  orig_copy_blength='4'   
_copy_count='4'  _source='134510644'
0: name='MPI_DOUBLE'  _copy_blength='64'  orig_copy_blength='8'   
_copy_count='8'  _source='134510648'
0: name='MPI_SHORT'  _copy_blength='4'  orig_copy_blength='2'   
_copy_count='2'  _source='134510656'

0: one='22'  two='222'  three='33.30'  four='44'
1: sizes '3'  '4'  '8'  '2'
1: offsets   '0'  '4'  '8'  '16'
1: addresses '134510640' '134510644' '134510648' '134510656'
1: name='MPI_CHAR'  _copy_blength='3'  orig_copy_blength='1'   
_copy_count='3'  _destination='134510640'
1: name='MPI_LONG'  _copy_blength='16'  orig_copy_blength='4'   
_copy_count='4'  _destination='134510644'
1: name='MPI_DOUBLE'  _copy_blength='64'  orig_copy_blength='8'   
_copy_count='8'  _destination='134510648'
1: name='MPI_SHORT'  _copy_blength='4'  orig_copy_blength='2'   
_copy_count='2'  _destination='134510656'

1: one='22'  two='222'  three='33.30'  four='44'

You can see from the output that the MPI_Send & MPI_Recv functions  
are reading or writing too much data from my structure, causing an  
overflow condition to occur.  I believe this is causing my  
application to crash.


Any help on this problem would be appreciated.  If there is anything  
else that you need from me, just let me know.


Thanks,
Brian



On Tue, Apr 28, 2009 at 9:58 PM, Brian Blank   
wrote:

To Whom This May Concern:

I am having problems with an OpenMPI application I am writing on the  
Solaris/Intel AMD64 platform.  I am using OpenMPI 1.3.2 which I  
compiled myself using the Solaris C/C++ compiler.


My application was crashing (signal 11) inside a call to malloc()  
which my code was running.  I thought it might be a memory overflow  
error that was causing this, so I fired up Purify.  Purify found  
several problems inside the the OpenMPI library.  I think one of the  
errors is serious and might be causing the problems I was looking for.


The serious error is an Array Bounds Write (ABW) which occurred 824  
times through 312 calls to MPI_Recv().  This error means that  
something inside the OpenMPI library is writing to memory where it  
shouldn't be writing to.  Here is the stack trace at the time of  
this error:


Stack Trace 1 (Occurred 596 times)

memcpy  rtlib.o
unpack_predefined_data  [datatype_unpack.h:41]
MEMCPY_CSUM( _destination, *(SOURCE), _copy_blength, (CONVERTOR) );
ompi_generic_simple_unpack [datatype_unpack.c:419]
ompi_convertor_unpack [convertor.c:314]
mca_pml_ob1_recv_frag_callback_match [pml_ob1_recvfrag.c:218]
mca_btl_sm_component_progress [btl_sm_component.c:427]
opal_progress [opal_progress.c:207]
opal_condition_wait [condition.h:99]
illegal).>
of 664 bytes.>


Stack Trace 2 (Occurred 228 times)

memcpy  rtlib.o
unpack_predefined_data  [datatype_unpack.h:41]
MEMCPY_CSUM( _destination, *(SOURCE), _copy_blength, (CONVERTOR) );
ompi_generic_simple_unpack [datatype_unpack.c:419]
ompi_conv

Re: [OMPI users] Bogus memcpy or bogus valgrind record

2009-04-29 Thread Jed Brown
Jeff Squyres wrote:

> But I'm able to replicate your error (but shouldn't the 2nd buffer be
> the 1st + size (not 2)?) -- let me dig into it a bit... we definitely
> shouldn't be getting invalid writes in the convertor, etc.

As Eugene pointed out earlier, it is fine.

  dataloctab = malloc (2 * (procglbnbr + 1) * sizeof (int));
  dataglbtab = dataloctab + 2;

dataloctab is the 2-element send buffer, dataglbtab is the receive
buffer of length 2*procglbnbr.

Jed



signature.asc
Description: OpenPGP digital signature


Re: [OMPI users] Bogus memcpy or bogus valgrind record

2009-04-29 Thread Jeff Squyres

On Apr 29, 2009, at 3:40 PM, Jed Brown wrote:

> But I'm able to replicate your error (but shouldn't the 2nd buffer  
be
> the 1st + size (not 2)?) -- let me dig into it a bit... we  
definitely

> shouldn't be getting invalid writes in the convertor, etc.

As Eugene pointed out earlier, it is fine.

  dataloctab = malloc (2 * (procglbnbr + 1) * sizeof (int));
  dataglbtab = dataloctab + 2;

dataloctab is the 2-element send buffer, dataglbtab is the receive
buffer of length 2*procglbnbr.




You're absolutely right -- sorry for not paying attention.

Regardless, there still is a definite problem.  I'm digging...

--
Jeff Squyres
Cisco Systems