date:20050926

Re: [O-MPI devel] mpif.h problems

2005-09-26 Thread David R. (Chip) Kent IV

I was able to get the code to compile and run by including only these 3 
constants.  Thanks for fixing this.

Chip

On Fri, Sep 23, 2005 at 11:35:14AM -0400, Jeff Squyres wrote:
> On Sep 21, 2005, at 3:37 PM, David R. (Chip) Kent IV wrote:
> 
> > I managed to find a number of problems with the mpif.h when I tried it 
> > on
> > a big application.  It looks like a lot of key constants are not 
> > defined
> > in this file.  So far, MPI_SEEK_SET, MPI_MODE_CREATE, MPI_MODE_WRONLY
> > have broken the build.  I've added them to mpif.h as I find them so 
> > that
> > I can get the build to go, but I assume there are many more values 
> > still
> > missing.
> 
> (sorry I didn't reply earlier; was traveling with limited connectivity 
> when you sent this)
> 
> Yoinks.  Looks like we forgot to add the MPI-2 IO constants into 
> mpif.h; thanks for finding this.
> 
> I'll go add them and take a swipe through the file to see if I can find 
> any others that we're missing.  Did you find any others that are 
> missing?
> 
> -- 
> {+} Jeff Squyres
> {+} The Open MPI Project
> {+} http://www.open-mpi.org/
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

-- 


-
David R. "Chip" Kent IV

Parallel Tools Team
High Performance Computing Environments Group (CCN-8)
Los Alamos National Laboratory

(505)665-5021
drk...@lanl.gov
-

This message is "Technical data or Software  Publicly 
Available" or "Correspondence".

Re: [O-MPI devel] mpif.h problems

2005-09-26 Thread Jeff Squyres


Ok, thanks.

I apologize -- I made this fix several days ago and forgot to commit 
it.  :-\


We try not to commit changes to the configure/build system during the 
US work day, so I'll commit this stuff tonight (i.e., should be in 
tomorrow's tarball).




On Sep 26, 2005, at 8:24 AM, David R. (Chip) Kent IV wrote:


I was able to get the code to compile and run by including only these 3
constants.  Thanks for fixing this.

Chip

On Fri, Sep 23, 2005 at 11:35:14AM -0400, Jeff Squyres wrote:

On Sep 21, 2005, at 3:37 PM, David R. (Chip) Kent IV wrote:

I managed to find a number of problems with the mpif.h when I tried 
it

on
a big application.  It looks like a lot of key constants are not
defined
in this file.  So far, MPI_SEEK_SET, MPI_MODE_CREATE, MPI_MODE_WRONLY
have broken the build.  I've added them to mpif.h as I find them so
that
I can get the build to go, but I assume there are many more values
still
missing.


(sorry I didn't reply earlier; was traveling with limited connectivity
when you sent this)

Yoinks.  Looks like we forgot to add the MPI-2 IO constants into
mpif.h; thanks for finding this.

I'll go add them and take a swipe through the file to see if I can 
find

any others that we're missing.  Did you find any others that are
missing?

--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--


-
David R. "Chip" Kent IV

Parallel Tools Team
High Performance Computing Environments Group (CCN-8)
Los Alamos National Laboratory

(505)665-5021
drk...@lanl.gov
-

This message is "Technical data or Software  Publicly
Available" or "Correspondence".
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
{+} Jeff Squyres
{+} The Open MPI Project
{+} http://www.open-mpi.org/

Re: [O-MPI devel] ompi_info Seg Fault, missing component -- linux (fwd)

2005-09-26 Thread Ferris McCormick

On Fri, 2005-09-16 at 11:35 -0500, Brian Barrett wrote:
> On Sep 16, 2005, at 8:44 AM, Ferris McCormick wrote:
> 
> > ==
> > fmccor@polylepis util [235]% ./opal_timer
> > --> frequency: 9
> > --> cycle count
> > Slept approximately 903151189 cycles, or 1003501 us
> > --> usecs
> > Slept approximately 18446744073289684648 us
> > ==
> 
> That last value means that I'm munging the upper 32 bits of the tick 
> register (it's 64 bits long).  So we're not quite there yet, but 
> getting closer.  I should be able to get to that today.
> 
> The other problem is very odd.  Since you're compiling in 32bit mode, 
> I'd expect us to see it on our PowerPC machines, but I haven't run into 
> that one yet.  I'll try to compile without debugging and see what I can 
> see.
> 
> 
> Brian
> 
Here's a little more information on the SegFault when trying
OBJ_DESTRUCT(&verbose); in opal/util/optput.c:
First of all, verbose is of type opal_output_stream_t, and this is not
an opal_object_t, so OBJ_DESTRUCT is calling opal_obj_run_destructors
with an object of the wrong type (although ompi might be forcing storage
allocation so that this call should work; I haven't worked it out).

Second, on my system at least, when OBJ_DESTRUCT(&verbose) gets called,
verbose looks like this (I have a debug fprintf to try to look at a bit
of the verbose structure.  The corresponding fprintf I put after
OBJ_CONSTRUCT(&verbose, opal_output_stream_t); is fine.)

Program received signal SIGSEGV, Segmentation fault.
0x7014f7d4 in opal_output_close (output_id=1883966264) at output.c:287
287 fprintf(stderr,"Destroying verbose, depth=%d
\n",(/*(opal_object_t*)&*/verbose.super).obj_class->cls_depth);
Current language:  auto; currently c
(gdb) print verbose
$1 = {super = {obj_class = 0x0, obj_reference_count = 1},
lds_is_debugging = false, 
lds_verbose_level = 0, lds_want_syslog = false, lds_syslog_priority =
0, 
lds_syslog_ident = 0x0, lds_prefix = 0x0, lds_want_stdout = false,
lds_want_stderr = true, 
lds_want_file = false, lds_want_file_append = false, lds_file_suffix =
0x0}
=
so that verbose.super.obj_class has been set to null, and no matter how
it is supposed to work, the opal_obj_run_destructors loop:
  cls = object->obj_class;
  for(i=0; i < cls->cls_depth;i++) { ...
is going to be working on garbage, because nothing in verbose has a
useful obj_class element.

Hope this helps,
Regards,
Ferris
-- 
Ferris McCormick (P44646, MI) 
Developer, Gentoo Linux (Sparc, Devrel)

signature.asc
Description: This is a digitally signed message part

Re: [O-MPI devel] ompi_info Seg Fault, missing component -- linux (fwd)

2005-09-26 Thread Ferris McCormick

On Mon, 2005-09-26 at 14:59 +, Ferris McCormick wrote:
> On Fri, 2005-09-16 at 11:35 -0500, Brian Barrett wrote:
> > On Sep 16, 2005, at 8:44 AM, Ferris McCormick wrote:
> > 
> > > ==
> > > fmccor@polylepis util [235]% ./opal_timer
> > > --> frequency: 9
> > > --> cycle count
> > > Slept approximately 903151189 cycles, or 1003501 us
> > > --> usecs
> > > Slept approximately 18446744073289684648 us
> > > ==
> > 
> > That last value means that I'm munging the upper 32 bits of the tick 
> > register (it's 64 bits long).  So we're not quite there yet, but 
> > getting closer.  I should be able to get to that today.
> > 
> > The other problem is very odd.  Since you're compiling in 32bit mode, 
> > I'd expect us to see it on our PowerPC machines, but I haven't run into 
> > that one yet.  I'll try to compile without debugging and see what I can 
> > see.
> > 
> > 
> > Brian
> > 
> Here's a little more information on the SegFault when trying
> OBJ_DESTRUCT(&verbose); in opal/util/optput.c:
> First of all, verbose is of type opal_output_stream_t, and this is not
> an opal_object_t, so OBJ_DESTRUCT is calling opal_obj_run_destructors
> with an object of the wrong type (although ompi might be forcing storage
> allocation so that this call should work; I haven't worked it out).
> 
> Second, on my system at least, when OBJ_DESTRUCT(&verbose) gets called,
> verbose looks like this (I have a debug fprintf to try to look at a bit
> of the verbose structure.  The corresponding fprintf I put after
> OBJ_CONSTRUCT(&verbose, opal_output_stream_t); is fine.)
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x7014f7d4 in opal_output_close (output_id=1883966264) at output.c:287
> 287 fprintf(stderr,"Destroying verbose, depth=%d
> \n",(/*(opal_object_t*)&*/verbose.super).obj_class->cls_depth);
> Current language:  auto; currently c
> (gdb) print verbose
> $1 = {super = {obj_class = 0x0, obj_reference_count = 1},
> lds_is_debugging = false, 
> lds_verbose_level = 0, lds_want_syslog = false, lds_syslog_priority =
> 0, 
> lds_syslog_ident = 0x0, lds_prefix = 0x0, lds_want_stdout = false,
> lds_want_stderr = true, 
> lds_want_file = false, lds_want_file_append = false, lds_file_suffix =
> 0x0}
> =
> so that verbose.super.obj_class has been set to null, and no matter how
> it is supposed to work, the opal_obj_run_destructors loop:
>   cls = object->obj_class;
>   for(i=0; i < cls->cls_depth;i++) { ...
> is going to be working on garbage, because nothing in verbose has a
> useful obj_class element.
> 

I've looked at the structures, and I see that opal_output_stream_t is
set up so that (opal_object_t*)(&verbose) should resolve correctly, and
thus my first concern is gone.

Now, for the second:  If built with --enable-debug, then when the
program finally reaches OBJ_DESTRUCT(&verbose), the obj_class pointer is
correct.  Without --enable-debug, it is NULL.  I'll keep looking at it,
but so far, I don't see what is going wrong.

Regards,
-- 
Ferris McCormick (P44646, MI) 
Developer, Gentoo Linux (Sparc, Devrel)


signature.asc
Description: This is a digitally signed message part

[O-MPI devel] [PATCH] Update Open MPI for new libibverbs API

2005-09-26 Thread Roland Dreier

[It's somewhat annoying to have to subscribe to de...@open-mpi.org
just to be able to send patches, but oh well...]


This patch updates Open MPI for the new ibv_create_cq() API.

Signed-off-by: Roland Dreier 

--- ompi/mca/btl/openib/btl_openib.c(revision 7507)
+++ ompi/mca/btl/openib/btl_openib.c(working copy)
@@ -656,7 +656,8 @@ int mca_btl_openib_module_init(mca_btl_o
 }

 /* Create the low and high priority queue pairs */ 
-openib_btl->ib_cq_low = ibv_create_cq(ctx, 
mca_btl_openib_component.ib_cq_size, NULL); 
+openib_btl->ib_cq_low = ibv_create_cq(ctx, 
mca_btl_openib_component.ib_cq_size,
+ NULL, NULL, 0); 

 if(NULL == openib_btl->ib_cq_low) {
 BTL_ERROR(("error creating low priority cq for %s errno says %s\n",
@@ -665,7 +666,8 @@ int mca_btl_openib_module_init(mca_btl_o
 return OMPI_ERROR;
 }

-openib_btl->ib_cq_high = ibv_create_cq(ctx, 
mca_btl_openib_component.ib_cq_size, NULL); 
+openib_btl->ib_cq_high = ibv_create_cq(ctx, 
mca_btl_openib_component.ib_cq_size,
+  NULL, NULL, 0); 

 if(NULL == openib_btl->ib_cq_high) {
 BTL_ERROR(("error creating high priority cq for %s errno says %s\n",

Re: [O-MPI devel] p2p linpack ---

2005-09-26 Thread Galen M. Shipman


Well, looks like we found the problem.

The memory free callback was incorrect. We were just looking for the  
base address of the free in the tree. Here is why this didn't work

Probably wouldn't compile but works for an explanation:

buf = malloc(40*1024);  /* malloc  10 pages */

 /* send the second half of the buffer to the peer */
/* note that leave_pinned will register and cache only what it sees  
in the send */

MPI_Send(buf+5*4*1024, 5*4*1024,  );

/* free the buffer, mpi will try to find the registration in the tree
based on the address, buf,, but won't find it so the registration
remains */
free(buf);


So since the registration is  left in the tree, a future malloc may  
obtain a virtual address that is within the base and bound of the  
registration. When this memory is later freed we try to deregister  
the entire registration, part of which might be in use by another  
buffer, it could even be in the process of an RDMA operation.


Anyway, I have modified the code and we are now passing a smaller  
linpack run with leave_pinned and the mem hooks enabled without using  
any mallopt trickiness.


Thanks,

Galen



On Sep 25, 2005, at 10:58 AM, Galen M. Shipman wrote:

Well, after adding a bunch of debugging  output, I have found the  
following.


With both leave_pinned and use_mem_hook enabled on a linpack run we  
get the assertion error on the memory callback in linpack. That is  
to say, there is a free occurring in the middle of a registration.


At the point of assert we have NOT resized any registrations.
The existing registrations in the tree are:

Existing registrations: 
BaseBound   Length
241615360   244841607   3226248
244841608   246428807   1587200
246428808   248016007   1587200
248019648   251245895   3226248
Tyring to free  
247917216   
From
BaseBound   
246428808   248016007   



When we get the assert, we are trying to free: 247917216, which is  
in the middle of the registration. Note we have NOT resized any  
registrations so I am confident there is not an issue with either  
the tree or the resize at least as far as linpack is concerned.

Here is the callstack:

#0  0x002a95f079c9 in raise () from /lib/libc.so.6
#1  0x002a95f08e6e in abort () from /lib/libc.so.6
#2  0x002a95f01690 in __assert_fail () from /lib/libc.so.6
#3  0x002a9571b200 in mca_mpool_base_mem_cb (base=0xec6eaa0,  
size=31624,

cbdata=0x0) at mpool_base_mem_cb.c:53
#4  0x002a9587fe0d in opal_mem_free_release_hook (buf=0xec6eaa0,
length=31624) at memory.c:121
#5  0x002a9588bd12 in opal_mem_free_free_hook (ptr=0xec6eaa0,
caller=0x42b052) at memory_malloc_hooks.c:66
#6  0x0042b052 in ATL_dmmIJK ()
#7  0x0064f9b1 in ATL_dgemmNN ()
#8  0x0057722b in ATL_dgemmNN_RB ()
#9  0x00577fc3 in ATL_rtrsmRUN ()
#10 0x0042c63c in ATL_dtrsm ()
#11 0x00423c1e in atl_f77wrap_dtrsm__ ()
#12 0x00423a94 in dtrsm_ ()
#13 0x00411192 in HPL_dtrsm (ORDER=17933, SIDE=17933, UPLO=8,
TRANS=4294967295, DIAG=0, M=23458672, N=0, ALPHA=1,  
A=0x7fbfffefa0, LDA=0,

B=0x202, LDB=0) at HPL_dtrsm.c:949
#14 0x0040cfb6 in HPL_pdupdateTT (PBCST=0x0, IFLAG=0x0,
PANEL=0x165f040, NN=-1) at HPL_pdupdateTT.c:362
#15 0x0041936f in HPL_pdgesvK2 (GRID=0x7fb4a0,  
ALGO=0x7fb460,

A=0x7fb260) at HPL_pdgesvK2.c:178
#16 0x0040d6f7 in HPL_pdgesv (GRID=0x7fb4a0, ALGO=0x460d,
A=0x7fb260) at HPL_pdgesv.c:107
#17 0x00405b10 in HPL_pdtest (TEST=0x7fb430,  
GRID=0x7fb4a0,

ALGO=0x7fb460, N=1, NB=80) at HPL_pdtest.c:193
#18 0x00401840 in main (ARGC=1, ARGV=0x7fb928)
at HPL_pddriver.c:223

Note that the free occurs in the ATLAS libraries, I will look into  
re-building linpack with another BLAS library to see what happens.  
Any other suggestions?


Thanks,

Galen

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [O-MPI devel] mpif.h problems

Re: [O-MPI devel] mpif.h problems

Re: [O-MPI devel] ompi_info Seg Fault, missing component -- linux (fwd)

Re: [O-MPI devel] ompi_info Seg Fault, missing component -- linux (fwd)

[O-MPI devel] [PATCH] Update Open MPI for new libibverbs API

Re: [O-MPI devel] p2p linpack ---

6 matches

Site Navigation

Mail list logo

Footer information