Re: [O-MPI devel] mpif.h problems
I was able to get the code to compile and run by including only these 3 constants. Thanks for fixing this. Chip On Fri, Sep 23, 2005 at 11:35:14AM -0400, Jeff Squyres wrote: > On Sep 21, 2005, at 3:37 PM, David R. (Chip) Kent IV wrote: > > > I managed to find a number of problems with the mpif.h when I tried it > > on > > a big application. It looks like a lot of key constants are not > > defined > > in this file. So far, MPI_SEEK_SET, MPI_MODE_CREATE, MPI_MODE_WRONLY > > have broken the build. I've added them to mpif.h as I find them so > > that > > I can get the build to go, but I assume there are many more values > > still > > missing. > > (sorry I didn't reply earlier; was traveling with limited connectivity > when you sent this) > > Yoinks. Looks like we forgot to add the MPI-2 IO constants into > mpif.h; thanks for finding this. > > I'll go add them and take a swipe through the file to see if I can find > any others that we're missing. Did you find any others that are > missing? > > -- > {+} Jeff Squyres > {+} The Open MPI Project > {+} http://www.open-mpi.org/ > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- - David R. "Chip" Kent IV Parallel Tools Team High Performance Computing Environments Group (CCN-8) Los Alamos National Laboratory (505)665-5021 drk...@lanl.gov - This message is "Technical data or Software Publicly Available" or "Correspondence".
Re: [O-MPI devel] mpif.h problems
Ok, thanks. I apologize -- I made this fix several days ago and forgot to commit it. :-\ We try not to commit changes to the configure/build system during the US work day, so I'll commit this stuff tonight (i.e., should be in tomorrow's tarball). On Sep 26, 2005, at 8:24 AM, David R. (Chip) Kent IV wrote: I was able to get the code to compile and run by including only these 3 constants. Thanks for fixing this. Chip On Fri, Sep 23, 2005 at 11:35:14AM -0400, Jeff Squyres wrote: On Sep 21, 2005, at 3:37 PM, David R. (Chip) Kent IV wrote: I managed to find a number of problems with the mpif.h when I tried it on a big application. It looks like a lot of key constants are not defined in this file. So far, MPI_SEEK_SET, MPI_MODE_CREATE, MPI_MODE_WRONLY have broken the build. I've added them to mpif.h as I find them so that I can get the build to go, but I assume there are many more values still missing. (sorry I didn't reply earlier; was traveling with limited connectivity when you sent this) Yoinks. Looks like we forgot to add the MPI-2 IO constants into mpif.h; thanks for finding this. I'll go add them and take a swipe through the file to see if I can find any others that we're missing. Did you find any others that are missing? -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- - David R. "Chip" Kent IV Parallel Tools Team High Performance Computing Environments Group (CCN-8) Los Alamos National Laboratory (505)665-5021 drk...@lanl.gov - This message is "Technical data or Software Publicly Available" or "Correspondence". ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- {+} Jeff Squyres {+} The Open MPI Project {+} http://www.open-mpi.org/
Re: [O-MPI devel] ompi_info Seg Fault, missing component -- linux (fwd)
On Fri, 2005-09-16 at 11:35 -0500, Brian Barrett wrote: > On Sep 16, 2005, at 8:44 AM, Ferris McCormick wrote: > > > == > > fmccor@polylepis util [235]% ./opal_timer > > --> frequency: 9 > > --> cycle count > > Slept approximately 903151189 cycles, or 1003501 us > > --> usecs > > Slept approximately 18446744073289684648 us > > == > > That last value means that I'm munging the upper 32 bits of the tick > register (it's 64 bits long). So we're not quite there yet, but > getting closer. I should be able to get to that today. > > The other problem is very odd. Since you're compiling in 32bit mode, > I'd expect us to see it on our PowerPC machines, but I haven't run into > that one yet. I'll try to compile without debugging and see what I can > see. > > > Brian > Here's a little more information on the SegFault when trying OBJ_DESTRUCT(&verbose); in opal/util/optput.c: First of all, verbose is of type opal_output_stream_t, and this is not an opal_object_t, so OBJ_DESTRUCT is calling opal_obj_run_destructors with an object of the wrong type (although ompi might be forcing storage allocation so that this call should work; I haven't worked it out). Second, on my system at least, when OBJ_DESTRUCT(&verbose) gets called, verbose looks like this (I have a debug fprintf to try to look at a bit of the verbose structure. The corresponding fprintf I put after OBJ_CONSTRUCT(&verbose, opal_output_stream_t); is fine.) Program received signal SIGSEGV, Segmentation fault. 0x7014f7d4 in opal_output_close (output_id=1883966264) at output.c:287 287 fprintf(stderr,"Destroying verbose, depth=%d \n",(/*(opal_object_t*)&*/verbose.super).obj_class->cls_depth); Current language: auto; currently c (gdb) print verbose $1 = {super = {obj_class = 0x0, obj_reference_count = 1}, lds_is_debugging = false, lds_verbose_level = 0, lds_want_syslog = false, lds_syslog_priority = 0, lds_syslog_ident = 0x0, lds_prefix = 0x0, lds_want_stdout = false, lds_want_stderr = true, lds_want_file = false, lds_want_file_append = false, lds_file_suffix = 0x0} = so that verbose.super.obj_class has been set to null, and no matter how it is supposed to work, the opal_obj_run_destructors loop: cls = object->obj_class; for(i=0; i < cls->cls_depth;i++) { ... is going to be working on garbage, because nothing in verbose has a useful obj_class element. Hope this helps, Regards, Ferris -- Ferris McCormick (P44646, MI) Developer, Gentoo Linux (Sparc, Devrel) signature.asc Description: This is a digitally signed message part
Re: [O-MPI devel] ompi_info Seg Fault, missing component -- linux (fwd)
On Mon, 2005-09-26 at 14:59 +, Ferris McCormick wrote: > On Fri, 2005-09-16 at 11:35 -0500, Brian Barrett wrote: > > On Sep 16, 2005, at 8:44 AM, Ferris McCormick wrote: > > > > > == > > > fmccor@polylepis util [235]% ./opal_timer > > > --> frequency: 9 > > > --> cycle count > > > Slept approximately 903151189 cycles, or 1003501 us > > > --> usecs > > > Slept approximately 18446744073289684648 us > > > == > > > > That last value means that I'm munging the upper 32 bits of the tick > > register (it's 64 bits long). So we're not quite there yet, but > > getting closer. I should be able to get to that today. > > > > The other problem is very odd. Since you're compiling in 32bit mode, > > I'd expect us to see it on our PowerPC machines, but I haven't run into > > that one yet. I'll try to compile without debugging and see what I can > > see. > > > > > > Brian > > > Here's a little more information on the SegFault when trying > OBJ_DESTRUCT(&verbose); in opal/util/optput.c: > First of all, verbose is of type opal_output_stream_t, and this is not > an opal_object_t, so OBJ_DESTRUCT is calling opal_obj_run_destructors > with an object of the wrong type (although ompi might be forcing storage > allocation so that this call should work; I haven't worked it out). > > Second, on my system at least, when OBJ_DESTRUCT(&verbose) gets called, > verbose looks like this (I have a debug fprintf to try to look at a bit > of the verbose structure. The corresponding fprintf I put after > OBJ_CONSTRUCT(&verbose, opal_output_stream_t); is fine.) > > Program received signal SIGSEGV, Segmentation fault. > 0x7014f7d4 in opal_output_close (output_id=1883966264) at output.c:287 > 287 fprintf(stderr,"Destroying verbose, depth=%d > \n",(/*(opal_object_t*)&*/verbose.super).obj_class->cls_depth); > Current language: auto; currently c > (gdb) print verbose > $1 = {super = {obj_class = 0x0, obj_reference_count = 1}, > lds_is_debugging = false, > lds_verbose_level = 0, lds_want_syslog = false, lds_syslog_priority = > 0, > lds_syslog_ident = 0x0, lds_prefix = 0x0, lds_want_stdout = false, > lds_want_stderr = true, > lds_want_file = false, lds_want_file_append = false, lds_file_suffix = > 0x0} > = > so that verbose.super.obj_class has been set to null, and no matter how > it is supposed to work, the opal_obj_run_destructors loop: > cls = object->obj_class; > for(i=0; i < cls->cls_depth;i++) { ... > is going to be working on garbage, because nothing in verbose has a > useful obj_class element. > I've looked at the structures, and I see that opal_output_stream_t is set up so that (opal_object_t*)(&verbose) should resolve correctly, and thus my first concern is gone. Now, for the second: If built with --enable-debug, then when the program finally reaches OBJ_DESTRUCT(&verbose), the obj_class pointer is correct. Without --enable-debug, it is NULL. I'll keep looking at it, but so far, I don't see what is going wrong. Regards, -- Ferris McCormick (P44646, MI) Developer, Gentoo Linux (Sparc, Devrel) signature.asc Description: This is a digitally signed message part
[O-MPI devel] [PATCH] Update Open MPI for new libibverbs API
[It's somewhat annoying to have to subscribe to de...@open-mpi.org just to be able to send patches, but oh well...] This patch updates Open MPI for the new ibv_create_cq() API. Signed-off-by: Roland Dreier --- ompi/mca/btl/openib/btl_openib.c(revision 7507) +++ ompi/mca/btl/openib/btl_openib.c(working copy) @@ -656,7 +656,8 @@ int mca_btl_openib_module_init(mca_btl_o } /* Create the low and high priority queue pairs */ -openib_btl->ib_cq_low = ibv_create_cq(ctx, mca_btl_openib_component.ib_cq_size, NULL); +openib_btl->ib_cq_low = ibv_create_cq(ctx, mca_btl_openib_component.ib_cq_size, + NULL, NULL, 0); if(NULL == openib_btl->ib_cq_low) { BTL_ERROR(("error creating low priority cq for %s errno says %s\n", @@ -665,7 +666,8 @@ int mca_btl_openib_module_init(mca_btl_o return OMPI_ERROR; } -openib_btl->ib_cq_high = ibv_create_cq(ctx, mca_btl_openib_component.ib_cq_size, NULL); +openib_btl->ib_cq_high = ibv_create_cq(ctx, mca_btl_openib_component.ib_cq_size, + NULL, NULL, 0); if(NULL == openib_btl->ib_cq_high) { BTL_ERROR(("error creating high priority cq for %s errno says %s\n",
Re: [O-MPI devel] p2p linpack ---
Well, looks like we found the problem. The memory free callback was incorrect. We were just looking for the base address of the free in the tree. Here is why this didn't work Probably wouldn't compile but works for an explanation: buf = malloc(40*1024); /* malloc 10 pages */ /* send the second half of the buffer to the peer */ /* note that leave_pinned will register and cache only what it sees in the send */ MPI_Send(buf+5*4*1024, 5*4*1024, ); /* free the buffer, mpi will try to find the registration in the tree based on the address, buf,, but won't find it so the registration remains */ free(buf); So since the registration is left in the tree, a future malloc may obtain a virtual address that is within the base and bound of the registration. When this memory is later freed we try to deregister the entire registration, part of which might be in use by another buffer, it could even be in the process of an RDMA operation. Anyway, I have modified the code and we are now passing a smaller linpack run with leave_pinned and the mem hooks enabled without using any mallopt trickiness. Thanks, Galen On Sep 25, 2005, at 10:58 AM, Galen M. Shipman wrote: Well, after adding a bunch of debugging output, I have found the following. With both leave_pinned and use_mem_hook enabled on a linpack run we get the assertion error on the memory callback in linpack. That is to say, there is a free occurring in the middle of a registration. At the point of assert we have NOT resized any registrations. The existing registrations in the tree are: Existing registrations: BaseBound Length 241615360 244841607 3226248 244841608 246428807 1587200 246428808 248016007 1587200 248019648 251245895 3226248 Tyring to free 247917216 From BaseBound 246428808 248016007 When we get the assert, we are trying to free: 247917216, which is in the middle of the registration. Note we have NOT resized any registrations so I am confident there is not an issue with either the tree or the resize at least as far as linpack is concerned. Here is the callstack: #0 0x002a95f079c9 in raise () from /lib/libc.so.6 #1 0x002a95f08e6e in abort () from /lib/libc.so.6 #2 0x002a95f01690 in __assert_fail () from /lib/libc.so.6 #3 0x002a9571b200 in mca_mpool_base_mem_cb (base=0xec6eaa0, size=31624, cbdata=0x0) at mpool_base_mem_cb.c:53 #4 0x002a9587fe0d in opal_mem_free_release_hook (buf=0xec6eaa0, length=31624) at memory.c:121 #5 0x002a9588bd12 in opal_mem_free_free_hook (ptr=0xec6eaa0, caller=0x42b052) at memory_malloc_hooks.c:66 #6 0x0042b052 in ATL_dmmIJK () #7 0x0064f9b1 in ATL_dgemmNN () #8 0x0057722b in ATL_dgemmNN_RB () #9 0x00577fc3 in ATL_rtrsmRUN () #10 0x0042c63c in ATL_dtrsm () #11 0x00423c1e in atl_f77wrap_dtrsm__ () #12 0x00423a94 in dtrsm_ () #13 0x00411192 in HPL_dtrsm (ORDER=17933, SIDE=17933, UPLO=8, TRANS=4294967295, DIAG=0, M=23458672, N=0, ALPHA=1, A=0x7fbfffefa0, LDA=0, B=0x202, LDB=0) at HPL_dtrsm.c:949 #14 0x0040cfb6 in HPL_pdupdateTT (PBCST=0x0, IFLAG=0x0, PANEL=0x165f040, NN=-1) at HPL_pdupdateTT.c:362 #15 0x0041936f in HPL_pdgesvK2 (GRID=0x7fb4a0, ALGO=0x7fb460, A=0x7fb260) at HPL_pdgesvK2.c:178 #16 0x0040d6f7 in HPL_pdgesv (GRID=0x7fb4a0, ALGO=0x460d, A=0x7fb260) at HPL_pdgesv.c:107 #17 0x00405b10 in HPL_pdtest (TEST=0x7fb430, GRID=0x7fb4a0, ALGO=0x7fb460, N=1, NB=80) at HPL_pdtest.c:193 #18 0x00401840 in main (ARGC=1, ARGV=0x7fb928) at HPL_pddriver.c:223 Note that the free occurs in the ATLAS libraries, I will look into re-building linpack with another BLAS library to see what happens. Any other suggestions? Thanks, Galen ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel