Re: [OMPI devel] Segment Faults in MPI_INIT

2006-10-14 Thread Jeff Squyres

Karl --

Yikes.  This looks like an alignment or memory write ordering kind of  
error; I have a dim recollection about doing some fixes for this, but  
am on a plane at the moment and cannot check the SVN logs.


Could you try the latest 1.1.2 RC and see if the problem still occurs  
for you?  It's available on the general download page on the web site.


Thanks!


On Oct 7, 2006, at 7:34 PM, Karl Dockendorf wrote:

I just (yesterday) made the move from LAM/MPI to OpenMPI.  The  
configure / compile / install went smoothly (version 1.1.1).   
However, after recompiling my source and executing it usually  
crashes in MPI_INIT.  Seems to be coming from the same place MOST  
of the time.  Usually spits out a message something like this.


Signal:10 info.si_errno:0(Unknown error: 0) si_code:1(BUS_ADRALN)
Failing at addr:0xfdff8018
*** End of error message ***
Signal:10 info.si_errno:0(Unknown error: 0) si_code:1(BUS_ADRALN)
Failing at addr:0x2807000
*** End of error message ***

The test system (before moving back to the cluster) is a G4  
PowerBook with OS 10.4.8 (not using Xgrid at the moment).  I'm  
oversubscribing it (2 processes, it knows there is only one).   
Attached are the config info from the install.  And listed below  
seems to be the crash point from the mca_bml_r2_progress function.   
Any help is much appreciated.


Karl

CRASH 1:
Command: nm
Path:/Users/karl/programs/nm/build/Release/nm
Parent:  orted [830]

Version: ??? (???)

PID:834
Thread: 0

Exception:  EXC_BAD_ACCESS (0x0001)
Codes:  KERN_INVALID_ADDRESS (0x0001) at 0xfdff8018

Thread 0 Crashed:
0   mca_btl_sm.so   0x003abbec mca_btl_sm_component_progress +  
3164

1   mca_bml_r2.so   0x003a0d38 mca_bml_r2_progress + 88
2   libopal.0.dylib 0x0032309c opal_progress + 236
3   mca_oob_tcp.so  0x00024f14 mca_oob_tcp_msg_wait + 52
4   mca_oob_tcp.so  0x0002a0a8 mca_oob_tcp_recv + 1128
5   liborte.0.dylib 0x002f07b0 mca_oob_recv_packed + 80
6   mca_gpr_proxy.so0x00059bd4 orte_gpr_proxy_put + 804
7   liborte.0.dylib 0x00304318 orte_soh_base_set_proc_soh + 968
8   libmpi.0.dylib  0x00222d88 ompi_mpi_init + 1816
9   libmpi.0.dylib  0x00248b50 MPI_Init + 240
10  nm  0x2e60 init_model + 48
11  nm  0x2c70 main + 48
12  nm  0x2494 _start + 340 (crt.c:272)
13  nm  0x233c start + 60

Thread 0 crashed with PPC Thread State 64:
  srr0: 0x003abbec srr1:  
0x0200f930vrsave: 0x
cr: 0x28004222  xer: 0x0004   lr:  
0x003aafa0  ctr: 0x003aaf90
r0: 0x   r1: 0xbfffe8d0   r2:  
0xfdff8000   r3: 0x0001
r4: 0x00049814   r5: 0xbfffe888   r6:  
0x   r7: 0xfdff8000
r8: 0x0004   r9: 0x004177e0  r10:  
0x0004  r11: 0x
   r12: 0x003aaf90  r13: 0xfffe  r14:  
0x003ad004  r15: 0x003441e8
   r16: 0x003ad8c4  r17: 0x0004  r18:  
0x  r19: 0x
   r20: 0x0014  r21: 0x  r22:  
0x003ae0c4  r23: 0x0001
   r24: 0x  r25: 0x0004  r26:  
0x00029c50  r27: 0x
   r28: 0x  r29: 0x0001  r30:  
0x  r31: 0x003aafa0




CRASH 2:
Command: nm
Path:/Users/karl/programs/nm/build/Release/nm
Parent:  orted [830]

Version: ??? (???)

PID:832
Thread: 0

Exception:  EXC_BAD_ACCESS (0x0001)
Codes:  KERN_PROTECTION_FAILURE (0x0002) at 0x

Thread 0 Crashed:
0   <<>>0x 0 + 0
1   mca_bml_r2.so   0x003a0d38 mca_bml_r2_progress + 88
2   libopal.0.dylib 0x0032309c opal_progress + 236
3   mca_oob_tcp.so  0x00024f14 mca_oob_tcp_msg_wait + 52
4   mca_oob_tcp.so  0x0002a0a8 mca_oob_tcp_recv + 1128
5   liborte.0.dylib 0x002f07b0 mca_oob_recv_packed + 80
6   mca_gpr_proxy.so0x00059bd4 orte_gpr_proxy_put + 804
7   liborte.0.dylib 0x00304318 orte_soh_base_set_proc_soh + 968
8   libmpi.0.dylib  0x00222d88 ompi_mpi_init + 1816
9   libmpi.0.dylib  0x00248b50 MPI_Init + 240
10  nm  0x2e60 init_model + 48
11  nm  0x2c70 main + 48
12  nm  0x2494 _start + 340 (crt.c:272)
13  nm  0x233c start + 60

Thread 0 crashed with PPC Thread State 64:
  srr0: 0x srr1:  
0x4000d930vrsave: 0x
cr: 0x28004222  xer: 0x0004   lr:  
0x003abe5c  ctr: 0x
r0: 0x   r1: 0xbfffe8d0   r2:  
0x02008000   r3: 0x003ad864
r4: 0x   r5: 0x02008000   r6:  
0x   r7: 0x02008000
r8: 0x000

Re: [OMPI devel] MPI_XXX_{get|set}_errhandler in general , and for files in particular

2006-10-14 Thread Jeff Squyres

On Oct 9, 2006, at 8:41 AM, Lisandro Dalcin wrote:


Looking at MPI-2 errata document,
http://www.mpi-forum.org/docs/errata-20-2.html, is says:

Page 61, after line 36. Add the following (paralleling the errata  
to MPI-1.1):


MPI_{COMM,WIN,FILE}_GET_ERRHANDLER behave as if a new error handler
object is created. That is, once the error handler is no longer
needed, MPI_ERRHANDLER_FREE should be called with the error handler
returned from MPI_ERRHANDLER_GET or MPI_{COMM,WIN,FILE}_GET_ERRHANDLER
to mark the error handler for deallocation. This provides behavior
similar to that of MPI_COMM_GROUP and MPI_GROUP_FREE.

Well, is seems that OMPI does not currently follow this specification.
Any plans to change this? Or it will not go in?


I'm not sure what you mean here -- OMPI currently increases the  
reference count on the errhandlers returned by COM|WIN| 
FILE_GET_ERRHANDLER (ERRHANDLER_GET is a synonym for  
COMM_GET_ERRHANDLER).  So when you call ERRHANDLER_FREE, it decreases  
the refcount, and if the refcount is 0, it actually frees the error  
handler (the user's handle is always set to ERRHANDLER_NULL,  
regardless of whether the reference count went to 0 or not).


Remember, too, that all communications increase the refcount on the  
associated communicator's errhandler.  So even if you ERRHANDLER_FREE  
an errhandler, if it's still associated with an ongoing  
communication, the back-end object won't be freed right away.


Can you cite a specific example of what you're trying to do and how  
OMPI is doing it wrong?



Additionaly, I've noted that MPI_File_get_errhandler fails with
MPI_ERR_FILE is passed file handle is MPI_FILE_NULL. However, I
undersand (regarding the standard) this is the handle to query to
get/set/reset the default error handler for new files... I think
MPI_File_{get|set}_errhandler should accept MPI_FILE_NULL handle. Am I
right?


By MPI-2:9.7, you are exactly correct.  OMPI currently allows  
MPI_FILE_SET_ERRHANDLER(MPI_FILE_NULL, ...) (there's even an explicit  
reference to MPI-2:9.7 in a comment in the source), but it looks like  
an oversight that we don't allow MPI_FILE_GET_ERRHANDLER 
(MPI_FILE_NULL, ...).  I will fix.


Thanks!

--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems



Re: [OMPI devel] Something broken using Persistent Requests

2006-10-14 Thread Jeff Squyres
Please do not feel bad about reporting problems -- despite the fact  
that it creates more work for us, it makes be OMPI better package.   
So keep 'em coming!


Is there a way that you can share your code so that we can see what  
is happening?  I looked through the code for MPI_WAIT and  
MPI_STARTALL and they seem to be doing the Right Things, at least in  
terms of the persistent requests.


If you're getting error -105, it looks like we're not converting this  
to a proper MPI error code before returning it to you (-105 ==  
OMPI_ERR_REQUEST, but it should be converted to MPI_ERR_REQUEST  
before it is returned).  I'll poke around to see what's happening here.




On Oct 12, 2006, at 8:33 PM, Lisandro Dalcin wrote:


I am getting errors using persistent communications (OMPI 1.1.1). I am
trying to implement (in Python) example 2.32 from page 107 of MPI- The
Complete Reference (V1, 2nd. edition).

I think the problem is not in my wrappers (my script works fine with
MPICH2). Below the two issues:

1 - MPI_Startall fails (returning a negative error code, -105, which
in fact it seems to be out of range [MPI_SUCCESS...MPI_LASTCODE]).
However, doing 'for r in reqlist: r.Start()' works.

2 - And then, calling MPI_Waitall (or even iterating over request
array and calling MPI_Wait), the request seems to be deallocated (I
get MPI_REQUEST_NULL upon return), so I cannot start them again. I
understand this is wrong, the request handles should be marked as
inactive, but not for deallocation.

Please, ignore me if this was reported. I am really busy and I have
not found the time to navigate the OMPI sources to get in touch with
its internal, so I am always reporting problems, and never patches.
Sorry!

--
Lisandro Dalcín
---
Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC)
Instituto de Desarrollo Tecnológico para la Industria Química (INTEC)
Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET)
PTLC - Güemes 3450, (3000) Santa Fe, Argentina
Tel/Fax: +54-(0)342-451.1594

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems