Re: [OMPI devel] Segment Faults in MPI_INIT
Karl -- Yikes. This looks like an alignment or memory write ordering kind of error; I have a dim recollection about doing some fixes for this, but am on a plane at the moment and cannot check the SVN logs. Could you try the latest 1.1.2 RC and see if the problem still occurs for you? It's available on the general download page on the web site. Thanks! On Oct 7, 2006, at 7:34 PM, Karl Dockendorf wrote: I just (yesterday) made the move from LAM/MPI to OpenMPI. The configure / compile / install went smoothly (version 1.1.1). However, after recompiling my source and executing it usually crashes in MPI_INIT. Seems to be coming from the same place MOST of the time. Usually spits out a message something like this. Signal:10 info.si_errno:0(Unknown error: 0) si_code:1(BUS_ADRALN) Failing at addr:0xfdff8018 *** End of error message *** Signal:10 info.si_errno:0(Unknown error: 0) si_code:1(BUS_ADRALN) Failing at addr:0x2807000 *** End of error message *** The test system (before moving back to the cluster) is a G4 PowerBook with OS 10.4.8 (not using Xgrid at the moment). I'm oversubscribing it (2 processes, it knows there is only one). Attached are the config info from the install. And listed below seems to be the crash point from the mca_bml_r2_progress function. Any help is much appreciated. Karl CRASH 1: Command: nm Path:/Users/karl/programs/nm/build/Release/nm Parent: orted [830] Version: ??? (???) PID:834 Thread: 0 Exception: EXC_BAD_ACCESS (0x0001) Codes: KERN_INVALID_ADDRESS (0x0001) at 0xfdff8018 Thread 0 Crashed: 0 mca_btl_sm.so 0x003abbec mca_btl_sm_component_progress + 3164 1 mca_bml_r2.so 0x003a0d38 mca_bml_r2_progress + 88 2 libopal.0.dylib 0x0032309c opal_progress + 236 3 mca_oob_tcp.so 0x00024f14 mca_oob_tcp_msg_wait + 52 4 mca_oob_tcp.so 0x0002a0a8 mca_oob_tcp_recv + 1128 5 liborte.0.dylib 0x002f07b0 mca_oob_recv_packed + 80 6 mca_gpr_proxy.so0x00059bd4 orte_gpr_proxy_put + 804 7 liborte.0.dylib 0x00304318 orte_soh_base_set_proc_soh + 968 8 libmpi.0.dylib 0x00222d88 ompi_mpi_init + 1816 9 libmpi.0.dylib 0x00248b50 MPI_Init + 240 10 nm 0x2e60 init_model + 48 11 nm 0x2c70 main + 48 12 nm 0x2494 _start + 340 (crt.c:272) 13 nm 0x233c start + 60 Thread 0 crashed with PPC Thread State 64: srr0: 0x003abbec srr1: 0x0200f930vrsave: 0x cr: 0x28004222 xer: 0x0004 lr: 0x003aafa0 ctr: 0x003aaf90 r0: 0x r1: 0xbfffe8d0 r2: 0xfdff8000 r3: 0x0001 r4: 0x00049814 r5: 0xbfffe888 r6: 0x r7: 0xfdff8000 r8: 0x0004 r9: 0x004177e0 r10: 0x0004 r11: 0x r12: 0x003aaf90 r13: 0xfffe r14: 0x003ad004 r15: 0x003441e8 r16: 0x003ad8c4 r17: 0x0004 r18: 0x r19: 0x r20: 0x0014 r21: 0x r22: 0x003ae0c4 r23: 0x0001 r24: 0x r25: 0x0004 r26: 0x00029c50 r27: 0x r28: 0x r29: 0x0001 r30: 0x r31: 0x003aafa0 CRASH 2: Command: nm Path:/Users/karl/programs/nm/build/Release/nm Parent: orted [830] Version: ??? (???) PID:832 Thread: 0 Exception: EXC_BAD_ACCESS (0x0001) Codes: KERN_PROTECTION_FAILURE (0x0002) at 0x Thread 0 Crashed: 0 <<>>0x 0 + 0 1 mca_bml_r2.so 0x003a0d38 mca_bml_r2_progress + 88 2 libopal.0.dylib 0x0032309c opal_progress + 236 3 mca_oob_tcp.so 0x00024f14 mca_oob_tcp_msg_wait + 52 4 mca_oob_tcp.so 0x0002a0a8 mca_oob_tcp_recv + 1128 5 liborte.0.dylib 0x002f07b0 mca_oob_recv_packed + 80 6 mca_gpr_proxy.so0x00059bd4 orte_gpr_proxy_put + 804 7 liborte.0.dylib 0x00304318 orte_soh_base_set_proc_soh + 968 8 libmpi.0.dylib 0x00222d88 ompi_mpi_init + 1816 9 libmpi.0.dylib 0x00248b50 MPI_Init + 240 10 nm 0x2e60 init_model + 48 11 nm 0x2c70 main + 48 12 nm 0x2494 _start + 340 (crt.c:272) 13 nm 0x233c start + 60 Thread 0 crashed with PPC Thread State 64: srr0: 0x srr1: 0x4000d930vrsave: 0x cr: 0x28004222 xer: 0x0004 lr: 0x003abe5c ctr: 0x r0: 0x r1: 0xbfffe8d0 r2: 0x02008000 r3: 0x003ad864 r4: 0x r5: 0x02008000 r6: 0x r7: 0x02008000 r8: 0x000
Re: [OMPI devel] MPI_XXX_{get|set}_errhandler in general , and for files in particular
On Oct 9, 2006, at 8:41 AM, Lisandro Dalcin wrote: Looking at MPI-2 errata document, http://www.mpi-forum.org/docs/errata-20-2.html, is says: Page 61, after line 36. Add the following (paralleling the errata to MPI-1.1): MPI_{COMM,WIN,FILE}_GET_ERRHANDLER behave as if a new error handler object is created. That is, once the error handler is no longer needed, MPI_ERRHANDLER_FREE should be called with the error handler returned from MPI_ERRHANDLER_GET or MPI_{COMM,WIN,FILE}_GET_ERRHANDLER to mark the error handler for deallocation. This provides behavior similar to that of MPI_COMM_GROUP and MPI_GROUP_FREE. Well, is seems that OMPI does not currently follow this specification. Any plans to change this? Or it will not go in? I'm not sure what you mean here -- OMPI currently increases the reference count on the errhandlers returned by COM|WIN| FILE_GET_ERRHANDLER (ERRHANDLER_GET is a synonym for COMM_GET_ERRHANDLER). So when you call ERRHANDLER_FREE, it decreases the refcount, and if the refcount is 0, it actually frees the error handler (the user's handle is always set to ERRHANDLER_NULL, regardless of whether the reference count went to 0 or not). Remember, too, that all communications increase the refcount on the associated communicator's errhandler. So even if you ERRHANDLER_FREE an errhandler, if it's still associated with an ongoing communication, the back-end object won't be freed right away. Can you cite a specific example of what you're trying to do and how OMPI is doing it wrong? Additionaly, I've noted that MPI_File_get_errhandler fails with MPI_ERR_FILE is passed file handle is MPI_FILE_NULL. However, I undersand (regarding the standard) this is the handle to query to get/set/reset the default error handler for new files... I think MPI_File_{get|set}_errhandler should accept MPI_FILE_NULL handle. Am I right? By MPI-2:9.7, you are exactly correct. OMPI currently allows MPI_FILE_SET_ERRHANDLER(MPI_FILE_NULL, ...) (there's even an explicit reference to MPI-2:9.7 in a comment in the source), but it looks like an oversight that we don't allow MPI_FILE_GET_ERRHANDLER (MPI_FILE_NULL, ...). I will fix. Thanks! -- Jeff Squyres Server Virtualization Business Unit Cisco Systems
Re: [OMPI devel] Something broken using Persistent Requests
Please do not feel bad about reporting problems -- despite the fact that it creates more work for us, it makes be OMPI better package. So keep 'em coming! Is there a way that you can share your code so that we can see what is happening? I looked through the code for MPI_WAIT and MPI_STARTALL and they seem to be doing the Right Things, at least in terms of the persistent requests. If you're getting error -105, it looks like we're not converting this to a proper MPI error code before returning it to you (-105 == OMPI_ERR_REQUEST, but it should be converted to MPI_ERR_REQUEST before it is returned). I'll poke around to see what's happening here. On Oct 12, 2006, at 8:33 PM, Lisandro Dalcin wrote: I am getting errors using persistent communications (OMPI 1.1.1). I am trying to implement (in Python) example 2.32 from page 107 of MPI- The Complete Reference (V1, 2nd. edition). I think the problem is not in my wrappers (my script works fine with MPICH2). Below the two issues: 1 - MPI_Startall fails (returning a negative error code, -105, which in fact it seems to be out of range [MPI_SUCCESS...MPI_LASTCODE]). However, doing 'for r in reqlist: r.Start()' works. 2 - And then, calling MPI_Waitall (or even iterating over request array and calling MPI_Wait), the request seems to be deallocated (I get MPI_REQUEST_NULL upon return), so I cannot start them again. I understand this is wrong, the request handles should be marked as inactive, but not for deallocation. Please, ignore me if this was reported. I am really busy and I have not found the time to navigate the OMPI sources to get in touch with its internal, so I am always reporting problems, and never patches. Sorry! -- Lisandro Dalcín --- Centro Internacional de Métodos Computacionales en Ingeniería (CIMEC) Instituto de Desarrollo Tecnológico para la Industria Química (INTEC) Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET) PTLC - Güemes 3450, (3000) Santa Fe, Argentina Tel/Fax: +54-(0)342-451.1594 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Server Virtualization Business Unit Cisco Systems