Re: [OMPI devel] SM init failures
Jeff Squyres wrote: FWIW, George found what looks like a race condition in the sm init code today -- it looks like we don't call maffinity anywhere in the sm btl startup, so we're not actually guaranteed that the memory is local to any particular process(or) (!). This race shouldn't cause segvs, though; it should only mean that memory is potentially farther away than we intended. Is this that business that came up recently on one of these mail lists about setting the memory node to -1 rather than using the value we know it should be? In mca_mpool_sm_alloc(), I do see a call to opal_maffinity_base_bind(). The central question is: does "first touch" mean both read and write? I.e., is the first process that either reads *or* writes to a given location considered "first touch"? Or is it only the first write? So, maybe the strategy is to create the shared area, have each process initialize its portion (FIFOs and free lists), have all processes sync, and then move on. That way, you know all memory will be written by the appropriate owner before it's read by anyone else. First-touch ownership will be proper and we won't be dependent on zero-filled pages. The big question in my mind remains that we don't seem to know how to reproduce the failure (segv) that we're trying to fix. I, personally, am reluctant to stick fixes into the code for problems I can't observe.
Re: [OMPI devel] SM init failures
Sorry to continue off-topic but going to System V shm would be for me like going back in the past. System V shared memory used to be the main way to do shared memory on MPICH and from my (little) experience, this was truly painful : - Cleanup issues : does shmctl(IPC_RMID) solve _all_ cases ? (even kill -9 ?) - Naming issues : shm segments identified as 32 bits key potentially causing conflicts between applications or layers of the same application on one node - Space issues : the total shm size on a system is bound to /proc/sys/kernel/shmmax, needing admin configuration and causing conflicts between MPI applications running on the same node Mmap'ed files can have a comprehensive name like -: Opal>--, preventing naming issues. If we are on linux, they can be allocated in /dev/shm to prevent filesystem trafic, and space is not limited. Sylvain On Mon, 30 Mar 2009, Tim Mattox wrote: I've been lurking on this conversation, and I am again left with the impression that the underlying shared memory configuration based on sharing a file is flawed. Why not use a System V shared memory segment without a backing file as I described in ticket #1320? On Mon, Mar 30, 2009 at 1:34 PM, George Bosilca wrote: Then it looks like the safest solution is the use either ftruncate or the lseek method and then touch the first byte of all memory pages. Unfortunately, I see two problems with this. First, there is a clear performance hit on the startup time. And second, we will have to find a pretty smart way to do this or we will completely break the memory affinity stuff. george. On Mar 30, 2009, at 13:24 , Iain Bason wrote: On Mar 30, 2009, at 12:05 PM, Jeff Squyres wrote: But don't we need the whole area to be zero filled? It will be zero-filled on demand using the lseek/touch method. However, the OS may not reserve space for the skipped pages or disk blocks. Thus one could still get out of memory or file system full errors at arbitrary points. Presumably one could also get segfaults from an mmap'ed segment whose pages couldn't be allocated when the demand came. Iain ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/ ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] SM init failures
On Mar 31, 2009, at 1:46 AM, Eugene Loh wrote: > FWIW, George found what looks like a race condition in the sm init > code today -- it looks like we don't call maffinity anywhere in the > sm btl startup, so we're not actually guaranteed that the memory is > local to any particular process(or) (!). This race shouldn't cause > segvs, though; it should only mean that memory is potentially farther > away than we intended. Is this that business that came up recently on one of these mail lists about setting the memory node to -1 rather than using the value we know it should be? In mca_mpool_sm_alloc(), I do see a call to opal_maffinity_base_bind(). No, it was a different thing -- but we missed the call to maffinity in mpool sm. So that might make George's point moot (I see he still hasn't chimed in yet on this thread, perhaps that's why ;-) ). To throw a little flame on the fire -- I notice the following from an MTT run last night: [svbu-mpi004:17172] *** Process received signal *** [svbu-mpi004:17172] Signal: Segmentation fault (11) [svbu-mpi004:17172] Signal code: Invalid permissions (2) [svbu-mpi004:17172] Failing at address: 0x2a98a3f080 [svbu-mpi004:17172] [ 0] /lib64/tls/libpthread.so.0 [0x2a960695b0] [svbu-mpi004:17172] [ 1] /home/jsquyres/bogus/lib/openmpi/ mca_btl_sm.so [0x2a97f22619] [svbu-mpi004:17172] [ 2] /home/jsquyres/bogus/lib/openmpi/ mca_btl_sm.so [0x2a97f225ee] [svbu-mpi004:17172] [ 3] /home/jsquyres/bogus/lib/openmpi/ mca_btl_sm.so [0x2a97f22946] [svbu-mpi004:17172] [ 4] /home/jsquyres/bogus/lib/libopen-pal.so. 0(opal_progress+0xa9) [0x2a95bbc078] [svbu-mpi004:17172] [ 5] /home/jsquyres/bogus/lib/libmpi.so.0 [0x2a95831324] [svbu-mpi004:17172] [ 6] /home/jsquyres/bogus/lib/libmpi.so.0 [0x2a9583185b] [svbu-mpi004:17172] [ 7] /home/jsquyres/bogus/lib/openmpi/ mca_coll_tuned.so [0x2a987e45be] [svbu-mpi004:17172] [ 8] /home/jsquyres/bogus/lib/openmpi/ mca_coll_tuned.so [0x2a987f160b] [svbu-mpi004:17172] [ 9] /home/jsquyres/bogus/lib/openmpi/ mca_coll_tuned.so [0x2a987e4c2e] [svbu-mpi004:17172] [10] /home/jsquyres/bogus/lib/libmpi.so. 0(PMPI_Barrier+0xd7) [0x2a9585987f] [svbu-mpi004:17172] [11] src/MPI_Type_extent_types_c(main+0xa20) [0x402f88] [svbu-mpi004:17172] [12] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x2a9618e3fb] [svbu-mpi004:17172] [13] src/MPI_Type_extent_types_c [0x4024da] [svbu-mpi004:17172] *** End of error message *** Notice the "invalid permissions" message. I didn't notice that before, but perhaps I wasn't looking. I also see that this is under coll_tuned, not coll_hierarch (i.e., *not* during MPI_INIT -- it's in a barrier). > The central question is: does "first touch" mean both read and > write? I.e., is the first process that either reads *or* writes to a > given location considered "first touch"? Or is it only the first write? So, maybe the strategy is to create the shared area, have each process initialize its portion (FIFOs and free lists), have all processes sync, and then move on. That way, you know all memory will be written by the appropriate owner before it's read by anyone else. First-touch ownership will be proper and we won't be dependent on zero-filled pages. That was what George was going at yesterday -- there's a section in the btl sm startup where you're setting up your own FIFOs. But then there's a section later where you're looking at your peers' FIFOs. There's no synchronization between these two points -- when you're looking at your peers' FIFOs, you can tell if they're not setup yet by if the peer's FIFO is NULL or not. If it's NULL, you loop and try again (until it's not NULL). This is what George thought might be "bad" from a maffinity standpoint -- but perhaps this is moot if mpool sm is calling maffinity bind. The big question in my mind remains that we don't seem to know how to reproduce the failure (segv) that we're trying to fix. I, personally, am reluctant to stick fixes into the code for problems I can't observe. Well, we *can* observe them -- I can reproduce them at a very low rate in my MTT runs. We just don't understand the problem yet to know how to reproduce them manually. To be clear: I'm violently agreeing with you: I want to fix the problem, but it would be much mo' betta to *know* that we fixed the problem rather than "well, it doesn't seem to be happening anymore." :-) -- Jeff Squyres Cisco Systems
Re: [OMPI devel] SM init failures
On Mar 31, 2009, at 3:45 AM, Sylvain Jeaugey wrote: Sorry to continue off-topic but going to System V shm would be for me like going back in the past. System V shared memory used to be the main way to do shared memory on MPICH and from my (little) experience, this was truly painful : - Cleanup issues : does shmctl(IPC_RMID) solve _all_ cases ? (even kill -9 ?) - Naming issues : shm segments identified as 32 bits key potentially causing conflicts between applications or layers of the same application on one node - Space issues : the total shm size on a system is bound to /proc/sys/kernel/shmmax, needing admin configuration and causing conflicts between MPI applications running on the same node Indeed. The one saving grace here is that the cleanup issues apparently can be solved on Linux with a special flag that indicates "automatically remove this shmem when all processes attaching to it have died." That was really the impetus for [re-]investigating sysv shm. I, too, remember the sysv pain because we used it in LAM, too... -- Jeff Squyres Cisco Systems
[OMPI devel] custom btl
Hi all, I am developing a btl module for a custom interconnect board (we call it apelink, it's an academic project), and I am porting the module from 1.2 (at which it used to work) to 1.3 branch. Two issues: 1) the use of pls_rsh_agent is said to be deprecated. How do I spawn the jobs using rsh, then? 2) although compilation is fine, i get a [gozer1:18640] mca: base: component_find: "mca_btl_apelink" does not appear to be a valid btl MCA dynamic component (ignored) already with an ompi_info command. Probably something changed in the 1.3 branch regarding DSO, which I should implement in my btl. Any hint? thanks roberto -- __ Roberto AmmendolaINFN - Roma II - APE group tel: +39-0672594504 email: roberto.ammend...@roma2.infn.it // \ Via della Ricerca Scientifica 1 - 00133 Roma \\_/ // __ ''-.._.-''-.._.. -(||)(') '''
Re: [OMPI devel] custom btl
On Mar 31, 2009, at 11:15 AM, Roberto Ammendola wrote: Hi all, I am developing a btl module for a custom interconnect board (we call it apelink, it's an academic project), and I am porting the module from 1.2 (at which it used to work) to 1.3 branch. Two issues: 1) the use of pls_rsh_agent is said to be deprecated. How do I spawn the jobs using rsh, then? The "pls" framework was replaced by the "plm" framework. So "plm_rsh_agent" should work. It defaults to "ssh : rsh" meaning that it'll look for ssh in your path, if it finds it, it will use it; if not, it'll look for rsh in your path, if it finds it, it will use it. If not, it'll fail. 2) although compilation is fine, i get a [gozer1:18640] mca: base: component_find: "mca_btl_apelink" does not appear to be a valid btl MCA dynamic component (ignored) already with an ompi_info command. Probably something changed in the 1.3 branch regarding DSO, which I should implement in my btl. Any hint? This is likely due to dlopen failing with your component -- the most common reason for this is a missing/unresolvable symbol. There's unfortunately a bug in libtool that doesn't show you the exact symbol that is unresolvable -- it instead may give a misleading error such as "file not found". :-( The way I have gotten around it before is to edit libltdl and add a printf. :-( Try this patch -- it compiles for me but I haven't tested it: --- opal/libltdl/loaders/dlopen.c.~1~ 2009-03-27 08:06:52.0 -0400 +++ opal/libltdl/loaders/dlopen.c 2009-03-31 11:50:05.0 -0400 @@ -195,6 +195,9 @@ if (!module) { +const char *error; +LT__GETERROR(error); +fprintf(stderr, "Can't dlopen %s: %s\n", filename, error); DL__SETERROR (CANNOT_OPEN); } -- Jeff Squyres Cisco Systems
Re: [OMPI devel] SM init failures
Jeff Squyres wrote: On Mar 31, 2009, at 1:46 AM, Eugene Loh wrote: > FWIW, George found what looks like a race condition in the sm init > code today -- it looks like we don't call maffinity anywhere in the > sm btl startup, so we're not actually guaranteed that the memory is > local to any particular process(or) (!). This race shouldn't cause > segvs, though; it should only mean that memory is potentially farther > away than we intended. Is this that business that came up recently on one of these mail lists about setting the memory node to -1 rather than using the value we know it should be? In mca_mpool_sm_alloc(), I do see a call to opal_maffinity_base_bind(). No, it was a different thing -- but we missed the call to maffinity in mpool sm. So that might make George's point moot (I see he still hasn't chimed in yet on this thread, perhaps that's why ;-) ). To throw a little flame on the fire -- I notice the following from an MTT run last night: [svbu-mpi004:17172] *** Process received signal *** [svbu-mpi004:17172] Signal: Segmentation fault (11) [svbu-mpi004:17172] Signal code: Invalid permissions (2) [svbu-mpi004:17172] Failing at address: 0x2a98a3f080 [svbu-mpi004:17172] [ 0] /lib64/tls/libpthread.so.0 [0x2a960695b0] [svbu-mpi004:17172] [ 1] /home/jsquyres/bogus/lib/openmpi/ mca_btl_sm.so [0x2a97f22619] [svbu-mpi004:17172] [ 2] /home/jsquyres/bogus/lib/openmpi/ mca_btl_sm.so [0x2a97f225ee] [svbu-mpi004:17172] [ 3] /home/jsquyres/bogus/lib/openmpi/ mca_btl_sm.so [0x2a97f22946] [svbu-mpi004:17172] [ 4] /home/jsquyres/bogus/lib/libopen-pal.so. 0(opal_progress+0xa9) [0x2a95bbc078] [svbu-mpi004:17172] [ 5] /home/jsquyres/bogus/lib/libmpi.so.0 [0x2a95831324] [svbu-mpi004:17172] [ 6] /home/jsquyres/bogus/lib/libmpi.so.0 [0x2a9583185b] [svbu-mpi004:17172] [ 7] /home/jsquyres/bogus/lib/openmpi/ mca_coll_tuned.so [0x2a987e45be] [svbu-mpi004:17172] [ 8] /home/jsquyres/bogus/lib/openmpi/ mca_coll_tuned.so [0x2a987f160b] [svbu-mpi004:17172] [ 9] /home/jsquyres/bogus/lib/openmpi/ mca_coll_tuned.so [0x2a987e4c2e] [svbu-mpi004:17172] [10] /home/jsquyres/bogus/lib/libmpi.so. 0(PMPI_Barrier+0xd7) [0x2a9585987f] [svbu-mpi004:17172] [11] src/MPI_Type_extent_types_c(main+0xa20) [0x402f88] [svbu-mpi004:17172] [12] /lib64/tls/libc.so.6(__libc_start_main+0xdb) [0x2a9618e3fb] [svbu-mpi004:17172] [13] src/MPI_Type_extent_types_c [0x4024da] [svbu-mpi004:17172] *** End of error message *** Notice the "invalid permissions" message. I didn't notice that before, but perhaps I wasn't looking. I also see that this is under coll_tuned, not coll_hierarch (i.e., *not* during MPI_INIT -- it's in a barrier). Yes, actually these happen "a lot". (I've been spending time looking at IU_Sif/r20880 MTT stack traces.) If the stack trace has MPI_Init in it, it seems to be going through mca_coll_hierarch. Otherwise, the seg fault is in a collective call as you note -- could be MPI_Allgather, Barrier, Bcast, and I imagine there are others -- then mca_coll_tuned and eventually down to the sm BTL. There are also quite a bit of orphaned(?) stack traces. Just a segfault and a single-level stack a la [ 0] /lib/libpthread.so > The central question is: does "first touch" mean both read and > write? I.e., is the first process that either reads *or* writes to a > given location considered "first touch"? Or is it only the first write? So, maybe the strategy is to create the shared area, have each process initialize its portion (FIFOs and free lists), have all processes sync, and then move on. That way, you know all memory will be written by the appropriate owner before it's read by anyone else. First-touch ownership will be proper and we won't be dependent on zero-filled pages. That was what George was going at yesterday -- there's a section in the btl sm startup where you're setting up your own FIFOs. But then there's a section later where you're looking at your peers' FIFOs. There's no synchronization between these two points -- when you're looking at your peers' FIFOs, you can tell if they're not setup yet by if the peer's FIFO is NULL or not. If it's NULL, you loop and try again (until it's not NULL). This is what George thought might be "bad" from a maffinity standpoint -- but perhaps this is moot if mpool sm is calling maffinity bind. The thing I was wondering about was memory barriers. E.g., you initialize stuff and then post the FIFO pointer. The other guy sees the FIFO pointer before the initialized memory. The big question in my mind remains that we don't seem to know how to reproduce the failure (segv) that we're trying to fix. I, personally, am reluctant to stick fixes into the code for problems I can't observe. Well, we *can* observe them -- I can reproduce them at a very low rate in my MTT runs. We just don't understand the problem yet to know how to reproduce them manually. To be clear: I'm violently agreeing with y
[OMPI devel] mallopt fixes
Ok, I've done a bunch of development and testing on the hg branch with all the mallopt fixes, etc., and I'm fairly confident that it's working properly. I plan to put this stuff back into the trunk tomorrow by noonish US Eastern if no one finds any problems with it: http://www.open-mpi.org/hg/hgwebdir.cgi/jsquyres/mallopt/ -- Jeff Squyres Cisco Systems
Re: [OMPI devel] SM init failures
On Mar 31, 2009, at 3:06 PM, Eugene Loh wrote: The thing I was wondering about was memory barriers. E.g., you initialize stuff and then post the FIFO pointer. The other guy sees the FIFO pointer before the initialized memory. We do do memory barriers during that SM startup sequence. I haven't checked in a while, but I thought we were doing the right kinds of barriers in the right order... But George mentioned on the call today that they may have found the issue, but they're testing it. He didn't explain what the issue was in case he was wrong. ;-) -- Jeff Squyres Cisco Systems
Re: [OMPI devel] SM init failures
Jeff Squyres wrote: On Mar 31, 2009, at 3:06 PM, Eugene Loh wrote: The thing I was wondering about was memory barriers. E.g., you initialize stuff and then post the FIFO pointer. The other guy sees the FIFO pointer before the initialized memory. We do do memory barriers during that SM startup sequence. I haven't checked in a while, but I thought we were doing the right kinds of barriers in the right order... There are certainly *some* barriers. The particular scenario I asked about didn't seem protected against (IMHO), but I certainly don't understand these issues and remain cautious about any guesses I make until I can demonstrate the problem and a solution. Regarding "demonstrating the problem", I see the Sun MTT logs show some number of Init errors without mca_coll_hierarch involved. I'll try rerunning on the same machines and see if I can trigger the problem. But George mentioned on the call today that they may have found the issue, but they're testing it. He didn't explain what the issue was in case he was wrong. ;-) 'kay.