Re: [OMPI devel] RFC: revamp topo framework
On 30 Oct 2009, at 20:28, Jeff Squyres wrote: What George is describing is the Right answer, but it may take you a little time. FWIW: the complexity of a topo component is actually pretty low. It's essentially a bunch of glue code (that I can probably mostly provide) and your mapping algorithms about how to reorder the communicator ranks. To be clear: topo components are *ONLY* about re-ordering ranks in a communicator -- the back-end of MPI_CART_CREATE and friends. The BTL components that George is talking about are Byte Transfer Layer components; essentially the brains behind MPI_SEND and friends. Open MPI has a per-device list of BTLs that can service each peer MPI process. Hence, if you're sending to another MPI process on the same host, the first BTL in the list will be the shared memory BTL. If you're sending to an MPI process on a different server that you're connected to via ethernet, the TCP BTL may be at the top of the list. And so on. Is sounds like you actually want to make *two* components: - topo: for reordering ranks during MPI_CART_CREATE and friends - btl: use the underlying network primitives for sending when possible As George indicated, the BTL module in each MPI process can determine during startup which MPI process peers it can talk to. It can then tell the upper-layer routing algorithm "I can talk to peer processes X, Y, and Z -- I cannot talk to peer processes A, B, and C". The upper-layer router (the PML module) will then put your BTL at the top of the list for peer processes X, Y, and Z, and will not put your BTL on the list ofr peer processes A, B, and C. For A, B, and C, other BTLs will be used (e.g., TCP). Does that make sense? To answer your question from a prior mail: the unity topo component is used for the remapping of ranks in MPI_CART_CREATE. Look in ompi/mca/topo/unity/. Thanks to everybody for the clarifications. The function I was looking for is mca_topo_base_cart_create() in ompi/mca/topo/base/ topo_base_cart_create.c And more precisely I needed the loop: p = topo_data->mtc_dims_or_index; coords = topo_data->mtc_coords; dummy_rank = *new_rank; for (i=0; (i < topo_data->mtc_ndims_or_nnodes && i < ndims); ++i, ++p) { dim = *p; nprocs /= dim; *coords++ = dummy_rank / nprocs; dummy_rank %= nprocs; } This defines the precise relation between ranks and coordinates. Once I know this, I do not even need to write a topo component, because I can define the ranks of my computing nodes in a rankfile in order that they get the coordinates that they need physically. A different issue is the BTL component. This is actually where my approach 1 and 2 differ (my previous distinction was confusing, due to my lack of understanding of the distinction between topo and btl components). In the 1st approach I would redefine some crucial (for my code) MPI functions in a way that they call the low level torus primitives, when the communication occurs between nearest neighbors, and fall back to open-mpi functions otherwise. The 2nd approach would be to develop our torus-btl. The fact that one can choose a "priority list of networks" is definitely great and dissipates my worries about the feasibility of the 2nd approach in my case. The only remaining question is whether I can get familiar with btl stuff fast enough. What do you suggest me to read in order to learn quickly how to create a BTL component? Many thanks and best regards, Luigi
[OMPI devel] Adding (3rd party?) MCA modules to the build system
Good Morning, i am trying to compile some kind of 3rd party btl module into my openmpi. I got the 1.3.3 release tarball, and i can now successfully call autogen.sh, configure and build after downgrading autoconf and friends to the exact versions suggested on the hacking site (i had the most recent versions installed before, which would cause make to fail when autogen.sh was called before). The btl module directory I have here contains Makefile.am and Makefile.in whose contents look very similar to those in the mca/btl/tcp directory for example. Among other things the Makefile.am contains a automake "if OMPI_BUILD_mybtlmodule_DSO" (looks exactly the same in the tcp module directory). Copying my module directory into ompi/mca/btl and running autogen.sh would just ignore it. I am absolutely lost in all this autoconf and automake build chaos (as it seems to me), but trying to analyse autogen.sh i figured (from the process_framework() function) that a mca subdir has to contain one of configure.in, configure.params and configure.ac to be recognized. I copied configure.params as it is from the tcp directory, as it seems fitting (containing just one single line: "PARAM_CONFIG_FILES="Makefile". Now running autogen.sh does indeed recognize the directory containing my btl module. It fails, however, with the line "ompi/mca/btl/mybtlmodule/Makefile.am:40: OMPI_BUILD_btl_mymodule_DSO does not appear in AM_CONDITIONAL". I vaguely know what that means, and i was half expecting something like this, but I can not find where those AM_CONDITIONALs are defined. Since the Makefule.am in the tcp subdir does contain a line practically identical to the one failing above, I tried a recursive grep for "OMPI_BUILD_btl_tcp_DSO" from the root build directory. This didn't really turn up anything useful though (mainly lots of occurenced in .cache files which i think i can ignore, and the one in Makefile.am). Since i feel pretty stuck here after several hours of trying to grok what's going on in this huge build system, i would very much appreciate some hints :). Thanks for your patience and your help, Yours, Christian.
Re: [OMPI devel] RFC: revamp topo framework
On Nov 3, 2009, at 3:40 AM, Luigi Scorzato wrote: This defines the precise relation between ranks and coordinates. Once I know this, I do not even need to write a topo component, because I can define the ranks of my computing nodes in a rankfile in order that they get the coordinates that they need physically. Fair enough. A topo component would make it unnecessary to lay out your processes in a specific order because it could (hypothetically) understand your physical topology and re-order the ranks accordingly. A different issue is the BTL component. This is actually where my approach 1 and 2 differ (my previous distinction was confusing, due to my lack of understanding of the distinction between topo and btl components). In the 1st approach I would redefine some crucial (for my code) MPI functions in a way that they call the low level torus primitives, when the communication occurs between nearest neighbors, and fall back to open-mpi functions otherwise. The 2nd approach would be to develop our torus-btl. The fact that one can choose a "priority list of networks" is definitely great and dissipates my worries about the feasibility of the 2nd approach in my case. The only remaining question is whether I can get familiar with btl stuff fast enough. What do you suggest me to read in order to learn quickly how to create a BTL component? The BTL is a bit more complicated than topo -- topo is actually pretty straightforward. BTL is a dumb byte-pusher that is controlled by an upper-level framework: the Point-to-point Messaging Layer (PML). The PML effects the semantics of the MPI point-to-point communications; PML components are the back-ends to MPI_SEND and friends. The PML initializes BTLs during MPI_INIT and builds up the priority lists of networks, etc. Then during MPI_SEND (etc.), the PML uses this information to decide what to do with messages -- fragment them over multiple BTLs, etc. It then calls the BTL modules in question to actually do the send. On receive, the BTLs make upcalls to the PML saying "here's a fragment; you handle it". Hence, in this way, the BTLs are dumb byte pushers -- they simply send and receive to individual peers (without any MPI semantics at all) and give all the fragments they receive to the PML, who then effects all the MPI semantics. Read ompi/mca/btl/btl.h and ompi/mca/pml/pml.h for the details of the interfaces. Are the network primitives of your network like TCP (reads and writes can partially complete), or are they like Myrinet / IB (messages are read and written discretely, potentially also starting reads and writes and later receiving completion calls indicating that they finished)? -- Jeff Squyres jsquy...@cisco.com
[OMPI devel] orte_rml_base_select failed
Hi, I am using open-mpi version 1.3.2. on SLES 11 machine. I have built it simply like ./configure => make => make install. I am facing the following error with mpirun on some machines. Root # mpirun -np 2 ls [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init_stage1.c at line 182 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_rml_base_select failed --> Returned value -13 instead of ORTE_SUCCESS -- [host-desktop1:09127] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_system_init.c at line 42 [host-desktop1:09127] [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 52 -- Open RTE was unable to initialize properly. The error occured while attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS. -- Can you please guide me to resolve this issue. Is there any run time environmental variable be set to get rid of this issue? Thanks in Advance, Amit Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
Re: [OMPI devel] Memory corruption with mpool
Hi, hm, I did not set any threading related options in configure, so I guess threading was not disabled. I compiled it again with the following configure options, --enable-debug --enable-memchecker --enable-mem-debug --disable-ft-thread --disable-progress-threads --disable-mpi-threads and the behavior did change, although it still does not work completely. I will investigate further. Thanks so far, Mondrian Jeff Squyres wrote: > Note that the problems Chris is talking about *should* only occur if you > have compiled Open MPI with multi-threaded support. Did you do that, > perchance? > > On Nov 2, 2009, at 9:26 AM, Mondrian Nuessle wrote: > >> Hi Christopher, >> >> >> Do you have any suggestions how to investigate this situation? >> > >> > Have you got OMPI_ENABLE_DEBUG defined? The symptoms of what you are >> > seeing sound like what might happen if debug is off and you trigger an >> > issue I posted about here related to thread safety of mpool. >> unfortunately, I have debug turned on (i.e. OMPI_ENABLE_DEBUG is >> defined in include/openmpi/opal_config.h). >> >> Regards, >> Mondrian >> >> -- >> Dr. Mondrian Nuessle >> Phone: +49 621 181 2717 University of Heidelberg >> Fax: +49 621 181 2713 Computer Architecture Group >> mailto:nues...@uni-hd.de http://ra.ziti.uni-heidelberg.de >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> > > -- Dr. Mondrian Nuessle Phone: +49 621 181 2717 University of Heidelberg Fax: +49 621 181 2713 Computer Architecture Group mailto:nues...@uni-hd.de http://ra.ziti.uni-heidelberg.de
Re: [OMPI devel] Memory corruption with mpool
On Nov 3, 2009, at 10:02 AM, Mondrian Nuessle wrote: hm, I did not set any threading related options in configure, so I guess threading was not disabled. I compiled it again with the following configure options, --enable-debug --enable-memchecker --enable-mem-debug --disable-ft- thread --disable-progress-threads --disable-mpi-threads and the behavior did change, although it still does not work completely. I will investigate further. FWIW: if you're building from a tarball, the various "debug" options are not the default (it defaults to an optimized build). All the thread stuff is disabled by default, however. So your specifying them should not change anything (it's not harmful to specify disabling them; it should just be exactly the same as not specifying them). -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] orte_rml_base_select failed
No parameter will help - the issue is that we couldn't find a TCP interface to use for wiring up the job. First thing you might check is that you have a TCP interface alive and active - can be the loopback interface, but you need at least something. If you do have an interface, then you might rebuild OMPI with --enable-debug so you can get some diagnostics. Then run the job again with -mca rml_base_verbose 10 -mca oob_base_verbose 10 and see what diagnostic error messages emerge. On Tue, Nov 3, 2009 at 4:42 AM, Amit Sharma wrote: > > > Hi, > > I am using open-mpi version 1.3.2. on SLES 11 machine. I have built it > simply like ./configure => make => make install. > > I am facing the following error with mpirun on some machines. > > Root # mpirun -np 2 ls > > [NO-NAME] ORTE_ERROR_LOG: Not found in file runtime/orte_init_stage1.c at > line 182 > -- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can fail > during orte_init; some of which are due to configuration or environment > problems. This failure appears to be an internal failure; here's some > additional information (which may only be relevant to an Open MPI > developer): > > orte_rml_base_select failed > --> Returned value -13 instead of ORTE_SUCCESS > > -- > [host-desktop1:09127] [NO-NAME] ORTE_ERROR_LOG: Not found in file > runtime/orte_system_init.c at line 42 [host-desktop1:09127] [NO-NAME] > ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 52 > -- > Open RTE was unable to initialize properly. The error occured while > attempting to orte_init(). Returned value -13 instead of ORTE_SUCCESS. > -- > > Can you please guide me to resolve this issue. Is there any run time > environmental variable be set to get rid of this issue? > > > Thanks in Advance, > Amit > > > > > Please do not print this email unless it is absolutely necessary. > > The information contained in this electronic message and any attachments to > this message are intended for the exclusive use of the addressee(s) and may > contain proprietary, confidential or privileged information. If you are not > the intended recipient, you should not disseminate, distribute or copy this > e-mail. Please notify the sender immediately and destroy all copies of this > message and any attachments. > > WARNING: Computer viruses can be transmitted via email. The recipient > should check this email and any attachments for the presence of viruses. The > company accepts no liability for any damage caused by any virus transmitted > by this email. > > www.wipro.com > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >
Re: [OMPI devel] Adding (3rd party?) MCA modules to the build system
On Nov 3, 2009, at 4:14 AM, Christian Bendele wrote: i am trying to compile some kind of 3rd party btl module into my openmpi. I got the 1.3.3 release tarball, and i can now successfully call autogen.sh, configure and build after downgrading autoconf and friends to the exact versions suggested on the hacking site (i had the most recent versions installed before, which would cause make to fail when autogen.sh was called before). I believe the fixes for that (needing to downgrade) will be in 1.3.4, if you care. :-) The SVN development trunk definitely works with the most recent autotools. The btl module directory I have here contains Makefile.am and Makefile.in whose contents look very similar to those in the mca/btl/ tcp directory for example. Among other things the Makefile.am contains a automake "if OMPI_BUILD_mybtlmodule_DSO" (looks exactly the same in the tcp module directory). Check out this wiki page for making new components in the OMPI tree: https://svn.open-mpi.org/trac/ompi/wiki/devel/CreateComponent Copying my module directory into ompi/mca/btl and running autogen.sh would just ignore it. I am absolutely lost in all this autoconf and automake build chaos (as it seems to me), but trying to analyse autogen.sh i figured (from the process_framework() function) that a mca subdir has to contain one of configure.in, configure.params and configure.ac to be recognized. I copied configure.params as it is from the tcp directory, as it seems fitting (containing just one single line: "PARAM_CONFIG_FILES="Makefile". Correct. Hopefully the wiki pages explains this better than needing to analyze autogen.sh... Now running autogen.sh does indeed recognize the directory containing my btl module. It fails, however, with the line "ompi/mca/btl/mybtlmodule/Makefile.am:40: OMPI_BUILD_btl_mymodule_DSO does not appear in AM_CONDITIONAL". I vaguely know what that means, and i was half expecting something like this, but I can not find where those AM_CONDITIONALs are defined. Since You shouldn't need to define these -- autogen.sh should define all the relevant AM_CONDITIONAL's. Check that wiki page and see if it answers your questions. Ping back here if not. -- Jeff Squyres jsquy...@cisco.com
[OMPI devel] 1.3.4 blocker
Avneesh/QLogic just pointed out to me that we have a binary compatibility issue (he's going to file a blocker ticket shortly). When we changed the .so version numbers right before 1.3.4rc3, we made it that apps compiled with 1.3.3 will not run against 1.3.4 because the apps are dependent upon libopen-rte and libopen-pal. This is likely because the wrapper compilers explicitly -lopen-rte and -lopen- pal, rather than letting them get pulled in implicitly (I *think* that's why -- but it's late and I haven't tried it). Simple test: mpicc hello.c -o hello against a 1.3.3 install. Then change your LD_LIBRARY_PATH to point to a 1.3.4rc3 install (assumedly in a different tree). Run ldd on hello; it'll show libopen-rte.so and libopen-pal.so as not found. It's too late to fix this for the 1.3/1.4 series; perhaps we can fix it in the 1.5 series properly. But I think we might have to lie about the .so version numbers for 1.3.4 to make the binary compatibility work. George / Brad -- we should chat on the phone tomorrow to figure this out (when I have more brain power to think about this properly). :-( -- Jeff Squyres jsquy...@cisco.com
[OMPI devel] Another 1.3.4 blocker: mpi_test_suite failing
MTT is finding (and I have confirmed by hand) that 1.3.4 is failing the mpi_test_suite P2P tests Ring Persistent (15/47). It does not fail on the trunk. Can someone confirm if this is a real problem? Here's a snipit of the output: P2P tests Ring Persistent (15/47), comm MPI_COMM_WORLD (1/13), type MPI_TYPE_MIX_ARRAY (28/27) P2P tests Ring Persistent (15/47), comm MPI_COMM_WORLD (1/13), type MPI_TYPE_MIX_LB_UB (29/27) P2P tests Ring Persistent (15/47), comm MPI_COMM_SELF (3/13), type MPI_CHAR (1/27) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) (p2p/tst_p2p_simple_ring_persistent.c:135) ERROR: Error in statuses; Invalid argument(22) (p2p/tst_p2p_simple_ring_persistent.c:135) ERROR: Error in statuses; Invalid argument(22) Here's how I'm running the test suite: mpirun -np 16 mpi_test_suite -x relaxed -d 'All,!MPI_SHORT_INT,! MPI_TYPE_MIX' across 4 nodes, 4ppn. -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] Another 1.3.4 blocker: mpi_test_suite failing
Hmm. Looks like this fails in 1.3.0 as well. So -- not a regression. But it still does seem weird. I'll file a ticket for 1.4. On Nov 3, 2009, at 9:22 PM, Jeff Squyres (jsquyres) wrote: MTT is finding (and I have confirmed by hand) that 1.3.4 is failing the mpi_test_suite P2P tests Ring Persistent (15/47). It does not fail on the trunk. Can someone confirm if this is a real problem? Here's a snipit of the output: P2P tests Ring Persistent (15/47), comm MPI_COMM_WORLD (1/13), type MPI_TYPE_MIX_ARRAY (28/27) P2P tests Ring Persistent (15/47), comm MPI_COMM_WORLD (1/13), type MPI_TYPE_MIX_LB_UB (29/27) P2P tests Ring Persistent (15/47), comm MPI_COMM_SELF (3/13), type MPI_CHAR (1/27) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) statuses[1].MPI_SOURCE:-1 instead of -2 (MPI_ANY_SOURCE:-1 MPI_PROC_NULL:-2) (p2p/tst_p2p_simple_ring_persistent.c:135) ERROR: Error in statuses; Invalid argument(22) (p2p/tst_p2p_simple_ring_persistent.c:135) ERROR: Error in statuses; Invalid argument(22) Here's how I'm running the test suite: mpirun -np 16 mpi_test_suite -x relaxed -d 'All,!MPI_SHORT_INT,! MPI_TYPE_MIX' across 4 nodes, 4ppn. -- Jeff Squyres jsquy...@cisco.com ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com
Re: [OMPI devel] === CREATE FAILURE (v1.5) ===
Mea culpa. Should be fixed now. If you can please restart the testing process. Thanks, george. On Nov 3, 2009, at 21:35 , MPI Team wrote: ERROR: Command returned a non-zero exist status (v1.5): ./configure --enable-dist Start time: Tue Nov 3 21:30:07 EST 2009 End time: Tue Nov 3 21:35:17 EST 2009 = == checking for a BSD-compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking for a thread-safe mkdir -p... /bin/mkdir -p checking for gawk... gawk checking whether make sets $(MAKE)... yes checking how to create a ustar tar archive... gnutar = = = = = = == == Configuring Open MPI = = = = = = == *** Checking versions checking Open MPI version... 1.5a1r22187 checking Open MPI release date... Unreleased developer copy checking Open MPI Subversion repository version... r22187 checking Open Run-Time Environment version... 1.5a1r22187 checking Open Run-Time Environment release date... Unreleased developer copy checking Open Run-Time Environment Subversion repository version... r22187 checking Open Portable Access Layer version... 1.5a1r22187 checking Open Portable Access Layer release date... Unreleased developer copy checking Open Portable Access Layer Subversion repository version... r22187 ./configure: line 5910: syntax error near unexpected token `<<<' ./configure: line 5910: `<<< .working' = == Your friendly daemon, Cyrador ___ testing mailing list test...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/testing