Re: [OMPI devel] rankfile questions
Hi, > -Original Message- > From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On > Behalf Of Ralph Castain > Sent: Wednesday, March 19, 2008 3:19 AM > To: Open MPI Developers > Subject: Re: [OMPI devel] rankfile questions > > Not trying to pile on here...but I do have a question. > > This commit inserted a bunch of affinity-specific code in ompi_mpi_init.c. > Was this truly necessary? > > It seems to me this violates our code architecture. Affinity-specific code > belongs in the opal_p[m]affinity functions. Why aren't we just calling a > "opal_paffinity_set_my_processor" function (or whatever name you like) in > mpi_init, and doing all this paffinity stuff there? This is the only place where this code is used. These functions process the info from ODLS and set paffinity appropriately. Moving this code to OPAL will cause unnecessary changes in paffinity base API. > > It would make mpi_init a lot cleaner, and preserve the code standards we > have had since the beginning. > > In addition, the code that has been added returns ORTE error and success > codes. Given the location, it should be OMPI error and success codes - if > we > move it to where I think it belongs (in OPAL), then those codes should > obviously be OPAL codes. Will be cleaned up, thanks. > > If I'm missing some reason why these things can't be done, please > enlighten > me. Otherwise, it would be nice if this could be cleaned up. > > Thanks > Ralph > > On 3/18/08 8:39 AM, "Jeff Squyres" wrote: > > > On Mar 18, 2008, at 9:32 AM, Jeff Squyres wrote: > > > >> I notice that rankfile didn't compile properly on some platforms and > >> issued warnings on other platforms. Thanks to Ralph for cleaning it > >> up... > >> > >> 1. I see a getenv("slot_list") in the MPI side of the code; it looks > >> like $slot_list is set by the odls for the MPI process. Why isn't it > >> an MCA parameter? That's what all other values passed by the orted to > >> the MPI process appear to be. "slot_list" consist of socket:core pair for the rank to be bind to. This info changes according to rankfile and different for each node and rank, therefore it cannot be passed via mca parameter. > >> > >> 2. I see that ompi_mpi_params.c is now registering 2 rmaps-level MCA > >> parameters. Why? Shouldn't these be in ORTE somewhere? If you mean paffinity_alone and rank_file_debug, then 1. paffinity_alone was there before. 2. After getting some answers from Ralph about orte_debug in ompi_mpi_init I intend to introduce ompi_debug mca parameter that will be used in this library and rank_file_debug will be removed. > > > > > > A few more notes: > > > > 3. Most of the files in orte/mca/rmaps/rankfile do not obey the prefix > > rule. I think that they should be renamed. Rank_file component was copied from round_robin, I thought it would be strange if it would look differently. > > > > 4. A quick look through rankfile_lex.l seems to show that there are > > global variables that are not protected by the prefix rule (or > > static). Ditto in rmaps_rf.c. These should be fixed. What do you mean? > > > > 5. rank_file_done was instantiated in both rankfile_lex.l and > > ramps_rf.c (causing a duplicate symbol linker error on OS X). I > > removed it from rmaps_rf.c (it was declared "extern" in > > rankfile_lex.h, assumedly to indicate that it is "owned" by the lex.l > > file...?). thanks > > > > 6. svn:ignore was not set in the new rankfile directory. Will be fixed. I guess due to the heavy network traffic nowadays, all these comments came now and not 2 weeks ago when I sent the code for reviews :) :) :). Best Regards, Lenny. > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] rankfile questions
Yes, you're right -- we should have reviewed this code 2 weeks ago when you asked. Sorry about that. :-\ Per adding lots of affinity code in ompi_mpi_init.c: perhaps those code belongs down in the paffinity (or rmaps?) base. It doesn't have to become part of any specific paffinity component (because it can be used with any paffinity component) This makes it callable by anyone (including orte) and keeps the abstraction barriers clean. On Mar 19, 2008, at 5:36 AM, Lenny Verkhovsky wrote: 1. I see a getenv("slot_list") in the MPI side of the code; it looks like $slot_list is set by the odls for the MPI process. Why isn't it an MCA parameter? That's what all other values passed by the orted to the MPI process appear to be. "slot_list" consist of socket:core pair for the rank to be bind to. This info changes according to rankfile and different for each node and rank, therefore it cannot be passed via mca parameter. I don't follow the logic here. MCA parameters can certainly be unique per MPI process... Remember that MCA parameters can be environment variables. The advantage of using MCA params as env variables is that we enforce a common prefix to ensure that we don't collide with other environment variables. There's functions to get the environment variable names of MCA parameters, for example, so that you can setenv them to pass them to another process (e.g., in the odls). Then you use the normal MCA parameter lookup functions to retrieve them in the target/receiver process. 2. I see that ompi_mpi_params.c is now registering 2 rmaps-level MCA parameters. Why? Shouldn't these be in ORTE somewhere? If you mean paffinity_alone and rank_file_debug, then 1. paffinity_alone was there before. 2. After getting some answers from Ralph about orte_debug in ompi_mpi_init I intend to introduce ompi_debug mca parameter that will be used in this library and rank_file_debug will be removed. rmaps_rank_file_path and rmaps_rank_file_debug. These have no place being registered in the OMPI layer. It looks like rank_file_path is only registered in ompi_mpi_init.c as an error check. Why isn't this done in the rmaps rankfile component itself? This would execute in mpirun and avoid launching at all if an error is detected (vs. detecting the error in each MPI process and aborting each one). A few more notes: 3. Most of the files in orte/mca/rmaps/rankfile do not obey the prefix rule. I think that they should be renamed. Rank_file component was copied from round_robin, I thought it would be strange if it would look differently. Blah -- it looks like round robin's files don't adhere to the prefix rule. In fairness, those files *may* be so old to predate the prefix rule...? Regardless, I think the rankfile files should be named in accordance with the rest of the code base and adhere to the prefix rule. round robin should probably be fixed as well. 4. A quick look through rankfile_lex.l seems to show that there are global variables that are not protected by the prefix rule (or static). Ditto in rmaps_rf.c. These should be fixed. What do you mean? From lex.l: int rank_file_line=1; rank_file_value_t rank_file_value; bool rank_file_done = false; These are neither static nor do they adhere to the prefix rule (obviously, if a symbol is static, it doesn't have to adhere to the prefix rule). Ditto for "rank_file_path" and "rankmap" in rmaps_rf.c. There may be others; that's all I looked through (e.g., I didn't check other files or check function symbols). -- Jeff Squyres Cisco Systems
Re: [OMPI devel] RFC: libevent update
I re-merged down to the libevent-merge branch (to include r17872) and a new tarball has been uploaded to http://www.open-mpi.org/~jsquyres/unofficial/ On Mar 18, 2008, at 10:11 PM, George Bosilca wrote: Commit 17872 is the one you're looking for. https://svn.open-mpi.org/trac/ompi/changeset/17872 george. On Mar 18, 2008, at 9:12 PM, Jeff Squyres wrote: When did you fix it? I merged the trunk down to the libevent-merge branch late this afternoon (r17869). On Mar 18, 2008, at 7:29 PM, George Bosilca wrote: This has been fixed in the trunk, but not yet merged in the branch. george. On Mar 18, 2008, at 7:17 PM, Josh Hursey wrote: I found another problem with the libevent branch. If I set "-mca btl tcp,self" on the command line then I get a segfult when sending messages > 16 KB. I can try to make a smaller repeater, but if you use the "progress" or "simple" tests in ompi-tests below: https://svn.open-mpi.org/svn/ompi-tests/trunk/iu/ft/correctness To build: shell$ make To run with failure: shell$ mpirun -np 2 -mca btl tcp,self progress -s 16 -v 1 To run without failure: shell$ mpirun -np 2 -mca btl tcp,self progress -s 15 -v 1 This program will display the message "Checkpoint at any time...". If you send mpirun SIGUSR2 it will progress to the next stage of the test. The failure occurs when the first message before this becomes an issue though. I was using Odin, and if I do not specify the btls then the test will pass as normal. The backtrace is below: -- ... Core was generated by `progress -s 16 -v 1'. Program terminated with signal 11, Segmentation fault. #0 0x002a9793318b in mca_bml_base_free (bml_btl=0x736275705f61636d, des=0x559700) at ../../../../ompi/mca/ bml/bml.h:267 267 bml_btl->btl_free( bml_btl->btl, des ); (gdb) bt #0 0x002a9793318b in mca_bml_base_free (bml_btl=0x736275705f61636d, des=0x559700) at ../../../../ompi/mca/ bml/bml.h:267 #1 0x002a9793304d in mca_pml_ob1_put_completion (btl=0x5598c0, ep=0x0, des=0x559700, status=0) at pml_ob1_recvreq.c:190 #2 0x002a97930069 in mca_pml_ob1_recv_frag_callback (btl=0x5598c0, tag=64 '@', des=0x2a989d2b00, cbdata=0x0) at pml_ob1_recvfrag.c:149 #3 0x002a97d5f3e0 in mca_btl_tcp_endpoint_recv_handler (sd=10, flags=2, user=0x5a5df0) at btl_tcp_endpoint.c:696 #4 0x002a95a0ab93 in event_process_active (base=0x508c80) at event.c:591 #5 0x002a95a0af59 in opal_event_base_loop (base=0x508c80, flags=2) at event.c:763 #6 0x002a95a0ad2b in opal_event_loop (flags=2) at event.c:670 #7 0x002a959fadf8 in opal_progress () at runtime/ opal_progress.c: 169 #8 0x002a9792caae in opal_condition_wait (c=0x2a9587d940, m=0x2a9587d9c0) at ../../../../opal/threads/condition.h:93 #9 0x002a9792c9dd in ompi_request_wait_completion (req=0x5a5380) at ../../../../ompi/request/request.h:381 #10 0x002a9792c920 in mca_pml_ob1_recv (addr=0x5baf70, count=16384, datatype=0x503770, src=1, tag=1001, comm=0x5039a0, status=0x0) at pml_ob1_irecv.c:104 #11 0x002a956f1f00 in PMPI_Recv (buf=0x5baf70, count=16384, type=0x503770, source=1, tag=1001, comm=0x5039a0, status=0x0) at precv.c:75 #12 0x0040211f in exchange_stage1 (ckpt_num=1) at progress.c:414 #13 0x00401295 in main (argc=5, argv=0x7fbfffe668) at progress.c:131 (gdb) p bml_btl $1 = (mca_bml_base_btl_t *) 0x736275705f61636d (gdb) p *bml_btl Cannot access memory at address 0x736275705f61636d -- -- Josh On Mar 17, 2008, at 2:50 PM, Jeff Squyres wrote: WHAT: Bring new version of libevent to the trunk. WHY: Newer version, slightly better performance (lower overheads / lighter weight), properly integrate the use of epoll and other scalable fd monitoring mechanisms. WHERE: 98% of the changes are in opal/event; there's a few changes to configury and one change to the orted. TIMEOUT: COB, Friday, 21 March 2008 DESCRIPTION: George/UTK has done the bulk of the work to integrate a new version of libevent on the following tmp branch: https://svn.open-mpi.org/svn/ompi/tmp-public/libevent-merge ** WE WOULD VERY MUCH APPRECIATE IF PEOPLE COULD MTT TEST THIS BRANCH! ** Cisco ran MTT on this branch on Friday and everything checked out (i.e., no more failures than on the trunk). We just made a few more minor changes today and I'm running MTT again now, but I'm not expecting any new failures (MTT will take several hours). We would like to bring the new libevent in over this upcoming weekend, but would very much appreciate if others could test on their platforms (Cisco tests mainly 64 bit RHEL4U4). This new libevent *should* be a fairly side-effect free change, but it is possible that since we're now using epoll and other scalable fd monitoring tools, we'll run into some unanticipated issues on some platforms. Here's a consolidated diff if you want to see the changes: https://svn.open-mpi.org/trac/ompi/changeset?old_path=tmp-public%
[OMPI devel] Libtool for 1.3 / trunk builds
Hi all - Now that Libtool 2.2 has gone stable (2.0 was skipped entirely), it probably makes sense to update the version of Libtool used to build the nightly tarball and releases for the trunk (and eventually v1.3) from the nightly snapshot we have been using to the stable LT 2.2 release. I've done some testing (ie, I installed LT 2.2 for another project, and nothing in OMPI broke over the last couple of weeks), so I have some confidence this should be a smooth transition. If the group decides this is a good idea, someone at IU would just have to install the new LT version and change some symlinks and it should all just work... Brian
Re: [OMPI devel] Libtool for 1.3 / trunk builds
Should we wait for the next LT point release? I see a fair amount of activity on the bugs-libtool list; I think they're planning a new release within the next few weeks. (I think we will want to go to the LT point release when it comes out; I don't really have strong feelings about going to 2.2 now or not) On Mar 19, 2008, at 12:26 PM, Brian W. Barrett wrote: Hi all - Now that Libtool 2.2 has gone stable (2.0 was skipped entirely), it probably makes sense to update the version of Libtool used to build the nightly tarball and releases for the trunk (and eventually v1.3) from the nightly snapshot we have been using to the stable LT 2.2 release. I've done some testing (ie, I installed LT 2.2 for another project, and nothing in OMPI broke over the last couple of weeks), so I have some confidence this should be a smooth transition. If the group decides this is a good idea, someone at IU would just have to install the new LT version and change some symlinks and it should all just work... Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] Libtool for 1.3 / trunk builds
True - I have no objection to waiting for 2.2.1 or 1.3 to be branched, whichever comes first. The main point is that under no circumstance should 1.3 be shipped with the same 2.1a pre-release as 1.2 uses -- it's time to migrate to something stable. Brian On Wed, 19 Mar 2008, Jeff Squyres wrote: Should we wait for the next LT point release? I see a fair amount of activity on the bugs-libtool list; I think they're planning a new release within the next few weeks. (I think we will want to go to the LT point release when it comes out; I don't really have strong feelings about going to 2.2 now or not) On Mar 19, 2008, at 12:26 PM, Brian W. Barrett wrote: Hi all - Now that Libtool 2.2 has gone stable (2.0 was skipped entirely), it probably makes sense to update the version of Libtool used to build the nightly tarball and releases for the trunk (and eventually v1.3) from the nightly snapshot we have been using to the stable LT 2.2 release. I've done some testing (ie, I installed LT 2.2 for another project, and nothing in OMPI broke over the last couple of weeks), so I have some confidence this should be a smooth transition. If the group decides this is a good idea, someone at IU would just have to install the new LT version and change some symlinks and it should all just work... Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Libtool for 1.3 / trunk builds
On Mar 19, 2008, at 4:05 PM, Brian W. Barrett wrote: True - I have no objection to waiting for 2.2.1 or 1.3 to be branched, whichever comes first. The main point is that under no circumstance should 1.3 be shipped with the same 2.1a pre-release as 1.2 uses -- it's time to migrate to something stable. Cool; I think we're agreed. Just for simplicity; let's do whatever comes first: LT hits 2.2.1 (or 2.2.2? I don't know their numbering scheme) or we branch for v1.3. Brian On Wed, 19 Mar 2008, Jeff Squyres wrote: Should we wait for the next LT point release? I see a fair amount of activity on the bugs-libtool list; I think they're planning a new release within the next few weeks. (I think we will want to go to the LT point release when it comes out; I don't really have strong feelings about going to 2.2 now or not) On Mar 19, 2008, at 12:26 PM, Brian W. Barrett wrote: Hi all - Now that Libtool 2.2 has gone stable (2.0 was skipped entirely), it probably makes sense to update the version of Libtool used to build the nightly tarball and releases for the trunk (and eventually v1.3) from the nightly snapshot we have been using to the stable LT 2.2 release. I've done some testing (ie, I installed LT 2.2 for another project, and nothing in OMPI broke over the last couple of weeks), so I have some confidence this should be a smooth transition. If the group decides this is a good idea, someone at IU would just have to install the new LT version and change some symlinks and it should all just work... Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres Cisco Systems
Re: [OMPI devel] xensocket btl and migration
Thanks a lot Jeff and Josh. Seems it will be quite an interesting task to implement a separate btl for xensocket (xs) or anything related to migration. I plan to stick to initial design for the time which seems ugly but simple and quite efficient (at the moment). I have bundled xs with tcp btl. So instead of using tcp btl, interested parties will use xs btl, which supports tcp inherently. Now during the execution - which starts by normal tcp, if we see that both the endpoints are on same physical host, we construct the xensockets (two in fact, one per endpoint to receive data -- xs is unidirectional). Upon signal that xs are created and connected, we start to make progress through xs sock descriptors, which means that normal tcp socket descriptors are alive but not in-charge as no data is being send or received through them. When we migrate to other physical host, our plan is to somehow make xs_socket invalid, and resort to normal tcp sockets. If a new endpoint pair is detected on the new physical host, we will do the same what was done initially. I am not sure, if it is an efficient design, but in theory it seems interesting, although has slight overhead. The worst part of design is that it is highly tcp centric. My current status is that I am able to run normal mpi programs on xs btl, but am having some problems with some benchmark programs using non blocking sends and receives coupled with MPI_Barrier(). Something somewhere somehow gets lost. Xensockets initially were non-blocking send/recv, and did not have the necessary code for supporting epoll/select. We had to add the necessary code in the module so i am quite sure that they will work with the new opal/libevent. Best Regards, Muhammad Atif - Original Message From: Josh Hursey To: Open MPI Developers Sent: Wednesday, March 19, 2008 2:20:59 AM Subject: Re: [OMPI devel] xensocket btl and migration Muhammad, With regard to your question on migration you will likely have to reload the BTL components when a migration occurs. Open MPI currently assumes that once the set of BTLs are decided upon in a process they are to be used until the application completes. There is some limited support for failover in which if one BTL 'fails' then it is disregarded and a previously defined alternative path is used. For example if between two peers Open MPI has the choice of using tcp or openib then it will use openib. If openib were to fail during the running of the job then it may be possible for Open MPI to fail over and use just tcp. I'm not sure how well tested this ability is, others can comment if you are interested in this. However failover is not really want you are looking for. What it seem you are looking for is the ability to tell two processes that they should no longer communicate over tcp, but continue communication over xensockets or visa versa. One technique would be upon migration, if unload the BTLs (component_close) then reopen (component_open) and reselect (component_select) then reexchange the modex the processes should settle into the new configuration. You will have to make sure that any state Open MPI has cached such as network addresses and node name data is refreshed upon restart. Take a look at the checkpoint/ restart logic for how I do this in the code base ([opal|orte|ompi]/ runtime/*_cr.c). It is likely that there is another, more efficient method but I don't have anything to point you to at the moment. One idea would be to add a refresh function to the modex which would force the reexchange of a single processes address set. There are a slew of problems with this that you will have to overcome including race conditions, but I think they can be surmounted. I'd be interested in hearing your experiences implementing this in Open MPI. Let me know if I can be of any more help. Cheers, Josh On Mar 9, 2008, at 6:13 AM, Muhammad Atif wrote: > Okay guys.. with all your support and help in understanding ompi > architecture, I was able to get Xensocket to work. Only minor > changes to the xensocket kernel module made it compatible with > libevent. I am getting results which are bad but I am sure, I have > to cleanup the code. At least my results have improved over native > netfront-netback of xen for messages of size larger than 1 MB. > > I started with making minor changes in the TCP btl, but it seems it > is not the best way, as changes are quite huge and it is better to > have separate dedicated btl for xensockets. As you guys might be > aware Xen supports live migration, now I have one stupid question. > My knowledge so far suggests that btl component is initialized only > once. The scerario here is if my guest os is migrated from one > physical node to another, and realizes that the communicating > processes are now on one physical host and they should abandon use > of TCP btl and make use of Xensocket btl. I am sure it