Re: [OMPI devel] [devel-core] [RFC] Exit without finalize
Gleb Natapov wrote: On Thu, Sep 06, 2007 at 06:50:43AM -0600, Ralph H Castain wrote: WHAT: Decide upon how to handle MPI applications where one or more processes exit without calling MPI_Finalize WHY:Some applications can abort via an exit call instead of calling MPI_Abort when a library (or something else) calls exit. This situation is outside a user's control, so they cannot fix it. WHERE: Refer to ticket #1144 - code changes are TBD WHEN: Up to the group [snip] Does the general community feel we should do anything here, or is this a "bug" that should be fixed by the entity calling "exit"? I should note that it actually is bad behavior (IMHO) for any library to call "exit" - but then, we do that in some situations too, so perhaps we shouldn't cast stones! Any suggested solutions or comments on whether or not we should do anything would be appreciated. IMO (a) should be implemented. I don't think (b) should be implemented. However, one could register an atexit handler that calls MPI_finalize. Therefore, the exiting process would be stuck until everyone else reaches their exits or finalize. That being said I think (a) probably makes more sense and adheres to the MPI standard. --td
Re: [OMPI devel] SM BTL hang issue
Scott Atchley wrote: Terry, Are you testing on Linux? If so, which kernel? No, I am running into issues on Solaris but Ollie's run of the test code on Linux seems to work fine. --td See the patch to iperf to handle kernel 2.6.21 and the issue that they had with usleep(0): http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt Scott On Aug 31, 2007, at 1:36 PM, Terry D. Dontje wrote: Ok, I have an update to this issue. I believe there is an implementation difference of sched_yield between Linux and Solaris. If I change the sched_yield in opal_progress to be a usleep(500) then my program completes quite quickly. I have sent a few questions to a Solaris engineer and hopefully will get some useful information. That being said, CT-6's implementation also used yield calls (note this actually is what sched_yield reduces down to in Solaris) and we did not see the same degradation issue as with Open MPI. I believe the reason is because CT-6's SM implementation is not calling CT-6's version of progress recursively and forcing all the unexpected to be read in before continuing. CT-6 also has a natural flow control in it's implementation (ie it has a fixed set fifo for eager messages. I believe both of these characteristics lend CT-6 to not being completely killed by the yield differences. --td Li-Ta Lo wrote: On Thu, 2007-08-30 at 12:45 -0400, terry.don...@sun.com wrote: Li-Ta Lo wrote: On Thu, 2007-08-30 at 12:25 -0400, terry.don...@sun.com wrote: Li-Ta Lo wrote: On Wed, 2007-08-29 at 14:06 -0400, Terry D. Dontje wrote: hmmm, interesting since my version doesn't abort at all. Some problem with fortran compiler/language binding? My C translation doesn't have any problem. [ollie@exponential ~]$ mpirun -np 4 a.out 10 Target duration (seconds): 10.00, #of msgs: 50331, usec per msg: 198.684707 Did you oversubscribe? I found np=10 on a 8 core system clogged things up sufficiently. Yea, I used np 10 on a 2 proc, 2 hyper-thread system (total 4 threads). Is this using Linux? Yes. Ollie ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] SM BTL hang issue
Ok, I have an update to this issue. I believe there is an implementation difference of sched_yield between Linux and Solaris. If I change the sched_yield in opal_progress to be a usleep(500) then my program completes quite quickly. I have sent a few questions to a Solaris engineer and hopefully will get some useful information. That being said, CT-6's implementation also used yield calls (note this actually is what sched_yield reduces down to in Solaris) and we did not see the same degradation issue as with Open MPI. I believe the reason is because CT-6's SM implementation is not calling CT-6's version of progress recursively and forcing all the unexpected to be read in before continuing. CT-6 also has a natural flow control in it's implementation (ie it has a fixed set fifo for eager messages. I believe both of these characteristics lend CT-6 to not being completely killed by the yield differences. --td Li-Ta Lo wrote: On Thu, 2007-08-30 at 12:45 -0400, terry.don...@sun.com wrote: Li-Ta Lo wrote: On Thu, 2007-08-30 at 12:25 -0400, terry.don...@sun.com wrote: Li-Ta Lo wrote: On Wed, 2007-08-29 at 14:06 -0400, Terry D. Dontje wrote: hmmm, interesting since my version doesn't abort at all. Some problem with fortran compiler/language binding? My C translation doesn't have any problem. [ollie@exponential ~]$ mpirun -np 4 a.out 10 Target duration (seconds): 10.00, #of msgs: 50331, usec per msg: 198.684707 Did you oversubscribe? I found np=10 on a 8 core system clogged things up sufficiently. Yea, I used np 10 on a 2 proc, 2 hyper-thread system (total 4 threads). Is this using Linux? Yes. Ollie ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] SM BTL hang issue
hmmm, interesting since my version doesn't abort at all. --td Li-Ta Lo wrote: On Wed, 2007-08-29 at 11:36 -0400, Terry D. Dontje wrote: To run the code I usually do "mpirun -np 6 a.out 10" on a 2 core system. It'll print out the following and then hang: Target duration (seconds): 10.00 # of messages sent in that time: 589207 Microseconds per message: 16.972 I know almost nothing about FORTRAN but the stack dump told me it got NULL pointer reference when accessing the "me" variable in the do .. while loop. How can this happen? [ollie@exponential ~]$ mpirun -np 2 a.out 100 [exponential:22145] *** Process received signal *** [exponential:22145] Signal: Segmentation fault (11) [exponential:22145] Signal code: Address not mapped (1) [exponential:22145] Failing at address: (nil) [exponential:22145] [ 0] [0xb7f2a440] [exponential:22145] [ 1] a.out(MAIN__+0x54a) [0x804909e] [exponential:22145] [ 2] a.out(main+0x27) [0x8049127] [exponential:22145] [ 3] /lib/libc.so.6(__libc_start_main+0xe0) [0x4e75ef70] [exponential:22145] [ 4] a.out [0x8048aa1] [exponential:22145] *** End of error message *** call MPI_Send(keep_going,1,MPI_LOGICAL,me+1,1, $ MPI_COMM_WORLD,ier) 804909e: 8b 45 d4mov0xffd4(%ebp),%eax 80490a1: 83 c0 01add$0x1,%eax It is compiled with g77/g90. Ollie ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] SM BTL hang issue
To run the code I usually do "mpirun -np 6 a.out 10" on a 2 core system. It'll print out the following and then hang: Target duration (seconds): 10.00 # of messages sent in that time: 589207 Microseconds per message: 16.972 --td Terry D. Dontje wrote: Heard you the first time Gleb, just been backed up with other stuff. Following is the code: include "mpif.h" character(20) cmd_line_arg ! We'll use the first command-line argument ! to set the duration of the test. real(8) :: duration = 10 ! The default duration (in seconds) can be ! set here. real(8) :: endtime ! This is the time at which we'll end the ! test. integer(8) :: nmsgs = 1! We'll count the number of messages sent ! out from each MPI process. There will be ! at least one message (at the very end), ! and we'll count all the others. logical :: keep_going = .true. ! This flag says whether to keep going. ! Initialize MPI stuff. call MPI_Init(ier) call MPI_Comm_rank(MPI_COMM_WORLD, me, ier) call MPI_Comm_size(MPI_COMM_WORLD, np, ier) if ( np == 1 ) then ! Test to make sure there is at least one other process. write(6,*) "Need at least 2 processes." write(6,*) "Try resubmitting the job with" write(6,*) " 'mpirun -np '" write(6,*) "where is at least 2." else if ( me == 0 ) then ! The first command-line argument is the duration of the test (seconds). call get_command_argument(1,cmd_line_arg,len,istat) if ( istat == 0 ) read(cmd_line_arg,*) duration ! Loop until test is done. endtime = MPI_Wtime() + duration ! figure out when to end do while ( MPI_Wtime() < endtime ) call MPI_Send(keep_going,1,MPI_LOGICAL,1,1,MPI_COMM_WORLD,ier) nmsgs = nmsgs + 1 end do ! Then, send the closing signal. keep_going = .false. call MPI_Send(keep_going,1,MPI_LOGICAL,1,1,MPI_COMM_WORLD,ier) ! Write summary information. write(6,'("Target duration (seconds):",f18.6)') duration write(6,'("# of messages sent in that time:", i12)') nmsgs write(6,'("Microseconds per message:", f19.3)') 1.d6 * duration / nmsgs else ! If you're not Process 0, you need to receive messages ! (and possibly relay them onward). do while ( keep_going ) call MPI_Recv(keep_going,1,MPI_LOGICAL,me-1,1,MPI_COMM_WORLD, & MPI_STATUS_IGNORE,ier) if ( me == np - 1 ) cycle ! The last process only receives messages. call MPI_Send(keep_going,1,MPI_LOGICAL,me+1,1,MPI_COMM_WORLD,ier) end do end if ! Finalize. call MPI_Finalize(ier) end Sorry it is in Fortran. --td Gleb Natapov wrote: On Wed, Aug 29, 2007 at 11:01:14AM -0400, Richard Graham wrote: If you are going to look at it, I will not bother with this. I need the code to reproduce the problem. Otherwise I have nothing to look at. Rich On 8/29/07 10:47 AM, "Gleb Natapov" wrote: On Wed, Aug 29, 2007 at 10:46:06AM -0400, Richard Graham wrote: Gleb, Are you looking at this ? Not today. And I need the code to reproduce the bug. Is this possible? Rich On 8/29/07 9:56 AM, "Gleb Natapov" wrote: On Wed, Aug 29, 2007 at 04:48:07PM +0300, Gleb Natapov wrote: Is this trunk or 1.2? Oops. I should read more carefully :) This is trunk. On Wed, Aug 29, 2007 at 09:40:30AM -0400, Terry D. Dontje wrote: I have a program that does a simple bucket brigade of sends and receives where rank 0 is the start and repeatedly sends to rank 1 until a certain amount of time has passed and then it sends and all done packet. Running this under np=2 always works. However, when I run with greater than 2 using only the SM btl the program usually hangs and one of the processes has a long stack that has a lot of the following 3 calls in it: [25] opal_progress(), line 187 in "opal_progress.c" [26] mca_btl_sm_component_progress(), line 397 in "btl_sm_component.c" [27] mca_bml_r2_progress(), line 110 in "bml_r2.c" When stepping through the ompi_fifo_write_to_head routine it looks like the fifo has overflowed. I am wondering if what is happening is rank 0 has sent a bunch of messages that have exhausted the resources such that one of the middle ranks which is in the process of sending cannot send and therefore never gets to the point of trying to receive the messages from rank 0? Is the above a possible scenario or are messages periodically bled off the SM BTL's fifos? Note, I have seen np=3 pass sometimes and I can get it to pass
Re: [OMPI devel] SM BTL hang issue
Heard you the first time Gleb, just been backed up with other stuff. Following is the code: include "mpif.h" character(20) cmd_line_arg ! We'll use the first command-line argument ! to set the duration of the test. real(8) :: duration = 10 ! The default duration (in seconds) can be ! set here. real(8) :: endtime ! This is the time at which we'll end the ! test. integer(8) :: nmsgs = 1! We'll count the number of messages sent ! out from each MPI process. There will be ! at least one message (at the very end), ! and we'll count all the others. logical :: keep_going = .true. ! This flag says whether to keep going. ! Initialize MPI stuff. call MPI_Init(ier) call MPI_Comm_rank(MPI_COMM_WORLD, me, ier) call MPI_Comm_size(MPI_COMM_WORLD, np, ier) if ( np == 1 ) then ! Test to make sure there is at least one other process. write(6,*) "Need at least 2 processes." write(6,*) "Try resubmitting the job with" write(6,*) " 'mpirun -np '" write(6,*) "where is at least 2." else if ( me == 0 ) then ! The first command-line argument is the duration of the test (seconds). call get_command_argument(1,cmd_line_arg,len,istat) if ( istat == 0 ) read(cmd_line_arg,*) duration ! Loop until test is done. endtime = MPI_Wtime() + duration ! figure out when to end do while ( MPI_Wtime() < endtime ) call MPI_Send(keep_going,1,MPI_LOGICAL,1,1,MPI_COMM_WORLD,ier) nmsgs = nmsgs + 1 end do ! Then, send the closing signal. keep_going = .false. call MPI_Send(keep_going,1,MPI_LOGICAL,1,1,MPI_COMM_WORLD,ier) ! Write summary information. write(6,'("Target duration (seconds):",f18.6)') duration write(6,'("# of messages sent in that time:", i12)') nmsgs write(6,'("Microseconds per message:", f19.3)') 1.d6 * duration / nmsgs else ! If you're not Process 0, you need to receive messages ! (and possibly relay them onward). do while ( keep_going ) call MPI_Recv(keep_going,1,MPI_LOGICAL,me-1,1,MPI_COMM_WORLD, & MPI_STATUS_IGNORE,ier) if ( me == np - 1 ) cycle ! The last process only receives messages. call MPI_Send(keep_going,1,MPI_LOGICAL,me+1,1,MPI_COMM_WORLD,ier) end do end if ! Finalize. call MPI_Finalize(ier) end Sorry it is in Fortran. --td Gleb Natapov wrote: On Wed, Aug 29, 2007 at 11:01:14AM -0400, Richard Graham wrote: If you are going to look at it, I will not bother with this. I need the code to reproduce the problem. Otherwise I have nothing to look at. Rich On 8/29/07 10:47 AM, "Gleb Natapov" wrote: On Wed, Aug 29, 2007 at 10:46:06AM -0400, Richard Graham wrote: Gleb, Are you looking at this ? Not today. And I need the code to reproduce the bug. Is this possible? Rich On 8/29/07 9:56 AM, "Gleb Natapov" wrote: On Wed, Aug 29, 2007 at 04:48:07PM +0300, Gleb Natapov wrote: Is this trunk or 1.2? Oops. I should read more carefully :) This is trunk. On Wed, Aug 29, 2007 at 09:40:30AM -0400, Terry D. Dontje wrote: I have a program that does a simple bucket brigade of sends and receives where rank 0 is the start and repeatedly sends to rank 1 until a certain amount of time has passed and then it sends and all done packet. Running this under np=2 always works. However, when I run with greater than 2 using only the SM btl the program usually hangs and one of the processes has a long stack that has a lot of the following 3 calls in it: [25] opal_progress(), line 187 in "opal_progress.c" [26] mca_btl_sm_component_progress(), line 397 in "btl_sm_component.c" [27] mca_bml_r2_progress(), line 110 in "bml_r2.c" When stepping through the ompi_fifo_write_to_head routine it looks like the fifo has overflowed. I am wondering if what is happening is rank 0 has sent a bunch of messages that have exhausted the resources such that one of the middle ranks which is in the process of sending cannot send and therefore never gets to the point of trying to receive the messages from rank 0? Is the above a possible scenario or are messages periodically bled off the SM BTL's fifos? Note, I have seen np=3 pass sometimes and I can get it to pass reliably if I raise the shared memory space used by the BTL. This is using the trunk. --td -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] SM BTL hang issue
Trunk. --td Gleb Natapov wrote: Is this trunk or 1.2? On Wed, Aug 29, 2007 at 09:40:30AM -0400, Terry D. Dontje wrote: I have a program that does a simple bucket brigade of sends and receives where rank 0 is the start and repeatedly sends to rank 1 until a certain amount of time has passed and then it sends and all done packet. Running this under np=2 always works. However, when I run with greater than 2 using only the SM btl the program usually hangs and one of the processes has a long stack that has a lot of the following 3 calls in it: [25] opal_progress(), line 187 in "opal_progress.c" [26] mca_btl_sm_component_progress(), line 397 in "btl_sm_component.c" [27] mca_bml_r2_progress(), line 110 in "bml_r2.c" When stepping through the ompi_fifo_write_to_head routine it looks like the fifo has overflowed. I am wondering if what is happening is rank 0 has sent a bunch of messages that have exhausted the resources such that one of the middle ranks which is in the process of sending cannot send and therefore never gets to the point of trying to receive the messages from rank 0? Is the above a possible scenario or are messages periodically bled off the SM BTL's fifos? Note, I have seen np=3 pass sometimes and I can get it to pass reliably if I raise the shared memory space used by the BTL. This is using the trunk. --td ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb. ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] SM BTL hang issue
I have a program that does a simple bucket brigade of sends and receives where rank 0 is the start and repeatedly sends to rank 1 until a certain amount of time has passed and then it sends and all done packet. Running this under np=2 always works. However, when I run with greater than 2 using only the SM btl the program usually hangs and one of the processes has a long stack that has a lot of the following 3 calls in it: [25] opal_progress(), line 187 in "opal_progress.c" [26] mca_btl_sm_component_progress(), line 397 in "btl_sm_component.c" [27] mca_bml_r2_progress(), line 110 in "bml_r2.c" When stepping through the ompi_fifo_write_to_head routine it looks like the fifo has overflowed. I am wondering if what is happening is rank 0 has sent a bunch of messages that have exhausted the resources such that one of the middle ranks which is in the process of sending cannot send and therefore never gets to the point of trying to receive the messages from rank 0? Is the above a possible scenario or are messages periodically bled off the SM BTL's fifos? Note, I have seen np=3 pass sometimes and I can get it to pass reliably if I raise the shared memory space used by the BTL. This is using the trunk. --td
[OMPI devel] vpath and vt-integration tmp branch
I've tried to do a vpath configure on the vt-integration tmp branch and get the following: configure: Entering directory './tracing/vampirtrace' /workspace/tdd/ct7/ompi-ws-vt//ompi-vt-integration/builds/ompi-vt-integration/configure: line 144920: cd: ./tracing/vampirtrace: No such file or directory Has this branch been tested with vpath? --td
Re: [OMPI devel] Maximum Shared Memory Segment - OK to increase?
Maybe an clarification of the SM BTL implementation is needed. Does the SM BTL not set a limit based on np using the max allowable as a ceiling? If not and all jobs are allowed to use up to max allowable I see the reason for not wanting to raise the max allowable. That being said it seems to me that the memory usage of the SM BTL is a lot larger than it should be. Wasn't there some work done around June that looked why the SM BTL was allocating a lot of memory, anything come out of that? --td Markus Daene wrote: Rolf, I think it is not a good idea to increase the default value to 2G. You have to keep in mind that there are not so many people who have a machine with 128 and more cores on a single node. The average people will have nodes with 2,4 maybe 8 cores and therefore it is not necessary to set this parameter to such a high value. Eventually it allocates all of this memory per node, and if you have only 4 or 8G per node it will be inbalanced. For my 8core nodes I have even decreased the sm_max_size to 32G and I had no problems with that. As far as I know (if not otherwise specified during runtime) this parameter is global. So even if you run on your machine with 2 procs it might allocate the 2G for the MPI smp module. I would recommend like Richard suggests to set the parameter for your machine in etc/openmpi-mca-params.conf and not to change the default value. Markus Rolf vandeVaart wrote: We are running into a problem when running on one of our larger SMPs using the latest Open MPI v1.2 branch. We are trying to run a job with np=128 within a single node. We are seeing the following error: "SM failed to send message due to shortage of shared memory." We then increased the allowable maximum size of the shared segment to 2Gigabytes-1 which is the maximum allowed on 32-bit application. We used the mca parameter to increase it as shown here. -mca mpool_sm_max_size 2147483647 This allowed the program to run to completion. Therefore, we would like to increase the default maximum from 512Mbytes to 2G-1 Gigabytes. Does anyone have an objection to this change? Soon we are going to have larger CPU counts and would like to increase the odds that things work "out of the box" on these large SMPs. On a side note, I did a quick comparison of the shared memory needs of the old Sun ClusterTools to Open MPI and came up with this table. Open MPI np Sun ClusterTools 6current suggested - 2 20M 128M128M 4 20M 128M128M 8 22M 256M256M 16 27M 512M512M 32 48M 512M 1G 64133M 512M2G-1 128476M 512M2G-1 ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] [devel-core] [RFC] Runtime Services Layer
George Bosilca wrote: Looks like I'm the only one barely excited about this idea. The system that you described, is well known. It been around for around 10 years, and it's called PMI. The interface you have in the tmp branch as well as the description you gave in your email are more than similar with what they sketch in the following two documents: http://www-unix.mcs.anl.gov/mpi/mpich/developer/design/pmiv2draft.htm http://www-unix.mcs.anl.gov/mpi/mpich/developer/design/pmiv2.htm Now, there is something wrong with reinventing the wheel if there are no improvements. And so far I'm unable to notice any major improvement neither compared with PMI nor with what we have today (except maybe being able to use PMI inside Open MPI). I agree with the first sentence above. I think this goes along the line of Raplh's comment of "what are we trying to solve here?" When this all started about 6 months ago I think the main concern was finding what interfaces existed between ORTE and OMPI. Though I am not sure how that blossomed into redesigning the interface. Not saying there isn't a reason to just that we should step back and make sure we know why we are. Again, my main concern is about fault tolerance. There is nothing in PMI (and nothing in RSL so far) that allow any kind of fault tolerance [And believe me re-writing the MPICH mpirun to allow checkpoint/restart is a hassle]. Moreover, your approach seems to open the possibility of having heterogeneous RTE (in terms of features) which in my view is definitively the wrong approach. I am curious about this last paragraph. Is it your belief that the current ORTE does lend itself to being extended to incorporate fault tolerance? Also, by heterogenous RTE are you meaning RTE running on a cluster of heterogenous set of platforms? If so, I would like to understand why you think that is the "wrong" approach. --td george. On Aug 16, 2007, at 9:47 PM, Tim Prins wrote: WHAT: Solicitation of feedback on the possibility of adding a runtime services layer to Open MPI to abstract out the runtime. WHY: To solidify the interface between OMPI and the runtime environment, and to allow the use of different runtime systems, including different versions of ORTE. WHERE: Addition of a new framework to OMPI, and changes to many of the files in OMPI to funnel all runtime request through this framework. Few changes should be required in OPAL and ORTE. WHEN: Development has started in tmp/rsl, but is still in its infancy. We hope to have a working system in the next month. TIMEOUT: 8/29/07 -- Short version: I am working on creating an interface between OMPI and the runtime system. This would make a RSL framework in OMPI which all runtime services would be accessed from. Attached is a graphic depicting this. This change would be invasive to the OMPI layer. Few (if any) changes will be required of the ORTE and OPAL layers. At this point I am soliciting feedback as to whether people are supportive or not of this change both in general and for v1.3. Long version: The current model used in Open MPI assumes that one runtime system is the best for all environments. However, in many environments it may be beneficial to have specialized runtime systems. With our current system this is not easy to do. With this in mind, the idea of creating a 'runtime services layer' was hatched. This would take the form of a framework within OMPI, through which all runtime functionality would be accessed. This would allow new or different runtime systems to be used with Open MPI. Additionally, with such a system it would be possible to have multiple versions of open rte coexisting, which may facilitate development and testing. Finally, this would solidify the interface between OMPI and the runtime system, as well as provide documentation and side effects of each interface function. However, such a change would be fairly invasive to the OMPI layer, and needs a buy-in from everyone for it to be possible. Here is a summary of the changes required for the RSL (at least how it is currently envisioned): 1. Add a framework to ompi for the rsl, and a component to support orte. 2. Change ompi so that it uses the new interface. This involves: a. Moving runtime specific code into the orte rsl component. b. Changing the process names in ompi to an opaque object. c. change all references to orte in ompi to be to the rsl. 3. Change the configuration code so that open-rte is only linked where needed. Of course, all this would happen on a tmp branch. The design of the rsl is not solidified. I have been playing in a tmp branch (located at https://svn.open-mpi.org/svn/ompi/tmp/rsl) which everyone is welcome to look at and comment on, but be advised that things here are subject to change (I don't think it even compiles right now). There are some fairly large open questions on this, including:
Re: [OMPI devel] Potential issue with PERUSE_COMM_MSG_MATCH_POSTED_REQ event called for unexpected matches
Nevermind my message below, things seem to be working for me now. Not sure what happened. --td Terry D. Dontje wrote: Rainer Keller wrote: Hi Terry, On Wednesday 22 August 2007 16:22, Terry D. Dontje wrote: I thought I would run this by the group before trying to unravel the code and figure out how to fix the problem. It looks to me from some experiementation that when a process matches an unexpected message that the PERUSE framework incorrectly fires a PERUSE_COMM_MSG_MATCH_POSTED_REQ in addition to a PERUSE_COMM_REQ_MATCH_UNEX event. I believe this is wrong that the former event should not be fired in this case. You are right, the former event PERUSE_COMM_MSG_MATCH_POSTED_Q should not be posted, as this was an unexpected message. If the above assumption is true I think the problem arises because PERUSE_COMM_MSG_MATCH_POSTED_REQ event is fired in function mca_pml_ob1_recv_request_progress which is called by mca_pml_ob1_recv_request_match_specific when a match of an unexpected message has occurred. I am wondering if the PERUSE_COMM_MSG_MATCH_POSTED_REQ event should be moved to a more posted queue centric routine something like mca_pml_ob1_recv_frag_match? I believe, this is correct -- at least this works for a large message late sender and late receiver test program mpi_peruse.c. Should be fixed with the committed patch v15947. Actually, there are two other items, one is a missing PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q... This works for large posted messges but when the posted message is small you don't see the unexpected messages at all now. --td Additionally, we have a problem that we fire PERUSE_COMM_REQ_ACTIVATE event for MPI_*Probe-function calls. The solution is to move the pml_base_sendreq.h / pml_base_recv_req.h into pml_ob1_irecv.c, and resp. pml_ob1_isend.c Please see the v15945. With best regards, Rainer ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Potential issue with PERUSE_COMM_MSG_MATCH_POSTED_REQ event called for unexpected matches
Rainer Keller wrote: Hi Terry, On Wednesday 22 August 2007 16:22, Terry D. Dontje wrote: I thought I would run this by the group before trying to unravel the code and figure out how to fix the problem. It looks to me from some experiementation that when a process matches an unexpected message that the PERUSE framework incorrectly fires a PERUSE_COMM_MSG_MATCH_POSTED_REQ in addition to a PERUSE_COMM_REQ_MATCH_UNEX event. I believe this is wrong that the former event should not be fired in this case. You are right, the former event PERUSE_COMM_MSG_MATCH_POSTED_Q should not be posted, as this was an unexpected message. If the above assumption is true I think the problem arises because PERUSE_COMM_MSG_MATCH_POSTED_REQ event is fired in function mca_pml_ob1_recv_request_progress which is called by mca_pml_ob1_recv_request_match_specific when a match of an unexpected message has occurred. I am wondering if the PERUSE_COMM_MSG_MATCH_POSTED_REQ event should be moved to a more posted queue centric routine something like mca_pml_ob1_recv_frag_match? I believe, this is correct -- at least this works for a large message late sender and late receiver test program mpi_peruse.c. Should be fixed with the committed patch v15947. Actually, there are two other items, one is a missing PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q... This works for large posted messges but when the posted message is small you don't see the unexpected messages at all now. --td Additionally, we have a problem that we fire PERUSE_COMM_REQ_ACTIVATE event for MPI_*Probe-function calls. The solution is to move the pml_base_sendreq.h / pml_base_recv_req.h into pml_ob1_irecv.c, and resp. pml_ob1_isend.c Please see the v15945. With best regards, Rainer
[OMPI devel] Potential issue with PERUSE_COMM_MSG_MATCH_POSTED_REQ event called for unexpected matches
I thought I would run this by the group before trying to unravel the code and figure out how to fix the problem. It looks to me from some experiementation that when a process matches an unexpected message that the PERUSE framework incorrectly fires a PERUSE_COMM_MSG_MATCH_POSTED_REQ in addition to a PERUSE_COMM_REQ_MATCH_UNEX event. I believe this is wrong that the former event should not be fired in this case. If the above assumption is true I think the problem arises because PERUSE_COMM_MSG_MATCH_POSTED_REQ event is fired in function mca_pml_ob1_recv_request_progress which is called by mca_pml_ob1_recv_request_match_specific when a match of an unexpected message has occurred. I am wondering if the PERUSE_COMM_MSG_MATCH_POSTED_REQ event should be moved to a more posted queue centric routine something like mca_pml_ob1_recv_frag_match? Suggestions...thoughts? --td
Re: [OMPI devel] [RFC] Runtime Services Layer
I think the concept is a good idea. A few questions that come to mind: 1. Do you have a set of APIs you plan on supporting? 2. Are you planning on adding new APIs (not currently supported by ORTE)? 3. Do any of the ORTE replacement APIs differ in how they work? 4. Will RSL change in how we access information from the GPR? If not how does this layer really separate us from ORTE? 5. How will RSL handle OOB functionality (routing of messages)? 6. How does making the process names opaque differ from how ORTE names processes? Do you still need a global namespace for a "universe"? I like the idea but I really wonder if this will even be half-baked in time for 1.3 (same concern as Jeff's). --td Tim Prins wrote: WHAT: Solicitation of feedback on the possibility of adding a runtime services layer to Open MPI to abstract out the runtime. WHY: To solidify the interface between OMPI and the runtime environment, and to allow the use of different runtime systems, including different versions of ORTE. WHERE: Addition of a new framework to OMPI, and changes to many of the files in OMPI to funnel all runtime request through this framework. Few changes should be required in OPAL and ORTE. WHEN: Development has started in tmp/rsl, but is still in its infancy. We hope to have a working system in the next month. TIMEOUT: 8/29/07 -- Short version: I am working on creating an interface between OMPI and the runtime system. This would make a RSL framework in OMPI which all runtime services would be accessed from. Attached is a graphic depicting this. This change would be invasive to the OMPI layer. Few (if any) changes will be required of the ORTE and OPAL layers. At this point I am soliciting feedback as to whether people are supportive or not of this change both in general and for v1.3. Long version: The current model used in Open MPI assumes that one runtime system is the best for all environments. However, in many environments it may be beneficial to have specialized runtime systems. With our current system this is not easy to do. With this in mind, the idea of creating a 'runtime services layer' was hatched. This would take the form of a framework within OMPI, through which all runtime functionality would be accessed. This would allow new or different runtime systems to be used with Open MPI. Additionally, with such a system it would be possible to have multiple versions of open rte coexisting, which may facilitate development and testing. Finally, this would solidify the interface between OMPI and the runtime system, as well as provide documentation and side effects of each interface function. However, such a change would be fairly invasive to the OMPI layer, and needs a buy-in from everyone for it to be possible. Here is a summary of the changes required for the RSL (at least how it is currently envisioned): 1. Add a framework to ompi for the rsl, and a component to support orte. 2. Change ompi so that it uses the new interface. This involves: a. Moving runtime specific code into the orte rsl component. b. Changing the process names in ompi to an opaque object. c. change all references to orte in ompi to be to the rsl. 3. Change the configuration code so that open-rte is only linked where needed. Of course, all this would happen on a tmp branch. The design of the rsl is not solidified. I have been playing in a tmp branch (located at https://svn.open-mpi.org/svn/ompi/tmp/rsl) which everyone is welcome to look at and comment on, but be advised that things here are subject to change (I don't think it even compiles right now). There are some fairly large open questions on this, including: 1. How to handle mpirun (that is, when a user types 'mpirun', do they always get ORTE, or do they sometimes get a system specific runtime). Most likely mpirun will always use ORTE, and alternative launching programs would be used for other runtimes. 2. Whether there will be any performance implications. My guess is not, but am not quite sure of this yet. Again, I am interested in people's comments on whether they think adding such abstraction is good or not, and whether it is reasonable to do such a thing for v1.3. Thanks, Tim Prins ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] openib btl header caching
Jeff Squyres wrote: With Mellanox's new HCA (ConnectX), extremely low latencies are possible for short messages between two MPI processes. Currently, OMPI's latency is around 1.9us while all other MPI's (HP MPI, Intel MPI, MVAPICH[2], etc.) are around 1.4us. A big reason for this difference is that, at least with MVAPICH[2], they are doing wire protocol header caching where the openib BTL does not. Specifically: - Mellanox tested MVAPICH with the header caching; latency was around 1.4us - Mellanox tested MVAPICH without the header caching; latency was around 1.9us Given that OMPI is the lone outlier around 1.9us, I think we have no choice except to implement the header caching and/or examine our header to see if we can shrink it. Mellanox has volunteered to implement header caching in the openib btl. Any objections? We can discuss what approaches we want to take (there's going to be some complications because of the PML driver, etc.); perhaps in the Tuesday Mellanox teleconf...? This sounds great. Sun, would like to hear how thing are being done so we can possibly port the solution to the udapl btl. --td
Re: [OMPI devel] [RFC] New command line options to replace persistent daemon operations
Ralph Castain wrote: WHAT: Proposal to add two new command line options that will allow us to replace the current need to separately launch a persistent daemon to support connect/accept operations WHY:Remove problems of confusing multiple allocations, provide a cleaner method for connect/accept between jobs WHERE: minor changes in orterun and orted, some code in rmgr and each pls to ensure the proper jobid and connect info is passed to each app_context as it is launched It is my opinion that we would be better off attacking the issues of the persistent daemons described below then creating a new set of options to mpirun for process placement. (more comments below on the actual proposal). TIMOUT: 8/10/07 We currently do not support connect/accept operations in a clean way. Users are required to first start a persistent daemon that operates in a user-named universe. They then must enter the mpirun command for each application in a separate window, providing the universe name on each command line. This is required because (a) mpirun will not run in the background (in fact, at one point in time it would segfault, though I believe it now just hangs), and (b) we require that all applications using connect/accept operate under the same HNP. This is burdensome and appears to be causing problems for users as it requires them to remember to launch that persistent daemon first - otherwise, the applications execute, but never connect. Additionally, we have the problem of confused allocations from the different login sessions. This has caused numerous problems of processes going to incorrect locations, allocations timing out at different times and causing jobs to abort, etc. What I propose here is to eliminate the confusion in a manner that minimizes code complexity. The idea is to utilize our so-painfully-developed multiple app_context capability to have the user launch all the interacting applications with the same mpirun command. This not only eliminates the annoyance factor for users by eliminating the need for multiple steps and login sessions, but also solves the problem of ensuring that all applications are running in the same allocation (so we don't have to worry any more about timeouts in one allocation aborting another job). The proposal is to add two command line options that are associated with a specific app_context (feel free to redefine the name of the option - I don't personally care): 1. --independent-job - indicates that this app_context is to be launched as an independent job. We will assign it a separate jobid, though we will map it as part of the overall command (e.g., if by slot and no other directives provided, it will start mapping where the prior app_context left off) I am unclear what does the option --connect really do? The MPI codes actually have to call MPI_Comm_connect to really connect to a process. Can we get away with just the above option? 2. --connect x,y,z - only valid when combined with the above option, indicates that this independent job is to be MPI-connected to app_contexts x,y,z (where x,y,z are the number of the app_context, counting from the beginning of the command - you choose if we start from 0 or 1). Alternatively, we can default to connecting to everyone, and then use --disconnect to indicate we -don't- want to be connected. Note that this means the entire allocation for the combined app_contexts must be provided. This helps the RTE tremendously to keep things straight, and ensures that all the app_contexts will be able to complete (or not) in a synchronized fashion. It also allows us to eliminate the persistent daemon and multiple login session requirements for connect/accept. That does not mean we cannot have a persistent daemon to create a virtual machine, assuming we someday want to support that mode of operation. This simply removes the requirement that the user start one just so they can use connect/accept. Comments? ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] Locking issue with OB1 PML
I think I've found a problem that is causing at least some of my runs of the MT tests to abort or hang. The issue is that in the OB1 request structure there is a req_send_range_lock that is never initialized with the appropriate (pthread_)mutex_init call. I've put in the following patch (given to me by Jeff) in ompi/mca/pml/ob1/pml_ob1_sendreq.c Index: pml_ob1_sendreq.c === --- pml_ob1_sendreq.c (revision 15535) +++ pml_ob1_sendreq.c (working copy) @@ -136,12 +136,18 @@ req->req_rdma_cnt = 0; req->req_throttle_sends = false; OBJ_CONSTRUCT(&req->req_send_ranges, opal_list_t); +OBJ_CONSTRUCT(&req->req_send_range_lock, opal_mutex_t); } +static void mca_pml_ob1_send_request_destruct (mca_pml_ob1_send_request_t* req) +{ +OBJ_DESTRUCT(&req->req_send_range_lock); +} + OBJ_CLASS_INSTANCE( mca_pml_ob1_send_request_t, mca_pml_base_send_request_t, mca_pml_ob1_send_request_construct, -NULL ); +mca_pml_ob1_send_request_destruct); /** * Completion of a short message - nothing left to schedule. Note that this The above seems to at least allow one of my tests to consistently pass (haven't tried the other tests yet). I was wanting to see if the above fix makes sense and if possibly there are similar issues with the other PMLs. Thanks, --td
[OMPI devel] Call for OMPI Binary Distributions
This announcement is to request links to Binary Distributions of Open MPI that our community may have on the web for users to download. We'd like to take those links and post them on our download page to make it easier for those who are insterested in getting binaries to install and not the source code. This information will be posted on the download pages like http://www.open-mpi.org/software/ompi/v1.2/. What we need is the following: 1. What OMPI release the distribution is based off of (v1.2...) 2. The content description/name of your distribution, 3. The link to your distribution(s) 4. The date the distribution was released. Thanks, --td
Re: [OMPI devel] Multi-environment builds
Jeff Squyres wrote: On Jul 10, 2007, at 1:26 PM, Ralph H Castain wrote: 2. It may be useful to have some high-level parameters to specify a specific run-time environment, since ORTE has multiple, related frameworks (e.g., RAS and PLS). E.g., "orte_base_launcher=tm", or somesuch. I was just writing this up in an enhancement ticket when the though hit me: isn't this aggregate MCA parameters? I.e.: mpirun --am tm ... Specifically, we'll need to make a "tm" AMCA file (and whatever other ones we want), but my point is: does AMCA already give us what we want? The above sounds like a possible solution as long as we are going to deliver a set of such files and not require each site to create their own. Also, can one pull in multiple AMCA files for one run thus you can specify a tm AMCA and possibly some other AMCA file that the user may want? --td
Re: [OMPI devel] Modex
Cool this sounds good enough to me. --td Brian Barrett wrote: THe function name changes are pretty obvious (s/mca_pml_base/ompi/), and I thought I'd try something new and actually document the interface in the header file :). So we should be good on that front. Brian On Jun 27, 2007, at 6:38 AM, Terry D. Dontje wrote: I am ok with the following as long as we can have some sort of documenation describing what changed like which old functions are replaced with newer functions and any description of changed assumptions. --td Brian Barrett wrote: On Jun 26, 2007, at 6:08 PM, Tim Prins wrote: Some time ago you were working on moving the modex out of the pml and cleaning it up a bit. Is this work still ongoing? The reason I ask is that I am currently working on integrating the RSL, and would rather build on the new code rather than the old... Tim Prins brings up a point I keep meaning to ask the group about. A long time ago in a galaxy far, far away (aka, last fall), Galen and I started working on the BTL / PML redesign that morphed into some smaller changes, including some interesting IB work. Anyway, I rewrote large chunks of the modex, which did a couple of things: * Moved the modex out of the pml base and into the general OMPI code (renaming the functions in the process) * Fixed the hang if a btl doesn't publish contact information (we wait until we receive a key pushed into the modex at the end of MPI_INIT) * Tried to reduce the number of required memory copies in the interface It's a fairly big change, in that all the BTLs have to be updated due to the function name differences. It's fairly well tested, and would be really nice for dealing with platforms where there are different networks on different machines. If no one has any objections, I'll probably do this next week... Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] Modex
I am ok with the following as long as we can have some sort of documenation describing what changed like which old functions are replaced with newer functions and any description of changed assumptions. --td Brian Barrett wrote: On Jun 26, 2007, at 6:08 PM, Tim Prins wrote: Some time ago you were working on moving the modex out of the pml and cleaning it up a bit. Is this work still ongoing? The reason I ask is that I am currently working on integrating the RSL, and would rather build on the new code rather than the old... Tim Prins brings up a point I keep meaning to ask the group about. A long time ago in a galaxy far, far away (aka, last fall), Galen and I started working on the BTL / PML redesign that morphed into some smaller changes, including some interesting IB work. Anyway, I rewrote large chunks of the modex, which did a couple of things: * Moved the modex out of the pml base and into the general OMPI code (renaming the functions in the process) * Fixed the hang if a btl doesn't publish contact information (we wait until we receive a key pushed into the modex at the end of MPI_INIT) * Tried to reduce the number of required memory copies in the interface It's a fairly big change, in that all the BTLs have to be updated due to the function name differences. It's fairly well tested, and would be really nice for dealing with platforms where there are different networks on different machines. If no one has any objections, I'll probably do this next week... Brian ___ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
Re: [OMPI devel] MPI_REAL2 support and Fortran ddt numbering
Rainer Keller wrote: Hello dear all, with the current numbering in mpif-common.h, the optional ddt MPI_REAL2 will break the binary compatibility of the fortran interface from v1.2 to v1.3 (see r15133). Now apart from MPI_REAL2 being of let's say rather minor importance, the group may feal that the numbering of datatypes is crucial to the end user and the (once agreed upon) allowed binary incompatibility for major version number changes is void. (The most important datatype that this change affects is MPI_DOUBLE_PRECISION: users will need to recompile their code with v1.3...) Please raise Your hand if anybody cares. Sun cares very much about this for the exact reason you state (Binary compatibility). I'd prefer this ddt is placed at the end of the list. thanks, --td
Re: [OMPI devel] [OMPI bugs] [Open MPI] #898: Move MPI exception man page fixes to v1.2
Open MPI wrote: #898: Move MPI exception man page fixes to v1.2 ---+ Reporter: jsquyres|Owner: Type: changeset move request | Status: new Priority: major |Milestone: Open MPI 1.2 Version: trunk | Resolution: Keywords: | ---+ Changes (by jsquyres): * cc: t...@sun.com, rlgra...@ornl.gov (added) Comment: Should put the RM's in the CC so that they know that this is up for RM blessing... I am ok with these changes going into 1.2. --td