Re: [OMPI devel] [devel-core] [RFC] Exit without finalize

2007-09-06 Thread Terry D. Dontje

Gleb Natapov wrote:


On Thu, Sep 06, 2007 at 06:50:43AM -0600, Ralph H Castain wrote:
 


WHAT:   Decide upon how to handle MPI applications where one or more
   processes exit without calling MPI_Finalize

WHY:Some applications can abort via an exit call instead of
   calling MPI_Abort when a library (or something else) calls
   exit. This situation is outside a user's control, so they
   cannot fix it.

WHERE:  Refer to ticket #1144 - code changes are TBD

WHEN:   Up to the group

   


[snip]
 


Does the general community feel we should do anything here, or is this a
"bug" that should be fixed by the entity calling "exit"? I should note that
it actually is bad behavior (IMHO) for any library to call "exit" - but
then, we do that in some situations too, so perhaps we shouldn't cast
stones!

Any suggested solutions or comments on whether or not we should do anything
would be appreciated.

   


IMO (a) should be implemented.

 

I don't think (b) should be implemented.  However, one could register an 
atexit handler that calls MPI_finalize.  Therefore, the exiting process 
would be stuck until everyone else reaches their exits or finalize.


That being said I think (a) probably makes more sense and adheres to the 
MPI standard.


--td



Re: [OMPI devel] SM BTL hang issue

2007-08-31 Thread Terry D. Dontje

Scott Atchley wrote:


Terry,

Are you testing on Linux? If so, which kernel?

 

No, I am running into issues on Solaris but Ollie's run of the test code 
on Linux seems to work fine.


--td

See the patch to iperf to handle kernel 2.6.21 and the issue that  
they had with usleep(0):


http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt

Scott

On Aug 31, 2007, at 1:36 PM, Terry D. Dontje wrote:

 


Ok, I have an update to this issue.  I believe there is an
implementation difference of sched_yield between Linux and  
Solaris.  If

I change the sched_yield in opal_progress to be a usleep(500) then my
program completes quite quickly.  I have sent a few questions to a
Solaris engineer and hopefully will get some useful information.

That being said, CT-6's implementation also used yield calls (note  
this
actually is what sched_yield reduces down to in Solaris) and we did  
not

see the same degradation issue as with Open MPI.  I believe the reason
is because CT-6's SM implementation is not calling CT-6's version of
progress recursively and forcing all the unexpected to be read in  
before
continuing.  CT-6 also has a natural flow control in it's  
implementation

(ie it has a fixed set fifo for eager messages.

I believe both of these characteristics lend CT-6 to not being
completely killed by the yield differences.

--td


Li-Ta Lo wrote:

   


On Thu, 2007-08-30 at 12:45 -0400, terry.don...@sun.com wrote:


 


Li-Ta Lo wrote:



   


On Thu, 2007-08-30 at 12:25 -0400, terry.don...@sun.com wrote:




 


Li-Ta Lo wrote:





   


On Wed, 2007-08-29 at 14:06 -0400, Terry D. Dontje wrote:






 


hmmm, interesting since my version doesn't abort at all.







   

Some problem with fortran compiler/language binding? My C  
translation

doesn't have any problem.

[ollie@exponential ~]$ mpirun -np 4 a.out 10
Target duration (seconds): 10.00, #of msgs: 50331, usec  
per msg:

198.684707







 

Did you oversubscribe?  I found np=10 on a 8 core system  
clogged things

up sufficiently.





   

Yea, I used np 10 on a 2 proc, 2 hyper-thread system (total 4  
threads).






 


Is this using Linux?



   


Yes.

Ollie


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


 


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
   



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 





Re: [OMPI devel] SM BTL hang issue

2007-08-31 Thread Terry D. Dontje
Ok, I have an update to this issue.  I believe there is an 
implementation difference of sched_yield between Linux and Solaris.  If 
I change the sched_yield in opal_progress to be a usleep(500) then my 
program completes quite quickly.  I have sent a few questions to a 
Solaris engineer and hopefully will get some useful information.


That being said, CT-6's implementation also used yield calls (note this 
actually is what sched_yield reduces down to in Solaris) and we did not 
see the same degradation issue as with Open MPI.  I believe the reason 
is because CT-6's SM implementation is not calling CT-6's version of 
progress recursively and forcing all the unexpected to be read in before 
continuing.  CT-6 also has a natural flow control in it's implementation 
(ie it has a fixed set fifo for eager messages.


I believe both of these characteristics lend CT-6 to not being 
completely killed by the yield differences.


--td


Li-Ta Lo wrote:


On Thu, 2007-08-30 at 12:45 -0400, terry.don...@sun.com wrote:
 


Li-Ta Lo wrote:

   


On Thu, 2007-08-30 at 12:25 -0400, terry.don...@sun.com wrote:


 


Li-Ta Lo wrote:

  

   


On Wed, 2007-08-29 at 14:06 -0400, Terry D. Dontje wrote:




 


hmmm, interesting since my version doesn't abort at all.

 

  

   

Some problem with fortran compiler/language binding? My C translation 
doesn't have any problem.


[ollie@exponential ~]$ mpirun -np 4 a.out 10
Target duration (seconds): 10.00, #of msgs: 50331, usec per msg:
198.684707





 

Did you oversubscribe?  I found np=10 on a 8 core system clogged things 
up sufficiently.


  

   


Yea, I used np 10 on a 2 proc, 2 hyper-thread system (total 4 threads).



 


Is this using Linux?

   




Yes.

Ollie


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 





Re: [OMPI devel] SM BTL hang issue

2007-08-29 Thread Terry D. Dontje

hmmm, interesting since my version doesn't abort at all.

--td

Li-Ta Lo wrote:


On Wed, 2007-08-29 at 11:36 -0400, Terry D. Dontje wrote:
 

To run the code I usually do "mpirun -np 6 a.out 10" on a 2 core 
system.  It'll print out the following and then hang:

Target duration (seconds): 10.00
# of messages sent in that time:  589207
Microseconds per message: 16.972

   




I know almost nothing about FORTRAN but the stack dump told me
it got NULL pointer reference when accessing the "me" variable
in the do .. while loop. How can this happen?

[ollie@exponential ~]$ mpirun -np 2 a.out 100
[exponential:22145] *** Process received signal ***
[exponential:22145] Signal: Segmentation fault (11)
[exponential:22145] Signal code: Address not mapped (1)
[exponential:22145] Failing at address: (nil)
[exponential:22145] [ 0] [0xb7f2a440]
[exponential:22145] [ 1] a.out(MAIN__+0x54a) [0x804909e]
[exponential:22145] [ 2] a.out(main+0x27) [0x8049127]
[exponential:22145] [ 3] /lib/libc.so.6(__libc_start_main+0xe0)
[0x4e75ef70]
[exponential:22145] [ 4] a.out [0x8048aa1]
[exponential:22145] *** End of error message ***

   call MPI_Send(keep_going,1,MPI_LOGICAL,me+1,1,
$   MPI_COMM_WORLD,ier)
804909e:   8b 45 d4mov0xffd4(%ebp),%eax
80490a1:   83 c0 01add$0x1,%eax

It is compiled with g77/g90.

Ollie


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 





Re: [OMPI devel] SM BTL hang issue

2007-08-29 Thread Terry D. Dontje
To run the code I usually do "mpirun -np 6 a.out 10" on a 2 core 
system.  It'll print out the following and then hang:

Target duration (seconds): 10.00
# of messages sent in that time:  589207
Microseconds per message: 16.972

--td

Terry D. Dontje wrote:

Heard you the first time Gleb, just been backed up with other stuff.  
Following is the code:


 include "mpif.h"

 character(20) cmd_line_arg ! We'll use the first command-line 
argument

! to set the duration of the test.

 real(8) :: duration = 10   ! The default duration (in seconds) 
can be

! set here.

 real(8) :: endtime ! This is the time at which we'll end the
! test.

 integer(8) :: nmsgs = 1! We'll count the number of messages sent
! out from each MPI process.  There 
will be

! at least one message (at the very end),
! and we'll count all the others.

 logical :: keep_going = .true. ! This flag says whether to keep going.

 ! Initialize MPI stuff.

 call MPI_Init(ier)
 call MPI_Comm_rank(MPI_COMM_WORLD, me, ier)
 call MPI_Comm_size(MPI_COMM_WORLD, np, ier)

 if ( np == 1 ) then

   ! Test to make sure there is at least one other process.

   write(6,*) "Need at least 2 processes."
   write(6,*) "Try resubmitting the job with"
   write(6,*) "   'mpirun -np '"
   write(6,*) "where  is at least 2."

 else if ( me == 0 ) then

   ! The first command-line argument is the duration of the test 
(seconds).


   call get_command_argument(1,cmd_line_arg,len,istat)
   if ( istat == 0 ) read(cmd_line_arg,*) duration

   ! Loop until test is done.

   endtime = MPI_Wtime() + duration ! figure out when to end
   do while ( MPI_Wtime() < endtime )
 call MPI_Send(keep_going,1,MPI_LOGICAL,1,1,MPI_COMM_WORLD,ier)
 nmsgs = nmsgs + 1
   end do

   ! Then, send the closing signal.

   keep_going = .false.
   call MPI_Send(keep_going,1,MPI_LOGICAL,1,1,MPI_COMM_WORLD,ier)

   ! Write summary information.

   write(6,'("Target duration (seconds):",f18.6)') duration
   write(6,'("# of messages sent in that time:", i12)') nmsgs
   write(6,'("Microseconds per message:", f19.3)') 1.d6 * duration / 
nmsgs


 else

   ! If you're not Process 0, you need to receive messages
   ! (and possibly relay them onward).

   do while ( keep_going )

 call MPI_Recv(keep_going,1,MPI_LOGICAL,me-1,1,MPI_COMM_WORLD, &
MPI_STATUS_IGNORE,ier)

 if ( me == np - 1 ) cycle ! The last process only receives 
messages.


 call MPI_Send(keep_going,1,MPI_LOGICAL,me+1,1,MPI_COMM_WORLD,ier)

   end do

 end if

 ! Finalize.

 call MPI_Finalize(ier)

end

Sorry it is in Fortran.

--td
Gleb Natapov wrote:


On Wed, Aug 29, 2007 at 11:01:14AM -0400, Richard Graham wrote:
 


If you are going to look at it, I will not bother with this.
  


I need the code to reproduce the problem. Otherwise I have nothing to
look at.
 


Rich


On 8/29/07 10:47 AM, "Gleb Natapov"  wrote:

  


On Wed, Aug 29, 2007 at 10:46:06AM -0400, Richard Graham wrote:



Gleb,
 Are you looking at this ?
  


Not today. And I need the code to reproduce the bug. Is this possible?




Rich


On 8/29/07 9:56 AM, "Gleb Natapov"  wrote:

  


On Wed, Aug 29, 2007 at 04:48:07PM +0300, Gleb Natapov wrote:



Is this trunk or 1.2?
  


Oops. I should read more carefully :) This is trunk.




On Wed, Aug 29, 2007 at 09:40:30AM -0400, Terry D. Dontje wrote:
  

I have a program that does a simple bucket brigade of sends and 
receives
where rank 0 is the start and repeatedly sends to rank 1 until 
a certain

amount of time has passed and then it sends and all done packet.

Running this under np=2 always works.  However, when I run with 
greater
than 2 using only the SM btl the program usually hangs and one 
of the
processes has a long stack that has a lot of the following 3 
calls in it:


[25] opal_progress(), line 187 in "opal_progress.c"
 [26] mca_btl_sm_component_progress(), line 397 in 
"btl_sm_component.c"

 [27] mca_bml_r2_progress(), line 110 in "bml_r2.c"

When stepping through the ompi_fifo_write_to_head routine it 
looks like

the fifo has overflowed.

I am wondering if what is happening is rank 0 has sent a bunch of
messages that have exhausted the
resources such that one of the middle ranks which is in the 
process of

sending cannot send and therefore
never gets to the point of trying to receive the messages from 
rank 0?


Is the above a possible scenario or are messages periodically 
bled off

the SM BTL's fifos?

Note, I have seen np=3 pass sometimes and I can get it to pass 

Re: [OMPI devel] SM BTL hang issue

2007-08-29 Thread Terry D. Dontje
Heard you the first time Gleb, just been backed up with other stuff.  
Following is the code:


 include "mpif.h"

 character(20) cmd_line_arg ! We'll use the first command-line argument
! to set the duration of the test.

 real(8) :: duration = 10   ! The default duration (in seconds) can be
! set here.

 real(8) :: endtime ! This is the time at which we'll end the
! test.

 integer(8) :: nmsgs = 1! We'll count the number of messages sent
! out from each MPI process.  There will be
! at least one message (at the very end),
! and we'll count all the others.

 logical :: keep_going = .true. ! This flag says whether to keep going.

 ! Initialize MPI stuff.

 call MPI_Init(ier)
 call MPI_Comm_rank(MPI_COMM_WORLD, me, ier)
 call MPI_Comm_size(MPI_COMM_WORLD, np, ier)

 if ( np == 1 ) then

   ! Test to make sure there is at least one other process.

   write(6,*) "Need at least 2 processes."
   write(6,*) "Try resubmitting the job with"
   write(6,*) "   'mpirun -np '"
   write(6,*) "where  is at least 2."

 else if ( me == 0 ) then

   ! The first command-line argument is the duration of the test (seconds).

   call get_command_argument(1,cmd_line_arg,len,istat)
   if ( istat == 0 ) read(cmd_line_arg,*) duration

   ! Loop until test is done.

   endtime = MPI_Wtime() + duration ! figure out when to end
   do while ( MPI_Wtime() < endtime )
 call MPI_Send(keep_going,1,MPI_LOGICAL,1,1,MPI_COMM_WORLD,ier)
 nmsgs = nmsgs + 1
   end do

   ! Then, send the closing signal.

   keep_going = .false.
   call MPI_Send(keep_going,1,MPI_LOGICAL,1,1,MPI_COMM_WORLD,ier)

   ! Write summary information.

   write(6,'("Target duration (seconds):",f18.6)') duration
   write(6,'("# of messages sent in that time:", i12)') nmsgs
   write(6,'("Microseconds per message:", f19.3)') 1.d6 * duration / nmsgs

 else

   ! If you're not Process 0, you need to receive messages
   ! (and possibly relay them onward).

   do while ( keep_going )

 call MPI_Recv(keep_going,1,MPI_LOGICAL,me-1,1,MPI_COMM_WORLD, &
MPI_STATUS_IGNORE,ier)

 if ( me == np - 1 ) cycle ! The last process only receives 
messages.


 call MPI_Send(keep_going,1,MPI_LOGICAL,me+1,1,MPI_COMM_WORLD,ier)

   end do

 end if

 ! Finalize.

 call MPI_Finalize(ier)

end

Sorry it is in Fortran.

--td
Gleb Natapov wrote:


On Wed, Aug 29, 2007 at 11:01:14AM -0400, Richard Graham wrote:
 


If you are going to look at it, I will not bother with this.
   


I need the code to reproduce the problem. Otherwise I have nothing to
look at. 

 


Rich


On 8/29/07 10:47 AM, "Gleb Natapov"  wrote:

   


On Wed, Aug 29, 2007 at 10:46:06AM -0400, Richard Graham wrote:
 


Gleb,
 Are you looking at this ?
   


Not today. And I need the code to reproduce the bug. Is this possible?

 


Rich


On 8/29/07 9:56 AM, "Gleb Natapov"  wrote:

   


On Wed, Aug 29, 2007 at 04:48:07PM +0300, Gleb Natapov wrote:
 


Is this trunk or 1.2?
   


Oops. I should read more carefully :) This is trunk.

 


On Wed, Aug 29, 2007 at 09:40:30AM -0400, Terry D. Dontje wrote:
   


I have a program that does a simple bucket brigade of sends and receives
where rank 0 is the start and repeatedly sends to rank 1 until a certain
amount of time has passed and then it sends and all done packet.

Running this under np=2 always works.  However, when I run with greater
than 2 using only the SM btl the program usually hangs and one of the
processes has a long stack that has a lot of the following 3 calls in it:

[25] opal_progress(), line 187 in "opal_progress.c"
 [26] mca_btl_sm_component_progress(), line 397 in "btl_sm_component.c"
 [27] mca_bml_r2_progress(), line 110 in "bml_r2.c"

When stepping through the ompi_fifo_write_to_head routine it looks like
the fifo has overflowed.

I am wondering if what is happening is rank 0 has sent a bunch of
messages that have exhausted the
resources such that one of the middle ranks which is in the process of
sending cannot send and therefore
never gets to the point of trying to receive the messages from rank 0?

Is the above a possible scenario or are messages periodically bled off
the SM BTL's fifos?

Note, I have seen np=3 pass sometimes and I can get it to pass reliably
if I raise the shared memory space used by the BTL.   This is using the
trunk.


--td


 



--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 





Re: [OMPI devel] SM BTL hang issue

2007-08-29 Thread Terry D. Dontje

Trunk.

--td
Gleb Natapov wrote:


Is this trunk or 1.2?

On Wed, Aug 29, 2007 at 09:40:30AM -0400, Terry D. Dontje wrote:
 

I have a program that does a simple bucket brigade of sends and receives 
where rank 0 is the start and repeatedly sends to rank 1 until a certain 
amount of time has passed and then it sends and all done packet.


Running this under np=2 always works.  However, when I run with greater 
than 2 using only the SM btl the program usually hangs and one of the 
processes has a long stack that has a lot of the following 3 calls in it:


[25] opal_progress(), line 187 in "opal_progress.c"
 [26] mca_btl_sm_component_progress(), line 397 in "btl_sm_component.c"
 [27] mca_bml_r2_progress(), line 110 in "bml_r2.c"

When stepping through the ompi_fifo_write_to_head routine it looks like 
the fifo has overflowed.


I am wondering if what is happening is rank 0 has sent a bunch of 
messages that have exhausted the
resources such that one of the middle ranks which is in the process of 
sending cannot send and therefore

never gets to the point of trying to receive the messages from rank 0?

Is the above a possible scenario or are messages periodically bled off 
the SM BTL's fifos?


Note, I have seen np=3 pass sometimes and I can get it to pass reliably 
if I raise the shared memory space used by the BTL.   This is using the 
trunk.



--td


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
   



--
Gleb.
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 





[OMPI devel] SM BTL hang issue

2007-08-29 Thread Terry D. Dontje
I have a program that does a simple bucket brigade of sends and receives 
where rank 0 is the start and repeatedly sends to rank 1 until a certain 
amount of time has passed and then it sends and all done packet.


Running this under np=2 always works.  However, when I run with greater 
than 2 using only the SM btl the program usually hangs and one of the 
processes has a long stack that has a lot of the following 3 calls in it:


[25] opal_progress(), line 187 in "opal_progress.c"
 [26] mca_btl_sm_component_progress(), line 397 in "btl_sm_component.c"
 [27] mca_bml_r2_progress(), line 110 in "bml_r2.c"

When stepping through the ompi_fifo_write_to_head routine it looks like 
the fifo has overflowed.


I am wondering if what is happening is rank 0 has sent a bunch of 
messages that have exhausted the
resources such that one of the middle ranks which is in the process of 
sending cannot send and therefore

never gets to the point of trying to receive the messages from rank 0?

Is the above a possible scenario or are messages periodically bled off 
the SM BTL's fifos?


Note, I have seen np=3 pass sometimes and I can get it to pass reliably 
if I raise the shared memory space used by the BTL.   This is using the 
trunk.



--td




[OMPI devel] vpath and vt-integration tmp branch

2007-08-29 Thread Terry D. Dontje
I've tried to do a vpath configure on the vt-integration tmp branch and 
get the following:


configure: Entering directory './tracing/vampirtrace'
/workspace/tdd/ct7/ompi-ws-vt//ompi-vt-integration/builds/ompi-vt-integration/configure: 
line 144920: cd: ./tracing/vampirtrace: No such file or directory


Has this branch been tested with vpath?

--td


Re: [OMPI devel] Maximum Shared Memory Segment - OK to increase?

2007-08-28 Thread Terry D. Dontje
Maybe an clarification of the SM BTL implementation is needed.  Does the 
SM BTL not set a limit based on np using the max allowable as a 
ceiling?  If not and all jobs are allowed to use up to max allowable I 
see the reason for not wanting to raise the max allowable. 

That being said it seems to me that the memory usage of the SM BTL is a 
lot larger than it should be.   Wasn't there some work done around June 
that looked why the SM BTL was allocating a lot of memory, anything come 
out of that? 


--td


Markus Daene wrote:


Rolf,

I think it is not a good idea to increase the default value to 2G.  You
have to keep in mind that there are not so many people who have  a 
machine with 128 and more cores on a single node. The average people

will have nodes with 2,4 maybe 8 cores and therefore it is not necessary
to set this parameter to such a high value. Eventually it allocates all
of this memory per node, and if you have only 4 or 8G per node it will
be inbalanced. For my 8core nodes I have even decreased the sm_max_size
to 32G and I had no problems with that. As far as I know (if not
otherwise specified during runtime) this parameter is global. So even if
you  run on your machine with 2 procs it might allocate the 2G for the
MPI smp module.
I would recommend like Richard suggests to set the parameter for your
machine in
etc/openmpi-mca-params.conf
and not to change the default value.

Markus


Rolf vandeVaart wrote:
 


We are running into a problem when running on one of our larger SMPs
using the latest Open MPI v1.2 branch.  We are trying to run a job
with np=128 within a single node.  We are seeing the following error:

"SM failed to send message due to shortage of shared memory."

We then increased the allowable maximum size of the shared segment to
2Gigabytes-1 which is the maximum allowed on 32-bit application.  We
used the mca parameter to increase it as shown here.

-mca mpool_sm_max_size 2147483647

This allowed the program to run to completion.  Therefore, we would
like to increase the default maximum from 512Mbytes to 2G-1 Gigabytes.
Does anyone have an objection to this change?  Soon we are going to
have larger CPU counts and would like to increase the odds that things
work "out of the box" on these large SMPs.

On a side note, I did a quick comparison of the shared memory needs of
the old Sun ClusterTools to Open MPI and came up with this table.

Open MPI
np  Sun ClusterTools 6current   suggested
-
 2 20M  128M128M
 4 20M  128M128M
 8 22M  256M256M
16 27M  512M512M
32 48M  512M  1G
64133M  512M2G-1
128476M  512M2G-1

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
   



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 





Re: [OMPI devel] [devel-core] [RFC] Runtime Services Layer

2007-08-24 Thread Terry D. Dontje

George Bosilca wrote:

Looks like I'm the only one barely excited about this idea. The  
system that you described, is well known. It been around for around  
10 years, and it's called PMI. The interface you have in the tmp  
branch as well as the description you gave in your email are more  
than similar with what they sketch in the following two documents:


http://www-unix.mcs.anl.gov/mpi/mpich/developer/design/pmiv2draft.htm
http://www-unix.mcs.anl.gov/mpi/mpich/developer/design/pmiv2.htm

Now, there is something wrong with reinventing the wheel if there are  
no improvements. And so far I'm unable to notice any major  
improvement neither compared with PMI nor with what we have today  
(except maybe being able to use PMI inside Open MPI).


 


I agree with the first sentence above.  I think this goes along
the line of Raplh's comment of "what are we trying to solve here?"
When this all started about 6 months ago I think the main concern
was finding what interfaces existed between ORTE and OMPI.  Though
I am not sure how that blossomed into redesigning the interface.
Not saying there isn't a reason to just that we should step back
and make sure we know why we are.

Again, my main concern is about fault tolerance. There is nothing in  
PMI (and nothing in RSL so far) that allow any kind of fault  
tolerance [And believe me re-writing the MPICH mpirun to allow  
checkpoint/restart is a hassle]. Moreover, your approach seems to  
open the possibility of having heterogeneous RTE (in terms of  
features) which in my view is definitively the wrong approach.


 


I am curious about this last paragraph.  Is it your belief that the current
ORTE does lend itself to being extended to incorporate fault tolerance?

Also, by heterogenous RTE are you meaning RTE running on a cluster
of heterogenous set of platforms?  If so, I would like to understand why
you think that is the "wrong" approach. 


--td


  george.

On Aug 16, 2007, at 9:47 PM, Tim Prins wrote:

 


WHAT: Solicitation of feedback on the possibility of adding a runtime
services layer to Open MPI to abstract out the runtime.

WHY: To solidify the interface between OMPI and the runtime  
environment,

and to allow the use of different runtime systems, including different
versions of ORTE.

WHERE: Addition of a new framework to OMPI, and changes to many of the
files in OMPI to funnel all runtime request through this framework.  
Few

changes should be required in OPAL and ORTE.

WHEN: Development has started in tmp/rsl, but is still in its  
infancy. We hope

to have a working system in the next month.

TIMEOUT: 8/29/07

--
Short version:

I am working on creating an interface between OMPI and the runtime  
system.
This would make a RSL framework in OMPI which all runtime services  
would be

accessed from. Attached is a graphic depicting this.

This change would be invasive to the OMPI layer. Few (if any) changes
will be required of the ORTE and OPAL layers.

At this point I am soliciting feedback as to whether people are
supportive or not of this change both in general and for v1.3.


Long version:

The current model used in Open MPI assumes that one runtime system is
the best for all environments. However, in many environments it may be
beneficial to have specialized runtime systems. With our current  
system this

is not easy to do.

With this in mind, the idea of creating a 'runtime services layer' was
hatched. This would take the form of a framework within OMPI,  
through which

all runtime functionality would be accessed. This would allow new or
different runtime systems to be used with Open MPI. Additionally,  
with such a
system it would be possible to have multiple versions of open rte  
coexisting,
which may facilitate development and testing. Finally, this would  
solidify the

interface between OMPI and the runtime system, as well as provide
documentation and side effects of each interface function.

However, such a change would be fairly invasive to the OMPI layer, and
needs a buy-in from everyone for it to be possible.

Here is a summary of the changes required for the RSL (at least how  
it is

currently envisioned):

1. Add a framework to ompi for the rsl, and a component to support  
orte.

2. Change ompi so that it uses the new interface. This involves:
a. Moving runtime specific code into the orte rsl component.
b. Changing the process names in ompi to an opaque object.
c. change all references to orte in ompi to be to the rsl.
3. Change the configuration code so that open-rte is only linked  
where needed.


Of course, all this would happen on a tmp branch.

The design of the rsl is not solidified. I have been playing in a  
tmp branch
(located at https://svn.open-mpi.org/svn/ompi/tmp/rsl) which  
everyone is

welcome to look at and comment on, but be advised that things here are
subject to change (I don't think it even compiles right now). There  
are

some fairly large open questions on this, including:

Re: [OMPI devel] Potential issue with PERUSE_COMM_MSG_MATCH_POSTED_REQ event called for unexpected matches

2007-08-23 Thread Terry D. Dontje
Nevermind my message below, things seem to be working for me now.  Not 
sure what happened.


--td
Terry D. Dontje wrote:


Rainer Keller wrote:

 


Hi Terry,
On Wednesday 22 August 2007 16:22, Terry D. Dontje wrote:


   


I thought I would run this by the group before trying to unravel the
code and figure out how to fix the problem.  It looks to me from some
experiementation that when a process matches an unexpected message that
the PERUSE framework incorrectly fires a
PERUSE_COMM_MSG_MATCH_POSTED_REQ in addition to a
PERUSE_COMM_REQ_MATCH_UNEX event.  I believe this is wrong that the
former event should not be fired in this case.
  

 

You are right, the former event PERUSE_COMM_MSG_MATCH_POSTED_Q should not be 
posted, as this was an unexpected message.




   


If the above assumption is true I think the problem arises because
PERUSE_COMM_MSG_MATCH_POSTED_REQ event is fired in function
mca_pml_ob1_recv_request_progress which is called by
mca_pml_ob1_recv_request_match_specific when a match of an unexpected
message has occurred.  I am wondering if the
PERUSE_COMM_MSG_MATCH_POSTED_REQ event should be moved to a more posted
queue centric routine something like mca_pml_ob1_recv_frag_match?
  

 

I believe, this is correct -- at least this works for a large message late 
sender and late receiver test program mpi_peruse.c.

Should be fixed with the committed patch v15947.
Actually, there are two other items, one is a missing 
PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q...




   


This works for large posted messges but when the posted message is small
you don't see the unexpected messages at all now.

--td

 

Additionally, we have a problem that we fire PERUSE_COMM_REQ_ACTIVATE event 
for MPI_*Probe-function calls. The solution is to move 
the pml_base_sendreq.h / pml_base_recv_req.h

into
pml_ob1_irecv.c, and resp. pml_ob1_isend.c
Please see the v15945.

With best regards,
Rainer


   



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 





Re: [OMPI devel] Potential issue with PERUSE_COMM_MSG_MATCH_POSTED_REQ event called for unexpected matches

2007-08-23 Thread Terry D. Dontje

Rainer Keller wrote:


Hi Terry,
On Wednesday 22 August 2007 16:22, Terry D. Dontje wrote:
 


I thought I would run this by the group before trying to unravel the
code and figure out how to fix the problem.  It looks to me from some
experiementation that when a process matches an unexpected message that
the PERUSE framework incorrectly fires a
PERUSE_COMM_MSG_MATCH_POSTED_REQ in addition to a
PERUSE_COMM_REQ_MATCH_UNEX event.  I believe this is wrong that the
former event should not be fired in this case.
   

You are right, the former event PERUSE_COMM_MSG_MATCH_POSTED_Q should not be 
posted, as this was an unexpected message.


 


If the above assumption is true I think the problem arises because
PERUSE_COMM_MSG_MATCH_POSTED_REQ event is fired in function
mca_pml_ob1_recv_request_progress which is called by
mca_pml_ob1_recv_request_match_specific when a match of an unexpected
message has occurred.  I am wondering if the
PERUSE_COMM_MSG_MATCH_POSTED_REQ event should be moved to a more posted
queue centric routine something like mca_pml_ob1_recv_frag_match?
   

I believe, this is correct -- at least this works for a large message late 
sender and late receiver test program mpi_peruse.c.

Should be fixed with the committed patch v15947.
Actually, there are two other items, one is a missing 
PERUSE_COMM_REQ_REMOVE_FROM_POSTED_Q...


 


This works for large posted messges but when the posted message is small
you don't see the unexpected messages at all now.

--td

Additionally, we have a problem that we fire PERUSE_COMM_REQ_ACTIVATE event 
for MPI_*Probe-function calls. The solution is to move 
 the pml_base_sendreq.h / pml_base_recv_req.h

into
 pml_ob1_irecv.c, and resp. pml_ob1_isend.c
Please see the v15945.

With best regards,
Rainer
 





[OMPI devel] Potential issue with PERUSE_COMM_MSG_MATCH_POSTED_REQ event called for unexpected matches

2007-08-22 Thread Terry D. Dontje
I thought I would run this by the group before trying to unravel the 
code and figure out how to fix the problem.  It looks to me from some 
experiementation that when a process matches an unexpected message that 
the PERUSE framework incorrectly fires a 
PERUSE_COMM_MSG_MATCH_POSTED_REQ in addition to a 
PERUSE_COMM_REQ_MATCH_UNEX event.  I believe this is wrong that the 
former event should not be fired in this case.


If the above assumption is true I think the problem arises because 
PERUSE_COMM_MSG_MATCH_POSTED_REQ event is fired in function 
mca_pml_ob1_recv_request_progress which is called by 
mca_pml_ob1_recv_request_match_specific when a match of an unexpected 
message has occurred.  I am wondering if the 
PERUSE_COMM_MSG_MATCH_POSTED_REQ event should be moved to a more posted 
queue centric routine something like mca_pml_ob1_recv_frag_match?


Suggestions...thoughts?

--td



Re: [OMPI devel] [RFC] Runtime Services Layer

2007-08-20 Thread Terry D. Dontje

I think the concept is a good idea.  A few questions that come to mind:

1.  Do you have a set of APIs you plan on supporting? 
2.  Are you planning on adding new APIs (not currently supported by ORTE)?

3.  Do any of the ORTE replacement APIs differ in how they work?
4.  Will RSL change in how we access information from the GPR?  If not
how does this layer really separate us from ORTE?
5.  How will RSL handle OOB functionality (routing of messages)?
6.  How does making the process names opaque differ from how ORTE
names processes?  Do you still need a global namespace for a 
"universe"?


I like the idea but I really wonder if this will even be half-baked in 
time for

1.3  (same concern as Jeff's).

--td

Tim Prins wrote:

WHAT: Solicitation of feedback on the possibility of adding a runtime 
services layer to Open MPI to abstract out the runtime.


WHY: To solidify the interface between OMPI and the runtime environment, 
and to allow the use of different runtime systems, including different 
versions of ORTE.


WHERE: Addition of a new framework to OMPI, and changes to many of the 
files in OMPI to funnel all runtime request through this framework. Few 
changes should be required in OPAL and ORTE.


WHEN: Development has started in tmp/rsl, but is still in its infancy. We hope 
to have a working system in the next month.


TIMEOUT: 8/29/07

--
Short version:

I am working on creating an interface between OMPI and the runtime system. 
This would make a RSL framework in OMPI which all runtime services would be 
accessed from. Attached is a graphic depicting this.


This change would be invasive to the OMPI layer. Few (if any) changes 
will be required of the ORTE and OPAL layers.


At this point I am soliciting feedback as to whether people are 
supportive or not of this change both in general and for v1.3.



Long version:

The current model used in Open MPI assumes that one runtime system is 
the best for all environments. However, in many environments it may be 
beneficial to have specialized runtime systems. With our current system this 
is not easy to do.


With this in mind, the idea of creating a 'runtime services layer' was 
hatched. This would take the form of a framework within OMPI, through which 
all runtime functionality would be accessed. This would allow new or 
different runtime systems to be used with Open MPI. Additionally, with such a

system it would be possible to have multiple versions of open rte coexisting,
which may facilitate development and testing. Finally, this would solidify the 
interface between OMPI and the runtime system, as well as provide 
documentation and side effects of each interface function.


However, such a change would be fairly invasive to the OMPI layer, and 
needs a buy-in from everyone for it to be possible.


Here is a summary of the changes required for the RSL (at least how it is 
currently envisioned):


1. Add a framework to ompi for the rsl, and a component to support orte.
2. Change ompi so that it uses the new interface. This involves:
a. Moving runtime specific code into the orte rsl component.
b. Changing the process names in ompi to an opaque object.
c. change all references to orte in ompi to be to the rsl.
3. Change the configuration code so that open-rte is only linked where needed.

Of course, all this would happen on a tmp branch.

The design of the rsl is not solidified. I have been playing in a tmp branch 
(located at https://svn.open-mpi.org/svn/ompi/tmp/rsl) which everyone is 
welcome to look at and comment on, but be advised that things here are 
subject to change (I don't think it even compiles right now). There are 
some fairly large open questions on this, including:


1. How to handle mpirun (that is, when a user types 'mpirun', do they 
always get ORTE, or do they sometimes get a system specific runtime). Most 
likely mpirun will always use ORTE, and alternative launching programs would 
be used for other runtimes.
2. Whether there will be any performance implications. My guess is not, 
but am not quite sure of this yet.


Again, I am interested in people's comments on whether they think adding 
such abstraction is good or not, and whether it is reasonable to do such a 
thing for v1.3.


Thanks,

Tim Prins



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] openib btl header caching

2007-08-13 Thread Terry D. Dontje

Jeff Squyres wrote:

With Mellanox's new HCA (ConnectX), extremely low latencies are  
possible for short messages between two MPI processes.  Currently,  
OMPI's latency is around 1.9us while all other MPI's (HP MPI, Intel  
MPI, MVAPICH[2], etc.) are around 1.4us.  A big reason for this  
difference is that, at least with MVAPICH[2], they are doing wire  
protocol header caching where the openib BTL does not.  Specifically:


- Mellanox tested MVAPICH with the header caching; latency was around  
1.4us
- Mellanox tested MVAPICH without the header caching; latency was  
around 1.9us


Given that OMPI is the lone outlier around 1.9us, I think we have no  
choice except to implement the header caching and/or examine our  
header to see if we can shrink it.  Mellanox has volunteered to  
implement header caching in the openib btl.


Any objections?  We can discuss what approaches we want to take  
(there's going to be some complications because of the PML driver,  
etc.); perhaps in the Tuesday Mellanox teleconf...?
 


This sounds great.  Sun, would like to hear how thing are being done
so we can possibly port the solution to the udapl btl.

--td


Re: [OMPI devel] [RFC] New command line options to replace persistent daemon operations

2007-07-27 Thread Terry D. Dontje

Ralph Castain wrote:


WHAT:   Proposal to add two new command line options that will allow us to
   replace the current need to separately launch a persistent daemon to
   support connect/accept operations

WHY:Remove problems of confusing multiple allocations, provide a cleaner
   method for connect/accept between jobs

WHERE:  minor changes in orterun and orted, some code in rmgr and each pls
   to ensure the proper jobid and connect info is passed to each
   app_context as it is launched

 


It is my opinion that we would be better off attacking the issues of
the persistent daemons described below then creating a new set of
options to mpirun for process placement.  (more comments below on
the actual proposal).


TIMOUT: 8/10/07

We currently do not support connect/accept operations in a clean way. Users
are required to first start a persistent daemon that operates in a
user-named universe. They then must enter the mpirun command for each
application in a separate window, providing the universe name on each
command line. This is required because (a) mpirun will not run in the
background (in fact, at one point in time it would segfault, though I
believe it now just hangs), and (b) we require that all applications using
connect/accept operate under the same HNP.

This is burdensome and appears to be causing problems for users as it
requires them to remember to launch that persistent daemon first -
otherwise, the applications execute, but never connect. Additionally, we
have the problem of confused allocations from the different login sessions.
This has caused numerous problems of processes going to incorrect locations,
allocations timing out at different times and causing jobs to abort, etc.

What I propose here is to eliminate the confusion in a manner that minimizes
code complexity. The idea is to utilize our so-painfully-developed multiple
app_context capability to have the user launch all the interacting
applications with the same mpirun command. This not only eliminates the
annoyance factor for users by eliminating the need for multiple steps and
login sessions, but also solves the problem of ensuring that all
applications are running in the same allocation (so we don't have to worry
any more about timeouts in one allocation aborting another job).

The proposal is to add two command line options that are associated with a
specific app_context (feel free to redefine the name of the option - I don't
personally care):

1. --independent-job - indicates that this app_context is to be launched as
an independent job. We will assign it a separate jobid, though we will map
it as part of the overall command (e.g., if by slot and no other directives
provided, it will start mapping where the prior app_context left off)

 

I am unclear what does the option --connect really do?  The MPI codes 
actually
have to call MPI_Comm_connect to really connect to a process.  Can we 
get away

with just the above option?


2. --connect x,y,z  - only valid when combined with the above option,
indicates that this independent job is to be MPI-connected to app_contexts
x,y,z (where x,y,z are the number of the app_context, counting from the
beginning of the command - you choose if we start from 0 or 1).
Alternatively, we can default to connecting to everyone, and then use
--disconnect to indicate we -don't- want to be connected.

Note that this means the entire allocation for the combined app_contexts
must be provided. This helps the RTE tremendously to keep things straight,
and ensures that all the app_contexts will be able to complete (or not) in a
synchronized fashion.

It also allows us to eliminate the persistent daemon and multiple login
session requirements for connect/accept. That does not mean we cannot have a
persistent daemon to create a virtual machine, assuming we someday want to
support that mode of operation. This simply removes the requirement that the
user start one just so they can use connect/accept.

Comments?


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 





[OMPI devel] Locking issue with OB1 PML

2007-07-20 Thread Terry D. Dontje
I think I've found a problem that is causing at least some of my runs of 
the MT tests to abort or hang.  The issue is that in the OB1 request 
structure there is a req_send_range_lock that is never initialized with 
the appropriate (pthread_)mutex_init call.  I've put in the following 
patch (given to me by Jeff) in ompi/mca/pml/ob1/pml_ob1_sendreq.c


Index: pml_ob1_sendreq.c
===
--- pml_ob1_sendreq.c   (revision 15535)
+++ pml_ob1_sendreq.c   (working copy)
@@ -136,12 +136,18 @@
req->req_rdma_cnt = 0;
req->req_throttle_sends = false;
OBJ_CONSTRUCT(&req->req_send_ranges, opal_list_t);
+OBJ_CONSTRUCT(&req->req_send_range_lock, opal_mutex_t);
}
+static void mca_pml_ob1_send_request_destruct 
(mca_pml_ob1_send_request_t* req)

+{
+OBJ_DESTRUCT(&req->req_send_range_lock);
+}
+
OBJ_CLASS_INSTANCE( mca_pml_ob1_send_request_t,
mca_pml_base_send_request_t,
mca_pml_ob1_send_request_construct,
-NULL );
+mca_pml_ob1_send_request_destruct);
/**
 * Completion of a short message - nothing left to schedule. Note  that 
this


The above seems to at least allow one of my tests to consistently pass 
(haven't tried the other tests yet). I was wanting to see if the above 
fix makes sense and if possibly there are similar issues with the other 
PMLs.


Thanks,

--td



[OMPI devel] Call for OMPI Binary Distributions

2007-07-17 Thread Terry D. Dontje
This announcement is to request links to Binary Distributions of Open 
MPI that our community may have on the web for users to download.  We'd 
like to take those links and post them on our download page to make it 
easier for those who are insterested in getting binaries to install and 
not the source code.


This information will be posted on the download pages like 
http://www.open-mpi.org/software/ompi/v1.2/.


What we need is the following:

1. What OMPI release the distribution is based off of (v1.2...)
2. The content description/name of your distribution,
3. The link to your distribution(s)
4. The date the distribution was released.

Thanks,

--td


Re: [OMPI devel] Multi-environment builds

2007-07-11 Thread Terry D. Dontje

Jeff Squyres wrote:


On Jul 10, 2007, at 1:26 PM, Ralph H Castain wrote:

 


2. It may be useful to have some high-level parameters to specify a
specific run-time environment, since ORTE has multiple, related
frameworks (e.g., RAS and PLS).  E.g., "orte_base_launcher=tm", or
somesuch.
 



I was just writing this up in an enhancement ticket when the though  
hit me: isn't this aggregate MCA parameters?  I.e.:


mpirun --am tm ...

Specifically, we'll need to make a "tm" AMCA file (and whatever other  
ones we want), but my point is: does AMCA already give us what we want?
 

The above sounds like a possible solution as long as we are going to 
deliver a set of such files and not require each site to create their 
own.  Also, can one pull in multiple AMCA files for one run thus you can 
specify a tm AMCA and possibly some other AMCA file that the user may want?


--td


Re: [OMPI devel] Modex

2007-06-27 Thread Terry D. Dontje

Cool this sounds good enough to me.

--td

Brian Barrett wrote:

THe function name changes are pretty obvious (s/mca_pml_base/ompi/),  
and I thought I'd try something new and actually document the  
interface in the header file :).  So we should be good on that front.


Brian

On Jun 27, 2007, at 6:38 AM, Terry D. Dontje wrote:

 


I am ok with the following as long as we can have some sort of
documenation describing what changed like which old functions
are replaced with newer functions and any description of changed
assumptions.

--td
Brian Barrett wrote:

   


On Jun 26, 2007, at 6:08 PM, Tim Prins wrote:



 


Some time ago you were working on moving the modex out of the pml
and cleaning
it up a bit. Is this work still ongoing? The reason I ask is that  
I am

currently working on integrating the RSL, and would rather build on
the new
code rather than the old...


   


Tim Prins brings up a point I keep meaning to ask the group about.  A
long time ago in a galaxy far, far away (aka, last fall), Galen and I
started working on the BTL / PML redesign that morphed into some
smaller changes, including some interesting IB work.  Anyway, I
rewrote large chunks of the modex, which did a couple of things:

* Moved the modex out of the pml base and into the general OMPI code
(renaming
 the functions in the process)
* Fixed the hang if a btl doesn't publish contact information (we
wait until we
 receive a key pushed into the modex at the end of MPI_INIT)
* Tried to reduce the number of required memory copies in the  
interface


It's a fairly big change, in that all the BTLs have to be updated due
to the function name differences.  It's fairly well tested, and would
be really nice for dealing with platforms where there are different
networks on different machines.  If no one has any objections, I'll
probably do this next week...

Brian

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


 


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
   



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 





Re: [OMPI devel] Modex

2007-06-27 Thread Terry D. Dontje

I am ok with the following as long as we can have some sort of
documenation describing what changed like which old functions
are replaced with newer functions and any description of changed
assumptions.

--td
Brian Barrett wrote:


On Jun 26, 2007, at 6:08 PM, Tim Prins wrote:

 

Some time ago you were working on moving the modex out of the pml  
and cleaning

it up a bit. Is this work still ongoing? The reason I ask is that I am
currently working on integrating the RSL, and would rather build on  
the new

code rather than the old...
   



Tim Prins brings up a point I keep meaning to ask the group about.  A  
long time ago in a galaxy far, far away (aka, last fall), Galen and I  
started working on the BTL / PML redesign that morphed into some  
smaller changes, including some interesting IB work.  Anyway, I  
rewrote large chunks of the modex, which did a couple of things:


* Moved the modex out of the pml base and into the general OMPI code  
(renaming

  the functions in the process)
* Fixed the hang if a btl doesn't publish contact information (we  
wait until we

  receive a key pushed into the modex at the end of MPI_INIT)
* Tried to reduce the number of required memory copies in the interface

It's a fairly big change, in that all the BTLs have to be updated due  
to the function name differences.  It's fairly well tested, and would  
be really nice for dealing with platforms where there are different  
networks on different machines.  If no one has any objections, I'll  
probably do this next week...


Brian

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 





Re: [OMPI devel] MPI_REAL2 support and Fortran ddt numbering

2007-06-19 Thread Terry D. Dontje

Rainer Keller wrote:


Hello dear all,
with the current numbering in mpif-common.h, the optional ddt  
MPI_REAL2 will
break the binary compatibility of the fortran interface from v1.2 to  
v1.3

(see r15133).

Now apart from MPI_REAL2 being of let's say rather minor importance,  
the group
may feal that the numbering of datatypes is crucial to the end user  
and the
(once agreed upon) allowed binary incompatibility for major version  
number

changes is void.

(The most important datatype that this change affects is  
MPI_DOUBLE_PRECISION:

users will need to recompile their code with v1.3...)

Please raise Your hand if anybody cares.



Sun cares very much about this for the exact reason you state (Binary 
compatibility).

I'd prefer this ddt is placed at the end of the list.

thanks,

--td



Re: [OMPI devel] [OMPI bugs] [Open MPI] #898: Move MPI exception man page fixes to v1.2

2007-02-12 Thread Terry D. Dontje

Open MPI wrote:


#898: Move MPI exception man page fixes to v1.2
---+
Reporter:  jsquyres|Owner:  
   Type:  changeset move request  |   Status:  new 
Priority:  major   |Milestone:  Open MPI 1.2
Version:  trunk   |   Resolution:  
Keywords:  |  
---+

Changes (by jsquyres):

* cc: t...@sun.com, rlgra...@ornl.gov (added)

Comment:

Should put the RM's in the CC so that they know that this is up for RM
blessing...

 


I am ok with these changes going into 1.2.

--td