Re: [OMPI devel] 1.3 release date?

2008-10-22 Thread Greg Watson

Brad,

Many thanks for the update.

Greg

On Oct 22, 2008, at 8:43 PM, Brad Benton wrote:


Greg,

Here is the latest schedule that we have for getting 1.3 out the door:
https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3

Basically, this schedule sets Nov. 10 as the release date with a  
backup date of Nov. 17.
Here is a bit more detail as to the release to beta and then to  
release candidate 1,

prior to the general release (lifted from the wiki):
   1.3 beta: Target: October 27, 2008
   1.3 rc1: Target: November 3, 2008
   1.3 release: Target: November 10, 2008

--Brad


On Fri, Oct 17, 2008 at 5:38 AM, Jeff Squyres   
wrote:

Greg -- I defer to George/Brad for plans of the specific release date.

We hope to be feature complete by early next week.  This clears the  
way for a "beta" release.  Specifically, there's two things we're  
waiting for:


1. Some FT stuff that Tim/Josh think can be done by this weekend
2. A critical code review for a big openib BTL change that will be  
done when Pasha and I are at the Chicago Forum meeting on Monday




On Oct 15, 2008, at 4:48 PM, Greg Watson wrote:

Hi all,

Has a release date been set for 1.3?

Thanks,

Greg
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] 1.3 release date?

2008-10-22 Thread Brad Benton
Greg,
Here is the latest schedule that we have for getting 1.3 out the door:
https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3

Basically, this schedule sets Nov. 10 as the release date with a backup date
of Nov. 17.
Here is a bit more detail as to the release to beta and then to release
candidate 1,
prior to the general release (lifted from the wiki):
   1.3 beta: Target: October 27, 2008
   1.3 rc1: Target: November 3, 2008
   1.3 release: Target: November 10, 2008
--Brad


On Fri, Oct 17, 2008 at 5:38 AM, Jeff Squyres  wrote:

> Greg -- I defer to George/Brad for plans of the specific release date.
>
> We hope to be feature complete by early next week.  This clears the way for
> a "beta" release.  Specifically, there's two things we're waiting for:
>
> 1. Some FT stuff that Tim/Josh think can be done by this weekend
> 2. A critical code review for a big openib BTL change that will be done
> when Pasha and I are at the Chicago Forum meeting on Monday
>
>
>
> On Oct 15, 2008, at 4:48 PM, Greg Watson wrote:
>
>  Hi all,
>>
>> Has a release date been set for 1.3?
>>
>> Thanks,
>>
>> Greg
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>
>
> --
> Jeff Squyres
> Cisco Systems
>
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>


Re: [OMPI devel] Restarting processes on different node

2008-10-22 Thread Paul H. Hargrove

Leonardo,

 As you say, there is the possiblity that moving from one node to 
another has caused problems due to different shared libraries.  The 
result from this could be a segmentation fault, an illegal instruction 
or even a bus error.  In all three cases, however, this failure 
generates a signal (SIGSEGV, SIGILL or SIGBUG).  So, it is possible that 
you are seeing the failure mode that you were expecting.
 There are at least 2 ways you can deal with heterogenous libaries.  
The first is that if the libs are only different due to preloading, you 
can undo the preloading as described in the BLCR FAQ 
(http://mantis.lbl.gov/blcr/doc/html/FAQ.html#prelink)
 The second would be to include the shared libaries in the checpoint 
itself.  While this is very costly in terms of storage, you may find it 
lets you restart in cases where you might not otherwise be able to.  The 
trick is to add --save-private or --save-all to the checkpoint command 
that OpenMPI uses to checkpoint the application processes.


-Paul

Leonardo Fialho wrote:

Hi All,

I´m trying to implement my FT architecture in Open MPI. Just now I 
need to restart a faulty process from a checkpoint. I saw that Josh 
uses orte-restart which call opal-restart through an ordinary mpirun 
call. It´s now good for me because in this case the restarted process 
becomes in a new job. I need to restart the process checkpoint in the 
same job and in another node under an existing orted. The checkpoints 
are taken without the "--term" option.


My modified orted receive a "restart request" from my modified 
heartbeat mechanism. I have tried to restart using the BLCR cr_restart 
command. It does not work, I think because the stderr/stdin/stdout was 
not handled by the opal environment. So, I tried to restart the 
checkpoint forking the orted and doing an execvp to the opal-restart. 
It recovers the checkpoint, but after the "opal_cr_init", it dies (*** 
Process received signal ***).


As follows is the job structure (from ompi-ps) after a fault:

Process Name |ORTE Name | Local Rank |PID |   Node |   State 
| HB Dest. |
- 

orterun | [[8002,0],0] |  65535 |  30434 | aoclsb | Running 
| |
  orted | [[8002,0],1] |  65535 |  30435 |  nodo1 | Running | 
[[8002,0],3] |
  orted | [[8002,0],2] |  65535 |  30438 |  nodo2 |  Faulty | 
[[8002,0],3] |
  orted | [[8002,0],3] |  65535 |  30441 |  nodo3 | Running | 
[[8002,0],4] |
  orted | [[8002,0],4] |  65535 |  30444 |  nodo4 | Running | 
[[8002,0],1] |



Process Name |ORTE Name | Local Rank |PID |  Node | State 
| Ckpt State | Ckpt Loc |Protector |
-- 

./ping/wait | [[8002,1],0] |  0 |   9069 | nodo1 |   Running 
|   Finished | /tmp/radic/0 | [[8002,0],2] |
./ping/wait | [[8002,1],1] |  0 |   6086 | nodo2 | Restoring 
|   Finished | /tmp/radic/1 | [[8002,0],3] |
./ping/wait | [[8002,1],2] |  0 |   5864 | nodo3 |   Running 
|   Finished | /tmp/radic/2 | [[8002,0],4] |
./ping/wait | [[8002,1],3] |  0 |   7405 | nodo4 |   Running 
|   Finished | /tmp/radic/3 | [[8002,0],1] |



The orted running on "nodo2" dies. It was detected by the orted 
[[8002,0],1] running on "nodo1" and informed to the HNP. The HNP 
update the procs structure and look for processes running on the 
faulty node, so it sends a restart request for the orted which holds 
the checkpoint of the faulty processes.


Below is the log generated:

[aoclsb:30434] [[8002,0],0] orted_recv: update state request from 
[[8002,0],3]
[aoclsb:30434] [[8002,0],0] orted_update_state: updating state (17) 
for orted process (vpid=2)
[aoclsb:30434] [[8002,0],0] orted_update_state: found process 
[[8002,1],1] on node nodo2, requesting recovery task for that
[aoclsb:30434] [[8002,0],0] orted_update_state: sending restore 
([[8002,1],1] process) request to [[8002,0],3]
[nodo3:05841] [[8002,0],3] orted_recv: restore checkpoint request from 
[[8002,0],0]
[nodo3:05841] [[8002,0],3] orted_restore_checkpoint: restarting 
process from checkpoint file (/tmp/radic/1/ompi_blcr_context.6086)
[nodo3:05841] [[8002,0],3] orted_restore_checkpoint: executing restart 
(opal-restart -mca crs_base_snapshot_dir /tmp/radic/1 .)

[nodo3:05924] opal_cr: init: Verbose Level: 1024
[nodo3:05924] opal_cr: init: FT Enabled: 1
[nodo3:05924] opal_cr: init: Is a tool program: 1
[nodo3:05924] opal_cr: init: Checkpoint Signal: 10
[nodo3:05924] opal_cr: init: Debug SIGPIPE: 0 (False)
[nodo3:05924] opal_cr: init: Temp Directory: /tmp
[nodo2:05965] *** Process received signal ***

The orted which receives the restart request forks and the call an 
execvp for the opal-restart, and then, unfortunately, it dies. I know 
that the restarted process should generate errors because the URI of 
it daemon is in

Re: [OMPI devel] adding new functions to a BTL

2008-10-22 Thread Eugene Loh




Ralf Wildenhues wrote:

  
Jeff Squyres wrote:


  We use lt_dlopen() to open the plugins (Libtool's wrapper for a   
portable dlopen).  It opens all plugins (DSOs) in a private scope.
That private scope is kept deep in the OPAL MCA base and not exposed   
elsewhere in the code base.  So if you manually dlopen a plugin again,  
I'll bet that the linker realizes that that DSO has already been  
loaded into the process space and doesn't actually load it again (but  
doesn't fail).  So the dlsyms fail because you don't have access to  
the private scope from where Libtool originally opened the DSO.
  

  
  Shouldn't it work to re-dlopen the lib with RTLD_GLOBAL?
  

I used dlopen("...", RTLD_LAZY | RTLD_GLOBAL).  It gave me a non-NULL
handle, but subsequent dlsyms failed.

  Also, recent libltdl should allow you to choose which scope you want in
the first place, local or global, through lt_dladvise.

I'm just learning all this dl stuff right now.  Jeff's --enable-static
seems to do exactly what I want (namely, make things work in the way
that I'm familiar with!).  I did try to figure out what OMPI was doing
and it seemed to me it was using RTLD_LAZY | RTLD_GLOBAL, which is why
I chose that.

For now, --enable-static seems to do exactly what I want.  Further
workarounds probably don't make any sense.




Re: [OMPI devel] adding new functions to a BTL

2008-10-22 Thread Ralf Wildenhues
Hello Jeff, Eugene,

> Jeff Squyres wrote:
>
>> We use lt_dlopen() to open the plugins (Libtool's wrapper for a   
>> portable dlopen).  It opens all plugins (DSOs) in a private scope.
>> That private scope is kept deep in the OPAL MCA base and not exposed   
>> elsewhere in the code base.  So if you manually dlopen a plugin again,  
>> I'll bet that the linker realizes that that DSO has already been  
>> loaded into the process space and doesn't actually load it again (but  
>> doesn't fail).  So the dlsyms fail because you don't have access to  
>> the private scope from where Libtool originally opened the DSO.

Shouldn't it work to re-dlopen the lib with RTLD_GLOBAL?

Also, recent libltdl should allow you to choose which scope you want in
the first place, local or global, through lt_dladvise.

Hope that helps.

Cheers,
Ralf


Re: [OMPI devel] adding new functions to a BTL

2008-10-22 Thread Eugene Loh

Jeff Squyres wrote:

We use lt_dlopen() to open the plugins (Libtool's wrapper for a  
portable dlopen).  It opens all plugins (DSOs) in a private scope.   
That private scope is kept deep in the OPAL MCA base and not exposed  
elsewhere in the code base.  So if you manually dlopen a plugin 
again,  I'll bet that the linker realizes that that DSO has already 
been  loaded into the process space and doesn't actually load it again 
(but  doesn't fail).  So the dlsyms fail because you don't have access 
to  the private scope from where Libtool originally opened the DSO.


Make sense?


Yes, I'm nodding my head vigorously (with a vacuous stare on my face).  
Mostly, I think you're very smart and I'm not!  I get the general 
principles, but am unfamiliar with the details.


Never mind:  --enable-static is exactly the flavor of suggestion I was 
looking for.  Thanks.  I'm back in the saddle.  Onward.


Re: [OMPI devel] Component open

2008-10-22 Thread Ralph Castain
Hmmm...interesting. I see what's going on - I'm having a build system  
issue that is causing some of the dynamic libraries to not be seen.


Red herring - thanks for clarifying!

Camille: thanks for fixing this way back when.

Ralph


On Oct 22, 2008, at 1:17 PM, George Bosilca wrote:


Ralph,

This problem was fixed long ago by some of the work Camille did. The  
exact revision number is r15402 (https://svn.open-mpi.org/trac/ompi/changeset/15402 
). I'm using this feature daily and so far I had any problems with it.


To reuse your example here is what Camille came up with.

$ mpiexec --mca routed_base_verbose 30 -n 3 hostname
[dancer:09638] mca: base: components_open: Looking for routed  
components

[dancer:09638] mca: base: components_open: opening routed components
[dancer:09638] mca: base: components_open: found loaded component  
binomial
[dancer:09638] mca: base: components_open: component binomial has no  
register function
[dancer:09638] mca: base: components_open: component binomial has no  
open function
[dancer:09638] mca: base: components_open: found loaded component  
direct
[dancer:09638] mca: base: components_open: component direct has no  
register function
[dancer:09638] mca: base: components_open: component direct has no  
open function
[dancer:09638] mca: base: components_open: found loaded component  
linear
[dancer:09638] mca: base: components_open: component linear has no  
register function
[dancer:09638] mca: base: components_open: component linear has no  
open function

[dancer:09638] mca:base:select: Auto-selecting routed components
[...]

And if we force a special component:

$ mpiexec --mca routed linear --mca routed_base_verbose 30 -n 3  
hostname
[dancer:09642] mca: base: components_open: Looking for routed  
components

[dancer:09642] mca: base: components_open: opening routed components
[dancer:09642] mca: base: components_open: found loaded component  
linear
[dancer:09642] mca: base: components_open: component linear has no  
register function
[dancer:09642] mca: base: components_open: component linear has no  
open function

[dancer:09642] mca:base:select: Auto-selecting routed components
[...]

I wonder what are the configuration options you're using?

 george.

On Oct 22, 2008, at 1:30 PM, Ralph Castain wrote:

I've been digging a little into optimization and found something  
that seems counterintuitive in the way OMPI is handling components.  
Specifically, if I specify a component I want used for a framework,  
OMPI still does a component load and open on every component in the  
framework - it only uses my specification during "select".


Thus, the cmd line

mpirun -mca routed linear

still results in the loading and opening of the direct and binomial  
components - even though we have directed the framework not to use  
them.


This causes us to waste memory when there is no possibility of a  
different component being selected. Is there a reason why "open"  
isn't using the mca params to guide the components it is loading?


Ralph

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Component open

2008-10-22 Thread George Bosilca

Ralph,

This problem was fixed long ago by some of the work Camille did. The  
exact revision number is r15402 (https://svn.open-mpi.org/trac/ompi/changeset/15402 
). I'm using this feature daily and so far I had any problems with it.


To reuse your example here is what Camille came up with.

$ mpiexec --mca routed_base_verbose 30 -n 3 hostname
[dancer:09638] mca: base: components_open: Looking for routed components
[dancer:09638] mca: base: components_open: opening routed components
[dancer:09638] mca: base: components_open: found loaded component  
binomial
[dancer:09638] mca: base: components_open: component binomial has no  
register function
[dancer:09638] mca: base: components_open: component binomial has no  
open function

[dancer:09638] mca: base: components_open: found loaded component direct
[dancer:09638] mca: base: components_open: component direct has no  
register function
[dancer:09638] mca: base: components_open: component direct has no  
open function

[dancer:09638] mca: base: components_open: found loaded component linear
[dancer:09638] mca: base: components_open: component linear has no  
register function
[dancer:09638] mca: base: components_open: component linear has no  
open function

[dancer:09638] mca:base:select: Auto-selecting routed components
[...]

And if we force a special component:

$ mpiexec --mca routed linear --mca routed_base_verbose 30 -n 3 hostname
[dancer:09642] mca: base: components_open: Looking for routed components
[dancer:09642] mca: base: components_open: opening routed components
[dancer:09642] mca: base: components_open: found loaded component linear
[dancer:09642] mca: base: components_open: component linear has no  
register function
[dancer:09642] mca: base: components_open: component linear has no  
open function

[dancer:09642] mca:base:select: Auto-selecting routed components
[...]

I wonder what are the configuration options you're using?

  george.

On Oct 22, 2008, at 1:30 PM, Ralph Castain wrote:

I've been digging a little into optimization and found something  
that seems counterintuitive in the way OMPI is handling components.  
Specifically, if I specify a component I want used for a framework,  
OMPI still does a component load and open on every component in the  
framework - it only uses my specification during "select".


Thus, the cmd line

mpirun -mca routed linear

still results in the loading and opening of the direct and binomial  
components - even though we have directed the framework not to use  
them.


This causes us to waste memory when there is no possibility of a  
different component being selected. Is there a reason why "open"  
isn't using the mca params to guide the components it is loading?


Ralph

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] Comm_spawn limits

2008-10-22 Thread Ralph Castain
I can't swear to this because I haven't fully grokked it yet, but I  
believe the answer is:


1. if child jobs have completed, it won't hurt. I think the various  
subsystem cleanup their bookkeeping when a job completes, so we could  
possibly reuse the number. Might be some race conditions we would have  
to resolve.


2. if child jobs haven't completed (which is the  situation this  
particular user was attempting), then we would have a problem with  
jobid confusion. Once we get the procs launched, though, I'm not sure  
how much of a problem there is - would have to investigate. Could  
cause some bookkeeping problems for job completion.


Interesting possibility, though...consider it another option for now.



On Oct 22, 2008, at 12:53 PM, George Bosilca wrote:


What's happened if we roll around with the counter ?

 george.

On Oct 22, 2008, at 2:49 PM, Ralph Castain wrote:

There recently was activity on the mailing lists where someone was  
attempting to call comm_spawn 100,000 times. Setting aside the  
threading issues that were the focus of that exchange, the fact is  
that OMPI currently cannot handle that many comm_spawns.


The ORTE jobid is composed of two elements:

1. the top 16-bits is an "identifier" for that mpirun

2. the lower 16-bits is a running counter identifying the specific  
job/launch for those procs.


Thus, we are limited to 64k comm_spawns.

Expanding this would require either revamping the entire way we  
handle jobs (e.g., removing the mpirun identifier - major effort),  
or expanding the orte_jobid_t from its current 32-bits to 64-bits.


Is this a problem we want to address?
Ralph

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Comm_spawn limits

2008-10-22 Thread George Bosilca

What's happened if we roll around with the counter ?

  george.

On Oct 22, 2008, at 2:49 PM, Ralph Castain wrote:

There recently was activity on the mailing lists where someone was  
attempting to call comm_spawn 100,000 times. Setting aside the  
threading issues that were the focus of that exchange, the fact is  
that OMPI currently cannot handle that many comm_spawns.


The ORTE jobid is composed of two elements:

1. the top 16-bits is an "identifier" for that mpirun

2. the lower 16-bits is a running counter identifying the specific  
job/launch for those procs.


Thus, we are limited to 64k comm_spawns.

Expanding this would require either revamping the entire way we  
handle jobs (e.g., removing the mpirun identifier - major effort),  
or expanding the orte_jobid_t from its current 32-bits to 64-bits.


Is this a problem we want to address?
Ralph

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


[OMPI devel] Comm_spawn limits

2008-10-22 Thread Ralph Castain
There recently was activity on the mailing lists where someone was  
attempting to call comm_spawn 100,000 times. Setting aside the  
threading issues that were the focus of that exchange, the fact is  
that OMPI currently cannot handle that many comm_spawns.


The ORTE jobid is composed of two elements:

1. the top 16-bits is an "identifier" for that mpirun

2. the lower 16-bits is a running counter identifying the specific job/ 
launch for those procs.


Thus, we are limited to 64k comm_spawns.

Expanding this would require either revamping the entire way we handle  
jobs (e.g., removing the mpirun identifier - major effort), or  
expanding the orte_jobid_t from its current 32-bits to 64-bits.


Is this a problem we want to address?
Ralph



Re: [OMPI devel] adding new functions to a BTL

2008-10-22 Thread Jeff Squyres

George reminds me that I forgot to explain why you couldn't dlsym

We use lt_dlopen() to open the plugins (Libtool's wrapper for a  
portable dlopen).  It opens all plugins (DSOs) in a private scope.   
That private scope is kept deep in the OPAL MCA base and not exposed  
elsewhere in the code base.  So if you manually dlopen a plugin again,  
I'll bet that the linker realizes that that DSO has already been  
loaded into the process space and doesn't actually load it again (but  
doesn't fail).  So the dlsyms fail because you don't have access to  
the private scope from where Libtool originally opened the DSO.


Make sense?




On Oct 22, 2008, at 1:04 PM, Eugene Loh wrote:

I'm trying to prototype an idea inside OMPI and am running into a  
problem.


I want to add a new function to a BTL and to have the PML call this  
function.  I can't just put such a function call into the PML (not  
even for my prototype) since the PML is loaded before the BTL and so  
the PML will complain about a missing symbol.


So, the PML will just have to refer to the function symbolically and  
I need to figure out the BTL function address "at the appropriate  
time" (after the BTL is loaded but before I need to call my function).


I tried to dlopen the BTL (seemed successful... I got back a non- 
NULL handle), but dlsym can't seem to find any of the symbols in the  
BTL (not even ones that existed before I started any of my work).


I can describe other things I tried or other things I think are  
supposed to work (but that I am reluctant to try), but let's cut to  
the chase:  HELP!


Please note that I'm a newbie OMPI developer and so I'm really  
interested in doing the simplest thing possible to try my  
prototype.  I recognize that certain things will have to be done to  
add "real code" back to the code base, but at this point I'd prefer  
to defer difficult work and just test the ideas of my prototype.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Direct routed module

2008-10-22 Thread George Bosilca

Youpiii!

 george.

On Oct 21, 2008, at 4:53 PM, Ralph Castain wrote:


Hello all

I am working on adding a new radix tree routed module and am  
simultaneously doing a little streamlining to the overall routed- 
related code for scalability. One thing that would help cleanup  
several areas of the code base would be to finally dump the "direct"  
routed module.


As you may recall, this module has been continued for historical  
purposes. It is not scalable since it requires that every process  
open a direct connection to every other process in the job. This is  
what pre-1.3 systems do. We originally left it alive for two  
reasons: (a) we wanted to have a fallback position while we  
developed the more scalable alternatives, and (b) the C/R code  
didn't support routed RML comm.


The latter situation was resolved some months ago, and we have had  
plenty of validation of our routed comm system. Thus, if there are  
no objections by the end of the week, I will remove this module and  
cleanup the code.


Please let me know if this is a concern.
Ralph

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


Re: [OMPI devel] adding new functions to a BTL

2008-10-22 Thread Jeff Squyres

Short answer because we're all still in Chicago...

Terry tells me that you're just hacking around trying to see what  
works, etc.  So adding direct calls to the BTL in this kind of  
scenario is ok.  I'm sure you're aware that this is not good for real  
code.  :-)


To directly call a BTL function, you might just want to configure OMPI  
with --enable-static; this will suck in all the plugins into libmpi,  
and therefore all symbols are directly available at link time.


There's other, more elegant ways for this hackaround, but if you're  
just playing/testing, this is probably good enough.



On Oct 22, 2008, at 1:04 PM, Eugene Loh wrote:

I'm trying to prototype an idea inside OMPI and am running into a  
problem.


I want to add a new function to a BTL and to have the PML call this  
function.  I can't just put such a function call into the PML (not  
even for my prototype) since the PML is loaded before the BTL and so  
the PML will complain about a missing symbol.


So, the PML will just have to refer to the function symbolically and  
I need to figure out the BTL function address "at the appropriate  
time" (after the BTL is loaded but before I need to call my function).


I tried to dlopen the BTL (seemed successful... I got back a non- 
NULL handle), but dlsym can't seem to find any of the symbols in the  
BTL (not even ones that existed before I started any of my work).


I can describe other things I tried or other things I think are  
supposed to work (but that I am reluctant to try), but let's cut to  
the chase:  HELP!


Please note that I'm a newbie OMPI developer and so I'm really  
interested in doing the simplest thing possible to try my  
prototype.  I recognize that certain things will have to be done to  
add "real code" back to the code base, but at this point I'd prefer  
to defer difficult work and just test the ideas of my prototype.

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] OOB-TCP Retries

2008-10-22 Thread Ralph Castain
Sorry for delayed response - had some things to finish, then had to  
stare at this code for awhile.


Unfortunately, the OOB is a snarled can of hideous worms. It looks to  
me that the OOB continues to attempt to complete any pending message  
requests once it detects that retries have exceeded the limit. In  
doing so, it looks like it triggers pending events, which would  
include pending sends - thus causing it to again emit that error  
message.


I can't swear to any of this, of course - the worms are really deep  
and tangled down there.


A rewrite of the OOB is planned for next year - hopefully, the last of  
the spaghetti to be unraveled. Not sure if that will really happen,  
though, as I think everyone is afraid of that black hole of despair.  
If it does, this is one thing we can try to address.


Any volunteers??

Ralph


On Oct 17, 2008, at 11:02 AM, Leonardo Fialho wrote:


Hi All,

I´m doing some experiments and modifications in my heartbeat code  
witch uses the OOB-TCP communication channel.


My modified orteds and orterun does not abort all processes when one  
orted die.


The problem is:

1) I kill an orted, so another orted detect the fault when try to  
send a heartbeat to the faulty orted.


2) The RTE get stable again, by the orted which have sent the  
heartbeat print the following oob-tcp message:
"[node1:21582] [[12518,0],1]-[[12518,0],2] oob-tcp: Communication  
retries exceeded.  Can not communicate with peer"


And the question is:

a) Once an oob-tcp instance gets the mca_oob_tcp_peer_shutdown it  
discards this peer, no?


b) The message is removed from the queue with ORTE_ERR_UNREACH code,  
no?


c) Why, after retries exceed, the orted continue to plot this message?

Thanks,
--

Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





Re: [OMPI devel] Direct routed module

2008-10-22 Thread Jeff Squyres

Sounds good to me.

On Oct 21, 2008, at 3:53 PM, Ralph Castain wrote:


Hello all

I am working on adding a new radix tree routed module and am  
simultaneously doing a little streamlining to the overall routed- 
related code for scalability. One thing that would help cleanup  
several areas of the code base would be to finally dump the "direct"  
routed module.


As you may recall, this module has been continued for historical  
purposes. It is not scalable since it requires that every process  
open a direct connection to every other process in the job. This is  
what pre-1.3 systems do. We originally left it alive for two  
reasons: (a) we wanted to have a fallback position while we  
developed the more scalable alternatives, and (b) the C/R code  
didn't support routed RML comm.


The latter situation was resolved some months ago, and we have had  
plenty of validation of our routed comm system. Thus, if there are  
no objections by the end of the week, I will remove this module and  
cleanup the code.


Please let me know if this is a concern.
Ralph

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



[OMPI devel] adding new functions to a BTL

2008-10-22 Thread Eugene Loh

I'm trying to prototype an idea inside OMPI and am running into a problem.

I want to add a new function to a BTL and to have the PML call this 
function.  I can't just put such a function call into the PML (not even 
for my prototype) since the PML is loaded before the BTL and so the PML 
will complain about a missing symbol.


So, the PML will just have to refer to the function symbolically and I 
need to figure out the BTL function address "at the appropriate time" 
(after the BTL is loaded but before I need to call my function).


I tried to dlopen the BTL (seemed successful... I got back a non-NULL 
handle), but dlsym can't seem to find any of the symbols in the BTL (not 
even ones that existed before I started any of my work).


I can describe other things I tried or other things I think are supposed 
to work (but that I am reluctant to try), but let's cut to the chase:  HELP!


Please note that I'm a newbie OMPI developer and so I'm really 
interested in doing the simplest thing possible to try my prototype.  I 
recognize that certain things will have to be done to add "real code" 
back to the code base, but at this point I'd prefer to defer difficult 
work and just test the ideas of my prototype.


Re: [OMPI devel] -display-map

2008-10-22 Thread Greg Watson

Ralph,

I guess the issue for us is that we will have to run two commands to  
get the information we need. One to get the configuration information,  
such as version and MCA parameters, and one to get the host  
information, whereas it would seem more logical that this should all  
be available via some kind of "configuration discovery" command. I  
understand the issue with supplying the hostfile though, so maybe this  
just points at the need for us to separate configuration information  
from the host information. In any case, we'll work with what you think  
is best.


Greg

On Oct 20, 2008, at 4:49 PM, Ralph Castain wrote:

Hmmm...just to be sure we are all clear on this. The reason we  
proposed to use mpirun is that "hostfile" has no meaning outside of  
mpirun. That's why ompi_info can't do anything in this regard.


We have no idea what hostfile the user may specify until we actually  
get the mpirun cmd line. They may have specified a default-hostfile,  
but they could also specify hostfiles for the individual  
app_contexts. These may or may not include the node upon which  
mpirun is executing.


So the only way to provide you with a separate command to get a  
hostfile<->nodename mapping would require you to provide us with the  
default-hostifle and/or hostfile cmd line options just as if you  
were issuing the mpirun cmd. We just wouldn't launch - but it would  
be the exact equivalent of doing "mpirun --do-not-launch".


Am I missing something? If so, please do correct me - I would be  
happy to provide a tool if that would make it easier. Just not sure  
what that tool would do.


Thanks
Ralph


On Oct 19, 2008, at 1:59 PM, Greg Watson wrote:


Ralph,

It seems a little strange to be using mpirun for this, but barring  
providing a separate command, or using ompi_info, I think this  
would solve our problem.


Thanks,

Greg

On Oct 17, 2008, at 10:46 AM, Ralph Castain wrote:


Sorry for delay - had to ponder this one for awhile.

Jeff and I agree that adding something to ompi_info would not be a  
good idea. Ompi_info has no knowledge or understanding of  
hostfiles, and adding that capability to it would be a major  
distortion of its intended use.


However, we think we can offer an alternative that might better  
solve the problem. Remember, we now treat hostfiles in a very  
different manner than before - see the wiki page for a complete  
description, or "man orte_hosts".


So the problem is that, to provide you with what you want, we need  
to "dump" the information from whatever default-hostfile was  
provided, and, if no default-hostfile was provided, then the  
information from each hostfile that was provided with an  
app_context.


The best way we could think of to do this is to add another mpirun  
cmd line option --dump-hostfiles that would output the line-by- 
line name from the hostfile plus the name we resolved it to. Of  
course, --xml would cause it to be in xml format.


Would that meet your needs?

Ralph


On Oct 15, 2008, at 3:12 PM, Greg Watson wrote:


Hi Ralph,

We've been discussing this back and forth a bit internally and  
don't really see an easy solution. Our problem is that Eclipse is  
not running on the head node, so gethostbyname will not  
necessarily resolve to the same address. For example, the  
hostfile might refer to the head node by an internal network  
address that is not visible to the outside world. Since  
gethostname also looks in /etc/hosts, it may resolve locally but  
not on a remote system. The only think I can think of would be,  
rather than us reading the hostfile directly as we do now, to  
provide an option to ompi_info that would dump the hostfile using  
the same rules that you apply when you're using the hostfile.  
Would that be feasible?


Greg

On Sep 22, 2008, at 4:25 PM, Ralph Castain wrote:

Sorry for delay - was on vacation and am now trying to work my  
way back to the surface.


I'm not sure I can fix this one for two reasons:

1. In general, OMPI doesn't really care what name is used for  
the node. However, the problem is that it needs to be  
consistent. In this case, ORTE has already used the name  
returned by gethostname to create its session directory  
structure long before mpirun reads a hostfile. This is why we  
retain the value from gethostname instead of allowing it to be  
overwritten by the name in whatever allocation we are given.  
Using the name in hostfile would require that I either find some  
way to remember any prior name, or that I tear down and rebuild  
the session directory tree - neither seems attractive nor simple  
(e.g., what happens when the user provides multiple entries in  
the hostfile for the node, each with a different IP address  
based on another interface in that node? Sounds crazy, but we  
have already seen it done - which one do I use?).


2. We don't actually store the hostfile info anywhere - we just  
use it and forget it. For us to add an XML attribute containing  
any host

[OMPI devel] Component open

2008-10-22 Thread Ralph Castain
I've been digging a little into optimization and found something that  
seems counterintuitive in the way OMPI is handling components.  
Specifically, if I specify a component I want used for a framework,  
OMPI still does a component load and open on every component in the  
framework - it only uses my specification during "select".


Thus, the cmd line

mpirun -mca routed linear

still results in the loading and opening of the direct and binomial  
components - even though we have directed the framework not to use them.


This causes us to waste memory when there is no possibility of a  
different component being selected. Is there a reason why "open" isn't  
using the mca params to guide the components it is loading?


Ralph



[OMPI devel] Restarting processes on different node

2008-10-22 Thread Leonardo Fialho

Hi All,

I´m trying to implement my FT architecture in Open MPI. Just now I need 
to restart a faulty process from a checkpoint. I saw that Josh uses 
orte-restart which call opal-restart through an ordinary mpirun call. 
It´s now good for me because in this case the restarted process becomes 
in a new job. I need to restart the process checkpoint in the same job 
and in another node under an existing orted. The checkpoints are taken 
without the "--term" option.


My modified orted receive a "restart request" from my modified heartbeat 
mechanism. I have tried to restart using the BLCR cr_restart command. It 
does not work, I think because the stderr/stdin/stdout was not handled 
by the opal environment. So, I tried to restart the checkpoint forking 
the orted and doing an execvp to the opal-restart. It recovers the 
checkpoint, but after the "opal_cr_init", it dies (*** Process received 
signal ***).


As follows is the job structure (from ompi-ps) after a fault:

Process Name |ORTE Name | Local Rank |PID |   Node |   State 
| HB Dest. |

-
orterun | [[8002,0],0] |  65535 |  30434 | aoclsb | Running |   
  |
  orted | [[8002,0],1] |  65535 |  30435 |  nodo1 | Running | 
[[8002,0],3] |
  orted | [[8002,0],2] |  65535 |  30438 |  nodo2 |  Faulty | 
[[8002,0],3] |
  orted | [[8002,0],3] |  65535 |  30441 |  nodo3 | Running | 
[[8002,0],4] |
  orted | [[8002,0],4] |  65535 |  30444 |  nodo4 | Running | 
[[8002,0],1] |



Process Name |ORTE Name | Local Rank |PID |  Node | State | 
Ckpt State | Ckpt Loc |Protector |

--
./ping/wait | [[8002,1],0] |  0 |   9069 | nodo1 |   Running 
|   Finished | /tmp/radic/0 | [[8002,0],2] |
./ping/wait | [[8002,1],1] |  0 |   6086 | nodo2 | Restoring 
|   Finished | /tmp/radic/1 | [[8002,0],3] |
./ping/wait | [[8002,1],2] |  0 |   5864 | nodo3 |   Running 
|   Finished | /tmp/radic/2 | [[8002,0],4] |
./ping/wait | [[8002,1],3] |  0 |   7405 | nodo4 |   Running 
|   Finished | /tmp/radic/3 | [[8002,0],1] |



The orted running on "nodo2" dies. It was detected by the orted 
[[8002,0],1] running on "nodo1" and informed to the HNP. The HNP update 
the procs structure and look for processes running on the faulty node, 
so it sends a restart request for the orted which holds the checkpoint 
of the faulty processes.


Below is the log generated:

[aoclsb:30434] [[8002,0],0] orted_recv: update state request from 
[[8002,0],3]
[aoclsb:30434] [[8002,0],0] orted_update_state: updating state (17) for 
orted process (vpid=2)
[aoclsb:30434] [[8002,0],0] orted_update_state: found process 
[[8002,1],1] on node nodo2, requesting recovery task for that
[aoclsb:30434] [[8002,0],0] orted_update_state: sending restore 
([[8002,1],1] process) request to [[8002,0],3]
[nodo3:05841] [[8002,0],3] orted_recv: restore checkpoint request from 
[[8002,0],0]
[nodo3:05841] [[8002,0],3] orted_restore_checkpoint: restarting process 
from checkpoint file (/tmp/radic/1/ompi_blcr_context.6086)
[nodo3:05841] [[8002,0],3] orted_restore_checkpoint: executing restart 
(opal-restart -mca crs_base_snapshot_dir /tmp/radic/1 .)

[nodo3:05924] opal_cr: init: Verbose Level: 1024
[nodo3:05924] opal_cr: init: FT Enabled: 1
[nodo3:05924] opal_cr: init: Is a tool program: 1
[nodo3:05924] opal_cr: init: Checkpoint Signal: 10
[nodo3:05924] opal_cr: init: Debug SIGPIPE: 0 (False)
[nodo3:05924] opal_cr: init: Temp Directory: /tmp
[nodo2:05965] *** Process received signal ***

The orted which receives the restart request forks and the call an 
execvp for the opal-restart, and then, unfortunately, it dies. I know 
that the restarted process should generate errors because the URI of it 
daemon is incorrect like all other enviroment variables, but it would 
generate a communication error, or any kind of error other than a 
process kill. My question is:


1) Why this process dies? I suspect that the checkpoint have pointers 
which points to libraries which are not loaded, or are loaded on 
different memory position (because this checkpoint becomes from another 
node). In this case the error should be "segmentation fault" or 
something like this, no?



If somebody have some information or can give me some help about this 
error I´ll be grateful.


Thanks--

Leonardo Fialho
Computer Architecture and Operating Systems Department - CAOS
Universidad Autonoma de Barcelona - UAB
ETSE, Edifcio Q, QC/3088
http://www.caos.uab.es
Phone: +34-93-581-2888
Fax: +34-93-581-2478