[OMPI devel] Something wrong with vt?

2008-02-11 Thread Gleb Natapov
I get the following error while "make install":

make[2]: Entering directory `/home_local/glebn/build_dbg/ompi/contrib/vt'
Making install in vt
make[3]: Entering directory `/home_local/glebn/build_dbg/ompi/contrib/vt/vt'
make[3]: *** No rule to make target `install'.  Stop.
make[3]: Leaving directory `/home_local/glebn/build_dbg/ompi/contrib/vt/vt'
make[2]: *** [install-recursive] Error 1
make[2]: Leaving directory `/home_local/glebn/build_dbg/ompi/contrib/vt'
make[1]: *** [install-recursive] Error 1
make[1]: Leaving directory `/home_local/glebn/build_dbg/ompi'
make: *** [install-recursive] Error 1

ompi/contrib/vt/vt/Makefile is empty!
--
Gleb.


[OMPI devel] status of LSF integration work?

2008-02-11 Thread Eric Jones

Greetings, MPI mavens,

Perhaps this belongs on users@, but since it's about development status 
I thought I start here.  I've fairly recently gotten involved in getting 
an MPI environment configured for our institute.  We have an existing 
LSF cluster because most of our work is more High-Throughput than 
High-Performance, so if I can use LSF to underlie our MPI environment, 
that'd be administratively easiest.


I tried to compile the LSF support in the public SVN repo and noticed it 
was, er, broken.  I'll include the trivial changes we made below.  But 
the behavior is still fairly unpredictable, mostly involving mpirun 
never spinning up daemons on other nodes.


I saw mention that work was being suspended on LSF support pending 
technical improvements on the LSF side (mentioning that Platform had 
provided a patch or try.)


Can I assume, based on the inactivity in the repo, that Platform hasn't 
resolved the issue?


Thanks,
Eric


Here're the diffs to get LSF support to compile.  We also made a change 
so it would report the LSF failure code instead of an uninitialized 
variable when it fails:


Index: pls_lsf_module.c
===
--- pls_lsf_module.c(revision 17234)
+++ pls_lsf_module.c(working copy)
@@ -304,7 +304,7 @@
  */
 if (lsb_launch(nodelist_argv, argv, LSF_DJOB_NOWAIT, env) < 0) {
 ORTE_ERROR_LOG(ORTE_ERR_FAILED_TO_START);
-opal_output(0, "lsb_launch failed: %d", rc);
+opal_output(0, "lsb_launch failed: %d", lsberrno);
 rc = ORTE_ERR_FAILED_TO_START;
 goto cleanup;
 }
@@ -356,7 +356,7 @@

 /* check for failed launch - if so, force terminate */
 if (failed_launch) {
-if (ORTE_SUCCESS !=
+/*if (ORTE_SUCCESS != */
 orte_pls_base_daemon_failed(jobid, false, -1, 0, 
ORTE_JOB_STATE_FAILED_TO_START);

 }


Re: [OMPI devel] more vt woes

2008-02-11 Thread Matthias Jurenz
This problem should be fixed now.
Thanks for the hint.

On Sa, 2008-02-09 at 08:47 -0500, Jeff Squyres wrote:

> While doing some pathscale compiler testing on the trunk (r17407), I  
> ran into this compile problem (the first is a warning, the second is  
> an error):
> 
> pathCC -DHAVE_CONFIG_H -I. -I../.. -I../../extlib/otf/otflib -I../../ 
> extlib/otf/otflib -I../../vtlib/ -I../../vtlib   -openmp -DVT_OMP -g - 
> Wall -Wundef -Wno-long-long -finline-functions -pthread -MT vtfilter- 
> vt_tracefilter.o -MD -MP -MF .deps/vtfilter-vt_tracefilter.Tpo -c -o  
> vtfilter-vt_tracefilter.o `test -f 'vt_tracefilter.cc' || echo  
> './'`vt_tracefilter.cc
> mv -f .deps/vtfilter-vt_otfhandler.Tpo .deps/vtfilter-vt_otfhandler.Po
> mv -f .deps/vtfilter-vt_filthandler.Tpo .deps/vtfilter-vt_filthandler.Po
> "vt_tracefilter.cc", line 451: Warning: Referenced scalar variable  
> _ZZ4mainE5retev is SHARED by default
> "vt_tracefilter.cc", line 921: Warning: Referenced scalar variable  
> _ZZ4mainE5retev is SHARED by default
> "vt_tracefilter.cc", line 950: Warning: Referenced scalar variable  
> _ZZ4mainE5retst is SHARED by default
> "vt_tracefilter.cc", line 977: Warning: Referenced scalar variable  
> _ZZ4mainE5retsn is SHARED by default
> mv -f .deps/vtfilter-vt_filter.Tpo .deps/vtfilter-vt_filter.Po
> mv -f .deps/vtfilter-vt_tracefilter.Tpo .deps/vtfilter-vt_tracefilter.Po
> pathCC -openmp -DVT_OMP -g -Wall -Wundef -Wno-long-long -finline- 
> functions -pthread -openmp  -o vtfilter vtfilter-vt_filter.o vtfilter- 
> vt_filthandler.o vtfilter-vt_otfhandler.o vtfilter-vt_tracefilter.o - 
> L../../extlib/otf/otflib -L../../extlib/otf/otflib/.libs -lotf  -lz - 
> lnsl -lutil  -lm
> vtfilter-vt_tracefilter.o(.text+0x309b): In function `main':
> /home/jsquyres/svn/ompi2/ompi/contrib/vt/vt/tools/vtfilter/ 
> vt_tracefilter.cc:794: undefined reference to  
> `FiltHandlerArgument::FiltHandlerArgument(FiltHandlerArgument const&)'
> vtfilter-vt_tracefilter.o(.text+0x312f):/home/jsquyres/svn/ompi2/ompi/ 
> contrib/vt/vt/tools/vtfilter/vt_tracefilter.cc:802: undefined  
> reference to  
> `FiltHandlerArgument::FiltHandlerArgument(FiltHandlerArgument const&)'
> vtfilter-vt_tracefilter.o(.text+0x577b): In function `__ompdo_main2':
> /home/jsquyres/svn/ompi2/ompi/contrib/vt/vt/tools/vtfilter/ 
> vt_tracefilter.cc:802: undefined reference to  
> `FiltHandlerArgument::FiltHandlerArgument(FiltHandlerArgument const&)'
> collect2: ld returned 1 exit status
> make[6]: *** [vtfilter] Error 1
> make[6]: Leaving directory `/home/jsquyres/svn/ompi2/ompi/contrib/vt/ 
> vt/tools/vtfilter'
> 
> This is with the pathscale v3.0 compilers.
> 

--
Matthias Jurenz,
Center for Information Services and 
High Performance Computing (ZIH), TU Dresden, 
Willersbau A106, Zellescher Weg 12, 01062 Dresden
phone +49-351-463-31945, fax +49-351-463-37773


smime.p7s
Description: S/MIME cryptographic signature


[OMPI devel] VT integration: make distclean problem

2008-02-11 Thread Josh Hursey
I've been noticing another problem with the VT integration. If you do  
a "./configure --enable-contrib-no-build=vt" a subsequent 'make  
distclean' will fail in contrib/vt. The 'make distclean' will succeed  
with VT enabled (default).


---

Making distclean in contrib/vt
make[2]: Entering directory `/san/homedirs/jjhursey/svn/ompi/ompi/ 
contrib/vt'

make[2]: *** No rule to make target `distclean'.  Stop.
make[2]: Leaving directory `/san/homedirs/jjhursey/svn/ompi/ompi/ 
contrib/vt'

make[1]: *** [distclean-recursive] Error 1
make[1]: Leaving directory `/san/homedirs/jjhursey/svn/ompi/ompi'
make: *** [distclean-recursive] Error 1
---

I haven't looked at how to fix this, but maybe it is as simple as  
adding a flag to the Makefile.am in that directory.


-- Josh



[OMPI devel] New Driver BTL

2008-02-11 Thread Cedric Desmoulin

Hello!

I don't know if it is the good method to have some help for developing  
with open mpi.


We are 4 french students and we have a project : we have to write a  
new driver (new btl) between openmpi and newmadeleine (see the web  
page, http://pm2.gforge.inria.fr/newmadeleine/doc/html/) With newmad,  
we use send receive interface. we need just the part of btl which is  
able to do it.


Have you some docs about structure like mca_btl_base_module and its  
friends ? I don't find where the function mca_btl_tcp_send is used. Do  
you know it ?



PLEASE HELP US!

Team



Re: [OMPI devel] VT integration: make distclean problem

2008-02-11 Thread Ralf Wildenhues
* Josh Hursey wrote on Mon, Feb 11, 2008 at 07:31:25PM CET:
> I've been noticing another problem with the VT integration. If you do  
> a "./configure --enable-contrib-no-build=vt" a subsequent 'make  
> distclean' will fail in contrib/vt. The 'make distclean' will succeed  
> with VT enabled (default).

ATM the toplevel configury does not run configure in contrib/vt/vt, if
that is disabled.  I think that's intended.  But it also means that a
distribution built from such a build tree cannot be complete, i.e.,
contain all contribs, because their Makefiles do not exist which contain
the respective dist rules.  Likewise for distclean and maintainer-clean.

I suppose for distclean, this could be worked around uglily (please
speak up if you want me to take a shot at it), but if you want all of
these to work out of the box even for --enable-contrib-no-build=vt, then
you need to run configure for vt every time.  Sorry 'bout that.

Cheers,
Ralf


[OMPI devel] Fixlet for config/ompi_contrib.m4

2008-02-11 Thread Ralf Wildenhues
Hello,

please apply this patch, to make future contrib integration just a tad
bit easier.  I verified that the generated configure script is
identical, minus whitespace and comments.

Cheers,
Ralf

2008-02-11  Ralf Wildenhues  

* config/ompi_contrib.m4 (OMPI_CONTRIB): Unify listings of
contrib software packages.

Index: config/ompi_contrib.m4
===
--- config/ompi_contrib.m4  (Revision 17419)
+++ config/ompi_contrib.m4  (Arbeitskopie)
@@ -67,20 +67,13 @@
 # Cycle through each of the hard-coded software packages and
 # configure them if not disabled.  May someday be expanded to have
 # autogen find the packages instead of this hard-coded list
-# (https://svn.open-mpi.org/trac/ompi/ticket/1162).  I couldn't
-# figure out a simple/easy way to have the m4 foreach do the m4
-# include *and* all the rest of the stuff, so I settled for having
-# two lists: each contribted software package will need to add its
-# configure.m4 list here and then add its name to the m4 define
-# for contrib_software_list.  Cope.
-#dnlm4_include(ompi/contrib/libnbc/configure.m4)
-m4_include(ompi/contrib/vt/configure.m4)
-
-m4_define(contrib_software_list, [vt])
-#dnlm4_define(contrib_software_list, [libnbc, vt])
+# (https://svn.open-mpi.org/trac/ompi/ticket/1162).
+# m4_define([contrib_software_list], [libnbc, vt])
+m4_define([contrib_software_list], [vt])
 m4_foreach(software, [contrib_software_list],
-   [OMPI_CONTRIB_DIST_SUBDIRS="$OMPI_CONTRIB_DIST_SUBDIRS 
contrib/software"
-   _OMPI_CONTRIB_CONFIGURE(software)])
+  [m4_include([ompi/contrib/]software[/configure.m4])
+  OMPI_CONTRIB_DIST_SUBDIRS="$OMPI_CONTRIB_DIST_SUBDIRS 
contrib/software"
+  _OMPI_CONTRIB_CONFIGURE(software)])

 # Setup the top-level glue
 AC_SUBST(OMPI_CONTRIB_SUBDIRS)


[OMPI devel] Leopard problems

2008-02-11 Thread Greg Watson

Hi,

Since I upgraded to MacOS X 10.5.1, I've been having problems running  
MPI programs (using both 1.2.4 and 1.2.5). The symptoms are  
intermittent (i.e. sometimes the application runs fine), and appear as  
follows:


1. One or more of the application processes die (I've see both one and  
two processes die).


2. (It appears) that the orted's associated with these application  
process then spin continually.


Here is what I see when I run "mpirun -np 4 ./mpitest":

12467   ??  Rs 1:26.52 orted --bootproxy 1 --name 0.0.1 -- 
num_procs 5 --vpid_start 0 --nodename node0 --universe  
greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp:// 
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 
10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp:// 
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 
10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
12468   ??  Rs 1:26.63 orted --bootproxy 1 --name 0.0.2 -- 
num_procs 5 --vpid_start 0 --nodename node1 --universe  
greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp:// 
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 
10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp:// 
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 
10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
12469   ??  Ss 0:00.04 orted --bootproxy 1 --name 0.0.3 -- 
num_procs 5 --vpid_start 0 --nodename node2 --universe  
greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp:// 
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 
10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp:// 
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 
10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
12470   ??  Ss 0:00.04 orted --bootproxy 1 --name 0.0.4 -- 
num_procs 5 --vpid_start 0 --nodename node3 --universe  
greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp:// 
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 
10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp:// 
10.0.1.200:56749;tcp://9.67.176.162:56749;tcp:// 
10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid

12471   ??  S  0:00.05 ./mpitest
12472   ??  S  0:00.05 ./mpitest

Killing the mpirun results in:

$ mpirun -np 4 ./mpitest
^Cmpirun: killing job...

^ 
C 
--

WARNING: mpirun is in the process of killing a job, but has detected an
interruption (probably control-C).

It is dangerous to interrupt mpirun while it is killing a job (proper
termination may not be guaranteed).  Hit control-C again within 1
second if you really want to kill mpirun immediately.
--
^Cmpirun: forcibly killing job...
--
WARNING: mpirun has exited before it received notification that all
started processes had terminated.  You should double check and ensure
that there are no runaway processes still executing.
--

At this point, the two spinning orted's are left running, and the only  
way to kill them is with -9.


Is anyone else seeing this problem?

Greg


Re: [OMPI devel] New Driver BTL

2008-02-11 Thread George Bosilca

Cedric,

There is not much documentation about this subject. However, we have  
some templates. Look in ompi/mca/btl/template to see how a new driver  
is supposed to be written.


I have a question. As far as I understand about New Madelaine it  
already support multi devices, so I guess the matching is done  
internally. Then the best approach for Open MPI will be to create an  
MTL instead of a BTL. The MTL interface is much simpler, basically a  
one to one wrapper for the point-to-point MPI functions. However, if  
you take this approach, there are few things that will be left out. As  
an example, no data resilience, no stripping, no pipelining. But if  
you do all this internally in NewMadeleine, I guess you don't need the  
Open MPI PML support.


  Thanks,
george.

On Feb 11, 2008, at 1:52 PM, Cedric Desmoulin wrote:


Hello!

I don't know if it is the good method to have some help for developing
with open mpi.

We are 4 french students and we have a project : we have to write a
new driver (new btl) between openmpi and newmadeleine (see the web
page, http://pm2.gforge.inria.fr/newmadeleine/doc/html/) With newmad,
we use send receive interface. we need just the part of btl which is
able to do it.

Have you some docs about structure like mca_btl_base_module and its
friends ? I don't find where the function mca_btl_tcp_send is used. Do
you know it ?


PLEASE HELP US!

Team

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel




smime.p7s
Description: S/MIME cryptographic signature


[OMPI devel] 1.3 Release schedule and contents

2008-02-11 Thread Brad Benton
All:

The latest scrub of the 1.3 release schedule and contents is ready for
review and comment.  Please use the following links:
  1.3 milestones:
https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3
  1.3.1 milestones:
https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3.1

In order to try and keep the dates for 1.3 in, I've pushed a bunch of stuff
(particularly ORTE things) to 1.3.1.  Even though there will be new
functionality slated for 1.3.1, the goal is to not have any interface
changes between the phases.

Please look over the list and schedules and let me or my fellow
1.3co-release manager George Bosilca (
bosi...@eecs.utk.edu) know of any issues, errors, suggestions, omissions,
heartburn, etc.

Thanks,
--Brad

Brad Benton
IBM


Re: [OMPI devel] 1.3 Release schedule and contents

2008-02-11 Thread Brian W. Barrett
Out of curiousity, why is one-sided rdma component struck from 1.3?  As 
far as I'm aware, the code is in the trunk and ready for release.


Brian

On Mon, 11 Feb 2008, Brad Benton wrote:


All:

The latest scrub of the 1.3 release schedule and contents is ready for
review and comment.  Please use the following links:
 1.3 milestones:
https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3
 1.3.1 milestones:
https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3.1

In order to try and keep the dates for 1.3 in, I've pushed a bunch of stuff
(particularly ORTE things) to 1.3.1.  Even though there will be new
functionality slated for 1.3.1, the goal is to not have any interface
changes between the phases.

Please look over the list and schedules and let me or my fellow
1.3co-release manager George Bosilca (
bosi...@eecs.utk.edu) know of any issues, errors, suggestions, omissions,
heartburn, etc.

Thanks,
--Brad

Brad Benton
IBM



Re: [OMPI devel] 1.3 Release schedule and contents

2008-02-11 Thread Ralph Castain
Yo Brian

The line through that item means it has already been completed and is ready
to go.

There should also be a line through item 1.3.a.vi - it has also been fixed.


On 2/11/08 8:29 PM, "Brian W. Barrett"  wrote:

> Out of curiousity, why is one-sided rdma component struck from 1.3?  As
> far as I'm aware, the code is in the trunk and ready for release.
> 
> Brian
> 
> On Mon, 11 Feb 2008, Brad Benton wrote:
> 
>> All:
>> 
>> The latest scrub of the 1.3 release schedule and contents is ready for
>> review and comment.  Please use the following links:
>>  1.3 milestones:
>> https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3
>>  1.3.1 milestones:
>> https://svn.open-mpi.org/trac/ompi/milestone/Open%20MPI%201.3.1
>> 
>> In order to try and keep the dates for 1.3 in, I've pushed a bunch of stuff
>> (particularly ORTE things) to 1.3.1.  Even though there will be new
>> functionality slated for 1.3.1, the goal is to not have any interface
>> changes between the phases.
>> 
>> Please look over the list and schedules and let me or my fellow
>> 1.3co-release manager George Bosilca (
>> bosi...@eecs.utk.edu) know of any issues, errors, suggestions, omissions,
>> heartburn, etc.
>> 
>> Thanks,
>> --Brad
>> 
>> Brad Benton
>> IBM
>> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Leopard problems

2008-02-11 Thread Ralph Castain
There is a known problem with Leopard and Open MPI of all versions. We
haven't had time to chase it down yet - probably still a few weeks away.

Ralph



On 2/11/08 1:39 PM, "Greg Watson"  wrote:

> Hi,
> 
> Since I upgraded to MacOS X 10.5.1, I've been having problems running
> MPI programs (using both 1.2.4 and 1.2.5). The symptoms are
> intermittent (i.e. sometimes the application runs fine), and appear as
> follows:
> 
> 1. One or more of the application processes die (I've see both one and
> two processes die).
> 
> 2. (It appears) that the orted's associated with these application
> process then spin continually.
> 
> Here is what I see when I run "mpirun -np 4 ./mpitest":
> 
> 12467   ??  Rs 1:26.52 orted --bootproxy 1 --name 0.0.1 --
> num_procs 5 --vpid_start 0 --nodename node0 --universe
> greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp://
> 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
> 10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp://
> 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
> 10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
> 12468   ??  Rs 1:26.63 orted --bootproxy 1 --name 0.0.2 --
> num_procs 5 --vpid_start 0 --nodename node1 --universe
> greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp://
> 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
> 10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp://
> 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
> 10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
> 12469   ??  Ss 0:00.04 orted --bootproxy 1 --name 0.0.3 --
> num_procs 5 --vpid_start 0 --nodename node2 --universe
> greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp://
> 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
> 10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp://
> 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
> 10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
> 12470   ??  Ss 0:00.04 orted --bootproxy 1 --name 0.0.4 --
> num_procs 5 --vpid_start 0 --nodename node3 --universe
> greg@Jarrah.local:default-universe-12462 --nsreplica "0.0.0;tcp://
> 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
> 10.37.129.2:56749;tcp://10.211.55.2:56749" --gprreplica "0.0.0;tcp://
> 10.0.1.200:56749;tcp://9.67.176.162:56749;tcp://
> 10.37.129.2:56749;tcp://10.211.55.2:56749" --set-sid
> 12471   ??  S  0:00.05 ./mpitest
> 12472   ??  S  0:00.05 ./mpitest
> 
> Killing the mpirun results in:
> 
> $ mpirun -np 4 ./mpitest
> ^Cmpirun: killing job...
> 
> ^ 
> C 
> --
> WARNING: mpirun is in the process of killing a job, but has detected an
> interruption (probably control-C).
> 
> It is dangerous to interrupt mpirun while it is killing a job (proper
> termination may not be guaranteed).  Hit control-C again within 1
> second if you really want to kill mpirun immediately.
> --
> ^Cmpirun: forcibly killing job...
> --
> WARNING: mpirun has exited before it received notification that all
> started processes had terminated.  You should double check and ensure
> that there are no runaway processes still executing.
> --
> 
> At this point, the two spinning orted's are left running, and the only
> way to kill them is with -9.
> 
> Is anyone else seeing this problem?
> 
> Greg
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] status of LSF integration work?

2008-02-11 Thread Ralph Castain
Jeff and I chatted about this today, in fact. We know the LSF support is
borked, but neither of us had time right now to fix it. We plan to do so,
though, before the 1.3 release - just can't promise when.

Ralph



On 2/11/08 8:00 AM, "Eric Jones"  wrote:

> Greetings, MPI mavens,
> 
> Perhaps this belongs on users@, but since it's about development status
> I thought I start here.  I've fairly recently gotten involved in getting
> an MPI environment configured for our institute.  We have an existing
> LSF cluster because most of our work is more High-Throughput than
> High-Performance, so if I can use LSF to underlie our MPI environment,
> that'd be administratively easiest.
> 
> I tried to compile the LSF support in the public SVN repo and noticed it
> was, er, broken.  I'll include the trivial changes we made below.  But
> the behavior is still fairly unpredictable, mostly involving mpirun
> never spinning up daemons on other nodes.
> 
> I saw mention that work was being suspended on LSF support pending
> technical improvements on the LSF side (mentioning that Platform had
> provided a patch or try.)
> 
> Can I assume, based on the inactivity in the repo, that Platform hasn't
> resolved the issue?
> 
> Thanks,
> Eric
> 
> 
> Here're the diffs to get LSF support to compile.  We also made a change
> so it would report the LSF failure code instead of an uninitialized
> variable when it fails:
> 
> Index: pls_lsf_module.c
> ===
> --- pls_lsf_module.c(revision 17234)
> +++ pls_lsf_module.c(working copy)
> @@ -304,7 +304,7 @@
>*/
>   if (lsb_launch(nodelist_argv, argv, LSF_DJOB_NOWAIT, env) < 0) {
>   ORTE_ERROR_LOG(ORTE_ERR_FAILED_TO_START);
> -opal_output(0, "lsb_launch failed: %d", rc);
> +opal_output(0, "lsb_launch failed: %d", lsberrno);
>   rc = ORTE_ERR_FAILED_TO_START;
>   goto cleanup;
>   }
> @@ -356,7 +356,7 @@
> 
>   /* check for failed launch - if so, force terminate */
>   if (failed_launch) {
> -if (ORTE_SUCCESS !=
> +/*if (ORTE_SUCCESS != */
>   orte_pls_base_daemon_failed(jobid, false, -1, 0,
> ORTE_JOB_STATE_FAILED_TO_START);
>   }
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




[OMPI devel] Scheduled merge of ORTE devel branch to trunk

2008-02-11 Thread Ralph Castain
Hello all

Per last week's telecon, we planned the merge of the latest ORTE devel
branch to the OMPI trunk for after Sun had committed its C++ changes. That
happened over the weekend.

Therefore, based on the requests at the telecon, I will be merging the
current ORTE devel branch to the trunk on Wed 2/13. I'll make the commit
around 4:30pm Eastern time - will send out warning shortly before the commit
to let you know it is coming. I'll advise of any delays.

This will be a snapshot of that devel branch - it will include the upgraded
launch system, remove the GPR, add the new tool communication library, allow
arbitrary mpiruns to interconnect, supports the revamped hostfile and
dash-host behaviors per the wiki, etc.

However, it is incomplete and contains some known flaws. For example,
totalview support has not been enabled yet. Comm_spawn, which is currently
broken on the OMPI trunk, is fixed - but singleton comm_spawn remains
broken. I am in the process of establishing support for direct and
standalone launch capabilities, but those won't be in the merge. I have
updated all of the launchers, but can only certify the SLURM, TM, and RSH
ones to work - the Xgrid launcher is known to not compile, so if you have
Xgrid on your Mac, you need to tell the build system to not build that
component.

This will give you a chance to look over the new arch, though, and I
understand that people would like to begin having a chance to test and
review the revised code. Hopefully, you will find most of the bugs to be
minor.

Please advise of any concerns about this merge. The schedule is totally
driven by the requests of the MPI team members (delaying the merge has no
impact on ORTE development), so requests to shift the schedule should be
discussed amongst the community.

Thanks
Ralph