Re: [OMPI devel] rankfile questions

2008-03-19 Thread Lenny Verkhovsky

Hi,

> -Original Message-
> From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org]
On
> Behalf Of Ralph Castain
> Sent: Wednesday, March 19, 2008 3:19 AM
> To: Open MPI Developers
> Subject: Re: [OMPI devel] rankfile questions
> 
> Not trying to pile on here...but I do have a question.
> 
> This commit inserted a bunch of affinity-specific code in
ompi_mpi_init.c.
> Was this truly necessary?
> 
> It seems to me this violates our code architecture. Affinity-specific
code
> belongs in the opal_p[m]affinity functions. Why aren't we just calling
a
> "opal_paffinity_set_my_processor" function (or whatever name you like)
in
> mpi_init, and doing all this paffinity stuff there?

This is the only place where this code is used. These functions process
the info from ODLS and set paffinity appropriately. Moving this code to
OPAL will cause unnecessary changes in paffinity base API.  

> 
> It would make mpi_init a lot cleaner, and preserve the code standards
we
> have had since the beginning.
> 
> In addition, the code that has been added returns ORTE error and
success
> codes. Given the location, it should be OMPI error and success codes -
if
> we
> move it to where I think it belongs (in OPAL), then those codes should
> obviously be OPAL codes.


Will be cleaned up,
thanks.

> 
> If I'm missing some reason why these things can't be done, please
> enlighten
> me. Otherwise, it would be nice if this could be cleaned up.
> 
> Thanks
> Ralph
> 
> On 3/18/08 8:39 AM, "Jeff Squyres"  wrote:
> 
> > On Mar 18, 2008, at 9:32 AM, Jeff Squyres wrote:
> >
> >> I notice that rankfile didn't compile properly on some platforms
and
> >> issued warnings on other platforms.  Thanks to Ralph for cleaning
it
> >> up...
> >>
> >> 1. I see a getenv("slot_list") in the MPI side of the code; it
looks
> >> like $slot_list is set by the odls for the MPI process.  Why isn't
it
> >> an MCA parameter?  That's what all other values passed by the orted
to
> >> the MPI process appear to be.

"slot_list" consist of socket:core pair for the rank to be bind to. This
info changes according to rankfile and different for each node and rank,
therefore it cannot be passed via mca parameter.

> >>
> >> 2. I see that ompi_mpi_params.c is now registering 2 rmaps-level
MCA
> >> parameters.  Why?  Shouldn't these be in ORTE somewhere?

If you mean paffinity_alone and rank_file_debug, then 
1. paffinity_alone was there before.
2. After getting some answers from Ralph about orte_debug in
ompi_mpi_init I intend to introduce ompi_debug mca parameter that will
be used in this library and rank_file_debug will be removed.

> >
> >
> > A few more notes:
> >
> > 3. Most of the files in orte/mca/rmaps/rankfile do not obey the
prefix
> > rule.  I think that they should be renamed.

Rank_file component was copied from round_robin, I thought it would be
strange if it would look differently.

> >
> > 4. A quick look through rankfile_lex.l seems to show that there are
> > global variables that are not protected by the prefix rule (or
> > static).  Ditto in rmaps_rf.c.  These should be fixed.

What do you mean?

> >
> > 5. rank_file_done was instantiated in both rankfile_lex.l and
> > ramps_rf.c (causing a duplicate symbol linker error on OS X).  I
> > removed it from rmaps_rf.c (it was declared "extern" in
> > rankfile_lex.h, assumedly to indicate that it is "owned" by the
lex.l
> > file...?).
thanks

> >
> > 6. svn:ignore was not set in the new rankfile directory.
Will be fixed.


I guess due to the heavy network traffic nowadays, all these comments
came now and not 2 weeks ago when I sent the code for reviews :) :) :).

Best Regards,
Lenny.

> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] rankfile questions

2008-03-19 Thread Jeff Squyres
Yes, you're right -- we should have reviewed this code 2 weeks ago  
when you asked.  Sorry about that.  :-\


Per adding lots of affinity code in ompi_mpi_init.c: perhaps those  
code belongs down in the paffinity (or rmaps?) base.  It doesn't have  
to become part of any specific paffinity component (because it can be  
used with any paffinity component)  This makes it callable by anyone  
(including orte) and keeps the abstraction barriers clean.




On Mar 19, 2008, at 5:36 AM, Lenny Verkhovsky wrote:


1. I see a getenv("slot_list") in the MPI side of the code; it

looks

like $slot_list is set by the odls for the MPI process.  Why isn't

it

an MCA parameter?  That's what all other values passed by the orted

to

the MPI process appear to be.


"slot_list" consist of socket:core pair for the rank to be bind to.  
This
info changes according to rankfile and different for each node and  
rank,

therefore it cannot be passed via mca parameter.


I don't follow the logic here.  MCA parameters can certainly be unique  
per MPI process...


Remember that MCA parameters can be environment variables.  The  
advantage of using MCA params as env variables is that we enforce a  
common prefix to ensure that we don't collide with other environment  
variables.  There's functions to get the environment variable names of  
MCA parameters, for example, so that you can setenv them to pass them  
to another process (e.g., in the odls).  Then you use the normal MCA  
parameter lookup functions to retrieve them in the target/receiver  
process.



2. I see that ompi_mpi_params.c is now registering 2 rmaps-level

MCA

parameters.  Why?  Shouldn't these be in ORTE somewhere?


If you mean paffinity_alone and rank_file_debug, then
1. paffinity_alone was there before.
2. After getting some answers from Ralph about orte_debug in
ompi_mpi_init I intend to introduce ompi_debug mca parameter that will
be used in this library and rank_file_debug will be removed.


rmaps_rank_file_path and rmaps_rank_file_debug.  These have no place  
being registered in the OMPI layer.


It looks like rank_file_path is only registered in ompi_mpi_init.c as  
an error check.  Why isn't this done in the rmaps rankfile component  
itself?  This would execute in mpirun and avoid launching at all if an  
error is detected (vs. detecting the error in each MPI process and  
aborting each one).



A few more notes:

3. Most of the files in orte/mca/rmaps/rankfile do not obey the

prefix

rule.  I think that they should be renamed.


Rank_file component was copied from round_robin, I thought it would be
strange if it would look differently.


Blah -- it looks like round robin's files don't adhere to the prefix  
rule.  In fairness, those files *may* be so old to predate the prefix  
rule...?


Regardless, I think the rankfile files should be named in accordance  
with the rest of the code base and adhere to the prefix rule.  round  
robin should probably be fixed as well.



4. A quick look through rankfile_lex.l seems to show that there are
global variables that are not protected by the prefix rule (or
static).  Ditto in rmaps_rf.c.  These should be fixed.


What do you mean?


From lex.l:

int rank_file_line=1;
rank_file_value_t rank_file_value;
bool rank_file_done = false;

These are neither static nor do they adhere to the prefix rule  
(obviously, if a symbol is static, it doesn't have to adhere to the  
prefix rule).  Ditto for "rank_file_path" and "rankmap" in  
rmaps_rf.c.  There may be others; that's all I looked through (e.g., I  
didn't check other files or check function symbols).


--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] RFC: libevent update

2008-03-19 Thread Jeff Squyres
I re-merged down to the libevent-merge branch (to include r17872) and  
a new tarball has been uploaded to http://www.open-mpi.org/~jsquyres/unofficial/




On Mar 18, 2008, at 10:11 PM, George Bosilca wrote:


Commit 17872 is the one you're looking for.

https://svn.open-mpi.org/trac/ompi/changeset/17872

george.

On Mar 18, 2008, at 9:12 PM, Jeff Squyres wrote:


When did you fix it?  I merged the trunk down to the libevent-merge
branch late this afternoon (r17869).


On Mar 18, 2008, at 7:29 PM, George Bosilca wrote:


This has been fixed in the trunk, but not yet merged in the branch.

george.

On Mar 18, 2008, at 7:17 PM, Josh Hursey wrote:


I found another problem with the libevent branch.

If I set "-mca btl tcp,self" on the command line then I get a  
segfult
when sending messages > 16 KB. I can try to make a smaller  
repeater,
but if you use the "progress" or "simple" tests in ompi-tests  
below:

https://svn.open-mpi.org/svn/ompi-tests/trunk/iu/ft/correctness

To build:
shell$ make
To run with failure:
shell$ mpirun  -np 2 -mca btl tcp,self progress  -s 16 -v 1
To run without failure:
shell$ mpirun  -np 2 -mca btl tcp,self progress  -s 15 -v 1

This program will display the message "Checkpoint at any  
time...". If

you send mpirun SIGUSR2 it will progress to the next stage of the
test. The failure occurs when the first message before this becomes
an issue though.

I was using Odin, and if I do not specify the btls then the test  
will

pass as normal.

The backtrace is below:
--
...
Core was generated by `progress -s 16 -v 1'.
Program terminated with signal 11, Segmentation fault.
#0  0x002a9793318b in mca_bml_base_free
(bml_btl=0x736275705f61636d, des=0x559700) at ../../../../ompi/mca/
bml/bml.h:267
267 bml_btl->btl_free( bml_btl->btl, des );
(gdb) bt
#0  0x002a9793318b in mca_bml_base_free
(bml_btl=0x736275705f61636d, des=0x559700) at ../../../../ompi/mca/
bml/bml.h:267
#1  0x002a9793304d in mca_pml_ob1_put_completion (btl=0x5598c0,
ep=0x0, des=0x559700, status=0) at pml_ob1_recvreq.c:190
#2  0x002a97930069 in mca_pml_ob1_recv_frag_callback
(btl=0x5598c0, tag=64 '@', des=0x2a989d2b00, cbdata=0x0) at
pml_ob1_recvfrag.c:149
#3  0x002a97d5f3e0 in mca_btl_tcp_endpoint_recv_handler (sd=10,
flags=2, user=0x5a5df0) at btl_tcp_endpoint.c:696
#4  0x002a95a0ab93 in event_process_active (base=0x508c80) at
event.c:591
#5  0x002a95a0af59 in opal_event_base_loop (base=0x508c80,
flags=2) at event.c:763
#6  0x002a95a0ad2b in opal_event_loop (flags=2) at event.c:670
#7  0x002a959fadf8 in opal_progress () at runtime/
opal_progress.c:
169
#8  0x002a9792caae in opal_condition_wait (c=0x2a9587d940,
m=0x2a9587d9c0) at ../../../../opal/threads/condition.h:93
#9  0x002a9792c9dd in ompi_request_wait_completion  
(req=0x5a5380)

at ../../../../ompi/request/request.h:381
#10 0x002a9792c920 in mca_pml_ob1_recv (addr=0x5baf70,
count=16384, datatype=0x503770, src=1, tag=1001, comm=0x5039a0,
status=0x0)
 at pml_ob1_irecv.c:104
#11 0x002a956f1f00 in PMPI_Recv (buf=0x5baf70, count=16384,
type=0x503770, source=1, tag=1001, comm=0x5039a0, status=0x0) at
precv.c:75
#12 0x0040211f in exchange_stage1 (ckpt_num=1) at
progress.c:414
#13 0x00401295 in main (argc=5, argv=0x7fbfffe668) at
progress.c:131
(gdb) p bml_btl
$1 = (mca_bml_base_btl_t *) 0x736275705f61636d
(gdb) p *bml_btl
Cannot access memory at address 0x736275705f61636d
--

-- Josh

On Mar 17, 2008, at 2:50 PM, Jeff Squyres wrote:


WHAT: Bring new version of libevent to the trunk.

WHY: Newer version, slightly better performance (lower overheads /
lighter weight), properly integrate the use of epoll and other
scalable fd monitoring mechanisms.

WHERE: 98% of the changes are in opal/event; there's a few changes
to
configury and one change to the orted.

TIMEOUT: COB, Friday, 21 March 2008

DESCRIPTION:

George/UTK has done the bulk of the work to integrate a new
version of
libevent on the following tmp branch:

 https://svn.open-mpi.org/svn/ompi/tmp-public/libevent-merge

** WE WOULD VERY MUCH APPRECIATE IF PEOPLE COULD MTT TEST THIS
BRANCH!
**

Cisco ran MTT on this branch on Friday and everything checked out
(i.e., no more failures than on the trunk).  We just made a few  
more

minor changes today and I'm running MTT again now, but I'm not
expecting any new failures (MTT will take several hours).  We  
would

like to bring the new libevent in over this upcoming weekend, but
would very much appreciate if others could test on their platforms
(Cisco tests mainly 64 bit RHEL4U4).  This new libevent *should*
be a
fairly side-effect free change, but it is possible that since  
we're

now using epoll and other scalable fd monitoring tools, we'll run
into
some unanticipated issues on some platforms.

Here's a consolidated diff if you want to see the changes:

https://svn.open-mpi.org/trac/ompi/changeset?old_path=tmp-public%

[OMPI devel] Libtool for 1.3 / trunk builds

2008-03-19 Thread Brian W. Barrett

Hi all -

Now that Libtool 2.2 has gone stable (2.0 was skipped entirely), it 
probably makes sense to update the version of Libtool used to build the 
nightly tarball and releases for the trunk (and eventually v1.3) from the 
nightly snapshot we have been using to the stable LT 2.2 release.


I've done some testing (ie, I installed LT 2.2 for another project, and 
nothing in OMPI broke over the last couple of weeks), so I have some 
confidence this should be a smooth transition.  If the group decides this 
is a good idea, someone at IU would just have to install the new LT 
version and change some symlinks and it should all just work...


Brian


Re: [OMPI devel] Libtool for 1.3 / trunk builds

2008-03-19 Thread Jeff Squyres
Should we wait for the next LT point release?  I see a fair amount of  
activity on the bugs-libtool list; I think they're planning a new  
release within the next few weeks.


(I think we will want to go to the LT point release when it comes out;  
I don't really have strong feelings about going to 2.2 now or not)




On Mar 19, 2008, at 12:26 PM, Brian W. Barrett wrote:


Hi all -

Now that Libtool 2.2 has gone stable (2.0 was skipped entirely), it
probably makes sense to update the version of Libtool used to build  
the
nightly tarball and releases for the trunk (and eventually v1.3)  
from the

nightly snapshot we have been using to the stable LT 2.2 release.

I've done some testing (ie, I installed LT 2.2 for another project,  
and

nothing in OMPI broke over the last couple of weeks), so I have some
confidence this should be a smooth transition.  If the group decides  
this

is a good idea, someone at IU would just have to install the new LT
version and change some symlinks and it should all just work...

Brian
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] Libtool for 1.3 / trunk builds

2008-03-19 Thread Brian W. Barrett
True - I have no objection to waiting for 2.2.1 or 1.3 to be branched, 
whichever comes first.  The main point is that under no circumstance 
should 1.3 be shipped with the same 2.1a pre-release as 1.2 uses -- it's 
time to migrate to something stable.


Brian

On Wed, 19 Mar 2008, Jeff Squyres wrote:


Should we wait for the next LT point release?  I see a fair amount of
activity on the bugs-libtool list; I think they're planning a new
release within the next few weeks.

(I think we will want to go to the LT point release when it comes out;
I don't really have strong feelings about going to 2.2 now or not)



On Mar 19, 2008, at 12:26 PM, Brian W. Barrett wrote:


Hi all -

Now that Libtool 2.2 has gone stable (2.0 was skipped entirely), it
probably makes sense to update the version of Libtool used to build
the
nightly tarball and releases for the trunk (and eventually v1.3)
from the
nightly snapshot we have been using to the stable LT 2.2 release.

I've done some testing (ie, I installed LT 2.2 for another project,
and
nothing in OMPI broke over the last couple of weeks), so I have some
confidence this should be a smooth transition.  If the group decides
this
is a good idea, someone at IU would just have to install the new LT
version and change some symlinks and it should all just work...

Brian
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel






Re: [OMPI devel] Libtool for 1.3 / trunk builds

2008-03-19 Thread Jeff Squyres

On Mar 19, 2008, at 4:05 PM, Brian W. Barrett wrote:


True - I have no objection to waiting for 2.2.1 or 1.3 to be branched,
whichever comes first.  The main point is that under no circumstance
should 1.3 be shipped with the same 2.1a pre-release as 1.2 uses --  
it's

time to migrate to something stable.


Cool; I think we're agreed.  Just for simplicity; let's do whatever  
comes first: LT hits 2.2.1 (or 2.2.2?  I don't know their numbering  
scheme) or we branch for v1.3.




Brian

On Wed, 19 Mar 2008, Jeff Squyres wrote:


Should we wait for the next LT point release?  I see a fair amount of
activity on the bugs-libtool list; I think they're planning a new
release within the next few weeks.

(I think we will want to go to the LT point release when it comes  
out;

I don't really have strong feelings about going to 2.2 now or not)



On Mar 19, 2008, at 12:26 PM, Brian W. Barrett wrote:


Hi all -

Now that Libtool 2.2 has gone stable (2.0 was skipped entirely), it
probably makes sense to update the version of Libtool used to build
the
nightly tarball and releases for the trunk (and eventually v1.3)
from the
nightly snapshot we have been using to the stable LT 2.2 release.

I've done some testing (ie, I installed LT 2.2 for another project,
and
nothing in OMPI broke over the last couple of weeks), so I have some
confidence this should be a smooth transition.  If the group decides
this
is a good idea, someone at IU would just have to install the new LT
version and change some symlinks and it should all just work...

Brian
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel





___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



Re: [OMPI devel] xensocket btl and migration

2008-03-19 Thread Muhammad Atif
Thanks a lot Jeff and Josh. 
Seems it will be quite an interesting task to implement a separate btl for 
xensocket (xs) or anything related to migration.  I plan to stick to initial 
design for the time which seems ugly but simple and quite efficient (at the 
moment). I have bundled xs with tcp btl. So instead of using tcp btl, 
interested parties will use xs btl, which supports tcp inherently. Now during 
the execution - which starts by normal tcp, if we see that both the endpoints 
are on same physical host, we construct the xensockets (two in fact, one per 
endpoint to receive data -- xs is unidirectional). Upon signal that xs are 
created and connected, we start to make progress through xs sock descriptors, 
which means that normal tcp socket descriptors are alive but not in-charge as 
no data is being send or received through them. When we migrate to other 
physical host, our plan is to somehow make xs_socket invalid, and resort to 
normal tcp sockets. If a new endpoint pair is detected on
 the new physical host, we will do the same what was done initially. 

I am not sure, if it is an efficient design, but in theory it seems 
interesting, although has slight overhead. The worst part of design is that it 
is highly tcp centric.  My current status is that I am able to run normal mpi 
programs on xs btl, but am having some problems with some benchmark programs 
using non blocking sends and receives coupled with MPI_Barrier(). Something 
somewhere somehow gets lost. 

Xensockets initially were non-blocking send/recv, and did not have the 
necessary code for supporting epoll/select. We had to add the necessary  code 
in the module so i am quite sure that they will work with the new opal/libevent.

Best Regards,
Muhammad Atif

- Original Message 
From: Josh Hursey 
To: Open MPI Developers 
Sent: Wednesday, March 19, 2008 2:20:59 AM
Subject: Re: [OMPI devel] xensocket btl and migration

Muhammad,

With regard to your question on migration you will likely have to  
reload the BTL components when a migration occurs. Open MPI currently  
assumes that once the set of BTLs are decided upon in a process they  
are to be used until the application completes. There is some limited  
support for failover in which if one BTL 'fails' then it is  
disregarded and a previously defined alternative path is used. For  
example if between two peers Open MPI has the choice of using tcp or  
openib then it will use openib. If openib were to fail during the  
running of the job then it may be possible for Open MPI to fail over  
and use just tcp. I'm not sure how well tested this ability is, others  
can comment if you are interested in this.

However failover is not really want you are looking for. What it seem  
you are looking for is the ability to tell two processes that they  
should no longer communicate over tcp, but continue communication over  
xensockets or visa versa. One technique would be upon migration, if  
unload the BTLs (component_close) then reopen (component_open) and  
reselect (component_select) then reexchange the modex the processes  
should settle into the new configuration. You will have to make sure  
that any state Open MPI has cached such as network addresses and node  
name data is refreshed upon restart. Take a look at the checkpoint/ 
restart logic for how I do this in the code base ([opal|orte|ompi]/ 
runtime/*_cr.c).

It is likely that there is another, more efficient method but I don't  
have anything to point you to at the moment. One idea would be to add  
a refresh function to the modex which would force the reexchange of a  
single processes address set. There are a slew of problems with this  
that you will have to overcome including race conditions, but I think  
they can be surmounted.

I'd be interested in hearing your experiences implementing this in  
Open MPI. Let me know if I can be of any more help.

Cheers,
Josh

On Mar 9, 2008, at 6:13 AM, Muhammad Atif wrote:

> Okay guys.. with all your support and help in understanding ompi  
> architecture, I was able to get Xensocket to work.  Only minor  
> changes to the xensocket kernel module made it compatible with  
> libevent. I am getting results which are bad but I am sure, I have  
> to cleanup the code. At least my results have improved over native  
> netfront-netback of xen for messages of size larger than 1 MB.
>
> I started with making minor changes in the TCP btl, but it seems it  
> is not the best way, as changes are quite huge and it is better to  
> have separate dedicated btl for xensockets. As you guys might be  
> aware Xen supports live migration, now I have one stupid question.  
> My knowledge so far suggests that btl component is initialized only  
> once. The scerario here is if my guest os is migrated from one  
> physical node to another, and realizes that the communicating  
> processes are now on one physical host and they should abandon use  
> of TCP btl and make use of Xensocket btl. I am sure it