[OMPI devel] RFC: VM launch

2011-01-03 Thread Ralph Castain
WHAT: convert orte to start by launching a virtual machine across all allocated 
nodes

WHY: support topologically-aware mapping methods

WHEN: sometime over the next couple of months

***
Several of us (including Jeff, Terry, Josh, and Ralph) are working to create 
topologically-aware mapping modules. This includes modules that correctly map 
processes to cores/sockets, perhaps take into account NIC proximity and switch 
connectivity, etc.

In order to make this work, the rmaps components in mpirun need to know the 
local topology of the nodes in the allocation. We currently obtain that info 
from the orted's as each orted samples the local topology via the opal sysinfo 
framework and then reports it back to mpirun. Unfortunately, we currently don't 
launch the orteds until -after- we map the job, so the topology info cannot be 
used in the mapping algorithm.

This work will modify the launch procedure to:

1. determine the final "allocation" using the current ras + hostfile + 
dash-host method.

2. launch a daemon on every node in the final "allocation"

3. each daemon discovers the local resources and reports that info back to 
mpirun

4. mpirun maps the job against the daemons using the node resource info

5. mpirun sends the launch msg to all daemons.

6. the daemons launch the job -and- provide a global topology map to all procs 
for their subsequent use

Note the significant change here: in the current procedure, we map the job on 
the nodes-to-be-used and then only launch daemons on nodes that have 
application procs on them. If the app then calls comm_spawn, we launch any 
additional daemons as required.

Under this revised procedure, we might launch daemons on nodes that are not 
used by the initial job. If the app then calls comm_spawn, no additional 
daemons will be required as we already have daemons on all available nodes. 
This simplifies comm_spawn, but precludes the ability of an app to dynamically 
discover and add nodes to the "allocation". There has been sporadic interest in 
such a feature, but nothing concrete.





Re: [OMPI devel] async thread in openib BTL

2011-01-03 Thread Jeff Squyres
In addition, it would be really, really nice if someone would consolidate the 
watching of these devices into other mechanisms.

The idea here is that the error can be noticed asynchronously, so it can't be 
part of the main libevent fd-watching (which is only checked once in a while).  
The async watcher needs to watch a lot of time.  But there's also the RDMA CM / 
IB CM fd watchers, too.  At a minimum, these could be combined.  They weren't 
combined at the time for expediency -- there's no real technical reason that 
can't be solved why they can't be merged.  While the cost of having 2 threads 
is pretty minimal, having 2 threads (or 3 or ... N threads) instead of 1 does 
take up a few resources.

Pasha and I never got the time to unify this fd monitoring, and we've now moved 
on such that it's unlikely that we'll get the opportunity to do it.  It would 
be great if one of the vendors still working in the openib BTL could do this, 
someday.  :-)

Additionally, with the new libevent work occurring, it could be possible to 
simply have a separate libevent base that handles all of these fds, which would 
be nice.


On Dec 23, 2010, at 10:28 AM, Shamis, Pavel wrote:

> The async thread is used to handle asynchronous error/notification events, 
> like port up/down, hca errors etc. 
> So most of the time the thread sleeps, and in healthy network you not 
> supposed to see any events.
> 
> Regards,
> 
> Pasha
> 
> On Dec 23, 2010, at 12:49 AM, Eugene Loh wrote:
> 
>> I'm starting to look at the openib BTL for the first time and am 
>> puzzled.  In btl_openib_async.c, it looks like an asynchronous thread is 
>> started.  During MPI_Init(), the main thread sends the async thread a 
>> file descriptor for each IB interface to be polled.  In MPI_Finalize(), 
>> the main thread asks the async thread to shut down.  Between MPI_Init() 
>> and MPI_Finalize(), I would think that the async thread would poll on 
>> the IB fd's and handle events that come up.  If I stick print statements 
>> into the async thread, however, I don't see any events come up on the IB 
>> fd's.  So, the async thread is useless.  Yes?  It starts up and shuts 
>> down, but never sees any events on the IB devices?
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [OMPI devel] IBV_EVENT_QP_ACCESS_ERR

2011-01-03 Thread Eugene Loh

George Bosilca wrote:


Eugene,

This error indicate that somehow we're accessing the QP while the QP is in 
"down" state. As the asynchronous thread is the one that see this error, I 
wonder if it doesn't look for some information about a QP that has been destroyed by the 
main thread (as this only occurs in MPI_Finalize).

Can you look in the syslog to see if there is any additional info related to 
this issue there?


Not much.  A one-liner like this:

Dec 27 21:49:36 burl-ct-x4150-11 hermon: [ID 492207 kern.info] hermon1: 
EQE local access violation



On Dec 30, 2010, at 20:43, Eugene Loh  wrote:
 


I was running a bunch of np=4 test programs over two nodes.  Occasionally, 
*one* of the codes would see an IBV_EVENT_QP_ACCESS_ERR during MPI_Finalize().  
I traced the code and ran another program that mimicked the particular MPI 
calls made by that program.  This other program, too, would occasionally 
trigger this error.  I never saw the problem with other tests.  Rate of 
incidence could go from consecutive runs (I saw this once) to 1:100s (more 
typically) to even less frequently -- I've had 1000s of consecutive runs with 
no problems.  (The tests run a few seconds apiece.)  The traffic pattern is 
sends from non-zero ranks to rank 0, with root-0 gathers, and lots of 
Allgathers.  The largest messages are 1000bytes.  It appears the problem is 
always seen on rank 3.

Now, I wouldn't mind someone telling me, based on that little information, what 
the problem is here, but I guess I don't expect that.  What I am asking is what 
IBV_EVENT_QP_ACCESS_ERR means.  Again, it's seen during MPI_Finalize.  The 
async thread is seeing this.  What is this error trying to tell me?
   



Re: [OMPI devel] IBV_EVENT_QP_ACCESS_ERR

2011-01-03 Thread Jeff Squyres (jsquyres)
I'd guess thesame thing as George - a race condition in the shutdown of the 
async thread...?  I haven't looked at that code in a long log time to remember 
how it tried to defend against the race condition. 

Sent from my PDA. No type good. 

On Jan 3, 2011, at 2:31 PM, "Eugene Loh"  wrote:

> George Bosilca wrote:
> 
>> Eugene,
>> 
>> This error indicate that somehow we're accessing the QP while the QP is in 
>> "down" state. As the asynchronous thread is the one that see this error, I 
>> wonder if it doesn't look for some information about a QP that has been 
>> destroyed by the main thread (as this only occurs in MPI_Finalize).
>> 
>> Can you look in the syslog to see if there is any additional info related to 
>> this issue there?
>> 
> Not much.  A one-liner like this:
> 
> Dec 27 21:49:36 burl-ct-x4150-11 hermon: [ID 492207 kern.info] hermon1: EQE 
> local access violation
> 
>> On Dec 30, 2010, at 20:43, Eugene Loh  wrote:
>> 
>>> I was running a bunch of np=4 test programs over two nodes.  Occasionally, 
>>> *one* of the codes would see an IBV_EVENT_QP_ACCESS_ERR during 
>>> MPI_Finalize().  I traced the code and ran another program that mimicked 
>>> the particular MPI calls made by that program.  This other program, too, 
>>> would occasionally trigger this error.  I never saw the problem with other 
>>> tests.  Rate of incidence could go from consecutive runs (I saw this once) 
>>> to 1:100s (more typically) to even less frequently -- I've had 1000s of 
>>> consecutive runs with no problems.  (The tests run a few seconds apiece.)  
>>> The traffic pattern is sends from non-zero ranks to rank 0, with root-0 
>>> gathers, and lots of Allgathers.  The largest messages are 1000bytes.  It 
>>> appears the problem is always seen on rank 3.
>>> 
>>> Now, I wouldn't mind someone telling me, based on that little information, 
>>> what the problem is here, but I guess I don't expect that.  What I am 
>>> asking is what IBV_EVENT_QP_ACCESS_ERR means.  Again, it's seen during 
>>> MPI_Finalize.  The async thread is seeing this.  What is this error trying 
>>> to tell me?
>>>   
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



[OMPI devel] mca_bml_r2_del_proc_btl()

2011-01-03 Thread Eugene Loh
I can't tell if this is a problem, though I suspect it's a small one 
even if it's a problem at all.


In mca_bml_r2_del_proc_btl(), a BTL is removed from the send list and 
from the RDMA list.


If the BTL is removed from the send list, the end-point's max send size 
is recomputed to be the minimum of the max send sizes of the remaining 
BTLs.  The code looks like this, where I've removed some code to focus 
on the parts that matter:


   /* remove btl from send list */
   if(mca_bml_base_btl_array_remove(&ep->btl_send, btl)) {

   /* reset max_send_size to the min of all btl's */
   for(b=0; b< mca_bml_base_btl_array_get_size(&ep->btl_send); b++) {
   bml_btl = mca_bml_base_btl_array_get_index(&ep->btl_send, b);
   ep_btl = bml_btl->btl;

   if (ep_btl->btl_max_send_size < ep->btl_max_send_size) {
   ep->btl_max_send_size = ep_btl->btl_max_send_size;
   }
   }
   }

Shouldn't that inner loop be preceded by initialization of 
ep->btl_max_send_size to some very large value (ironically enough, 
perhaps "-1")?


Something similar happens in the same function when the BTL is removed 
from the RDMA list and  ep->btl_pipeline_send_length and 
ep->btl_send_limit are recomputed.


Re: [OMPI devel] IBV_EVENT_QP_ACCESS_ERR

2011-01-03 Thread Shamis, Pavel
It looks that we are touching some QP that was released. Before close the QP we 
make sure to complete all outstanding messages on the endpoint. Once all qps 
(and other resources) are closed , we signal to async thread to remove this hca 
from monitoring list.  For me it looks that somehow we close the QP before all 
outstanding requests were completed.

Regards
---
Pavel Shamis (Pasha)

On Jan 3, 2011, at 12:44 PM, Jeff Squyres (jsquyres) wrote:

> I'd guess thesame thing as George - a race condition in the shutdown of the 
> async thread...?  I haven't looked at that code in a long log time to 
> remember how it tried to defend against the race condition. 
> 
> Sent from my PDA. No type good. 
> 
> On Jan 3, 2011, at 2:31 PM, "Eugene Loh"  wrote:
> 
>> George Bosilca wrote:
>> 
>>> Eugene,
>>> 
>>> This error indicate that somehow we're accessing the QP while the QP is in 
>>> "down" state. As the asynchronous thread is the one that see this error, I 
>>> wonder if it doesn't look for some information about a QP that has been 
>>> destroyed by the main thread (as this only occurs in MPI_Finalize).
>>> 
>>> Can you look in the syslog to see if there is any additional info related 
>>> to this issue there?
>>> 
>> Not much.  A one-liner like this:
>> 
>> Dec 27 21:49:36 burl-ct-x4150-11 hermon: [ID 492207 kern.info] hermon1: EQE 
>> local access violation
>> 
>>> On Dec 30, 2010, at 20:43, Eugene Loh  wrote:
>>> 
 I was running a bunch of np=4 test programs over two nodes.  Occasionally, 
 *one* of the codes would see an IBV_EVENT_QP_ACCESS_ERR during 
 MPI_Finalize().  I traced the code and ran another program that mimicked 
 the particular MPI calls made by that program.  This other program, too, 
 would occasionally trigger this error.  I never saw the problem with other 
 tests.  Rate of incidence could go from consecutive runs (I saw this once) 
 to 1:100s (more typically) to even less frequently -- I've had 1000s of 
 consecutive runs with no problems.  (The tests run a few seconds apiece.)  
 The traffic pattern is sends from non-zero ranks to rank 0, with root-0 
 gathers, and lots of Allgathers.  The largest messages are 1000bytes.  It 
 appears the problem is always seen on rank 3.
 
 Now, I wouldn't mind someone telling me, based on that little information, 
 what the problem is here, but I guess I don't expect that.  What I am 
 asking is what IBV_EVENT_QP_ACCESS_ERR means.  Again, it's seen during 
 MPI_Finalize.  The async thread is seeing this.  What is this error trying 
 to tell me?
 
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel