[OMPI devel] RFC: VM launch
WHAT: convert orte to start by launching a virtual machine across all allocated nodes WHY: support topologically-aware mapping methods WHEN: sometime over the next couple of months *** Several of us (including Jeff, Terry, Josh, and Ralph) are working to create topologically-aware mapping modules. This includes modules that correctly map processes to cores/sockets, perhaps take into account NIC proximity and switch connectivity, etc. In order to make this work, the rmaps components in mpirun need to know the local topology of the nodes in the allocation. We currently obtain that info from the orted's as each orted samples the local topology via the opal sysinfo framework and then reports it back to mpirun. Unfortunately, we currently don't launch the orteds until -after- we map the job, so the topology info cannot be used in the mapping algorithm. This work will modify the launch procedure to: 1. determine the final "allocation" using the current ras + hostfile + dash-host method. 2. launch a daemon on every node in the final "allocation" 3. each daemon discovers the local resources and reports that info back to mpirun 4. mpirun maps the job against the daemons using the node resource info 5. mpirun sends the launch msg to all daemons. 6. the daemons launch the job -and- provide a global topology map to all procs for their subsequent use Note the significant change here: in the current procedure, we map the job on the nodes-to-be-used and then only launch daemons on nodes that have application procs on them. If the app then calls comm_spawn, we launch any additional daemons as required. Under this revised procedure, we might launch daemons on nodes that are not used by the initial job. If the app then calls comm_spawn, no additional daemons will be required as we already have daemons on all available nodes. This simplifies comm_spawn, but precludes the ability of an app to dynamically discover and add nodes to the "allocation". There has been sporadic interest in such a feature, but nothing concrete.
Re: [OMPI devel] async thread in openib BTL
In addition, it would be really, really nice if someone would consolidate the watching of these devices into other mechanisms. The idea here is that the error can be noticed asynchronously, so it can't be part of the main libevent fd-watching (which is only checked once in a while). The async watcher needs to watch a lot of time. But there's also the RDMA CM / IB CM fd watchers, too. At a minimum, these could be combined. They weren't combined at the time for expediency -- there's no real technical reason that can't be solved why they can't be merged. While the cost of having 2 threads is pretty minimal, having 2 threads (or 3 or ... N threads) instead of 1 does take up a few resources. Pasha and I never got the time to unify this fd monitoring, and we've now moved on such that it's unlikely that we'll get the opportunity to do it. It would be great if one of the vendors still working in the openib BTL could do this, someday. :-) Additionally, with the new libevent work occurring, it could be possible to simply have a separate libevent base that handles all of these fds, which would be nice. On Dec 23, 2010, at 10:28 AM, Shamis, Pavel wrote: > The async thread is used to handle asynchronous error/notification events, > like port up/down, hca errors etc. > So most of the time the thread sleeps, and in healthy network you not > supposed to see any events. > > Regards, > > Pasha > > On Dec 23, 2010, at 12:49 AM, Eugene Loh wrote: > >> I'm starting to look at the openib BTL for the first time and am >> puzzled. In btl_openib_async.c, it looks like an asynchronous thread is >> started. During MPI_Init(), the main thread sends the async thread a >> file descriptor for each IB interface to be polled. In MPI_Finalize(), >> the main thread asks the async thread to shut down. Between MPI_Init() >> and MPI_Finalize(), I would think that the async thread would poll on >> the IB fd's and handle events that come up. If I stick print statements >> into the async thread, however, I don't see any events come up on the IB >> fd's. So, the async thread is useless. Yes? It starts up and shuts >> down, but never sees any events on the IB devices? >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [OMPI devel] IBV_EVENT_QP_ACCESS_ERR
George Bosilca wrote: Eugene, This error indicate that somehow we're accessing the QP while the QP is in "down" state. As the asynchronous thread is the one that see this error, I wonder if it doesn't look for some information about a QP that has been destroyed by the main thread (as this only occurs in MPI_Finalize). Can you look in the syslog to see if there is any additional info related to this issue there? Not much. A one-liner like this: Dec 27 21:49:36 burl-ct-x4150-11 hermon: [ID 492207 kern.info] hermon1: EQE local access violation On Dec 30, 2010, at 20:43, Eugene Loh wrote: I was running a bunch of np=4 test programs over two nodes. Occasionally, *one* of the codes would see an IBV_EVENT_QP_ACCESS_ERR during MPI_Finalize(). I traced the code and ran another program that mimicked the particular MPI calls made by that program. This other program, too, would occasionally trigger this error. I never saw the problem with other tests. Rate of incidence could go from consecutive runs (I saw this once) to 1:100s (more typically) to even less frequently -- I've had 1000s of consecutive runs with no problems. (The tests run a few seconds apiece.) The traffic pattern is sends from non-zero ranks to rank 0, with root-0 gathers, and lots of Allgathers. The largest messages are 1000bytes. It appears the problem is always seen on rank 3. Now, I wouldn't mind someone telling me, based on that little information, what the problem is here, but I guess I don't expect that. What I am asking is what IBV_EVENT_QP_ACCESS_ERR means. Again, it's seen during MPI_Finalize. The async thread is seeing this. What is this error trying to tell me?
Re: [OMPI devel] IBV_EVENT_QP_ACCESS_ERR
I'd guess thesame thing as George - a race condition in the shutdown of the async thread...? I haven't looked at that code in a long log time to remember how it tried to defend against the race condition. Sent from my PDA. No type good. On Jan 3, 2011, at 2:31 PM, "Eugene Loh" wrote: > George Bosilca wrote: > >> Eugene, >> >> This error indicate that somehow we're accessing the QP while the QP is in >> "down" state. As the asynchronous thread is the one that see this error, I >> wonder if it doesn't look for some information about a QP that has been >> destroyed by the main thread (as this only occurs in MPI_Finalize). >> >> Can you look in the syslog to see if there is any additional info related to >> this issue there? >> > Not much. A one-liner like this: > > Dec 27 21:49:36 burl-ct-x4150-11 hermon: [ID 492207 kern.info] hermon1: EQE > local access violation > >> On Dec 30, 2010, at 20:43, Eugene Loh wrote: >> >>> I was running a bunch of np=4 test programs over two nodes. Occasionally, >>> *one* of the codes would see an IBV_EVENT_QP_ACCESS_ERR during >>> MPI_Finalize(). I traced the code and ran another program that mimicked >>> the particular MPI calls made by that program. This other program, too, >>> would occasionally trigger this error. I never saw the problem with other >>> tests. Rate of incidence could go from consecutive runs (I saw this once) >>> to 1:100s (more typically) to even less frequently -- I've had 1000s of >>> consecutive runs with no problems. (The tests run a few seconds apiece.) >>> The traffic pattern is sends from non-zero ranks to rank 0, with root-0 >>> gathers, and lots of Allgathers. The largest messages are 1000bytes. It >>> appears the problem is always seen on rank 3. >>> >>> Now, I wouldn't mind someone telling me, based on that little information, >>> what the problem is here, but I guess I don't expect that. What I am >>> asking is what IBV_EVENT_QP_ACCESS_ERR means. Again, it's seen during >>> MPI_Finalize. The async thread is seeing this. What is this error trying >>> to tell me? >>> > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel
[OMPI devel] mca_bml_r2_del_proc_btl()
I can't tell if this is a problem, though I suspect it's a small one even if it's a problem at all. In mca_bml_r2_del_proc_btl(), a BTL is removed from the send list and from the RDMA list. If the BTL is removed from the send list, the end-point's max send size is recomputed to be the minimum of the max send sizes of the remaining BTLs. The code looks like this, where I've removed some code to focus on the parts that matter: /* remove btl from send list */ if(mca_bml_base_btl_array_remove(&ep->btl_send, btl)) { /* reset max_send_size to the min of all btl's */ for(b=0; b< mca_bml_base_btl_array_get_size(&ep->btl_send); b++) { bml_btl = mca_bml_base_btl_array_get_index(&ep->btl_send, b); ep_btl = bml_btl->btl; if (ep_btl->btl_max_send_size < ep->btl_max_send_size) { ep->btl_max_send_size = ep_btl->btl_max_send_size; } } } Shouldn't that inner loop be preceded by initialization of ep->btl_max_send_size to some very large value (ironically enough, perhaps "-1")? Something similar happens in the same function when the BTL is removed from the RDMA list and ep->btl_pipeline_send_length and ep->btl_send_limit are recomputed.
Re: [OMPI devel] IBV_EVENT_QP_ACCESS_ERR
It looks that we are touching some QP that was released. Before close the QP we make sure to complete all outstanding messages on the endpoint. Once all qps (and other resources) are closed , we signal to async thread to remove this hca from monitoring list. For me it looks that somehow we close the QP before all outstanding requests were completed. Regards --- Pavel Shamis (Pasha) On Jan 3, 2011, at 12:44 PM, Jeff Squyres (jsquyres) wrote: > I'd guess thesame thing as George - a race condition in the shutdown of the > async thread...? I haven't looked at that code in a long log time to > remember how it tried to defend against the race condition. > > Sent from my PDA. No type good. > > On Jan 3, 2011, at 2:31 PM, "Eugene Loh" wrote: > >> George Bosilca wrote: >> >>> Eugene, >>> >>> This error indicate that somehow we're accessing the QP while the QP is in >>> "down" state. As the asynchronous thread is the one that see this error, I >>> wonder if it doesn't look for some information about a QP that has been >>> destroyed by the main thread (as this only occurs in MPI_Finalize). >>> >>> Can you look in the syslog to see if there is any additional info related >>> to this issue there? >>> >> Not much. A one-liner like this: >> >> Dec 27 21:49:36 burl-ct-x4150-11 hermon: [ID 492207 kern.info] hermon1: EQE >> local access violation >> >>> On Dec 30, 2010, at 20:43, Eugene Loh wrote: >>> I was running a bunch of np=4 test programs over two nodes. Occasionally, *one* of the codes would see an IBV_EVENT_QP_ACCESS_ERR during MPI_Finalize(). I traced the code and ran another program that mimicked the particular MPI calls made by that program. This other program, too, would occasionally trigger this error. I never saw the problem with other tests. Rate of incidence could go from consecutive runs (I saw this once) to 1:100s (more typically) to even less frequently -- I've had 1000s of consecutive runs with no problems. (The tests run a few seconds apiece.) The traffic pattern is sends from non-zero ranks to rank 0, with root-0 gathers, and lots of Allgathers. The largest messages are 1000bytes. It appears the problem is always seen on rank 3. Now, I wouldn't mind someone telling me, based on that little information, what the problem is here, but I guess I don't expect that. What I am asking is what IBV_EVENT_QP_ACCESS_ERR means. Again, it's seen during MPI_Finalize. The async thread is seeing this. What is this error trying to tell me? >> ___ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ___ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel