Greg: Can you run with "--mca btl_base_verbose 100" on your debug build so that we can get some additional output to see why UDCM is failing to setup properly?
On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote: > On Tue, Jun 10, 2014 at 12:10:28AM +0000, Jeff Squyres (jsquyres) wrote: >> I seem to recall that you have an IB-based cluster, right? >> >> From a *very quick* glance at the code, it looks like this might be a simple >> incorrect-finalization issue. That is: >> >> - you run the job on a single server >> - openib disqualifies itself because you're running on a single server >> - openib then goes to finalize/close itself >> - but openib didn't fully initialize itself (because it disqualified itself >> early in the initialization process), and something in the finalization >> process didn't take that into account >> >> Nathan -- is that anywhere close to correct? > > Nope. udcm_module_finalize is being called because there was an error > setting up the udcm state. See btl_openib_connect_udcm.c:476. The > opal_list_t destructor is getting an assert failure. Probably because > the constructor wasn't called. I can rearrange the constructors to be > called first but there appears to be a deeper issue with the user's > system: udcm_module_init should not be failing! It creates a couple of > CQs, allocates a small number of registered bufferes and starts > monitoring the fd for the completion channel. All these things are also > done in the setup of the openib btl itself. Keep in mind that the openib > btl will not disqualify itself when running single server. Openib may be > used to communicate on node and is needed for the dynamics case. > > The user might try adding -mca btl_base_verbose 100 to shed some > light on what the real issue is. > > BTW, I no longer monitor the user mailing list. If something needs my > attention forward it to me directly. > > -Nathan -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/