Greg: 

Can you run with "--mca btl_base_verbose 100" on your debug build so that we 
can get some additional output to see why UDCM is failing to setup properly?



On Jun 10, 2014, at 10:25 AM, Nathan Hjelm <hje...@lanl.gov> wrote:

> On Tue, Jun 10, 2014 at 12:10:28AM +0000, Jeff Squyres (jsquyres) wrote:
>> I seem to recall that you have an IB-based cluster, right?
>> 
>> From a *very quick* glance at the code, it looks like this might be a simple 
>> incorrect-finalization issue.  That is:
>> 
>> - you run the job on a single server
>> - openib disqualifies itself because you're running on a single server
>> - openib then goes to finalize/close itself
>> - but openib didn't fully initialize itself (because it disqualified itself 
>> early in the initialization process), and something in the finalization 
>> process didn't take that into account
>> 
>> Nathan -- is that anywhere close to correct?
> 
> Nope. udcm_module_finalize is being called because there was an error
> setting up the udcm state. See btl_openib_connect_udcm.c:476. The
> opal_list_t destructor is getting an assert failure. Probably because
> the constructor wasn't called. I can rearrange the constructors to be
> called first but there appears to be a deeper issue with the user's
> system: udcm_module_init should not be failing! It creates a couple of
> CQs, allocates a small number of registered bufferes and starts
> monitoring the fd for the completion channel. All these things are also
> done in the setup of the openib btl itself. Keep in mind that the openib
> btl will not disqualify itself when running single server. Openib may be
> used to communicate on node and is needed for the dynamics case.
> 
> The user might try adding -mca btl_base_verbose 100 to shed some
> light on what the real issue is.
> 
> BTW, I no longer monitor the user mailing list. If something needs my
> attention forward it to me directly.
> 
> -Nathan


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to